Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit 06da3c5

Browse filesBrowse files
committed
Rework subtransaction commit protocol for hot standby.
This patch eliminates the marking of subtransactions as SUBCOMMITTED in pg_clog during their commit; instead they remain in-progress until main transaction commit. At main transaction commit, the commit protocol is atomic-by-page instead of one transaction at a time. To avoid a race condition with some subtransactions appearing committed before others in the case where they span more than one pg_clog page, we conserve the logic that marks them subcommitted before marking the parent committed. Simon Riggs with minor help from me
1 parent 3afffbc commit 06da3c5
Copy full SHA for 06da3c5

File tree

7 files changed

+279
-212
lines changed
Filter options

7 files changed

+279
-212
lines changed

‎src/backend/access/transam/README

Copy file name to clipboardExpand all lines: src/backend/access/transam/README
+15-6Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.11 2008/03/21 13:23:28 momjian Exp $
1+
$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.12 2008/10/20 19:18:18 alvherre Exp $
22

33
The Transaction System
44
======================
@@ -341,11 +341,20 @@ from disk. They also allow information to be permanent across server restarts.
341341
pg_clog records the commit status for each transaction that has been assigned
342342
an XID. A transaction can be in progress, committed, aborted, or
343343
"sub-committed". This last state means that it's a subtransaction that's no
344-
longer running, but its parent has not updated its state yet (either it is
345-
still running, or the backend crashed without updating its status). A
346-
sub-committed transaction's status will be updated again to the final value as
347-
soon as the parent commits or aborts, or when the parent is detected to be
348-
aborted.
344+
longer running, but its parent has not updated its state yet. It is not
345+
necessary to update a subtransaction's transaction status to subcommit, so we
346+
can just defer it until main transaction commit. The main role of marking
347+
transactions as sub-committed is to provide an atomic commit protocol when
348+
transaction status is spread across multiple clog pages. As a result, whenever
349+
transaction status spreads across multiple pages we must use a two-phase commit
350+
protocol: the first phase is to mark the subtransactions as sub-committed, then
351+
we mark the top level transaction and all its subtransactions committed (in
352+
that order). Thus, subtransactions that have not aborted appear as in-progress
353+
even when they have already finished, and the subcommit status appears as a
354+
very short transitory state during main transaction commit. Subtransaction
355+
abort is always marked in clog as soon as it occurs. When the transaction
356+
status all fit in a single CLOG page, we atomically mark them all as committed
357+
without bothering with the intermediate sub-commit state.
349358

350359
Savepoints are implemented using subtransactions. A subtransaction is a
351360
transaction inside a transaction; its commit or abort status is not only

‎src/backend/access/transam/clog.c

Copy file name to clipboardExpand all lines: src/backend/access/transam/clog.c
+214-15Lines changed: 214 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@
2626
* Portions Copyright (c) 1996-2008, PostgreSQL Global Development Group
2727
* Portions Copyright (c) 1994, Regents of the University of California
2828
*
29-
* $PostgreSQL: pgsql/src/backend/access/transam/clog.c,v 1.47 2008/08/01 13:16:08 alvherre Exp $
29+
* $PostgreSQL: pgsql/src/backend/access/transam/clog.c,v 1.48 2008/10/20 19:18:18 alvherre Exp $
3030
*
3131
*-------------------------------------------------------------------------
3232
*/
@@ -80,32 +80,182 @@ static int ZeroCLOGPage(int pageno, bool writeXlog);
8080
static bool CLOGPagePrecedes(int page1, int page2);
8181
static void WriteZeroPageXlogRec(int pageno);
8282
static void WriteTruncateXlogRec(int pageno);
83+
static void TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
84+
TransactionId *subxids, XidStatus status,
85+
XLogRecPtr lsn, int pageno);
86+
static void TransactionIdSetStatusBit(TransactionId xid, XidStatus status,
87+
XLogRecPtr lsn, int slotno);
88+
static void set_status_by_pages(int nsubxids, TransactionId *subxids,
89+
XidStatus status, XLogRecPtr lsn);
8390

8491

8592
/*
86-
* Record the final state of a transaction in the commit log.
93+
* TransactionIdSetTreeStatus
94+
*
95+
* Record the final state of transaction entries in the commit log for
96+
* a transaction and its subtransaction tree. Take care to ensure this is
97+
* efficient, and as atomic as possible.
98+
*
99+
* xid is a single xid to set status for. This will typically be
100+
* the top level transactionid for a top level commit or abort. It can
101+
* also be a subtransaction when we record transaction aborts.
102+
*
103+
* subxids is an array of xids of length nsubxids, representing subtransactions
104+
* in the tree of xid. In various cases nsubxids may be zero.
87105
*
88106
* lsn must be the WAL location of the commit record when recording an async
89107
* commit. For a synchronous commit it can be InvalidXLogRecPtr, since the
90108
* caller guarantees the commit record is already flushed in that case. It
91109
* should be InvalidXLogRecPtr for abort cases, too.
92110
*
111+
* In the commit case, atomicity is limited by whether all the subxids are in
112+
* the same CLOG page as xid. If they all are, then the lock will be grabbed
113+
* only once, and the status will be set to committed directly. Otherwise
114+
* we must
115+
* 1. set sub-committed all subxids that are not on the same page as the
116+
* main xid
117+
* 2. atomically set committed the main xid and the subxids on the same page
118+
* 3. go over the first bunch again and set them committed
119+
* Note that as far as concurrent checkers are concerned, main transaction
120+
* commit as a whole is still atomic.
121+
*
122+
* Example:
123+
* TransactionId t commits and has subxids t1, t2, t3, t4
124+
* t is on page p1, t1 is also on p1, t2 and t3 are on p2, t4 is on p3
125+
* 1. update pages2-3:
126+
* page2: set t2,t3 as sub-committed
127+
* page3: set t4 as sub-committed
128+
* 2. update page1:
129+
* set t1 as sub-committed,
130+
* then set t as committed,
131+
then set t1 as committed
132+
* 3. update pages2-3:
133+
* page2: set t2,t3 as committed
134+
* page3: set t4 as committed
135+
*
93136
* NB: this is a low-level routine and is NOT the preferred entry point
94-
* for most uses; TransactionLogUpdate() in transam.c is the intended caller.
137+
* for most uses; functions in transam.c are the intended callers.
138+
*
139+
* XXX Think about issuing FADVISE_WILLNEED on pages that we will need,
140+
* but aren't yet in cache, as well as hinting pages not to fall out of
141+
* cache yet.
95142
*/
96143
void
97-
TransactionIdSetStatus(TransactionId xid, XidStatus status, XLogRecPtr lsn)
144+
TransactionIdSetTreeStatus(TransactionId xid, int nsubxids,
145+
TransactionId *subxids, XidStatus status, XLogRecPtr lsn)
146+
{
147+
int pageno = TransactionIdToPage(xid); /* get page of parent */
148+
int i;
149+
150+
Assert(status == TRANSACTION_STATUS_COMMITTED ||
151+
status == TRANSACTION_STATUS_ABORTED);
152+
153+
/*
154+
* See how many subxids, if any, are on the same page as the parent, if any.
155+
*/
156+
for (i = 0; i < nsubxids; i++)
157+
{
158+
if (TransactionIdToPage(subxids[i]) != pageno)
159+
break;
160+
}
161+
162+
/*
163+
* Do all items fit on a single page?
164+
*/
165+
if (i == nsubxids)
166+
{
167+
/*
168+
* Set the parent and all subtransactions in a single call
169+
*/
170+
TransactionIdSetPageStatus(xid, nsubxids, subxids, status, lsn,
171+
pageno);
172+
}
173+
else
174+
{
175+
int nsubxids_on_first_page = i;
176+
177+
/*
178+
* If this is a commit then we care about doing this correctly (i.e.
179+
* using the subcommitted intermediate status). By here, we know we're
180+
* updating more than one page of clog, so we must mark entries that
181+
* are *not* on the first page so that they show as subcommitted before
182+
* we then return to update the status to fully committed.
183+
*
184+
* To avoid touching the first page twice, skip marking subcommitted
185+
* for the subxids on that first page.
186+
*/
187+
if (status == TRANSACTION_STATUS_COMMITTED)
188+
set_status_by_pages(nsubxids - nsubxids_on_first_page,
189+
subxids + nsubxids_on_first_page,
190+
TRANSACTION_STATUS_SUB_COMMITTED, lsn);
191+
192+
/*
193+
* Now set the parent and subtransactions on same page as the parent,
194+
* if any
195+
*/
196+
pageno = TransactionIdToPage(xid);
197+
TransactionIdSetPageStatus(xid, nsubxids_on_first_page, subxids, status,
198+
lsn, pageno);
199+
200+
/*
201+
* Now work through the rest of the subxids one clog page at a time,
202+
* starting from the second page onwards, like we did above.
203+
*/
204+
set_status_by_pages(nsubxids - nsubxids_on_first_page,
205+
subxids + nsubxids_on_first_page,
206+
status, lsn);
207+
}
208+
}
209+
210+
/*
211+
* Helper for TransactionIdSetTreeStatus: set the status for a bunch of
212+
* transactions, chunking in the separate CLOG pages involved. We never
213+
* pass the whole transaction tree to this function, only subtransactions
214+
* that are on different pages to the top level transaction id.
215+
*/
216+
static void
217+
set_status_by_pages(int nsubxids, TransactionId *subxids,
218+
XidStatus status, XLogRecPtr lsn)
219+
{
220+
int pageno = TransactionIdToPage(subxids[0]);
221+
int offset = 0;
222+
int i = 0;
223+
224+
while (i < nsubxids)
225+
{
226+
int num_on_page = 0;
227+
228+
while (TransactionIdToPage(subxids[i]) == pageno && i < nsubxids)
229+
{
230+
num_on_page++;
231+
i++;
232+
}
233+
234+
TransactionIdSetPageStatus(InvalidTransactionId,
235+
num_on_page, subxids + offset,
236+
status, lsn, pageno);
237+
offset = i;
238+
pageno = TransactionIdToPage(subxids[offset]);
239+
}
240+
}
241+
242+
/*
243+
* Record the final state of transaction entries in the commit log for
244+
* all entries on a single page. Atomic only on this page.
245+
*
246+
* Otherwise API is same as TransactionIdSetTreeStatus()
247+
*/
248+
static void
249+
TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
250+
TransactionId *subxids, XidStatus status,
251+
XLogRecPtr lsn, int pageno)
98252
{
99-
int pageno = TransactionIdToPage(xid);
100-
int byteno = TransactionIdToByte(xid);
101-
int bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT;
102253
int slotno;
103-
char *byteptr;
104-
char byteval;
254+
int i;
105255

106256
Assert(status == TRANSACTION_STATUS_COMMITTED ||
107257
status == TRANSACTION_STATUS_ABORTED ||
108-
status == TRANSACTION_STATUS_SUB_COMMITTED);
258+
(status == TRANSACTION_STATUS_SUB_COMMITTED && !TransactionIdIsValid(xid)));
109259

110260
LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
111261

@@ -116,9 +266,62 @@ TransactionIdSetStatus(TransactionId xid, XidStatus status, XLogRecPtr lsn)
116266
* mustn't let it reach disk until we've done the appropriate WAL flush.
117267
* But when lsn is invalid, it's OK to scribble on a page while it is
118268
* write-busy, since we don't care if the update reaches disk sooner than
119-
* we think. Hence, pass write_ok = XLogRecPtrIsInvalid(lsn).
269+
* we think.
120270
*/
121271
slotno = SimpleLruReadPage(ClogCtl, pageno, XLogRecPtrIsInvalid(lsn), xid);
272+
273+
/*
274+
* Set the main transaction id, if any.
275+
*
276+
* If we update more than one xid on this page while it is being written
277+
* out, we might find that some of the bits go to disk and others don't.
278+
* If we are updating commits on the page with the top-level xid that could
279+
* break atomicity, so we subcommit the subxids first before we mark the
280+
* top-level commit.
281+
*/
282+
if (TransactionIdIsValid(xid))
283+
{
284+
/* Subtransactions first, if needed ... */
285+
if (status == TRANSACTION_STATUS_COMMITTED)
286+
{
287+
for (i = 0; i < nsubxids; i++)
288+
{
289+
Assert(ClogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
290+
TransactionIdSetStatusBit(subxids[i],
291+
TRANSACTION_STATUS_SUB_COMMITTED,
292+
lsn, slotno);
293+
}
294+
}
295+
296+
/* ... then the main transaction */
297+
TransactionIdSetStatusBit(xid, status, lsn, slotno);
298+
}
299+
300+
/* Set the subtransactions */
301+
for (i = 0; i < nsubxids; i++)
302+
{
303+
Assert(ClogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
304+
TransactionIdSetStatusBit(subxids[i], status, lsn, slotno);
305+
}
306+
307+
ClogCtl->shared->page_dirty[slotno] = true;
308+
309+
LWLockRelease(CLogControlLock);
310+
}
311+
312+
/*
313+
* Sets the commit status of a single transaction.
314+
*
315+
* Must be called with CLogControlLock held
316+
*/
317+
static void
318+
TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno)
319+
{
320+
int byteno = TransactionIdToByte(xid);
321+
int bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT;
322+
char *byteptr;
323+
char byteval;
324+
122325
byteptr = ClogCtl->shared->page_buffer[slotno] + byteno;
123326

124327
/* Current state should be 0, subcommitted or target state */
@@ -132,8 +335,6 @@ TransactionIdSetStatus(TransactionId xid, XidStatus status, XLogRecPtr lsn)
132335
byteval |= (status << bshift);
133336
*byteptr = byteval;
134337

135-
ClogCtl->shared->page_dirty[slotno] = true;
136-
137338
/*
138339
* Update the group LSN if the transaction completion LSN is higher.
139340
*
@@ -149,8 +350,6 @@ TransactionIdSetStatus(TransactionId xid, XidStatus status, XLogRecPtr lsn)
149350
if (XLByteLT(ClogCtl->shared->group_lsn[lsnindex], lsn))
150351
ClogCtl->shared->group_lsn[lsnindex] = lsn;
151352
}
152-
153-
LWLockRelease(CLogControlLock);
154353
}
155354

156355
/*

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.