UN-3660 [FIX] PG barrier decrement — phase-split reconnect-retry on stale connection#2130
UN-3660 [FIX] PG barrier decrement — phase-split reconnect-retry on stale connection#2130muhammad-ali-e merged 2 commits intofeat/UN-3445-pg-queue-integrationZipstack/unstract:feat/UN-3445-pg-queue-integrationfrom UN-3660-pg-barrier-decrement-retryZipstack/unstract:UN-3660-pg-barrier-decrement-retryCopy head branch name to clipboard
Conversation
…tale connection The barrier decrement was the last uncovered idle-reap hang path (after UN-3654 dispatch send, UN-3651 enqueue, UN-3659 store_result). Its try/except caught only psycopg2.DataError, so a PgBouncer-reaped cached connection failed the UPDATE, propagated to the in-body abort, and tore the whole execution down on a transient blip (all files COMPLETED, execution ERROR/stuck). The decrement is non-idempotent and a double-apply is harmful (fires the callback early with partial results, or skips past 0 and strands the barrier), so a blind retry — or even send()'s reused-guard alone — is unsafe: a reused-conn failure at commit time is ambiguous. Instead split the statement into its two phases and retry only the safe one: - EXECUTE phase fails on a reused (cached) conn → idle-reap, never committed → reconnect and re-apply exactly once. - COMMIT phase fails → ambiguous (server may have applied it) → never retry; propagate so the caller tears the barrier down (fail-fast, unchanged). Safety rests on the connection being non-autocommit (verified): an execute-phase failure is never durable even if the server ran the UPDATE. Adds the shared _CONN_DEAD_ERRORS constant (parity with client.py / result_backend.py) and a _recover_after_error helper now reused by _cursor. Tests: 6 unit (execute-retry, commit-no-retry, fresh-conn-no-retry, DataError-no-retry, both error types, one-shot bound), a real-DB exactly-once decrement, and a real pg_terminate_backend run asserting the callback fires exactly once. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
| Filename | Overview |
|---|---|
| workers/queue_backend/pg_barrier.py | Adds _recover_after_error helper (refactors _cursor), _DecrementRow NamedTuple, and the key _apply_decrement phase-split retry function; wires them into _barrier_pg_decrement with a pre-guard conn_was_cached sample. Logic is correct — execute-phase retries only when reused=True and conn_dead and attempt < limit, commit failures are never retried. |
| workers/queue_backend/pg_queue/connection.py | Promotes CONN_DEAD_ERRORS to a shared module-level constant so all three PG-queue sites import from one place; drops Final annotation present in the prior per-module definitions (the only finding in this file). |
| workers/queue_backend/pg_queue/client.py | Removes local _CONN_DEAD_ERRORS definition and imports it from connection.py; no behavioural change. |
| workers/queue_backend/pg_queue/result_backend.py | Same deduplication as client.py — local _CONN_DEAD_ERRORS removed, imported from connection.py; no behavioural change. |
| workers/tests/test_pg_barrier.py | Adds TestDecrementPhaseSplitRetry (6 unit tests covering execute-retry heals, commit-fail no-retry, fresh-conn no-retry, DataError no-retry, one-shot bound, and end-to-end wrapper wiring) plus two real-DB tests (idle-reap self-heal and pg_terminate_backend exactly-once). _FakeConn extended with commit_error and get_transaction_status to support new paths. |
Sequence Diagram
%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant Caller as _barrier_pg_decrement
participant AD as _apply_decrement
participant TL as Thread-local conn
participant DB as Postgres / PgBouncer
Caller->>TL: sample conn_was_cached (before entry guard)
Caller->>TL: _get_conn() [entry-guard txn check]
Caller->>AD: "_apply_decrement(reused=conn_was_cached)"
loop attempt 1..2
AD->>TL: "conn = _get_conn()"
AD->>DB: cur.execute(UPDATE RETURNING)
alt "Execute fails — conn-dead AND reused AND attempt < limit"
AD->>AD: _recover_after_error → discard conn
AD->>AD: sleep + continue (retry)
else Execute fails — not retried (fresh-conn or budget spent)
AD->>AD: _recover_after_error → discard conn
AD-->>Caller: raise OperationalError/InterfaceError
else Execute fails — non-conn error (DataError)
AD->>AD: _recover_after_error → rollback only, conn kept
AD-->>Caller: raise DataError
else Execute succeeds
AD->>DB: conn.commit()
alt Commit fails — AMBIGUOUS, never retried
AD->>AD: _recover_after_error → discard conn
AD-->>Caller: raise (fail-fast)
else Commit succeeds
AD-->>Caller: _DecrementRow(remaining, results)
end
end
end
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant Caller as _barrier_pg_decrement
participant AD as _apply_decrement
participant TL as Thread-local conn
participant DB as Postgres / PgBouncer
Caller->>TL: sample conn_was_cached (before entry guard)
Caller->>TL: _get_conn() [entry-guard txn check]
Caller->>AD: "_apply_decrement(reused=conn_was_cached)"
loop attempt 1..2
AD->>TL: "conn = _get_conn()"
AD->>DB: cur.execute(UPDATE RETURNING)
alt "Execute fails — conn-dead AND reused AND attempt < limit"
AD->>AD: _recover_after_error → discard conn
AD->>AD: sleep + continue (retry)
else Execute fails — not retried (fresh-conn or budget spent)
AD->>AD: _recover_after_error → discard conn
AD-->>Caller: raise OperationalError/InterfaceError
else Execute fails — non-conn error (DataError)
AD->>AD: _recover_after_error → rollback only, conn kept
AD-->>Caller: raise DataError
else Execute succeeds
AD->>DB: conn.commit()
alt Commit fails — AMBIGUOUS, never retried
AD->>AD: _recover_after_error → discard conn
AD-->>Caller: raise (fail-fast)
else Commit succeeds
AD-->>Caller: _DecrementRow(remaining, results)
end
end
end
Reviews (2): Last reviewed commit: "UN-3660 [FIX] Address PR review — fix re..." | Re-trigger Greptile
muhammad-ali-e
left a comment
There was a problem hiding this comment.
Automated PR review (PR Review Toolkit)
Reviewed via six specialised agents (Code Reviewer, Silent-Failure Hunter, Type-Design Analyzer, PR Test Analyzer, Comment Analyzer, Code Simplifier) scoped to the two changed files. Overall: a well-reasoned, safety-critical change. The execute-vs-commit phase split is structurally sound, the no-double-count argument holds, and the real-DB pg_terminate_backend test is a genuine exactly-once proof.
The findings below are mostly Low/Nit. The one that matters most is #1: the reused fresh-vs-cached distinction is inoperative in the production path (the entry guard's _get_conn() pre-populates _local.conn, so reused is always True on attempt 1). Not a correctness/double-count bug — the execute-phase retry is unconditionally safe — but the documented invariant isn't realized in prod and the unit test asserting it bypasses the real entry point.
See inline comments for specifics.
Priority summary
| # | Sev | Location | Finding |
|---|---|---|---|
| 1 | Med | pg_barrier.py:675 | reused always True in prod path; fresh-conn-not-retried invariant inoperative |
| 2 | Med | pg_barrier.py:137 | _recover_after_error swallows rollback failure with no log; can reclassify a non-conn error as conn-death |
| 3 | Low | pg_barrier.py:647 | Load-bearing non-autocommit invariant is documented but unenforced — add assert not conn.autocommit |
| 4 | Low | pg_barrier.py:656 | Docstring names wrong caller for commit-phase teardown (Celery .link path has no teardown) |
| 5 | Low | pg_barrier.py:104 | _CONN_DEAD_ERRORS triplicated across 3 modules — hoist to shared module to make the no-drift invariant real |
| 6 | Low | pg_barrier.py:631 | Return tuple[int, Any] | None → NamedTuple for clarity / typed results |
| 7 | Low | pg_barrier.py:701 | Execute-phase give-up raise emits no decrement-level log (asymmetric trail) |
| 8 | Nit | pg_barrier.py:719 | AssertionError comment's "keeps the type checker honest" rationale is wrong (None is a valid return) |
| 9 | Test | test_pg_barrier.py:275 | time.sleep backoff never asserted — deleting it wouldn't fail any test |
| 10 | Test | test_pg_barrier.py:329 | Fresh-conn test bypasses the real entry point (ties to #1) |
| 11 | Test | test_pg_barrier.py:368 | Test rationale ("one-shot bound") doesn't match the actual gating branch |
| for attempt in range(1, _BARRIER_DECREMENT_ATTEMPTS + 1): | ||
| # Only a connection cached BEFORE this attempt can be a stale idle-reaped | ||
| # handle; a freshly-reconnected one failing is a genuine error, not a reap. | ||
| reused = getattr(_local, "conn", None) is not None and not _local.conn.closed |
There was a problem hiding this comment.
[Medium] The reused fresh-vs-cached distinction is inoperative in the production path.
The only production caller, _barrier_pg_decrement, unconditionally runs the entry-guard _get_conn() (~line 765) before _apply_decrement. That call populates _local.conn with a live, open connection. So by the time reused is sampled here, _local.conn is always non-None and open → reused is always True on attempt 1 — including the case the design explicitly wants to exclude: a brand-new connection whose first statement dies against a genuinely-down DB.
This is not a double-count bug (the execute-phase retry is unconditionally safe since nothing committed). The practical impact is bounded: one wasted reconnect + a 0.5s time.sleep before surfacing a real DB-down error. But:
- the documented invariant ("a freshly-created one failing is a genuine DB error… so it is not retried", lines 645-647) is not realized through the production entry point — here
reused's only real effect is the one-shot bound, whichattempt < _BARRIER_DECREMENT_ATTEMPTSalready provides; test_fresh_conn_execute_failure_is_not_retriedpasses only because it calls_apply_decrementdirectly with_local.conn = None, bypassing the guard — false confidence;- the "Mirrors
PgQueueClient.send's reused-guard" claim is operationally inaccurate: insend,reusedis sampled before any_get_conn.
Fix: capture reused in _barrier_pg_decrement before the entry-guard _get_conn() and thread it into _apply_decrement. Then add an end-to-end test (through _barrier_pg_decrement) for the fresh-conn case. Alternatively, if the team accepts the execute retry is always safe regardless, soften the docstrings/comments to stop claiming a fresh conn isn't retried.
Minor (same line): capture _local.conn once to avoid the double lookup —
cached = getattr(_local, "conn", None)
reused = cached is not None and not cached.closed| conn_dead = isinstance(exc, _CONN_DEAD_ERRORS) | ||
| try: | ||
| conn.rollback() | ||
| except Exception: |
There was a problem hiding this comment.
[Medium] Rollback failure is swallowed with no log, and can silently reclassify a non-connection error as a connection death.
Two compounding issues:
- The rollback exception is discarded with zero logging. The original op error does surface later via the caller's
raise, but the rollback failure itself leaves no trace — if rollback fails for an unexpected reason (e.g. anInternalError), that diagnostic is gone. conn_dead = Trueon a failed rollback is the exact gate that decides whether_apply_decrementretries the execute phase (line 687). The docstring at lines 659-660 promises aDataError"propagates unchanged", but a non-_CONN_DEAD_ERRORSerror whose rollback also fails gets relabeledconn_dead=Trueand routed into the retry path — contradicting the stated invariant. The counter stays safe (execute phase committed nothing), but the real error is masked behind a misleading "cached connection died, reconnecting" warning and re-run.
Fix: log the rollback failure with exc_info before flipping conn_dead:
except Exception:
logger.warning(
"PgBarrier: rollback during error recovery failed (original error %s) "
"— treating the connection as dead and discarding it.",
type(exc).__name__, exc_info=True,
)
conn_dead = True| rolled back on disconnect. So the decrement provably never landed → | ||
| reconnect and re-apply it exactly once. (Only a *cached* connection can be a | ||
| stale idle-reaped handle; a freshly-created one failing is a genuine DB | ||
| error, not a reap, so it is not retried.) **This safety relies on the |
There was a problem hiding this comment.
[Low] The load-bearing "non-autocommit" invariant is documented but unenforced.
The entire exactly-once argument for the execute-phase retry rests on the connection being non-autocommit (so the uncommitted UPDATE rolls back on disconnect). This is currently true only because create_pg_connection (pg_queue/connection.py) never sets autocommit and thus inherits psycopg2's False default — it does not actively open non-autocommit, so the phrase "create_pg_connection opens it that way" slightly overstates. There is no runtime guard: a future change adding conn.autocommit = True there (e.g. to match the queue's commit-immediately pattern) would silently reintroduce the double-count this function exists to prevent — and the entry guard's TRANSACTION_STATUS_IDLE check does not imply non-autocommit.
Fix: (a) soften the wording to "non-autocommit (psycopg2's default; create_pg_connection does not override it)"; and (b) as a code change, assert the invariant loudly, e.g. assert not conn.autocommit at the top of _apply_decrement, so it fails loudly rather than corrupting the counter.
| - **COMMIT phase** — ``conn.commit()``. A failure here is AMBIGUOUS: the | ||
| server may have applied the commit before the socket dropped. Re-applying | ||
| could double-count, so it is NEVER retried — it propagates, and the caller | ||
| (:func:`run_batch_with_barrier`) tears the barrier down so the execution |
There was a problem hiding this comment.
[Low / comment accuracy] This docstring names the wrong caller for commit-phase teardown.
The direct caller, _barrier_pg_decrement, does not catch a commit-phase error — its only except is psycopg2.DataError (~line 791). So a commit-phase failure propagates to whichever ultimate caller drove it:
run_batch_with_barrier(in-body PG path) — caught and_abort_barrier_in_bodyruns → teardown. Fail-fast holds. ✅barrier_pg_decr_and_check(the Celery.linkpath, today's live path) — no teardown; the barrier strands toexpires_at(max_retries=0, no replay).
The no-double-count guarantee still holds (commit is never retried), but the "tears the barrier down so the execution fails fast" claim is false for the path that's actually live today. Suggest rewording to cover both paths and drop the single-caller framing.
| # decrement phase-split retry, so the "was this a connection death?" test can't | ||
| # drift across them. Mirrors ``PgQueueClient._CONN_DEAD_ERRORS`` (UN-3654) and | ||
| # ``PgResultBackend._CONN_DEAD_ERRORS`` (UN-3659), the siblings this models on. | ||
| _CONN_DEAD_ERRORS: Final[tuple[type[Exception], ...]] = ( |
There was a problem hiding this comment.
[Low] _CONN_DEAD_ERRORS is now copy #3 of an identical constant (pg_queue/client.py, pg_queue/result_backend.py). The comment here explicitly worries about the "was this a connection death?" test drifting — but the three module-level copies can drift independently, which is the exact risk the constant was meant to kill, just moved up to the module boundary.
Fix: hoist a single _CONN_DEAD_ERRORS into a shared location (e.g. pg_queue/connection.py, which all three already import for create_pg_connection) and import it in all three. If that's deemed too invasive, at minimum add a test asserting the three tuples are equal.
Comment nit: PgQueueClient._CONN_DEAD_ERRORS / PgResultBackend._CONN_DEAD_ERRORS are module-level constants, not class attributes — PgQueueClient._CONN_DEAD_ERRORS would AttributeError. Refer to them by module (pg_queue.client / pg_queue.result_backend).
| ) | ||
| time.sleep(_BARRIER_RETRY_BACKOFF_SECONDS) | ||
| continue | ||
| raise |
There was a problem hiding this comment.
[Low] The execute-phase give-up raise emits no decrement-level log.
The retry branch logs a warning (lines 688-698) and the commit branch logs (709-716), but this propagate path is silent. For a fresh-conn death (reused False) or a non-conn error, _barrier_pg_decrement only catches DataError, so an OperationalError here propagates with no barrier-specific breadcrumb — and on the Celery .link path the barrier isn't torn down, so it hangs to expires_at. The result is an asymmetric trail: retryable blips are logged, but the terminal "gave up / wasn't retryable" decision isn't. Consider one warning before this raise naming the execution, error type, and that the decrement is propagating without retry (mirroring the commit-phase log).
| ) | ||
| raise | ||
| return row | ||
| # Unreachable: the loop either returns the row or raises. The bare raise keeps |
There was a problem hiding this comment.
[Nit / comment accuracy] The "keeps the type checker honest" rationale is wrong.
The return annotation is tuple[int, Any] | None, so an implicit fall-through off the loop would return None — a valid return (the function legitimately returns None at line 718 when fetchone() finds no row). A type checker would therefore not flag a missing return whether or not this line exists. The AssertionError is a pure defensive runtime guard, not a type-checker aid. Reword to e.g. "Defensive: the loop always returns or raises before here." (Keeping the guard is fine — cheap insurance in concurrency code; just fix the justification.)
| """ | ||
|
|
||
| @pytest.fixture(autouse=True) | ||
| def _no_sleep(self, monkeypatch): |
There was a problem hiding this comment.
[Test gap] The time.sleep(_BARRIER_RETRY_BACKOFF_SECONDS) backoff is never asserted.
This _no_sleep fixture (and the inline monkeypatches at ~456/601/631) replace pg_barrier.time.sleep with a bare no-op purely to skip the wall-clock wait — none record or assert the call. If someone deleted the time.sleep in _apply_decrement (line 699), every test would still pass, and a "tighten the retry loop" refactor could silently start hammering a struggling DB.
Fix: have the fixture append to a list, then assert sleeps == [pg_barrier._BARRIER_RETRY_BACKOFF_SECONDS] in the retry test and sleeps == [] in the no-retry tests (commit-phase / fresh-conn).
| assert reconnects == [] # NEVER reconnected/retried after a commit failure | ||
| assert conn.closed is True # the dead conn was discarded | ||
|
|
||
| def test_fresh_conn_execute_failure_is_not_retried( |
There was a problem hiding this comment.
[Test gap — ties to the reused finding] This test bypasses the real entry point.
It calls _apply_decrement(...) directly with _local.conn = None, so it exercises a fresh-conn state that the production wrapper (_barrier_pg_decrement → barrier_pg_decr_and_check) never produces: that wrapper's entry-guard _get_conn() always pre-populates _local.conn, making reused always True on attempt 1. So this test asserts a property the production path does not actually provide. Add an end-to-end test through _barrier_pg_decrement for the fresh-conn-not-retried case (or, if reused is fixed to sample before the entry guard, this becomes a true assertion).
| assert reconnects == [] # no retry on a logical/data error | ||
| assert conn.closed is False # a live conn after a data error is kept | ||
|
|
||
| def test_reraises_after_second_execute_failure(self, _clean_local, monkeypatch): |
There was a problem hiding this comment.
[Test quality / informational] The stated rationale doesn't match the actual gating branch.
This test is framed around "the one-shot bound" (attempt < _BARRIER_DECREMENT_ATTEMPTS), but trace the path: on attempt 1, _recover_after_error sets _local.conn = None before the continue; on attempt 2, reused recomputes to False, so the retry is refused by reused, not by the attempt bound — both independently block it. The attempt < term can never be the sole reason a retry is refused, because every retry discards the connection. Net: the decrement is hard-capped at one self-heal by the reused logic regardless of the attempt constant, and bumping _BARRIER_DECREMENT_ATTEMPTS would NOT increase self-heals — an untested assumption worth a comment. The propagation outcome asserted here is correct; only the rationale needs fixing.
…ed constant Addresses the PR Review Toolkit findings on #2130: - reused-guard was inoperative in production: `_barrier_pg_decrement` runs the entry-guard `_get_conn()` before `_apply_decrement`, so `reused` (sampled inside the loop) was always True — the fresh-conn-not-retried case never fired. Now sample it in `_barrier_pg_decrement` BEFORE the guard and thread it in. Adds an end-to-end test through the wrapper for the fresh-conn case (the prior direct-call test bypassed the guard → false confidence). - Hoist `_CONN_DEAD_ERRORS` to `pg_queue.connection` (the module all three sites import) and import it in client.py / result_backend.py / pg_barrier.py, so the three copies can't drift. (psycopg2 import drops out of the two that only used it for the constant.) - `_recover_after_error`: log the rollback failure with exc_info before flipping `conn_dead` — it was silently swallowed and can reclassify a non-conn error. - Non-autocommit invariant: soften the docstring ("psycopg2's default; not overridden") and pin it at the source with `test_create_pg_connection_is_non_autocommit` (a runtime assert in `_apply_decrement` would break the autocommit test fixture). - Return a `_DecrementRow` NamedTuple instead of `tuple[int, Any]` so the caller reads `.remaining` / `.results`. - Log the execute-phase give-up (fresh-conn death / retry exhausted) — the retryable and commit paths logged but this terminal path was silent. - Commit-phase docstring: name both caller paths (in-body teardown vs Celery .link reclaim-at-expiry) instead of only run_batch_with_barrier. - Reword the unreachable-raise comment (it's a defensive runtime guard, not a type-checker aid) and assert the backoff sleep fires exactly on retry. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
|
Thanks for the thorough review — addressed in [Medium] [Medium] Rollback failure swallowed — fixed. [Low] [Low] Non-autocommit invariant unenforced — softened the wording ("psycopg2's default; [Low] Commit-phase docstring names wrong caller — fixed. Now covers both paths: in-body ( [Low] NamedTuple return — done. [Low] Silent execute-phase give-up — fixed. A conn-dead give-up (fresh-conn death / retry exhausted) now logs a warning before propagating, matching the retryable and commit paths. (A non-conn [Nit] Unreachable-raise rationale — reworded to "defensive runtime guard" (the [Test gap] Backoff never asserted — fixed. The [Test gap / quality] Fresh-conn bypass + one-shot rationale — the new end-to-end test covers the real fresh-conn path; Re-verified: full |
4ffc41b
into
feat/UN-3445-pg-queue-integration
What
Closes the last uncovered stale-connection-after-idle hang path in the PG-queue execution engine, after UN-3654 (dispatch
send), UN-3651 (barrier enqueue) and UN-3659 (store_result).The
PgBarrierdecrement —runs on the worker's cached thread-local connection. After a worker sits idle >
server_idle_timeout(PgBouncer) the connection is reaped server-side. The decrement'stry/exceptcaught onlypsycopg2.DataError, so anOperationalError/InterfaceErrorpropagated, the batch's decrement was lost,remainingnever reached 0, and the execution hung at the fan-in (all filesCOMPLETEDbut execution stuckEXECUTING) until the ~6h barrier expiry. Same class as the earlier hangs, one step later in the pipeline.Why not a plain retry (or
send's reused-guard)The decrement is non-idempotent, and a double-apply is harmful: it can fire the callback early with incomplete results, or drive
remainingpast 0 and strand the barrier. Evensend()'s reused-guard is not enough on its own here — a reused-connection failure at commit time is ambiguous (the server may have applied it).Fix — phase-split retry
Split the statement into its two phases and retry only the one where a re-apply is provably exactly-once:
DataError/ non-connection error → propagate unchanged.Safety rests on the connection being non-autocommit (verified in
create_pg_connection): an execute-phase failure is never durable even if the server ran theUPDATE, because the open transaction is rolled back on disconnect. Durability happens only atcommit(), the phase we never retry — so re-applying can never double-count.Also adds the shared
_CONN_DEAD_ERRORSconstant (parity withclient.py/result_backend.py, so the "is this a connection death?" test can't drift) and a_recover_after_errorhelper now reused by_cursor.Tests
TestDecrementPhaseSplitRetry): execute-retry-heals (both error types), commit-fail-does-NOT-retry, fresh-conn-no-retry, DataError-no-retry, one-shot bound.pg_terminate_backend: a genuinely killed backend mid-decrement self-heals and fires the callback exactly once.test_pg_barrier.py(63) + sibling barrier suites (35) green; pre-commit clean.Dev-tested end-to-end inside the live PG worker fleet against the real database (real backend termination mid-decrement → self-heal → callback once → barrier consumed).
After this, no stale-connection-after-idle path can hang an execution.