fix(core): sm.list() no longer writes terminated state to disk#1737
fix(core): sm.list() no longer writes terminated state to disk#1737
Conversation
sm.list() was bypassing the lifecycle manager's probe decision matrix by persisting terminated state immediately on a single isAlive() failure. The dashboard's 3s poll via /api/sessions/patches called sm.list() ~10x more often than the lifecycle manager, so a transient runtime failure would permanently kill the session before the lifecycle manager could evaluate all three probes (runtime, process, activity). Changes: - sm.list() now persists "detecting" instead of "terminated" when it detects a dead runtime, so the lifecycle manager's resolveProbeDecision pipeline remains the single authority on terminal decisions. - /api/sessions/patches now calls listCached() instead of list(), preventing the dashboard's 3s poll from probing runtimes directly. The cache TTL (35s) aligns with the lifecycle manager's 30s poll. - Updated CLAUDE.md invariants to reflect the new behavior.
Test Coverage Report
Per-file breakdown
Uncovered lines
|
Greptile SummaryThis PR prevents
Confidence Score: 4/5Safe to merge with one follow-up fix: sessions already in detecting on disk will display as killed on every cache refresh until the lifecycle manager resolves them. The on-disk guard correctly prevents lastTransitionAt from being reset, but it also skips the only code path that corrects session.status from killed back to detecting. Every 35 s cache refresh that hits list() with a still-dead runtime will cache status = killed rather than detecting, directly undermining the PR's stated goal. packages/core/src/session-manager.ts — specifically the interaction between the enrichSessionWithRuntimeState status assignment and the new onDiskLifecycle guard in the persist block.
|
| Filename | Overview |
|---|---|
| packages/core/src/session-manager.ts | The new onDiskLifecycle guard correctly prevents lastTransitionAt from being reset on every call, but it leaves session.status = "killed" (set in enrichSessionWithRuntimeState) uncorrected whenever the session is already "detecting" on disk — causing the cache to store "killed" on every refresh. |
| packages/web/src/app/api/sessions/patches/route.ts | Switched from list() to listCached() — correct change to reduce runtime probing frequency from the 3 s dashboard poll to the 35 s cache TTL. |
| packages/core/src/tests/session-manager/query.test.ts | Tests updated to expect "detecting" instead of "killed". They correctly validate the first-call path but do not cover the second-call scenario where on-disk state is already "detecting", which is where the status regression would surface. |
| CLAUDE.md | Invariant comment updated to reflect that sm.list() now writes detecting rather than terminated — accurate and useful documentation. |
Sequence Diagram
sequenceDiagram
participant Dashboard as Dashboard (3 s poll)
participant Route as /api/sessions/patches
participant Cache as listCached() [35 s TTL]
participant List as sm.list()
participant Enrich as enrichSessionWithRuntimeState
participant Disk as Session metadata (disk)
participant LM as Lifecycle Manager (30 s poll)
Dashboard->>Route: GET /api/sessions/patches
Route->>Cache: listCached()
alt cache hit
Cache-->>Route: cached sessions
else cache miss
Cache->>List: list()
List->>Disk: read raw metadata
List->>Enrich: enrichSessionWithRuntimeState()
Enrich->>Enrich: isAlive() returns false
Enrich->>Enrich: "lifecycle.session.state = detecting in-memory"
Enrich->>Enrich: "session.status = killed in-memory"
Enrich-->>List: returns
List->>Disk: check onDiskLifecycle.session.state
alt not detecting or terminated or done on disk
List->>Disk: persist detecting + runtime_lost
List->>List: "session.status = deriveLegacyStatus = detecting"
else already detecting on disk
Note over List: persist skipped, session.status stays killed
end
List-->>Cache: sessions
Cache-->>Route: sessions cached 35 s
end
Route-->>Dashboard: patches
LM->>LM: resolveProbeDecision 3 signals
LM->>Disk: write terminated or retry detecting
Comments Outside Diff (1)
-
packages/core/src/session-manager.ts, line 1915-1938 (link)session.statusstays"killed"when session is already"detecting"on diskenrichSessionWithRuntimeStateunconditionally setssession.status = "killed"at line 1030 (theTERMINAL_SESSION_STATUSESguard doesn't protect"detecting"since it isn't in that set). On the firstlist()call the persist block runs and correctssession.statusto"detecting"viaderiveLegacyStatus(persisted). But when the session is already"detecting"on disk, the newonDiskLifecycle.session.state !== "detecting"guard skips the entire block — sosession.statusis never corrected and the caller receivesstatus = "killed".The 35 s
listCached()cache defers the problem, but every cache refresh hitslist()directly; if the runtime is still dead and the on-disk state is"detecting", the cached result flips back to"killed"until the lifecycle manager's probe pipeline resolves the session. This is the exact premature termination the PR is trying to prevent.The simplest targeted fix: after the
ifblock, sync the legacy status back when the lifecycle is already"detecting":if (session.lifecycle?.session.state === "detecting" && session.status === "killed") { session.status = "detecting"; }
Alternatively, add an
elsebranch inside the existing block that callsderiveLegacyStatuswhen the guard fails.Prompt To Fix With AI
This is a comment left during a code review. Path: packages/core/src/session-manager.ts Line: 1915-1938 Comment: **`session.status` stays `"killed"` when session is already `"detecting"` on disk** `enrichSessionWithRuntimeState` unconditionally sets `session.status = "killed"` at line 1030 (the `TERMINAL_SESSION_STATUSES` guard doesn't protect `"detecting"` since it isn't in that set). On the first `list()` call the persist block runs and corrects `session.status` to `"detecting"` via `deriveLegacyStatus(persisted)`. But when the session is **already** `"detecting"` on disk, the new `onDiskLifecycle.session.state !== "detecting"` guard skips the entire block — so `session.status` is never corrected and the caller receives `status = "killed"`. The 35 s `listCached()` cache defers the problem, but every cache refresh hits `list()` directly; if the runtime is still dead and the on-disk state is `"detecting"`, the cached result flips back to `"killed"` until the lifecycle manager's probe pipeline resolves the session. This is the exact premature termination the PR is trying to prevent. The simplest targeted fix: after the `if` block, sync the legacy status back when the lifecycle is already `"detecting"`: ```ts if (session.lifecycle?.session.state === "detecting" && session.status === "killed") { session.status = "detecting"; } ``` Alternatively, add an `else` branch inside the existing block that calls `deriveLegacyStatus` when the guard fails. How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
packages/core/src/session-manager.ts:1915-1938
**`session.status` stays `"killed"` when session is already `"detecting"` on disk**
`enrichSessionWithRuntimeState` unconditionally sets `session.status = "killed"` at line 1030 (the `TERMINAL_SESSION_STATUSES` guard doesn't protect `"detecting"` since it isn't in that set). On the first `list()` call the persist block runs and corrects `session.status` to `"detecting"` via `deriveLegacyStatus(persisted)`. But when the session is **already** `"detecting"` on disk, the new `onDiskLifecycle.session.state !== "detecting"` guard skips the entire block — so `session.status` is never corrected and the caller receives `status = "killed"`.
The 35 s `listCached()` cache defers the problem, but every cache refresh hits `list()` directly; if the runtime is still dead and the on-disk state is `"detecting"`, the cached result flips back to `"killed"` until the lifecycle manager's probe pipeline resolves the session. This is the exact premature termination the PR is trying to prevent.
The simplest targeted fix: after the `if` block, sync the legacy status back when the lifecycle is already `"detecting"`:
```ts
if (session.lifecycle?.session.state === "detecting" && session.status === "killed") {
session.status = "detecting";
}
```
Alternatively, add an `else` branch inside the existing block that calls `deriveLegacyStatus` when the guard fails.
Reviews (2): Last reviewed commit: "fix(core): skip re-persisting detecting ..." | Re-trigger Greptile
Check the on-disk lifecycle state (raw metadata) instead of the in-memory state when deciding whether to persist. Enrichment already sets detecting in-memory, so the previous guard always skipped the persist block. Using the on-disk state ensures: - First detection: persists detecting + lastTransitionAt - Subsequent calls: skips re-write, preserving the original timestamp
|
Addressed the bot review comment: the persist guard now checks the on-disk lifecycle state ( |
Summary
Fixes #1735
sm.list()no longer persiststerminatedto disk — it now writesdetectingstate withruntime_lostreason, so the lifecycle manager'sresolveProbeDecisionpipeline remains the single authority on terminal decisions. A singleisAlive()failure no longer permanently kills a session./api/sessions/patchesnow callslistCached()instead oflist()— the dashboard's 3s poll was the most dangerous caller, probing runtimes ~10x more often than the lifecycle manager. The cache TTL (35s) aligns with the lifecycle manager's 30s poll cycle.What was happening
sm.list()detected a dead runtime viaisAlive()and immediately persistedsession.state = "terminated"withreason = "runtime_lost"to disk. This bypassed the lifecycle manager's carefully designed probe pipeline which evaluates three independent signals (runtime, process, activity) and routes 4/5 outcomes todetecting(retry). Only unanimous agreement across all signals produces a terminal decision.The dashboard's
/api/sessions/patchesendpoint polled every 3s callingsm.list()directly — so a single transientisAlive()failure would permanently kill the session before the lifecycle manager's 30s poll could run.Test plan
killedstatus to expectdetectingpnpm buildpassespnpm typecheckpassespnpm lintpasses (0 errors, pre-existing warnings only)pnpm testpasses (all core + cli tests green; 3 pre-existing web test failures unrelated)🤖 Generated with Claude Code