Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

fix(core): sm.list() no longer writes terminated state to disk#1737

Open
whoisasx wants to merge 2 commits intomainComposioHQ/agent-orchestrator:mainfrom
feat/issue-1735ComposioHQ/agent-orchestrator:feat/issue-1735Copy head branch name to clipboard
Open

fix(core): sm.list() no longer writes terminated state to disk#1737
whoisasx wants to merge 2 commits intomainComposioHQ/agent-orchestrator:mainfrom
feat/issue-1735ComposioHQ/agent-orchestrator:feat/issue-1735Copy head branch name to clipboard

Conversation

@whoisasx
Copy link
Copy Markdown
Collaborator

@whoisasx whoisasx commented May 8, 2026

Summary

Fixes #1735

  • sm.list() no longer persists terminated to disk — it now writes detecting state with runtime_lost reason, so the lifecycle manager's resolveProbeDecision pipeline remains the single authority on terminal decisions. A single isAlive() failure no longer permanently kills a session.
  • /api/sessions/patches now calls listCached() instead of list() — the dashboard's 3s poll was the most dangerous caller, probing runtimes ~10x more often than the lifecycle manager. The cache TTL (35s) aligns with the lifecycle manager's 30s poll cycle.
  • Updated CLAUDE.md invariants to reflect the new behavior.

What was happening

sm.list() detected a dead runtime via isAlive() and immediately persisted session.state = "terminated" with reason = "runtime_lost" to disk. This bypassed the lifecycle manager's carefully designed probe pipeline which evaluates three independent signals (runtime, process, activity) and routes 4/5 outcomes to detecting (retry). Only unanimous agreement across all signals produces a terminal decision.

The dashboard's /api/sessions/patches endpoint polled every 3s calling sm.list() directly — so a single transient isAlive() failure would permanently kill the session before the lifecycle manager's 30s poll could run.

Test plan

  • Updated existing tests that expected killed status to expect detecting
  • pnpm build passes
  • pnpm typecheck passes
  • pnpm lint passes (0 errors, pre-existing warnings only)
  • pnpm test passes (all core + cli tests green; 3 pre-existing web test failures unrelated)

🤖 Generated with Claude Code

sm.list() was bypassing the lifecycle manager's probe decision matrix by
persisting terminated state immediately on a single isAlive() failure.
The dashboard's 3s poll via /api/sessions/patches called sm.list() ~10x
more often than the lifecycle manager, so a transient runtime failure
would permanently kill the session before the lifecycle manager could
evaluate all three probes (runtime, process, activity).

Changes:
- sm.list() now persists "detecting" instead of "terminated" when it
  detects a dead runtime, so the lifecycle manager's resolveProbeDecision
  pipeline remains the single authority on terminal decisions.
- /api/sessions/patches now calls listCached() instead of list(),
  preventing the dashboard's 3s poll from probing runtimes directly.
  The cache TTL (35s) aligns with the lifecycle manager's 30s poll.
- Updated CLAUDE.md invariants to reflect the new behavior.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Test Coverage Report

Metric Value
Lines covered 1044/1161
Lines not covered 117/1161
Overall coverage 89.9%
Per-file breakdown
File Coverage
packages/core/src/session-manager.ts 1010/1127 (89.6%)
packages/web/src/app/api/sessions/patches/route.ts 34/34 (100.0%)

Uncovered lines

  • packages/core/src/session-manager.ts: L295, L299-L300, L305-L306, L308, L321, L426, L501-L502, L504, L584, L642, L713-L715, L842-L843, L846, L857, L935, L1055-L1057, L1093, L1193, L1306, L1422, L1426-L1427, L1480, L1549-L1550, L1569-L1570, L1584-L1585, L1593-L1594, L1617-L1618, L1644, L1747, L1751-L1752, L1775-L1779, L1781, L1783, L1789, L1813, L1818, L1827-L1828, L1831-L1833, L1840, L2039-L2041, L2053, L2160, L2186-L2187, L2239, L2272-L2273, L2285, L2289, L2341, L2346, L2353, L2362, L2386, L2395-L2396, L2426, L2436, L2442-L2443, L2467, L2488, L2510-L2511, L2517-L2518, L2523, L2535, L2579, L2581-L2583, L2585, L2588-L2590, L2592-L2593, L2595, L2610, L2619, L2627, L2652, L2703, L2707, L2731, L2794, L2797, L2812, L2828, L2831, L2845, L2933

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 8, 2026

Greptile Summary

This PR prevents sm.list() from writing terminated directly to disk when it detects a dead runtime — it now writes detecting instead, deferring terminal decisions exclusively to the lifecycle manager's probe pipeline. The dashboard's /api/sessions/patches endpoint is also switched from list() to listCached() to reduce runtime probe frequency from every 3 s to every 35 s.

  • sm.list() state change: Dead-runtime detection now persists detecting + runtime_lost to disk (not terminated), and the new onDiskLifecycle guard prevents lastTransitionAt from being reset on every list() invocation when the session is already \"detecting\".
  • /api/sessions/patches uses cache: Switching to listCached() aligns the dashboard poll cadence with the lifecycle manager's 30 s cycle, eliminating the most dangerous caller of raw list().
  • Tests and docs updated: Existing tests now expect \"detecting\" status, and CLAUDE.md invariants are updated to reflect the new authority boundary.

Confidence Score: 4/5

Safe to merge with one follow-up fix: sessions already in detecting on disk will display as killed on every cache refresh until the lifecycle manager resolves them.

The on-disk guard correctly prevents lastTransitionAt from being reset, but it also skips the only code path that corrects session.status from killed back to detecting. Every 35 s cache refresh that hits list() with a still-dead runtime will cache status = killed rather than detecting, directly undermining the PR's stated goal.

packages/core/src/session-manager.ts — specifically the interaction between the enrichSessionWithRuntimeState status assignment and the new onDiskLifecycle guard in the persist block.

Important Files Changed

Filename Overview
packages/core/src/session-manager.ts The new onDiskLifecycle guard correctly prevents lastTransitionAt from being reset on every call, but it leaves session.status = "killed" (set in enrichSessionWithRuntimeState) uncorrected whenever the session is already "detecting" on disk — causing the cache to store "killed" on every refresh.
packages/web/src/app/api/sessions/patches/route.ts Switched from list() to listCached() — correct change to reduce runtime probing frequency from the 3 s dashboard poll to the 35 s cache TTL.
packages/core/src/tests/session-manager/query.test.ts Tests updated to expect "detecting" instead of "killed". They correctly validate the first-call path but do not cover the second-call scenario where on-disk state is already "detecting", which is where the status regression would surface.
CLAUDE.md Invariant comment updated to reflect that sm.list() now writes detecting rather than terminated — accurate and useful documentation.

Sequence Diagram

sequenceDiagram
    participant Dashboard as Dashboard (3 s poll)
    participant Route as /api/sessions/patches
    participant Cache as listCached() [35 s TTL]
    participant List as sm.list()
    participant Enrich as enrichSessionWithRuntimeState
    participant Disk as Session metadata (disk)
    participant LM as Lifecycle Manager (30 s poll)

    Dashboard->>Route: GET /api/sessions/patches
    Route->>Cache: listCached()
    alt cache hit
        Cache-->>Route: cached sessions
    else cache miss
        Cache->>List: list()
        List->>Disk: read raw metadata
        List->>Enrich: enrichSessionWithRuntimeState()
        Enrich->>Enrich: isAlive() returns false
        Enrich->>Enrich: "lifecycle.session.state = detecting in-memory"
        Enrich->>Enrich: "session.status = killed in-memory"
        Enrich-->>List: returns
        List->>Disk: check onDiskLifecycle.session.state
        alt not detecting or terminated or done on disk
            List->>Disk: persist detecting + runtime_lost
            List->>List: "session.status = deriveLegacyStatus = detecting"
        else already detecting on disk
            Note over List: persist skipped, session.status stays killed
        end
        List-->>Cache: sessions
        Cache-->>Route: sessions cached 35 s
    end
    Route-->>Dashboard: patches
    LM->>LM: resolveProbeDecision 3 signals
    LM->>Disk: write terminated or retry detecting
Loading

Comments Outside Diff (1)

  1. packages/core/src/session-manager.ts, line 1915-1938 (link)

    P1 session.status stays "killed" when session is already "detecting" on disk

    enrichSessionWithRuntimeState unconditionally sets session.status = "killed" at line 1030 (the TERMINAL_SESSION_STATUSES guard doesn't protect "detecting" since it isn't in that set). On the first list() call the persist block runs and corrects session.status to "detecting" via deriveLegacyStatus(persisted). But when the session is already "detecting" on disk, the new onDiskLifecycle.session.state !== "detecting" guard skips the entire block — so session.status is never corrected and the caller receives status = "killed".

    The 35 s listCached() cache defers the problem, but every cache refresh hits list() directly; if the runtime is still dead and the on-disk state is "detecting", the cached result flips back to "killed" until the lifecycle manager's probe pipeline resolves the session. This is the exact premature termination the PR is trying to prevent.

    The simplest targeted fix: after the if block, sync the legacy status back when the lifecycle is already "detecting":

    if (session.lifecycle?.session.state === "detecting" && session.status === "killed") {
      session.status = "detecting";
    }

    Alternatively, add an else branch inside the existing block that calls deriveLegacyStatus when the guard fails.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: packages/core/src/session-manager.ts
    Line: 1915-1938
    
    Comment:
    **`session.status` stays `"killed"` when session is already `"detecting"` on disk**
    
    `enrichSessionWithRuntimeState` unconditionally sets `session.status = "killed"` at line 1030 (the `TERMINAL_SESSION_STATUSES` guard doesn't protect `"detecting"` since it isn't in that set). On the first `list()` call the persist block runs and corrects `session.status` to `"detecting"` via `deriveLegacyStatus(persisted)`. But when the session is **already** `"detecting"` on disk, the new `onDiskLifecycle.session.state !== "detecting"` guard skips the entire block — so `session.status` is never corrected and the caller receives `status = "killed"`.
    
    The 35 s `listCached()` cache defers the problem, but every cache refresh hits `list()` directly; if the runtime is still dead and the on-disk state is `"detecting"`, the cached result flips back to `"killed"` until the lifecycle manager's probe pipeline resolves the session. This is the exact premature termination the PR is trying to prevent.
    
    The simplest targeted fix: after the `if` block, sync the legacy status back when the lifecycle is already `"detecting"`:
    ```ts
    if (session.lifecycle?.session.state === "detecting" && session.status === "killed") {
      session.status = "detecting";
    }
    ```
    Alternatively, add an `else` branch inside the existing block that calls `deriveLegacyStatus` when the guard fails.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
packages/core/src/session-manager.ts:1915-1938
**`session.status` stays `"killed"` when session is already `"detecting"` on disk**

`enrichSessionWithRuntimeState` unconditionally sets `session.status = "killed"` at line 1030 (the `TERMINAL_SESSION_STATUSES` guard doesn't protect `"detecting"` since it isn't in that set). On the first `list()` call the persist block runs and corrects `session.status` to `"detecting"` via `deriveLegacyStatus(persisted)`. But when the session is **already** `"detecting"` on disk, the new `onDiskLifecycle.session.state !== "detecting"` guard skips the entire block — so `session.status` is never corrected and the caller receives `status = "killed"`.

The 35 s `listCached()` cache defers the problem, but every cache refresh hits `list()` directly; if the runtime is still dead and the on-disk state is `"detecting"`, the cached result flips back to `"killed"` until the lifecycle manager's probe pipeline resolves the session. This is the exact premature termination the PR is trying to prevent.

The simplest targeted fix: after the `if` block, sync the legacy status back when the lifecycle is already `"detecting"`:
```ts
if (session.lifecycle?.session.state === "detecting" && session.status === "killed") {
  session.status = "detecting";
}
```
Alternatively, add an `else` branch inside the existing block that calls `deriveLegacyStatus` when the guard fails.

Reviews (2): Last reviewed commit: "fix(core): skip re-persisting detecting ..." | Re-trigger Greptile

Comment thread packages/core/src/session-manager.ts
Check the on-disk lifecycle state (raw metadata) instead of the
in-memory state when deciding whether to persist. Enrichment already
sets detecting in-memory, so the previous guard always skipped the
persist block. Using the on-disk state ensures:
- First detection: persists detecting + lastTransitionAt
- Subsequent calls: skips re-write, preserving the original timestamp
@whoisasx
Copy link
Copy Markdown
Collaborator Author

whoisasx commented May 9, 2026

Addressed the bot review comment: the persist guard now checks the on-disk lifecycle state (raw metadata) instead of the in-memory state. The in-memory guard wouldn't work because enrichment already sets detecting before the persist block runs — so it would skip persisting entirely on the first call too. With the on-disk check: first detection persists detecting + lastTransitionAt, subsequent calls skip the re-write since the on-disk state is already detecting. See 350f4a7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sm.list() bypasses lifecycle probe decision matrix — single isAlive() failure writes terminated to disk

1 participant

Morty Proxy This is a proxified and sanitized view of the page, visit original site.