Update trigger#4
Open
tylerc-govsignals wants to merge 1028 commits into
GovSignals:ConProgramming/two-phase-deployGovSignals/trigger.dev:ConProgramming/two-phase-deployfrom
Open
Update trigger#4tylerc-govsignals wants to merge 1028 commits intoGovSignals:ConProgramming/two-phase-deployGovSignals/trigger.dev:ConProgramming/two-phase-deployfrom
tylerc-govsignals wants to merge 1028 commits into
GovSignals:ConProgramming/two-phase-deployGovSignals/trigger.dev:ConProgramming/two-phase-deployfrom
Conversation
…xDuration (TRI-9117) (#3529) When a Node EventEmitter (e.g. node-redis) emits an "error" event with no listener attached, Node escalates it to process.on("uncaughtException") in the task worker. The worker reported the error via the UNCAUGHT_EXCEPTION IPC event but did not exit, and the supervisor-side handler in taskRunProcess only logged the message at debug level — leaving the run() promise orphaned until maxDuration fired and producing empty attempts (durationMs=0, costInCents=0). The supervisor now rejects the in-flight attempt with an UncaughtExceptionError and gracefully terminates the worker (preserving the OTEL flush window) on UNCAUGHT_EXCEPTION. The attempt fails fast with TASK_EXECUTION_FAILED, surfacing the original error name, message, and stack trace, and falls under the normal retry policy. This mirrors the existing indexing-side behavior in indexWorkerManifest. Apply the same handling to unhandled promise rejections, which Node already routes through uncaughtException by default.
## Summary
Both Claude Code workflows (`claude.yml` and `claude-md-audit.yml`)
authenticated via `CLAUDE_CODE_OAUTH_TOKEN`, which broke when the org
disabled Claude subscription access for Claude Code:
> Your organization has disabled Claude subscription access for Claude
Code · Use an Anthropic API key instead, or ask your admin to enable
access
This switches both workflows to `anthropic_api_key: ${{
secrets.ANTHROPIC_API_KEY }}` (secret already added to the repo).
## Test plan
- [ ] Confirm `📝 CLAUDE.md Audit` runs to completion on this PR
- [ ] Confirm `@claude` mention in a PR comment still triggers the
`Claude Code` workflow successfully
## Summary Stamps the active OpenTelemetry `trace_id` and `span_id` onto every Sentry event captured from the webapp, so engineers can copy a `trace_id` from a Sentry issue and search for the corresponding trace in any OTel-aware backend. Also adds an `otel_sampled` tag to indicate whether the trace was head-sampled — a cheap signal for whether the link will resolve to span data or hit a missing trace. ## Why Sentry and OTel were OTel-disconnected: `apps/webapp/sentry.server.ts` initialised Sentry with `skipOpenTelemetrySetup: true`, and no error-capture site (`logger.server.ts`, the Remix-wrapped `handleError`, the root `ErrorBoundary`) attached OTel context to the event. With many spans/sec across services, getting from a Sentry issue to its trace was guesswork. ## Approach Single global Sentry event processor, registered immediately after `Sentry.init`. On each event it reads `trace.getActiveSpan()?.spanContext()` via `@opentelemetry/api`, then writes: - `event.contexts.trace.trace_id` and `event.contexts.trace.span_id` (Sentry's native trace context fields) - `event.tags.otel_sampled` = `"true"` | `"false"` (derived from `traceFlags`) If no active span (module-load errors, scheduled timers without a context, primary cluster process), the processor returns the event unmodified — Sentry's default propagation context fills in. Implementation is co-located in `apps/webapp/sentry.server.ts` (no separate helper module — `sentry.server.ts` is built standalone by esbuild and a separate import would have required a new bundling step). Helper functions are exported so the unit tests can reach them without re-running `Sentry.init`. ## Non-goals (deliberate) - No sample rate change. ~95% of Sentry events will carry a `trace_id` that returns no spans in the tracing backend (head-sampled out). The `otel_sampled` tag makes that obvious at a glance. Raising find-rate is a separate conversation with cost trade-offs. - No user/org tags or `Sentry.setUser` (would need auth-helper + per-request scope wiring across multiple worker entrypoints — separate ticket). - Webapp image only. No changes to supervisor or CLI workers. ## Test plan - [x] Unit tests in `apps/webapp/test/sentryTraceContext.server.test.ts` — 9 tests covering: helper returns \`undefined\` with no active span; returns \`traceId\`/\`spanId\`/\`sampled=true\` for a recording span; returns \`sampled=false\` for a non-recording span; processor leaves the event unchanged with no active span; processor stamps \`trace_id\`/\`span_id\` onto \`contexts.trace\`; preserves existing \`contexts.trace\` fields; tags \`otel_sampled\` correctly for both sampled and non-sampled cases; never throws if \`@opentelemetry/api\` access throws. - [x] \`pnpm run typecheck --filter webapp\` passes. - [x] Manually verified end-to-end against a sandboxed Sentry project: confirmed both sampled and non-sampled traces correctly populate \`contexts.trace.trace_id\` matching the OTel ids logged from the loader, and the \`otel_sampled\` tag appears with the expected value. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
) When a webapp API route's catch-all 500 branch handles a non-typed exception, it returns the raw `error.message` to the caller. If the exception originates from an internal subsystem (the ORM client, an infra dependency, etc.) the server-side error string is surfaced verbatim in the response body — exposing implementation details the API surface shouldn't carry. The leak shows up in three shapes across the routes: - `return json({ error: error.message }, { status: 500 })` - `return json({ error: error instanceof Error ? error.message : "Internal Server Error" }, { status: 500 })` - ``return json({ error: `Internal server error: ${error.message}` }, { status: 500 })`` (plus a couple of analogous neverthrow-Result variants on admin routes.) ## Fix Across 19 webapp routes, replace each leaking branch with a generic body (`"Something went wrong"` / `"Internal Server Error"` to match the file's existing fallback) and add `logger.error(...)` so full visibility is preserved server-side. Catch blocks that branch on typed user-input errors (`ServiceValidationError`, `EngineServiceValidationError`, `OutOfEntitlementError`, `PrismaClientKnownRequestError`) are left intact — those messages are constructed deliberately and intended to be customer-facing. ## Test plan - [x] `pnpm run typecheck --filter webapp` - [x] Per-route manual probe: inject a synthetic `Error` at the top of the catch'd `try` block (or fake the wrapped call's rejection / Result error), curl the route with the dev API key, confirm the response body changed from the synthetic message verbatim → generic body. 21/21 leak sites verified end-to-end. - [x] 4xx-typed-error paths spot-checked: throwing `ServiceValidationError` from inside the catch'd try still surfaces its message at 422 as intended.
## Lots of filter UX improvements across lots of routes ### General - Promoted important filters out of the "More filters" so they're always visible - SearchInput primitive is now reusable and Esc now clears the field (AI filter input also clears with Esc) - Tooltips + keyboard shortcuts on every primary filter button - Brighter text on selected filter items / queue items - Filter dropdowns reordered for better hierarchy - Removed debounce on Tasks page search for faster filtering ### Tasks page search - Esc now clears the field - ENTER submits a search to improve performance when you have lots of tasks https://github.com/user-attachments/assets/4b30521e-dbc4-4468-b2af-8c85bdfb9002 ### Runs filters - Moves Status and Tasks out of the More filters menu - "Root only" toggle is set to false when you filter for a Task. This state isn't stored and flips back to the stored value if filters are cleared <img width="1690" height="986" alt="CleanShot 2026-04-26 at 19 24 08@2x" src="https://github.com/user-attachments/assets/b07da73c-140e-451f-a7bf-c32129317f63" /> ### Batches filters - General consistency improvements <img width="1429" height="948" alt="CleanShot 2026-05-08 at 09 50 35" src="https://github.com/user-attachments/assets/e5ec267f-2aa3-43ef-991e-93bf01bdaea5" /> ### Schedules - General consistency improvements <img width="1567" height="1141" alt="CleanShot 2026-05-08 at 09 51 11" src="https://github.com/user-attachments/assets/34b7da88-87c6-4e4d-a70f-fe13ea9f87ec" /> ### Queues - General consistency improvements <img width="824" height="416" alt="CleanShot 2026-05-08 at 09 52 02" src="https://github.com/user-attachments/assets/b4adc102-8192-4a68-b199-a175c2645a6c" /> ### Waitpoint tokens - General consistency improvements <img width="941" height="363" alt="CleanShot 2026-05-08 at 09 52 19" src="https://github.com/user-attachments/assets/d43aeb3f-7f80-454d-b183-fd077a4e3ff7" /> ### Models - General consistency improvements <img width="1570" height="509" alt="CleanShot 2026-05-08 at 09 53 17" src="https://github.com/user-attachments/assets/066d7646-4672-4cae-8ec0-e30a82889914" /> ### AI metrics - General consistency improvements <img width="1568" height="624" alt="CleanShot 2026-05-08 at 09 53 43" src="https://github.com/user-attachments/assets/fdfc4806-26fa-458d-a5ed-5c226b3bbc9f" /> ### Logs - General consistency improvements <img width="1267" height="752" alt="CleanShot 2026-05-08 at 09 54 30" src="https://github.com/user-attachments/assets/3e9ba871-b9dd-490e-aded-5d87134fd2bb" /> ### Errors - General consistency improvements <img width="1568" height="670" alt="CleanShot 2026-05-08 at 09 54 50" src="https://github.com/user-attachments/assets/fdda027a-e24f-4804-b4bb-203a6c2db960" /> ### Query - General consistency improvements - History, Scope, Triggered (date) filters all have shortcut tooltips - Scope filter now reuses the metrics ScopeFilter component <img width="1566" height="716" alt="CleanShot 2026-05-08 at 09 55 22" src="https://github.com/user-attachments/assets/0130b4a2-9daf-4edc-bada-3380aff4022a" /> ### Dashboards - General consistency improvements - Scope filter gets nicer icons and a shortcut - Nice icons for the Scope menu items <img width="1567" height="769" alt="CleanShot 2026-05-08 at 09 56 10" src="https://github.com/user-attachments/assets/7bea25f7-6c33-4d4a-a36d-3a1cb56afe09" /> ### Custom dashboard - General consistency improvements - Add chart, Add title, and the kebab menu now have tooltips + shortcuts <img width="1566" height="782" alt="CleanShot 2026-05-08 at 09 58 11" src="https://github.com/user-attachments/assets/9df4db25-b2c0-43a2-b92f-00256337d5a9" /> ### Environment variables - General consistency improvements <img width="1569" height="930" alt="CleanShot 2026-05-08 at 09 58 55" src="https://github.com/user-attachments/assets/26e614b4-88e7-400b-aa6d-a96bad488fb8" /> ### Preview branches - General consistency improvements <img width="1570" height="986" alt="CleanShot 2026-05-08 at 09 59 17" src="https://github.com/user-attachments/assets/57a2b939-3670-4252-ab2c-d6dc65bdda1b" />
## Summary Adds a Redis pub/sub reload path to the webapp's in-memory LLM pricing registry. When enabled on a process, the registry reloads from the database whenever a publish lands on the configured channel — instead of waiting for the existing 5-minute interval. Lets pricing/model changes propagate to cost enrichment within seconds. Subscription is **off by default** and opt-in per process. Only OTel-ingesting services need real-time freshness; dashboard and worker services run fine on the periodic interval and shouldn't pile onto each publish with a full-table reload. ## Design When `LLM_PRICING_RELOAD_PUBSUB_ENABLED=true`, subscribes via `createRedisClient` against `COMMON_WORKER_REDIS_*` and listens on `LLM_PRICING_RELOAD_CHANNEL` (default `llm-registry:reload`). The 5-minute periodic reload stays as a backstop, and a SIGTERM/SIGINT handler closes the subscription cleanly. The publisher side lives outside this PR — any process running in the same Redis namespace can trigger a reload by `PUBLISH llm-registry:reload <anything>`. Includes a `.server-changes/` note for the changelog. ### Debounced reload Bursts of publishes are coalesced. The first publish schedules a reload at T+`LLM_PRICING_RELOAD_DEBOUNCE_MS` (default 1s); subsequent publishes during that window are no-ops because the trailing reload picks up everything when it queries the DB. Bounds reload rate to at most 1 per debounce window regardless of publisher chattiness, so a runaway upstream publisher can't fan out into a flood of full-table-scan reloads. ## Test plan - [ ] With `LLM_PRICING_RELOAD_PUBSUB_ENABLED=false` (default): `redis-cli PUBSUB NUMSUB llm-registry:reload` returns `0` while the webapp is up - [ ] With it set to `true`: returns `>= 1` - [ ] `redis-cli PUBLISH llm-registry:reload test` returns `1` (one subscriber received) on a subscribed process - [ ] Mutate an `LlmModel` row externally, publish on the channel, observe the registry's match() picks up the change without waiting for the 5-min tick - [ ] Publish 100x in rapid succession; confirm only one reload fires within the debounce window
…ze (#3538) ## Summary - Run-view inspector panel was glitching out on Firefox: visual flicker on close, locking up at min size, and intermittent `panelHasSpace` invariant errors. Root cause is the underlying `react-window-splitter` library's collapse animation, which uses `@react-spring/rafz` and interacts poorly with Firefox. - Disabled the library's collapse animation on Firefox only, app-wide (every consumer of `RESIZABLE_PANEL_ANIMATION`). Chromium and Safari behaviour is unchanged. ## Changes - **Firefox animation skip** in `RESIZABLE_PANEL_ANIMATION` — UA-detected at module load, resolves to `undefined` for Firefox so the library's animation actor completes in one frame instead of running its rAF loop. - **Inspector min raised 50px → 250px** so dragging can't shrink the panel into a near-useless width. - **`autosaveId` bumped `v2` → `v3`** to invalidate stale persisted snapshots (the library has a `// TODO` branch that ignores prop changes for already-registered panels, so existing users would otherwise still see the old 50px min). - **`react-window-splitter` pinned** to exact `0.4.1` to protect the patch from drifting if line offsets change in a patch release. - **Two hunks added to the existing `@window-splitter/state` patch:** - Removed the library's auto-collapse-on-drag block entirely. Every collapsible panel in the app is parent-controlled, and that block was triggering state-machine deadlocks when handlers were no-ops. Drag-to-collapse is now disabled across the app; collapse is only triggered explicitly (close button, ESC, URL change, etc.). - In `getDeltaForEvent`, fall back to the panel's `default` before its `min` when expanding — so the first ever click on a span opens the inspector at 500px, not 250px. ## Local testing confirmed - [x] Firefox: open a run, click various spans → panel opens instantly at 500px, drags freely between 250px and max, closes instantly to 0. No console errors. - [x] Chrome/Chromium: same flow, but with smooth open/close animation as before. - [x] Safari: same as Chrome. - [x] Reload mid-session → panel restores cleanly to the dragged size. - [x] Other resizable panels in the app (logs, deployments, schedules, batches, bulk-actions, runs index) still animate on Chromium/Safari. ## Notes - Linear: TRI-8584 - Branch contains intermediate commits exploring an unsuccessful snapshot-validator approach; they're reverted by the final commit. Cumulative diff is 6 files. Squash on merge if you'd prefer a clean history. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…over (#3548) ## Summary During an ElastiCache role swap (failover) or node-type change (vertical scale), the ioredis TCP/TLS connection stays open but the server starts answering with `READONLY` (the client is talking to a node that became a replica) or `LOADING` (node still loading data from disk). Without an explicit hook, those errors surface to caller code as `ReplyError` instances — every write op on the affected connection fails until the cluster fully cuts over. This PR adds `reconnectOnError` to every prod ioredis client so the disconnect + reconnect + retry cycle absorbs these errors and caller code never sees them. ## Fix ```ts export function defaultReconnectOnError(err: Error): boolean | 1 | 2 { const msg = err.message ?? ""; if (msg.startsWith("READONLY") || msg.startsWith("LOADING")) return 2; return false; } ``` Returning `2` tells ioredis to disconnect, reconnect, and re-issue the failed command. After reconnect, DNS / SG state routes the new socket to a writable node. The helper lives in `@internal/redis` and is wired into both the shared `createRedisClient` (which covers RunQueue, schedule-engine, redis-worker, and every other internal-package consumer) and the direct `new Redis(...)` call sites in the webapp. V1-only marqs files are intentionally not migrated. ## Test plan - [x] `pnpm run typecheck --filter webapp` - [x] `pnpm run typecheck --filter @internal/run-engine` - [x] Verified end-to-end against a live ElastiCache vertical-scale event — caller-surfaced errors went from tens of thousands during the cutover window down to a handful per ioredis client - [ ] Confirm steady-state behavior unchanged after deploy
…3549) ## Summary When ElastiCache demotes a primary to replica — during a Multi-AZ failover or a vertical node-type change — the demoting primary issues an `UNBLOCKED` reply to any in-flight blocking commands (`BLPOP`, `BRPOP`, `BLMOVE`, `XREADGROUP ... BLOCK`, etc.) to clear them before the role flips. ioredis surfaces these as `ReplyError` to caller code. The shared `defaultReconnectOnError` added in #3548 only matches `READONLY` and `LOADING`. This extends it to `UNBLOCKED` so the disconnect-reconnect-retry cycle handles BLPOP-shaped errors the same way the existing two cases handle non-blocking-command errors. ## Fix ```ts export function defaultReconnectOnError(err: Error): boolean | 1 | 2 { const msg = err.message ?? ""; if ( msg.startsWith("READONLY") || msg.startsWith("LOADING") || msg.startsWith("UNBLOCKED") ) { return 2; } return false; } ``` Returning `2` tells ioredis to disconnect, reconnect, and re-issue the command. For a BLPOP that means a fresh BLPOP against the new primary instead of the `UNBLOCKED` error escaping to the caller. ## Test plan - [ ] CI green - [ ] Trigger a Multi-AZ failover or a vertical scale event on an ElastiCache replication group whose clients are running blocking commands and confirm no `UNBLOCKED` errors surface to caller code during the cutover.
…h rate limit (#3475) ## Summary - Adds admin-only editors on the back-office org page for `Organization.maximumProjectCount` and `Organization.batchRateLimitConfig`, alongside the existing API rate limit editor. - Splits the back-office org page into per-section components (`ApiRateLimitSection`, `BatchRateLimitSection`, `MaxProjectsSection`) so each tool is self-contained — adding new sections later doesn't bloat the route. - Generalizes the rate-limit form into a reusable `RateLimitSection` component + `RateLimitDomain` server config so API and batch share the same UI, validation, and action handler. Each domain only owns its env defaults, DB column, and logger key. - "Saved." banner and validation errors are scoped to the section that submitted, not the page. Heads-up: the API rate-limit log key was renamed `admin.backOffice.rateLimit` → `admin.backOffice.apiRateLimit` for symmetry with the new `admin.backOffice.batchRateLimit`. ## Test plan - [ ] As an admin, visit `/admin/back-office/orgs/:orgId` and confirm all three sections render with the org's current values (or system defaults). - [ ] Edit and save each section; confirm only that section shows the "Saved." banner. - [ ] Submit invalid input (e.g. `0` tokens, malformed interval); confirm errors render in the offending form only and the other sections stay closed. - [ ] Confirm a non-admin user is redirected away from the route. - [ ] After saving a rate-limit override, hit the org with traffic and confirm the new limit is enforced (API rate limit + batch rate limit code paths read the column at request time).
#3554) ## Summary TTL expiration on queued runs was being scheduled twice: once via a per-run `expireRun` worker job (the original implementation) and once via the batch TTL system (added more recently). Both paths attempt to flip the same run to `EXPIRED`. The per-run job almost always won the race, leaving the batch consumer to observe runs already expired by the older path. This collapses TTL expiration onto the batch path so every queued TTLed run goes through a single Redis-backed sorted set + batch consumer instead of also getting its own scheduled redis-worker job. ## Design `engine.trigger` and `delayedRunSystem.enqueueDelayedRun` no longer call `ttlSystem.scheduleExpireRun`. The remaining `enqueueSystem.enqueueRun({ includeTtl: true })` already adds the run to the TTL sorted set; `TtlSystem.expireRunsBatch` flips it to `EXPIRED` when the TTL fires. Delayed runs get the same coverage by passing `includeTtl: true` on their post-delay enqueue, so the TTL is armed from the moment the run enters the queue (matching how the old job behaved — `parseNaturalLanguageDuration` is evaluated at enqueue time). The new path explicitly does not re-expire runs once they have been allocated a concurrency slot. That is intentional: TTL is for runs that are queued and have never started. Once a run has a slot it is on its way to executing. ## Test plan - [x] `pnpm run test --filter @internal/run-engine ./src/engine/tests/ttl.test.ts` — 15 tests, including a new "Re-enqueued runs are not expired by TTL once they have started" that locks in the queued-and-never-started contract. - [x] `pnpm run test --filter @internal/run-engine ./src/engine/tests/delays.test.ts` — 5 tests, including "Delayed run with a ttl" which now also asserts the TTL is armed from queue-enter time, not `createdAt`. - [x] `pnpm run test --filter @internal/run-engine ./src/engine/tests/lazyWaitpoint.test.ts` — 12 tests. - [x] `pnpm run typecheck --filter @internal/run-engine`.
…ges (#3559) ## Summary Make `taskIdentifier` optional on the run-queue message schema. No behavior change in this PR; readers continue to accept payloads that include the field. A separate change will stop writing it on the wire to shrink the per-run payload that lives in Redis while runs wait to be dequeued. ## Design The field is written into every payload at enqueue time but no consumer reads it back on the dequeue path. Both the run-engine and supervisor derive `taskIdentifier` from the loaded `TaskRun` row instead. Relaxing the schema first means readers tolerate payloads that omit it, so the writer-side change can ship without producing schema-parse errors during a rolling deploy. `projectId` is left required: `WorkerQueueResolver.#getOverride` reads it for project-scoped runtime worker-queue overrides. ## Test plan - [x] `pnpm run typecheck --filter @internal/run-engine` - [x] `pnpm run typecheck --filter webapp` - [x] `pnpm run test ./src/run-queue/tests/enqueueMessage.test.ts ./src/run-queue/tests/workerQueueResolver.test.ts --run` (28/28 passing)
### Style updates to the notifications - Tightened up the typography - Brighter background to make it stand out a bit more - A bit more padding to make it more readable - Show the close button on hover instead - Turned the notification into a separate component as it's shared on the admin page modal - Minor tweaks to the behavior of toggling the notification beween open/closed side menu states ### Before <img width="224" height="313" alt="before" src="https://github.com/user-attachments/assets/c9a9377c-4a3b-4477-921a-3c86385d3f0b" /> ### After (with image) <img width="239" height="284" alt="CleanShot 2026-05-11 at 17 22 01" src="https://github.com/user-attachments/assets/311b4dbc-4853-4e6c-9f83-8173b38bd466" /> ### After (no image) <img width="239" height="189" alt="after" src="https://github.com/user-attachments/assets/884e062b-3608-4cb3-a462-d50597257753" /> --------- Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
## Summary
1 improvement, 1 bug fix.
## Improvements
- Fail attempts on uncaught exceptions instead of hanging to
`MAX_DURATION_EXCEEDED`. A Node `EventEmitter` (e.g. `node-redis`)
emitting `"error"` with no `.on("error", ...)` listener escalates to
`uncaughtException`, which the worker previously reported but did not
act on — runs drifted to maxDuration with empty attempts. They now fail
fast with the original error and status `FAILED`, and respect the task's
normal retry policy. You should still attach `.on("error", ...)`
listeners to long-lived clients to handle errors gracefully.
([#3529](#3529))
## Bug fixes
- Fix dev workers spinning at 100% CPU after the parent CLI disconnects.
Orphaned `trigger-dev-run-worker` (and indexer) processes were caught in
an `uncaughtException` feedback loop: a periodic IPC send via
`process.send` would throw `ERR_IPC_CHANNEL_CLOSED` once the parent
closed the channel, which re-entered the same handler that itself called
`process.send`, scheduled via `setImmediate` and amplified by
source-map-support's `prepareStackTrace`. Fixed by (1) silently dropping
packets in `ZodIpcConnection` when the channel is disconnected, (2)
adding a `process.on("disconnect", ...)` handler in dev workers so they
exit cleanly when the CLI closes the IPC channel, and (3) wrapping all
`uncaughtException`-path `process.send` calls in a `safeSend` guard that
checks `process.connected` and swallows synchronous throws.
([#3491](#3491))
<details>
<summary>Raw changeset output</summary>
# Releases
## @trigger.dev/build@4.4.6
### Patch Changes
- Updated dependencies:
- `@trigger.dev/core@4.4.6`
## trigger.dev@4.4.6
### Patch Changes
- Fix dev workers spinning at 100% CPU after the parent CLI disconnects.
Orphaned `trigger-dev-run-worker` (and indexer) processes were caught in
an `uncaughtException` feedback loop: a periodic IPC send via
`process.send` would throw `ERR_IPC_CHANNEL_CLOSED` once the parent
closed the channel, which re-entered the same handler that itself called
`process.send`, scheduled via `setImmediate` and amplified by
source-map-support's `prepareStackTrace`. Fixed by (1) silently dropping
packets in `ZodIpcConnection` when the channel is disconnected, (2)
adding a `process.on("disconnect", ...)` handler in dev workers so they
exit cleanly when the CLI closes the IPC channel, and (3) wrapping all
`uncaughtException`-path `process.send` calls in a `safeSend` guard that
checks `process.connected` and swallows synchronous throws.
([#3491](#3491))
- Fail attempts on uncaught exceptions instead of hanging to
`MAX_DURATION_EXCEEDED`. A Node `EventEmitter` (e.g. `node-redis`)
emitting `"error"` with no `.on("error", ...)` listener escalates to
`uncaughtException`, which the worker previously reported but did not
act on — runs drifted to maxDuration with empty attempts. They now fail
fast with the original error and status `FAILED`, and respect the task's
normal retry policy. You should still attach `.on("error", ...)`
listeners to long-lived clients to handle errors gracefully.
([#3529](#3529))
- Updated dependencies:
- `@trigger.dev/core@4.4.6`
- `@trigger.dev/build@4.4.6`
- `@trigger.dev/schema-to-json@4.4.6`
## @trigger.dev/core@4.4.6
### Patch Changes
- Fix dev workers spinning at 100% CPU after the parent CLI disconnects.
Orphaned `trigger-dev-run-worker` (and indexer) processes were caught in
an `uncaughtException` feedback loop: a periodic IPC send via
`process.send` would throw `ERR_IPC_CHANNEL_CLOSED` once the parent
closed the channel, which re-entered the same handler that itself called
`process.send`, scheduled via `setImmediate` and amplified by
source-map-support's `prepareStackTrace`. Fixed by (1) silently dropping
packets in `ZodIpcConnection` when the channel is disconnected, (2)
adding a `process.on("disconnect", ...)` handler in dev workers so they
exit cleanly when the CLI closes the IPC channel, and (3) wrapping all
`uncaughtException`-path `process.send` calls in a `safeSend` guard that
checks `process.connected` and swallows synchronous throws.
([#3491](#3491))
- Fail attempts on uncaught exceptions instead of hanging to
`MAX_DURATION_EXCEEDED`. A Node `EventEmitter` (e.g. `node-redis`)
emitting `"error"` with no `.on("error", ...)` listener escalates to
`uncaughtException`, which the worker previously reported but did not
act on — runs drifted to maxDuration with empty attempts. They now fail
fast with the original error and status `FAILED`, and respect the task's
normal retry policy. You should still attach `.on("error", ...)`
listeners to long-lived clients to handle errors gracefully.
([#3529](#3529))
## @trigger.dev/python@4.4.6
### Patch Changes
- Updated dependencies:
- `@trigger.dev/core@4.4.6`
- `@trigger.dev/build@4.4.6`
- `@trigger.dev/sdk@4.4.6`
## @trigger.dev/react-hooks@4.4.6
### Patch Changes
- Updated dependencies:
- `@trigger.dev/core@4.4.6`
## @trigger.dev/redis-worker@4.4.6
### Patch Changes
- Updated dependencies:
- `@trigger.dev/core@4.4.6`
## @trigger.dev/rsc@4.4.6
### Patch Changes
- Updated dependencies:
- `@trigger.dev/core@4.4.6`
## @trigger.dev/schema-to-json@4.4.6
### Patch Changes
- Updated dependencies:
- `@trigger.dev/core@4.4.6`
## @trigger.dev/sdk@4.4.6
### Patch Changes
- Updated dependencies:
- `@trigger.dev/core@4.4.6`
</details>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…3552) Closes [TRI-9234](https://linear.app/triggerdotdev/issue/TRI-9234/retry-task-process-sigsegv-errors-respecting-user-retry-config) ## What this changes SIGSEGV crashes (`TASK_PROCESS_SIGSEGV`) will now be **retried when an attempt fails**, in line with the task's configured retry settings (`retry.maxAttempts` etc.) — the same path SIGTERM and uncaught exceptions already use. Previously SIGSEGV was hard-classified as non-retriable and failed the run on the first segfault, ignoring the user's retry policy. Tasks without a retry policy still fail fast on the first SIGSEGV. Behaviour is unchanged for OOM kills (separate machine-bump retry path) and SIGKILL_TIMEOUT. ## Deploy **Only the webapp needs to ship.** The retry decision lives entirely in the webapp: - V2 path: `internal-packages/run-engine` (bundled into the webapp) - V1 path: `apps/webapp/app/v3/services/completeAttempt.server.ts` No supervisor, CLI, SDK, or customer-task-image changes required. Customers do not need to redeploy. The `@trigger.dev/core` changeset is just keeping the public package in sync — the published npm version isn't what makes the fix work. ## Why retry SIGSEGV in Node tasks is frequently non-deterministic across processes: - **Native addon races** (`sharp`, `canvas`, `better-sqlite3`, `node-rdkafka`, `bcrypt`, …) — libuv thread-pool work stepping on V8 handles. Different heap layout / thread schedule on a fresh process → retry often succeeds. - **JIT / GC interaction** — V8 turbofan deopt or GC during a native callback. Timing-dependent. - **Near-OOM in native code** — when RSS approaches the cgroup limit, native allocations fail and poorly-written addons dereference NULL → SIGSEGV instead of clean OOM-kill. - **Host / hardware issues** — bit flips, kernel quirks. Retry lands on a different host. The genuinely deterministic case (a user-code bug always tripping the same addon) is real, but a subset — and `maxAttempts` bounds the damage. ## Pre-existing inconsistency this resolves - `shouldRetryError` returned `false` for `TASK_PROCESS_SIGSEGV` → `fail_run`. - `shouldLookupRetrySettings` already listed `TASK_PROCESS_SIGSEGV` as retry-config-aware — but that branch was unreachable because `shouldRetryError` short-circuited first in `retrying.ts:86-90`. - We already retry `TASK_RUN_UNCAUGHT_EXCEPTION` (clearly a user-code bug) under the user's retry policy; refusing to retry SIGSEGV was the odd one out. ## Test plan - [x] `pnpm exec vitest run test/errors.test.ts` in `packages/core` — 26/26 pass (4 new) - [x] `pnpm run build --filter @trigger.dev/core` - [ ] CI green on PR 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary Adds `.claude/REVIEW.md` — a repo-specific source of truth for what AI / agent code reviewers should treat as critical in this codebase (rolling-deploy safety, hot-table indexes, recovery-path queries, testcontainers usage, etc.). Pairs with a Claude-based PR audit that flags drift between REVIEW.md and the code as it evolves. ## How the audit works Mirrors the existing `.github/workflows/claude-md-audit.yml` pattern. On non-draft, non-fork PRs that touch code, `anthropics/claude-code-action` reads REVIEW.md, samples the PR diff, and posts a sticky comment with up to 3 of: - `[stale]` — rule cites a path / function / table that's been removed or renamed - `[contradiction]` — code in the PR violates a current rule - `[missing]` — PR introduces a new pattern future reviewers should know about - `[obsolete]` — rule asserts a constraint the repo has moved past If nothing's off, posts `✅ REVIEW.md looks current for this PR.` ## Test plan - [ ] Convert this PR to ready-for-review, confirm the audit runs and posts a sticky comment - [ ] Verify the audit doesn't run on fork PRs (gated by `head.repo.full_name == github.repository`) - [ ] Verify suggestions are actionable on at least one follow-up PR
…3499) ## Summary Consolidates the webapp's authentication and authorization into a small set of route helpers, replacing the ad-hoc `requireUser` / `requireUserId` / `authenticatedEnvironmentForAuthentication` calls scattered across routes. Same security model, but the per-request flow (authenticate → authorize → load) now lives in one place per route family. Introduces a plugin seam (`@trigger.dev/plugins`) that lets the cloud build install a richer RBAC implementation without touching webapp code. The OSS fallback keeps the pre-RBAC permissive behaviour intact, so self-hosted deployments work unchanged. Adds a comprehensive end-to-end auth test suite that didn't exist before — 193 `it()` blocks (vitest reports ~199 after `it.each` expansion) covering API key, PAT and JWT auth across the public API surface, plus dashboard session auth for admin pages. ## Changes ### Plugin contract — `@trigger.dev/plugins` `RoleBaseAccessController` interface authoritative for both OSS (fallback) and cloud (enterprise plugin): - `authenticateBearer(request, { allowJWT? })` — API-key / public-JWT auth, returns env + ability - `authenticateSession(request, { userId, organizationId?, projectId? })` — dashboard auth, caller resolves `userId` from the session cookie and passes it in (no `helpers.getSessionUserId` callback — decouples the plugin host from session-cookie code) - `authenticatePat(request, { organizationId?, projectId? })` — PAT auth, returns identity + `lastAccessedAt` so the host can throttle the per-request update - `authenticateAuthorize*` variants for the auth-and-check-in-one-call cases - `isUsingPlugin(): Promise<boolean>` — capability flag for UI / branching where plugin-present-ness matters; replaces the sentinel-string coupling that had `personalAccessToken.server` matching `"RBAC plugin not installed"` literally ### Dashboard auth (started, partial rollout) Admin and settings pages migrated to a unified `dashboardLoader` / `dashboardAction` helper that authenticates the session, runs an authorization check, and exposes the result to the route. Other dashboard routes still on the old pattern; remaining migration tracked in TRI-8730. Migrated routes: - `admin.*` (14 admin / back-office / feature-flags / LLM-models / notifications / orgs / concurrency pages) - `_app.orgs.$organizationSlug.settings.team` - `_app.orgs.$organizationSlug.settings.roles` ### API / realtime / engine auth (complete for the migrated families) 71 routes migrated to a unified `apiBuilder` that centralizes Bearer / PAT / Public-JWT authentication and applies the per-route authorization check before the handler runs. Includes: - `api.v1.*` and `api.v2.*` and `api.v3.*` — tasks, runs, batches, queues, prompts, deployments, query, sessions, waitpoints, packets, workers, idempotency keys - `realtime.v1.*` — runs, batches, sessions, streams - `engine.v1.*` — dev / worker-action protocols 29 routes still on the legacy `authenticateApiRequest*` helpers — tracked as a post-deploy follow-up in TRI-9228. Multi-resource auth direction is now explicit at the call site via `anyResource(...)` (OR) and `everyResource(...)` (AND). Bare arrays no longer typecheck — fixes a class of bug where a JWT scoped to one resource could implicitly access others under OR semantics. PAT auth path consolidated: was three DB queries per request (legacy `authenticateApiRequestWithPersonalAccessToken` findFirst + `rbac.authenticatePat` join + `lastAccessedAt` update). Now one query in the steady state — plugin returns `lastAccessedAt`, host smart-skips the update via JS-side throttle when fresh. Side effect: action aliases preserved historic JWT scope semantics where the new model is stricter (e.g. a `write:tasks` JWT now also satisfies `trigger` / `batchTrigger` / `update` actions on the same resource — matched at the auth boundary, not in the route handler). ### Backwards-compat fixes The strict-match model regressed several real-world JWT shapes. Each preserved via explicit `anyResource(...)` entries in the route's authz block: - **Batch retrieve routes** (`api.v1.batches.$batchId`, `api.v2.*`, `realtime.v1.batches.*`) accept `read:runs` JWTs again (pre-RBAC literal-match superScope behaviour) - **Runs list routes** (`api.v1.runs`, `realtime.v1.runs`) accept type-level `read:tasks` / `read:tags` on unfiltered queries (matched the legacy `Object.keys` iteration semantic) - **PAT/OAT auth shape** normalized through `toAuthenticated` so all auth methods return the same slim `AuthenticatedEnvironment` (was: API-key returned the slim shape but PAT/OAT returned raw Prisma `Decimal` / no `orgMember`) - **Scope `:` preservation** in resource ids — `read:tags:env:staging` now correctly identifies the tag id as `env:staging`, not `env` ### Slim `AuthenticatedEnvironment` Extracted to `@trigger.dev/core/v3/auth/environment` — a structural shape independent of `@trigger.dev/database`. The plugin contract returns this; webapp consumers import from there; the cloud plugin (Drizzle) returns the same shape without Prisma's `Decimal` class leaking into the public surface. Lets internal-packages (run-engine, etc.) refer to `AuthenticatedEnvironment` without pulling Prisma in. ### Auth test suite (new — `*.e2e.full.test.ts`) 193 e2e tests run against a real spawned webapp + Postgres (no mocks). Coverage matrix: - **API key auth** — read / write / trigger / batchTrigger / deploy actions across runs, batches, deployments, prompts, queues, query, sessions, input-streams, waitpoints, tasks, idempotency keys; multi-key resources (a run carries batch / tag / task identifiers — auth must accept any matching scope) - **Personal Access Token auth** — comprehensive matrix: scope match, scope mismatch, missing scope, expired token, malformed token - **Public JWT auth** — sub-vs-URL environment resolution, expired JWTs, signature verification, scope checking, otu (one-time-use) token semantics, branch-environment signing-key fallback - **Dashboard session auth** — admin-only pages reject non-admins; per-action gating - **Cross-cutting edge cases** — revoked API key grace window, JWT cross-environment isolation, MissingResource branch behaviour ### Hygiene cleanups - Deleted dead `app/services/authorization.server.ts` (legacy `checkAuthorization` + types — no live consumers post-migration) and its orphaned test - Dropped the never-populated `scopes` field from `ApiAuthenticationResultSuccess` - `scheduleEmail` moved out of `email.server.ts` into its own module — breaks a `commonWorker → marqs/V1` import chain that was poisoning the auth test graph - OSS Roles page shows a deployment-aware empty state ("Roles aren't available in this self-hosted deployment" vs the plan-upsell copy) via `rbac.isUsingPlugin()` - Team action handler: explicit per-intent ability gates (`manage:billing` for purchase-seats, `manage:members` for set-role + remove-member with self-leave carve-out) ### Cross-repo coordination All public-package contract changes paired in `triggerdotdev/cloud#763` (rbac-packages branch) — the enterprise plugin implements the same `RoleBaseAccessController` interface against Drizzle. ## Test plan - [x] `pnpm run typecheck --filter webapp` clean - [x] `pnpm --filter webapp exec vitest run --config vitest.e2e.full.config.ts` — 193/193 pass (requires Docker for testcontainers) - [x] Spot-check an authed API endpoint with a valid + invalid API key against a local stack - [x] Spot-check the migrated admin pages render and gate non-admins --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… queues (#3558) ## Summary Queues that use concurrency keys can no longer bypass the per-queue length cap, and the "Queued | Running" columns in the dashboard now show the true total across all CK variants instead of 0. The cap and the dashboard both relied on `ZCARD` of the base queue key, but CK-keyed runs live under `<base>:ck:<variant>` keys. Any queue that used concurrency keys read 0 — letting a single CK variant grow unbounded past the user's configured cap. ## Fix Two per-base-queue counters are maintained inside the CK Lua scripts: `<base>:lengthCounter` and `<base>:runningCounter`. Non-CK enqueue/dequeue paths are untouched. Counters are lazy-initialized the first time a CK enqueue (or nack) lands on a queue: the Lua script sums `ZCARD` across the variants tracked by `ckIndex`, sets the counter, then `INCR`s. Pre-existing CK backlog on already-populated queues is captured automatically — no batch migration required. `INCR`/`DECR` is gated on `ZADD`/`SADD` returning 1 (a new entry vs an idempotent no-op), so duplicate enqueues or re-dequeues don't inflate the counter. The counter is `SET` with a 24-hour TTL on init. `INCR`/`DECR` do not extend the TTL, so the counter expires daily and the next CK operation re-seeds it from `ckIndex`. This bounds any drift that accumulates during the rolling-deploy overlap window — where old (un-Tracked) and new (Tracked) webapp instances briefly coexist — to ≤24 hours, with no admin sweep or background reconciler needed. Read paths pipeline `ZCARD`/`SCARD` on the base key + `GET` on the counter and sum. A missing counter is treated as 0, so pure non-CK queues see the same answer as before. The counter-aware scripts ship alongside the originals with a `Tracked` suffix for rolling-deploy safety; a follow-up PR will drop the originals once this has rolled out. ## Test plan - [ ] `pnpm run test --filter @internal/run-engine` — 116 tests pass, including a new `ckCounters.test.ts` covering lazy init from pre-existing backlog, churn, floor-at-zero, the non-CK regression case, mixed CK + non-CK on the same base queue, idempotent re-enqueue (ZADD-already-exists), 24h TTL on the counter, and nack re-seeding after counter expiry. - [ ] Verified end-to-end against a live local environment: - Triggered 24 CK enqueues across 4 variants → `lengthCounter=16`, `runningCounter=8`, dashboard showed Queued=16 / Running=8 for the CK queue. - Set the env queue cap to 16, triggered 12 more enqueues → 8 succeeded, 4 rejected with `QueueSizeLimitExceededError`. - Deleted the counter on a queue with 31 messages already sitting in CK variants, triggered one more enqueue → counter materialized to 31 from the `ckIndex` sum, then INCR'd.
## Summary Local ClickHouse was burning ~325% CPU endlessly merging its own telemetry tables (`metric_log`, `asynchronous_metric_log`, `part_log`, `trace_log`) after the container had been running long enough to accumulate hundreds of GB of system-log data. OrbStack Helper reflected this on the host (~400% CPU). These tables are not used by anything in the dev stack. They only exist for ClickHouse to log itself, so disabling them eliminates the merge churn entirely. ## Changes - Adds `docker/config/clickhouse-disable-system-logs.xml`, mounted into `/etc/clickhouse-server/config.d/`, that removes the noisy system log tables via `<table remove="1"/>`. - Mounts the override file in `docker/docker-compose.yml`. After applying, idle CPU dropped from 325% to ~12% on my machine. ## Test plan - [ ] `pnpm run docker` brings up the stack cleanly - [ ] `docker stats clickhouse` shows low idle CPU - [ ] App functionality unaffected (system log tables are not queried by the webapp)
…mpling (#3567) ## Summary Follow-up to #3561. The drift-audit workflow timed out on PR #3542 (92 files, +5962 lines) by hitting `--max-turns 15` before reaching a verdict, leaving a red ❌ on that PR with no sticky comment. ## Changes - `--max-turns` bumped from 15 to 30. - Prompt now opens with an explicit "Strategy" section: read REVIEW.md once, scan the file-list only, open at most 5 files (3-5 on PRs >50 files), and bias toward finishing over exploring. - Final rule: *"when in doubt between one more file read and finish now — finish now."* The audit is allowed to miss things. It is not allowed to time out and leave a red X. ## Test plan - [ ] Verify this PR's audit posts `✅ REVIEW.md looks current for this PR.` (small diff) - [ ] After merge, retry the audit on #3542 or a similarly large PR and confirm it completes
…#3564) ## Summary - Users on production are hitting `QuotaExceededError: Failed to execute 'setItem' on 'Storage'` when navigating runs, because their localStorage is full of orphaned `panel-group-react-aria<n>-:<rid>:` entries. - Each entry is a session-unique key written by the resizable panel library; they accumulated to thousands per user over the last two months and now block legitimate `setItem` calls (the run-view inspector can no longer persist its layout, and the page crashes mid-render). - This PR evicts the legacy entries once on client boot. The leak itself is already plugged by the v1.1.3 upgrade in #XXXX — this is the cleanup that recovers the wasted quota on existing users' machines. ## Root cause (already fixed, for context) In v0.4.1 of the underlying library, `PanelGroupImpl` defaulted `autosaveStrategy` to `"localStorage"` unconditionally — so *every* `PanelGroup` wrote to localStorage on every autosave trigger, including the four in `QueryEditor`, the one in `ReplayRunDialog`, the storybook routes, etc. Without an `autosaveId`, the key fell back to `panel-group-${useId()}`, and React Aria's `useId()` produces a new session-unique prefix each visit. Result: entries accumulated without bound across sessions. The condition was introduced when [#3282](#3282) removed the wrapper's explicit `autosaveStrategy="cookie"` override (to fix HTTP 431 cookie-size errors). That worked, but the library default that took over silently caused this leak. The v1.1.3 upgrade in the resizable-panel PR changed the default to `autosaveStrategy = autosaveId ? "localStorage" : undefined`, so no new entries are being written. Existing residue still needs to be removed from users' browsers. ## Changes - New file [`apps/webapp/app/clientBeforeFirstRender.ts`](apps/webapp/app/clientBeforeFirstRender.ts) — exports a `clientBeforeFirstRender()` function that runs synchronously, before React hydrates. Encapsulates a small cleanup helper that scans `localStorage` and removes: - Every key starting with `panel-group-react-aria` (the legacy auto-generated keys). - The orphan `panel-run-parent-v2` key from before the autosaveId v2→v3 bump. - [`apps/webapp/app/entry.client.tsx`](apps/webapp/app/entry.client.tsx) — imports and invokes `clientBeforeFirstRender()` once, before `hydrateRoot()`. This guarantees the cleanup completes before any `ResizablePanelGroup` mounts and tries to write. The cleanup is wrapped in `try/catch` so private-browsing / disabled-storage scenarios fail silently. Idempotent: subsequent loads find no matching keys and exit immediately. ## Test plan - [x] Locally seed ~50 fake `panel-group-react-aria…` entries plus a `panel-run-parent-v2` entry via DevTools console, hard reload → legacy entries gone, real entries (`panel-run-parent-v3`, `panel-run-tree`) preserved. - [x] Idempotency: reload a second time, no errors, no state changes. - [x] Add a control entry (`panel-run-parent-v3-but-different-suffix`) — confirmed not over-matched. - [x] Simulate broken `Storage.setItem` throwing — page still renders, cleanup swallows the error. - [x] Typecheck clean. ## Notes - Customer report: `QuotaExceededError: Failed to execute 'setItem' on 'Storage': Setting the value of 'panel-run-parent-v3' exceeded the quota.` - The cleanup runs once per page load. Once a user has loaded the app after this deploys, their localStorage is clean and the function becomes a no-op forever.
## Summary - Recommend deploying NodeLocal DNS and lowering `ndots` to `1` in the Kubernetes self-hosting guide. - Recommend storing task events in ClickHouse (`EVENT_REPOSITORY_DEFAULT_STORE=clickhouse_v2`) in both the Docker and Kubernetes guides, plus a new row in the webapp env var reference.
`pr_checks` runs the full matrix on every PR. #3609 touched only `apps/webapp/app/routes/admin.tsx` and still ran the 4-job CLI e2e matrix and 5-job sdk-compat suite. Adds a `changes` job using `dorny/paths-filter` and gates each tier: - webapp + e2e-webapp: `apps/webapp/**`, `packages/**`, `internal-packages/**` - packages: `packages/**` - internal: `internal-packages/**` + `packages/**` (cross-deps) - e2e (cli-v3): `packages/{cli-v3,build,core,schema-to-json}/**` - sdk-compat: `packages/{trigger-sdk,core}/**` `.configs/**`, `package.json`, `pnpm-lock.yaml`, `pnpm-workspace.yaml`, `turbo.json` are also included in every filter since they affect the whole workspace. Inlines the `units` reusable-workflow children so each can be gated independently (status check names also flatten from `units / webapp / ...` to `webapp / ...`). `unit-tests.yml` is unaffected - still used by `publish.yml`. Adds an `all-checks` gate that always runs and short-circuits to success when every dependent is success-or-skipped. With this in place a single required status check (`All PR Checks`) is enough; before this, `paths-ignore` would have left required checks Pending on docs/changeset PRs ([gh docs](https://docs.github.com/en/actions/managing-workflow-runs/skipping-workflow-runs)).
…nizations (#3609) Switching between the Users and Organizations tabs in the admin dashboard now keeps the current `?search=` value, so you can flip between the two without re-typing your filter. Other admin tabs don't take `search` and so don't carry it.
Adds Sessions, a durable, run-aware stream primitive that scopes session.in / session.out records to a session (not a single run). Records survive run boundaries; reconnect-from-last-event-id is built in. Server foundation: - New /realtime/v1/sessions/:session/:io/append + /records routes - sessionRunManager + sessionsRepository + clickhouseSessionsRepository - mintRunToken for short-lived per-session tokens - s2Append retry-with-backoff + undici cause diagnostics - /api/v[12]/packets/* exempt from customer rate limits - BackgroundWorker schema gains taskKind enum (TASK, AGENT, SCHEDULED) - TaskRun.taskKind column + clickhouse 029_add_task_kind_to_task_runs_v2 Core types: - new sessionStreams, inputStreams, realtimeStreams packages in @trigger.dev/core - session-streams-api / realtime-streams-api surface Sessions dashboard UI (the primitive's own viewer): - /sessions index + detail routes - SessionsTable, SessionFilters, SessionStatus, CloseSessionDialog - AGENT/SCHEDULED filter in RunFilters + TaskTriggerSource Includes the sessions-primitive changeset.
`tasks.trigger`, `tasks.batchTrigger`, `batch.create`, `wait.createToken`, `wait.forDuration`, and the input/session stream waitpoint endpoints all accept a caller-supplied `idempotencyKey` and store it verbatim against a composite-unique index on `TaskRun`, `BatchTaskRun`, or `Waitpoint`. The schemas had no length cap, so a sufficiently long high-entropy key produced an index row larger than the underlying storage layer can hold. The insert failed at the database, and the caller saw a generic 500 from `RunEngineTriggerTaskService.call()` / `CreateBatchService` / waitpoint creation, depending on the endpoint. Keys produced by `idempotencyKeys.create()` are 64-character SHA-256 hashes and never trip this — it only manifests for direct REST callers (or SDK callers passing a raw string they generated themselves). Low-entropy keys also sail through, because the storage layer compresses repeated bytes before they reach the index, which is why the failure mode is intermittent and tied to caller-side key shape. ## Fix Add `.max(2048, "<field> must be 2048 characters or less")` to the seven schemas that feed an indexed `idempotencyKey` column: - `TriggerTaskRequestBody.options.idempotencyKey` - `BatchTriggerTaskItem.options.idempotencyKey` - `CreateBatchRequestBody.idempotencyKey` - `CreateWaitpointTokenRequestBody.idempotencyKey` - `CreateInputStreamWaitpointRequestBody.idempotencyKey` - `CreateSessionStreamWaitpointRequestBody.idempotencyKey` - `WaitForDurationRequestBody.idempotencyKey` Plus the `idempotency-key` HTTP header on the trigger route (and the three batch routes that re-export `HeadersSchema`). The header schema is lifted out of `api.v1.tasks.$taskId.trigger.ts` into `apps/webapp/app/v3/triggerHeaders.server.ts` so it can be exercised in tests without dragging the route's import-time side effects. The 2048 character ceiling is chosen to sit safely under the per-row index limit while staying generous against existing callers — keys that fit before still fit. Oversized keys now return a structured Zod 400 instead of a generic 500. Limit is documented under `Idempotency key` in `docs/limits.mdx` and as a `<Note>` on `docs/idempotency.mdx`. ## Test plan - [x] 15 schema unit tests added (`packages/core/src/v3/schemas/idempotencyKey.test.ts`, `apps/webapp/test/routes/triggerHeaders.test.ts`) — rejection-with-message + boundary acceptance for each capped schema. The webapp test exercises the extracted `TriggerHeadersSchema` directly with no mocks. - [x] `pnpm run build --filter @trigger.dev/core` - [x] `pnpm run typecheck --filter webapp` - [x] End-to-end verified locally: baseline (small key) → 200; 3000-char high-entropy header → 400 with the expected Zod error; same key at the 2048 boundary → 200; same key with the cap reverted → the database rejected the insert and the route returned 500 to the caller. Cap restored. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…3542) ## Summary A `/sessions` dashboard for inspecting durable Sessions, an `AGENT` / `SCHEDULED` task-kind filter for the runs list, and the server-side hardening (rate-limit exemption for packets, retry-with-backoff on stream appends, typed too-large-chunk error) that the `chat.agent` runtime in #3543 needs. Builds on the Sessions primitive shipped in #3417. ## Design The Sessions list + detail routes mirror the run inspector pattern. `TaskTriggerSource` gains `AGENT` and `SCHEDULED` values, persisted on `BackgroundWorker.taskKind` and `TaskRun.taskKind` (plus a matching Clickhouse column), so the runs list can filter by kind. New `@trigger.dev/core` modules — `sessionStreams`, `inputStreams`, a `sessionStreamInstance` for realtime streams, and the `realtime-streams-api` / `session-streams-api` surfaces — expose the typed shapes that chat.agent will use to drive `session.out`. `ChatChunkTooLargeError` lets the runtime drop oversized chunks with a typed surface instead of failing the run. `s2Append` retries transient failures with exponential backoff. `/api/v[12]/packets/*` is exempt from customer rate limits so chat snapshot reads and writes don't get throttled under load. ## Stack Part of a 4-PR stack. Merge bottom-up. 1. **This PR** (#3542) → `main` 2. #3543 → #3542 — `chat.agent` runtime + browser transport 3. #3545 → #3543 — agent-view dashboard 4. #3546 → #3545 — ai-chat reference + MCP tooling Replaces #3173 (closed). <!-- GitButler Footer Boundary Top --> --- This is **part 5 of 5 in a stack** made with GitButler: - <kbd> 5 </kbd> #3612 - <kbd> 4 </kbd> #3546 - <kbd> 3 </kbd> #3545 - <kbd> 2 </kbd> #3543 - <kbd> 1 </kbd> #3542 👈 <!-- GitButler Footer Boundary Bottom -->
The `code` paths filter currently matches `**` minus a tiny exclusion list, so a PR that only touches `.github/workflows/*.yml` still flips `code == true` and runs typecheck (~2 min on the runner). Exclude `.github/**` from `code`, then re-include just `pr_checks.yml` and `typecheck.yml` so a change to either of those still triggers the full code check matrix. Effect: - workflow-only PRs (this one, future dependabot/codeql/etc.) skip typecheck; `all-checks` treats the skipped job as non-failure so the required status passes. - modifying `pr_checks.yml` or `typecheck.yml` themselves still triggers typecheck. - the existing per-suite filters (`webapp`, `packages`, `internal`, `cli`, `sdk`) already re-include the specific workflows that gate them, so they're unaffected.
Two defensive fixes to the native realtime backend's run-change publishing (behind a feature flag, off by default), so turning it on can never destabilize the run lifecycle. **Never throws at the caller.** Publish sites run synchronously on the run-engine event bus and the metadata flush loop. The internal publish was already wrapped in try/catch, but lazy construction (singleton + metrics) and record encoding ran before that guard, so a throw could propagate into a run lifecycle operation. The public `publishChangeRecord` / `publishManyChangeRecords` helpers now wrap the whole call and log-and-drop on failure. **Bounds outage buffering.** The publisher connection caps `maxRetriesPerRequest` at 1 (vs ioredis's default of 20), so during a pub/sub Redis outage a publish rejects after ~1 reconnect cycle instead of holding commands in memory for ~20s. A dropped publish is latency-only, since the consumer has a periodic backstop full-resolve. The offline queue stays on, so the first publish after a process boots still flushes once the connection is ready.
…age (#3947) ## Summary The webapp Docker image build runs `pnpm run build --filter=webapp...`, which builds `@trigger.dev/sdk` as a dependency. The SDK's `build` script recently gained a `bundle-docs` step (`tsx ../../scripts/bundleSdkDocs.ts`), but the build couldn't run it in the pruned image, breaking the image build. Two things were missing: - `docker/Dockerfile` copied `scripts/updateVersion.ts` into the builder stage but not `scripts/bundleSdkDocs.ts`, so the step failed with `ERR_MODULE_NOT_FOUND`. - Even with the script present, the repo-level `docs/` tree it reads is a separate workspace package that isn't in webapp's dependency graph, so `turbo prune --scope=webapp` excludes it — the script's missing-docs guard would then fail the build. ## Design The Dockerfile now copies `bundleSdkDocs.ts` alongside `updateVersion.ts`. `bundleSdkDocs.ts` skips gracefully when the repo `docs/` tree is absent, which is exactly the pruned-dependency-build case (the SDK is compiled there but never published). Publishing always runs from the full monorepo where `docs/` exists, so the missing-docs guard still protects releases — it only fires when `docs/` is present but a cited doc is genuinely missing, rather than when the whole tree was pruned away. This avoids dragging 27M of docs into a throwaway builder stage. ## Test plan - [x] `bundle-docs` from the full monorepo still bundles all cited docs (exit 0) - [x] Simulated pruned tree without `docs/` skips cleanly instead of failing - [ ] Webapp Docker image build succeeds in CI --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e menu (#3941) Major dashboard restructure plus the new task landing pages and self-serve schedules add-on integration. ## Side menu - Full restructure: standalone Tasks / Runs / Sessions block at the top; new collapsible sections for AI, Observability, Deployments, Manage - Persisted collapse state per section in `dashboardPreferences` - New / updated icons across the menu - Dashboards section: built-in Run metrics + AI metrics + custom dashboards, with drag-to-reorder via ReactGridLayout (`DashboardList.tsx`) - DevPresence connection indicator in the env selector (DEV + V2) ## Tasks (`_index` — unified Tasks page) - Replaces the separated Agents / Standard / Schedules listing pages with one table - New `UnifiedTaskListPresenter` composes `TaskListPresenter` + `AgentListPresenter` (shared `currentWorker` lookup) - Columns: Type (with kind badge), ID, File, Running (numeric for tasks; running + suspended pills for agents), Activity (24h stacked-by-status), sticky menu - Search + "Task type" multi-select filter (URL-synced) - Client-side pagination at 25/page - Right-hand "useful links" panel (cookie-persisted state) - Live-reload SSE: page revalidates on `WORKER_CREATED` so onboarding `trigger dev` flips the blank state automatically ## Agent landing page (`/agents/$agentParam`) - New per-agent detail page - Top tabs (Sessions / Runs) toggle both the chart panel and the table - Three dashboard-style chart cards: Sessions/Runs activity, LLM spend, Tokens - `AgentDetailPresenter` queries ClickHouse for run activity, session activity (with FINAL on `sessions_v1`), and LLM cost/token activity from `llm_metrics_v1` - TimeFilter at the top drives all three charts - Sticky table header, resizable horizontal handle, sidebar with Test agent button + properties - Docs link → `ai-chat/overview` ## Standard Task landing page (`/tasks/standard/$taskParam`) - New per-task detail page mirroring the Agent layout - `TaskDetailPresenter` for activity + properties - Chart panel wrapped in a Card with "Runs by status" header - Top bar with title, TimeFilter, pagination - Right sidebar: Test task + identifier, queue, machine, retry, TTL, payload schema, etc. ## Scheduled Task landing page (`/tasks/scheduled/$taskParam`) - New per-task detail page mirroring the Agent / Standard layout - Top-bar actions (right → left): pagination, Bulk replay…, View all runs, TimeFilter, Create schedule - Connected schedules mini-table in the sidebar - **Self-serve schedules add-on integration** (reincarnated from the now-removed `/schedules` listing page during the `origin/main` merge): - Bottom usage bar pinned via `grid-rows-[auto_1fr_auto]` — progress ring + "X/Y of your schedules" + Purchase / Upgrade / Request CTA - At-limit "Create schedule" intercept dialog - `PurchaseSchedulesModal` extracted as a shared component (`apps/webapp/app/components/schedules/PurchaseSchedulesModal.tsx`) handling increase / decrease / above-quota / need-to-delete states - New resource action route at `/resources/orgs/$organizationSlug/schedules-addon` ## Sessions - Index page: list, filters, blank state, help tooltip rework - Detail page: combined input/output chronological view (replaces split tabs) - Improved raw-message view layout (full-height) - AI payload UI: `data-*` parts grouped under "AI SDK data parts:" label - `toSafeUrl` helper guards rendered URLs from streamed content - Fix: duplicate assistant content on inspector tab switch ## Playground (Test agent) - Restructured top menu; back button + agent-selector popover - Improved blank state - Recent agent chat history moved into the tabbed menu - Better message-scroll container (full height) ## Dashboards - New Dashboards landing page (`/dashboards`) — Run metrics, AI metrics, Create your own CTAs - `BuiltInDashboards` updated; new `TasksDashboardPresenter` for the tasks overview - Custom dashboards section gains drag-to-reorder; cosmetic fix for active-row drag-handle blending ## PageHeader / shared primitives - `PageTitle` gains an `accessory` prop supporting string (auto-wrapped in tooltip) and ReactNode - Help tooltips on Tasks, Runs, Sessions PageTitles explaining the concept and sub-categories - `Card` primitive used for dashboard-style chart panels throughout ## Code review fixes (last batch on this branch) - ClickHouse activity queries hardened: `FINAL` + `_is_deleted = 0` on `task_runs_v2` (ReplacingMergeTree); `organization_id` + `project_id` filters for sort-key prefix; `inserted_at` partition filter on `llm_metrics_v1` - `UnifiedTaskListPresenter`: shared `currentWorker` lookup; slug-collision guard in `mergeRunningStates`; off-by-one fixed in 24h bucket alignment - `ScheduleListPresenter`: halved platform RPCs by deriving limit from `currentPlan` instead of calling `getLimit` - Sessions detail: stopped IntersectionObserver / scroll listener re-attach on every chunk; `requestAnimationFrame` deferral on auto-scroll to avoid virtualizer race - URL hardening: `?types=` validated against known kinds; new `parseFiniteInt` helper applied to `from`/`to`/`page` params - AgentView: HITL resolution buffer now cleared once parts reach a terminal state (was an unbounded Map on long sessions); subscription effect deps documented with eslint suppression - `PurchaseSchedulesModal`: bundle state resets on each open instead of persisting stale drafts ## Manual testing Manual smoke-test plan is tracked under [TRI-10883](https://linear.app/triggerdotdev/issue/TRI-10883), broken into 20 sub-issues covering onboarding, self-serve schedules, side menu, the four landing pages, sessions, runs, dashboards, regressions and performance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
) ## Summary Several optional workflow jobs fail on forks and private mirrors that lack org-specific secrets or registry permissions. This adds per-job repository-variable gates so those deployments can switch them off without editing workflows — matching the pattern from #3901 (`ENABLE_CLAUDE_CODE` / `ENABLE_WORKFLOW_SECURITY_SCAN`). Two variables, both **default-enabled** (a job runs unless its variable is explicitly `'false'`), so canonical-repo behaviour is unchanged where the variables are unset: **`ENABLE_HELM_PRERELEASE`** — gates the chart-publish jobs that push to `oci://ghcr.io/<owner>/charts` (needs `write_package` on the owner's charts namespace): - `helm-prerelease.yml` → `prerelease` job - `release-helm.yml` → `release` job Without the permission these fail with `403: denied: permission_denied: write_package` on every PR / `helm-v*` tag. The `lint-and-test` jobs (lint + template + kubeconform, no push) always run, so chart validity is still enforced everywhere. **`ENABLE_DEPENDABOT_ALERTS`** — gates the Dependabot notifier crons that need `DEPENDABOT_ALERTS_TOKEN` / `SLACK_BOT_TOKEN` and post to a specific Slack: - `dependabot-critical-alerts.yml` → `alert` job (daily cron) - `dependabot-weekly-summary.yml` → `summary` job (weekly cron) On a fork/mirror these otherwise fire on schedule and fail (or post nowhere) indefinitely. ## Test plan - Variables unset (default): all jobs run as today. - `ENABLE_HELM_PRERELEASE=false`: helm `lint-and-test` runs, publish jobs skip — no 403 on repos lacking `write_package`. - `ENABLE_DEPENDABOT_ALERTS=false`: the two cron jobs skip cleanly (neutral, not failed). 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
) ### Summary Self-serve billing UI is now hidden for managed-billing organizations. Plan pickers, upgrade actions, billing alerts, and related upgrade prompts are replaced with a "Contact us" option where appropriate. Uses the new showSelfServe subscription flag, defaulting to true for existing self-serve organizations. ### Testing - [x] billing pages render correctly for self-serve organizations. - [x] managed-billing organizations no longer see self-serve upgrade flows. - [x] "Contact us" actions are shown instead of upgrade actions where applicable. ### Changelog Hide self-serve billing flows for managed-billing organizations behind the new showSelfServe subscription flag.
## Summary
`chat.agent`'s system prompt (the `chat.prompt` text plus any skills
preamble) could not carry a provider cache breakpoint, so the largest
and most stable part of the prompt re-paid full input price on every
turn. `chat.toStreamTextOptions()` now emits the system prompt as a
structured message carrying `providerOptions` when you opt in, so a
provider can cache the system block. Without an option, `system` stays a
plain string, so existing behavior is unchanged.
## API
Three ways to opt in (most specific wins, no deep merge):
```ts
// Anthropic sugar
chat.toStreamTextOptions({ cacheControl: { type: "ephemeral" } });
// provider-agnostic (also covers Amazon Bedrock's cachePoint)
chat.toStreamTextOptions({ systemProviderOptions: { anthropic: { cacheControl: { type: "ephemeral" } } } });
// at the definition site
chat.prompt.set(SYSTEM_PROMPT, { providerOptions: { anthropic: { cacheControl: { type: "ephemeral" } } } });
```
The `cacheControl` shorthand is Anthropic-only; `systemProviderOptions`
is the general form. Pairs with a `prepareMessages` cache breakpoint to
cache the conversation prefix too.
Docs guide: #3951
…locks (#3954) ## Summary The script that generates the changeset release PR description was silently dropping some changelog entries and stripping code examples. In [#3932](#3932), entry [#3937](#3937) was missing entirely from the Improvements list and [#3952](#3952 code block was gone, even though both were present in the raw changeset output. ## Root cause `parsePrBody` parsed the raw changeset body line by line: - The dependency-bump filter matched any entry whose text *began* with a backticked package name, so a real changelog entry like `` `@trigger.dev/sdk` now bundles... `` got thrown out along with the genuine version-bump lines. - Only the first line of each bullet was kept, so fenced code blocks, sub-bullets, and continuation paragraphs were discarded. ## Fix Group each top-level bullet with its indented continuation (code blocks, sub-bullets, paragraphs), dedent it, and re-emit it intact. The dependency filter is now anchored so it only matches lines that are *entirely* a package bump, leaving real entries that merely start with a package name. Verified by replaying #3932's raw body through the script: #3937 returns to the list, #3952's code block is preserved, and #3936's sub-bullets nest correctly under their parent.
Adds `pnpm.overrides` pinning a few transitive deps to their current releases: - `js-cookie` → 3.0.7 - `tmp` → 0.2.7 - `brace-expansion` → 1.1.13 / 2.0.3 / 5.0.6 (one entry per major) Each override is scoped to the affected major range so unaffected majors aren't dragged forward. Also drops the `fast-xml-builder` override, which no longer resolves to anything in the tree. Lockfile-only - no published package's dependencies change. `js-cookie`/`tmp` parents pin ranges that can't reach the new versions on their own, so overrides (not a plain lockfile refresh) are needed to hold them.
Currently the `db:seed` script just hangs on success. This PR adds `process.exit(0)` to the finally block after db disconnect so the script exits properly. --------- Co-authored-by: Chris Arderne <chris@trigger.dev>
…t dispatches (#3918) ### Problem Firestarter's `didWarmStart: true` means the response was written to a long-poll socket — not that the runner received it. A silently dead poller (no FIN, e.g. a VM torn down mid-poll) leaves the dispatched run stuck in `PENDING_EXECUTING` until the run engine's heartbeat redrive, and each redrive burns a queue redelivery toward `TASK_RUN_DEQUEUED_MAX_RETRIES`. ### Change After a warm-start hit, the supervisor retains the `DequeuedMessage` (TimerWheel, default 10s), then probes the existing `getLatestSnapshot` API. If the run is still on the exact dequeued snapshot, no runner ever acted — it falls through to the regular cold-create path. Recovery: ~10s + cold start, no new APIs, no CLI changes. - **Double-start safe**: `startRunAttempt` runs under a per-run lock and 409s stale snapshot ids, so a reviving runner and the fallback workload can't both execute; the loser exits before running anything. - **Probe errors → do nothing**: healthy runners legitimately act late during platform brownouts (nested attempt-start retries), so falling back on uncertainty would stampede duplicates. The heartbeat redrive stays as the backstop (also covers supervisor restarts dropping timers). - **Off by default**: `TRIGGER_WARM_START_VERIFY_ENABLED` (+ `TRIGGER_WARM_START_VERIFY_DELAY_MS`, 1–60s, default 10s). Disabled = complete no-op. Works for all workload managers (compute/k8s/docker) since it hooks the shared dequeue path. - Emits `warmstart.verify` wide events (`outcome: delivered | fallback | probe_error`), making the silent-loss rate directly measurable.
…3963) ## Summary `chat.headStart` (the warm step-1 fast path) previously handed its response over only to `chat.agent`. This extends handover to the other two backends: `chat.customAgent` consumes it with `conversation.consumeHandover({ payload })` on turn 0, and `chat.createSession` surfaces it as `turn.handover` (call `turn.complete()` with no source to finalize a pure-text handover). The low-level `chat.waitForHandover()` and `accumulator.applyHandover()` are exported for hand-rolled loops. It also adds `triggerConfig` to `chat.headStart()` and `chat.openSession()`, so the auto-triggered handover-prepare run inherits tags, queue, machine, and the other session run options the same way `chat.createStartSessionAction()` does. The `chat:{chatId}` tag is prepended automatically. Because the session is created once on the first head-start turn (idempotent on the chat id), this is the only place those options can be set for a head-start chat's lifetime. ## Fix: tool-call resume When the warm step-1 hands over a pending tool call (rather than pure text), the agent loop resumes that tool round. For it to merge cleanly the pipe threads the spliced partial as `originalMessages`, so the resumed tool-output chunk attaches to the handed-over tool-call instead of throwing `No tool invocation found`. `MessageAccumulator.addResponse` now also dedups by id (replace-in-place), so the persisted history doesn't carry a duplicate assistant message when the resumed response reuses the partial's id. Incorporates the `triggerConfig` work from [#3933](#3933) by @saasjesus, with `createStartSessionAction` extended to also forward `maxDuration`, `region`, and `lockToVersion` so the two session entry points stay consistent. Verified end-to-end against a local environment: handover (pure-text and tool-call) on both new backends, a `chat.agent` regression pass, and `triggerConfig` tags and queue landing on the run. --------- Co-authored-by: saasjesus <armin@chatarmin.com>
## Summary Reworks the scheduled task page right-hand sidebar. - Adds **Overview** / **Schedules** tabs. The Schedules tab is a paginated table of all schedules attached to the task, declarative first. - Surfaces schedule fields (ID, CRON + human-readable description, next/last run, status) directly in the Overview property table. - Sidebar can be dragged much wider (up to 80% of the viewport). - "No schedules attached" panel explains declarative vs imperative and links to docs. - Schedule **create / edit / enable / disable / delete** all happen inside the existing Sheet — no more navigating to the standalone schedule page. Toasts confirm each action. ## Test plan - Open a scheduled task page and verify the new tabs - Create, edit, enable/disable, and delete a schedule — confirm you stay on the page and see a toast each time - Visit a task with no schedules attached and confirm the info panel renders - Drag the sidebar wider; confirm pagination shows when there are >25 schedules
## Summary Docs deploy from the `docs-live` branch via Mintlify, so merging to `main` no longer publishes docs on its own. To publish, push a `docs-release-*` tag at the commit you want live. The workflow runs the Mintlify broken-links check against that commit, then fast-forwards `docs-live` to it, which is what Mintlify deploys from. ## Design The ref move uses the GitHub API with `force=false`, making it fast-forward only: a tag that is not ahead of `docs-live` fails the job rather than rewinding production. Mintlify's GitHub app reacts to the resulting push and deploys, so no extra deploy credentials are needed. Usage: ```bash git tag docs-release-2026.06.16 # tag the main commit you want live git push origin docs-release-2026.06.16 ```
…3964) ## Summary `chat.headStart` now works with the `chat.customAgent` and `chat.createSession` backends (not just `chat.agent`), and takes a `triggerConfig` option. These docs cover both. The Fast starts guide gets a "Handover with custom agents" section showing how each backend consumes the handover (`consumeHandover` returning `{ isFinal, skipped }` for custom agents, `turn.handover` for createSession), including threading `originalMessages` so a resumed tool round merges into the handed-over assistant. The `chat.headStart` API section documents `triggerConfig` (tags, queue, machine, and the rest) on the auto-triggered run. The reference picks up `ChatTurn.handover`, `turn.complete()` with no source, `chat.waitForHandover`, and a new `HeadStartHandlerOptions` table. Docs for the SDK changes in [#3963](#3963).
…served keys (#3966) Fix Vercel onboarding wizard to properly filter out reserved TRIGGER_ env vars
## Summary New `/ai-chat/prompt-caching` guide covering how to cache a chat agent's prompt prefix with Anthropic prompt caching: the system prompt, the conversation history (a `prepareMessages` breakpoint), and how caching interacts with compaction. It also shows how to verify cache hits via usage and the dashboard, the prefix-stability footguns, and an "Other providers" section (OpenAI and Google cache automatically; Amazon Bedrock uses `cachePoint` through `systemProviderOptions`). Registered under Features in the AI Agents nav, next to Compaction. --------- Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Eric Allam <ericallam@users.noreply.github.com>
## Summary The "What extractNewToolResults returns" reference in the tool-result-auditing guide did not match the SDK. It listed an `input` field that `chat.history.extractNewToolResults()` never returns, and marked `output` as optional when it is always present. This corrects the block to the real `ChatNewToolResult` shape (`toolCallId`, `toolName`, `output`, optional `errorText`). Every usage example in the same guide already reads only those fields, so the reference now matches both the examples and the code.
…3958) ## Summary The Models page is now split into two tabs. **Your models** shows the models your project has actually used in the selected time range, with usage charts (cost over time, tokens over time, calls by model), a per-model table of calls / cost / avg TTFC / avg tokens-per-sec, and calls/tokens trend sparklines. **Model library** is the full catalog, reordered from alphabetical to a relevance-based provider order (Anthropic, OpenAI, Google, then the rest), newest models first within each provider, with a "New" badge on models released in the last 7 days. One time-range selector drives the whole Your models tab, so the charts, the table, and the sparklines all share the same window. Opening a model shows its own metrics with an independent range picker and a "View in AI metrics" link that opens the AI metrics dashboard filtered to that model. The active tab is kept in the URL so it survives a refresh and is shareable. ## Prompt caching & cost accuracy Both the Your models tab and the AI metrics dashboard now surface prompt-cache usage: a cache-savings column plus per-model cached-tokens and cache-hit-rate views, and a caching section on the dashboard (hit rate, cached tokens, estimated savings, and hit rate by model). Building this surfaced a cost bug. `input_tokens` is the total prompt count and already includes cache-read and cache-creation tokens, but the cost pipeline charged the full input at the input price and then added a separate cache line, so cached tokens were billed twice (and on Anthropic, cache reads were never discounted because their price is keyed differently). The input price now applies only to the non-cached remainder, with cache prices resolved across the provider-specific keys, so LLM cost and the cache hit-rate metric are accurate. Hit rate is computed as cached reads over total input. ## Notes Also fixes React "invalid DOM property" console warnings from the provider icons (the Llama and DeepSeek SVGs used raw `fill-rule` / `clip-rule` / `clip-path` attributes), which this page surfaces by rendering more provider icons. ## Screenshots **Your models tab:** usage charts and a per-model table with calls/tokens trend sparklines. <img width="2560" height="1267" alt="1-your-models-tab" src="https://github.com/user-attachments/assets/859bd24f-9047-4828-8bbb-83e5882846d6" /> **Model library:** provider-relevance ordering with a "New" badge on models released in the last 7 days. <img width="2560" height="1267" alt="2-model-library-tab" src="https://github.com/user-attachments/assets/46dd54b9-80f9-4922-ade9-5935b08dfebc" /> **Model detail, Metrics tab:** per-model range picker and a "View in AI metrics" link. <img width="2560" height="1267" alt="3-model-detail-metrics" src="https://github.com/user-attachments/assets/0f65d9d0-6142-4918-93f0-110bb277101a" /> **View in AI metrics:** the dashboard deep-linked and filtered to the selected model. <img width="2560" height="1267" alt="4-ai-metrics-filtered" src="https://github.com/user-attachments/assets/821f256c-e305-493c-98c7-eafaf2f57f83" />
…#3939) ## Summary The agent skills' deep guidance now ships inside `@trigger.dev/sdk` and is read from `node_modules`, so it tracks the `@trigger.dev/sdk` version installed in your project automatically. This updates the Skills page, the Building with AI step, and the rules-redirect page to drop the old "pinned to the CLI version, re-run to refresh" framing and describe the version-pinned reference instead. Pairs with the SDK/CLI change in #3937. Keep this draft until that ships, since it describes behavior that is not released yet.
## Summary Typing in the search bar on the task page could clear or reset the input mid-keystroke. This fixes the re-render race so the field stays stable while you type. ## Root cause Two things compounded: - `SearchInput`'s sync effect depended on `text`, so it re-ran on every keystroke and could overwrite the input with the URL/controlled value while focused. - Each task row unmounted and remounted its activity chart during the side-panel open/close animation (25 charts at once), forcing heavy re-renders that the search effect raced against. ## Fix - `SearchInput` now tracks the last synced value in a ref instead of comparing against `text`, keeping the effect off the keystroke path. It only writes to state when the incoming URL/controlled value actually changes, and never while the input is focused. - Activity charts are now hidden (`hidden` attribute) instead of unmounted during the panel animation, so the rows don't churn the tree and the resize stays smooth. --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ngs (#3970) ## Summary Three improvements to the SDK-bundled agent skills (follow-up to the skills installer): - **`trigger-` namespace.** The installed skills (`authoring-tasks`, `getting-started`, …) had generic names that collide with unrelated skills in a shared agent skills directory. They're now prefixed — `trigger-authoring-tasks`, `trigger-getting-started`, etc. — matching the convention the public skills repo already uses. - **New `trigger-cost-savings` skill.** An MCP-driven cost audit: right-sizes machines, flags missing `maxDuration`, spots sequential triggers that could batch, and reviews schedule frequency, using `list_runs` / `get_run_details` for live analysis. - **Bundle the full docs.** `@trigger.dev/sdk` now bundles the entire "Documentation" section of the docs (157 pages) instead of a curated 55-page subset, so an agent has the complete, version-pinned reference in `node_modules`. ## How the bundling works `scripts/bundleSdkDocs.ts` now reads `docs/docs.json`, walks the "Documentation" dropdown, and copies every page under it into the SDK. The set tracks the docs navigation automatically — add a page to the nav and it ships, no skill edits needed. The API reference and Guides & examples dropdowns are intentionally excluded. A skill's `sources:` frontmatter is now informational only. The dropped idea of a dedicated `trigger-config` skill is replaced by references to the bundled build-extension docs (`config/extensions/*`) from the `trigger-authoring-tasks` config section and the chat-agent skills.
Adds an opt-in mechanism to route a configurable percentage of organizations onto the compute (MicroVM) backing of their region at trigger time, without changing their stored region settings. Routing is gated by three global feature flags - `computeMigrationEnabled`, `computeMigrationFreePercentage`, `computeMigrationPaidPercentage` - plus a per-org `computeMigrationEnabled` override that wins in both directions. A region's compute backing is resolved from a new `WorkerInstanceGroup.region` column: a container group and its MicroVM group share one geo `region`, so the migration swaps the resolved worker queue to the backing group's queue. Orgs are bucketed deterministically by id, so ramping a percentage down keeps a strict subset rather than reshuffling, and a region with no compute backing is never touched. Everything is off by default - behaviour is unchanged unless the flags are set. The flags and the worker-region groups are read on the trigger hot path from in-memory snapshots rather than the database: a small `createReloadingRegistry` helper loads each at startup and refreshes them on an interval, so no per-trigger query is added and a percentage or kill-switch change propagates within the reload interval. A cold replica whose snapshot hasn't loaded yet reads as not-migrated (the container path) and self-corrects on the next load - the same cold-start contract as the datastore / LLM-pricing registries, with a `reloading_registry_loaded` metric so a never-loaded registry is alertable. The same migration decision is consulted at deploy-time template creation so a migrated org gets a compute template built ahead of its first run. This runs in shadow mode (best-effort, never fails the deploy) by default, or - when the `computeMigrationRequireTemplate` flag is on - in required mode, built synchronously at deploy so the first run never builds on-demand and template errors surface at deploy time. So operators keep "which runs ran where" while customers only see geography: the run's actual worker queue is stored raw, and the geo region is stamped separately on `TaskRun.region` (and a new ClickHouse `region` column) at trigger time. Read surfaces - the dashboard, the API, and the Query/Logs page - show the geo region, falling back to the worker queue for runs written before the column existed. Minor follow-ups left out of scope: the percentage flags render as text inputs on the admin flags page (the catalog UI has no numeric control type yet), and `createReloadingRegistry` could later gain pub/sub for sub-second cross-replica propagation if the reload interval proves too slow.
## Summary 7 improvements. ## Improvements - `@trigger.dev/sdk` now bundles the Trigger.dev agent skills and a curated snapshot of the docs those skills reference. The skills that `trigger skills` installs into your coding agent read this content from node_modules, so the guidance your AI assistant follows is pinned to the SDK version installed in your project and stays current across upgrades instead of going stale until the next reinstall. ([#3937](#3937)) - Running a CLI command like `dev`, `deploy`, `preview`, or `update` before initializing a project no longer crashes with a raw `Cannot find matching package.json` stack trace. The CLI now detects the missing project and points you to `npx trigger.dev@latest init` instead. ([#3929](#3929)) - The agent skills installed by `trigger skills` are now namespaced with a `trigger-` prefix (e.g. `trigger-authoring-tasks`, `trigger-getting-started`) so they don't collide with unrelated skills in your coding agent's skills directory. Adds a `trigger-cost-savings` skill for auditing and reducing compute spend (right-sizing machines, `maxDuration`, batching, debounce), and `@trigger.dev/sdk` now bundles the full Trigger.dev documentation so your agent can read the complete, version-pinned reference directly from node_modules. ([#3970](#3970)) - The run span API response now includes `cachedCost` and `cacheCreationCost` on the `ai` object, alongside the existing `inputCost` / `outputCost` / `totalCost`. `inputCost` reflects only the non-cached input, so these fields let you reconstruct the full cost breakdown for prompt-cached calls. ([#3958](#3958)) - `chat.headStart` now works with the `chat.customAgent` and `chat.createSession` backends, not only `chat.agent`. The warm step-1 response hands over to your loop the same way it does for a managed agent. ([#3963](#3963)) In a `chat.customAgent` loop, consume the handover on turn 0: ```ts const conversation = new chat.MessageAccumulator(); const { isFinal, skipped } = await conversation.consumeHandover({ payload }); if (skipped) return; // warm handler aborted, so exit without a turn if (isFinal) { await chat.writeTurnComplete(); // step 1 is the response, no streamText } else { const result = streamText({ model, messages: conversation.modelMessages, tools }); // Pass originalMessages so the handed-over tool round merges into the // step-1 assistant instead of starting a new message. const response = await chat.pipeAndCapture(result, { originalMessages: conversation.uiMessages, }); if (response) await conversation.addResponse(response); } ``` With `chat.createSession`, the iterator surfaces it as `turn.handover`; call `turn.complete()` with no argument on a final handover. The lower-level `chat.waitForHandover()` and `accumulator.applyHandover()` are also exported for hand-rolled loops. - Cache your chat agent's system prompt with Anthropic prompt caching. `chat.toStreamTextOptions()` now emits the system prompt as a cacheable message when you opt in, so a large, stable system block is billed at cache-read rates on every turn instead of full price. ([#3952](#3952)) ```ts // at the streamText call site (Anthropic sugar) streamText({ ...chat.toStreamTextOptions({ cacheControl: { type: "ephemeral" } }), messages, }); // provider-agnostic equivalent chat.toStreamTextOptions({ systemProviderOptions: { anthropic: { cacheControl: { type: "ephemeral" } } }, }); // or where the prompt is defined chat.prompt.set(SYSTEM_PROMPT, { providerOptions: { anthropic: { cacheControl: { type: "ephemeral" } } }, }); ``` Without an option, `system` stays a plain string. Pairs with a `prepareMessages` cache breakpoint to cache the conversation prefix across turns too. - Three fixes for custom agent loops (`chat.customAgent`, `chat.createSession`, and hand-rolled `MessageAccumulator` loops): ([#3936](#3936)) - Continuation runs no longer replay already-answered user messages into the first turn. The `.in` resume cursor is now seeded before any listener attaches (the same boot logic `chat.agent` uses), so a chat that continues after a cancel, crash, or upgrade only sees genuinely new messages. - Steering a hand-rolled loop mid-stream no longer wipes the in-flight assistant response. `chat.pipeAndCapture` now stamps a server-generated message id on the stream, so a `prepareStep` injection keeps the partial text instead of replacing the message. - Task-backed tools (`ai.toolExecute`) now work from custom agent loops: the parent's session is threaded to the child run, so child tasks can stream progress into the chat with `chat.stream.writer({ target: "root" })` instead of failing with "session handle is not initialized". <details> <summary>Raw changeset output</summary>⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ `main` is currently in **pre mode** so this branch has prereleases rather than normal releases. If you want to exit prereleases, run `changeset pre exit` on `main`.⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ # Releases ## @trigger.dev/build@4.5.0-rc.7 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.7` ## trigger.dev@4.5.0-rc.7 ### Patch Changes - `@trigger.dev/sdk` now bundles the Trigger.dev agent skills and a curated snapshot of the docs those skills reference. The skills that `trigger skills` installs into your coding agent read this content from node_modules, so the guidance your AI assistant follows is pinned to the SDK version installed in your project and stays current across upgrades instead of going stale until the next reinstall. ([#3937](#3937)) - Running a CLI command like `dev`, `deploy`, `preview`, or `update` before initializing a project no longer crashes with a raw `Cannot find matching package.json` stack trace. The CLI now detects the missing project and points you to `npx trigger.dev@latest init` instead. ([#3929](#3929)) - The agent skills installed by `trigger skills` are now namespaced with a `trigger-` prefix (e.g. `trigger-authoring-tasks`, `trigger-getting-started`) so they don't collide with unrelated skills in your coding agent's skills directory. Adds a `trigger-cost-savings` skill for auditing and reducing compute spend (right-sizing machines, `maxDuration`, batching, debounce), and `@trigger.dev/sdk` now bundles the full Trigger.dev documentation so your agent can read the complete, version-pinned reference directly from node_modules. ([#3970](#3970)) - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.7` - `@trigger.dev/build@4.5.0-rc.7` - `@trigger.dev/schema-to-json@4.5.0-rc.7` ## @trigger.dev/core@4.5.0-rc.7 ### Patch Changes - The run span API response now includes `cachedCost` and `cacheCreationCost` on the `ai` object, alongside the existing `inputCost` / `outputCost` / `totalCost`. `inputCost` reflects only the non-cached input, so these fields let you reconstruct the full cost breakdown for prompt-cached calls. ([#3958](#3958)) ## @trigger.dev/python@4.5.0-rc.7 ### Patch Changes - Updated dependencies: - `@trigger.dev/sdk@4.5.0-rc.7` - `@trigger.dev/core@4.5.0-rc.7` - `@trigger.dev/build@4.5.0-rc.7` ## @trigger.dev/react-hooks@4.5.0-rc.7 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.7` ## @trigger.dev/redis-worker@4.5.0-rc.7 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.7` ## @trigger.dev/rsc@4.5.0-rc.7 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.7` ## @trigger.dev/schema-to-json@4.5.0-rc.7 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.7` ## @trigger.dev/sdk@4.5.0-rc.7 ### Patch Changes - `@trigger.dev/sdk` now bundles the Trigger.dev agent skills and a curated snapshot of the docs those skills reference. The skills that `trigger skills` installs into your coding agent read this content from node_modules, so the guidance your AI assistant follows is pinned to the SDK version installed in your project and stays current across upgrades instead of going stale until the next reinstall. ([#3937](#3937)) - `chat.headStart` now works with the `chat.customAgent` and `chat.createSession` backends, not only `chat.agent`. The warm step-1 response hands over to your loop the same way it does for a managed agent. ([#3963](#3963)) In a `chat.customAgent` loop, consume the handover on turn 0: ```ts const conversation = new chat.MessageAccumulator(); const { isFinal, skipped } = await conversation.consumeHandover({ payload }); if (skipped) return; // warm handler aborted, so exit without a turn if (isFinal) { await chat.writeTurnComplete(); // step 1 is the response, no streamText } else { const result = streamText({ model, messages: conversation.modelMessages, tools }); // Pass originalMessages so the handed-over tool round merges into the // step-1 assistant instead of starting a new message. const response = await chat.pipeAndCapture(result, { originalMessages: conversation.uiMessages, }); if (response) await conversation.addResponse(response); } ``` With `chat.createSession`, the iterator surfaces it as `turn.handover`; call `turn.complete()` with no argument on a final handover. The lower-level `chat.waitForHandover()` and `accumulator.applyHandover()` are also exported for hand-rolled loops. - Add `triggerConfig` support to `chat.headStart()` and `chat.openSession()`, so the auto-triggered handover-prepare run inherits tags, queue, machine, and other session trigger options the same way `chat.createStartSessionAction()` does. The `chat:{chatId}` tag is prepended automatically. ([#3963](#3963)) ```ts export const POST = chat.headStart({ agentId: "my-agent", triggerConfig: { tags: ["org:acme"], queue: "chat" }, run: async ({ chat }) => streamText({ ...chat.toStreamTextOptions(), model }), }); ``` Because the session is created once on the first head-start turn and is idempotent on the chat id, this is the only place to set those options for a head-start chat's lifetime. `chat.createStartSessionAction()` now also forwards `maxDuration`, `region`, and `lockToVersion` so both session entry points stay consistent. - Cache your chat agent's system prompt with Anthropic prompt caching. `chat.toStreamTextOptions()` now emits the system prompt as a cacheable message when you opt in, so a large, stable system block is billed at cache-read rates on every turn instead of full price. ([#3952](#3952)) ```ts // at the streamText call site (Anthropic sugar) streamText({ ...chat.toStreamTextOptions({ cacheControl: { type: "ephemeral" } }), messages, }); // provider-agnostic equivalent chat.toStreamTextOptions({ systemProviderOptions: { anthropic: { cacheControl: { type: "ephemeral" } } }, }); // or where the prompt is defined chat.prompt.set(SYSTEM_PROMPT, { providerOptions: { anthropic: { cacheControl: { type: "ephemeral" } } }, }); ``` Without an option, `system` stays a plain string. Pairs with a `prepareMessages` cache breakpoint to cache the conversation prefix across turns too. - Three fixes for custom agent loops (`chat.customAgent`, `chat.createSession`, and hand-rolled `MessageAccumulator` loops): ([#3936](#3936)) - Continuation runs no longer replay already-answered user messages into the first turn. The `.in` resume cursor is now seeded before any listener attaches (the same boot logic `chat.agent` uses), so a chat that continues after a cancel, crash, or upgrade only sees genuinely new messages. - Steering a hand-rolled loop mid-stream no longer wipes the in-flight assistant response. `chat.pipeAndCapture` now stamps a server-generated message id on the stream, so a `prepareStep` injection keeps the partial text instead of replacing the message. - Task-backed tools (`ai.toolExecute`) now work from custom agent loops: the parent's session is threaded to the child run, so child tasks can stream progress into the chat with `chat.stream.writer({ target: "root" })` instead of failing with "session handle is not initialized". - The agent skills installed by `trigger skills` are now namespaced with a `trigger-` prefix (e.g. `trigger-authoring-tasks`, `trigger-getting-started`) so they don't collide with unrelated skills in your coding agent's skills directory. Adds a `trigger-cost-savings` skill for auditing and reducing compute spend (right-sizing machines, `maxDuration`, batching, debounce), and `@trigger.dev/sdk` now bundles the full Trigger.dev documentation so your agent can read the complete, version-pinned reference directly from node_modules. ([#3970](#3970)) - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.7` ## @trigger.dev/plugins@4.5.0-rc.7 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.7` </details> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Replicates `TaskRun.planType` into the `task_runs_v2` ClickHouse table so run analytics can group by plan type. Adds a `plan_type` column (goose migration `033`, `LowCardinality(String)`), the replication insert mapping, and the matching schema/column/type entries - same shape as the recent `region` addition. Write-once at trigger, so it just rides along on existing replicated rows. Internal analytics only; not exposed in the Query API.
#3960) ## Summary Prisma infrastructure failures (P1xxx-class: database unreachable, timed out, connection dropped, engine init/panic) carry the database hostname in their `.message`. This captures them centrally for observability and ensures they never reach API clients verbatim. ## Design A `$allOperations` client extension on the writer and replica clients logs infrastructure errors with the originating model and operation, then rethrows the **original** error unchanged — call sites that branch on `error.code` (unique-violation idempotency, not-found handling) and transaction retries keep working. Only infrastructure errors are logged; routine query/validation errors (P2xxx) are left alone. `$allOperations` can't see the transaction boundary (`$transaction` is a client method, not an operation), so infrastructure errors surfacing from `$transaction()` without a Prisma code — e.g. `PrismaClientInitializationError` — are logged separately at the transaction wrapper, where the existing coded-error path would otherwise miss them. `clientSafeErrorMessage()` swaps an infrastructure error's message for `"Internal Server Error"` at the API routes that previously returned `error.message` raw. Status codes, headers, and every non-infrastructure message are unchanged. ## Test plan - [x] P2002 / P2025 rethrow with code intact and are not logged - [x] Statement errors inside `$transaction` keep their code (retry logic intact) - [x] Raw queries wrapped without crashing on the undefined model - [x] A genuine connectivity failure is logged with model/operation/code - [x] `clientSafeErrorMessage` obfuscates infra messages, preserves all others - [x] `pnpm run typecheck --filter webapp` (12/12) ## Note Overlaps with #3391 (Prisma 7 migration) on `apps/webapp/app/db.server.ts` — coordinate rebasing.
The global feature flags admin page had a few rough edges. The percentage flags are numeric (`z.coerce.number()`) but rendered as free-text inputs, so you could type non-numeric values that only failed validation after submitting - and the error surfaced behind the confirm dialog. The control-type detection now recognises numbers and renders a proper number input, with the min/max range as the placeholder so the type is clear even when the field is unset. The save error also shows inside the confirm dialog now, not just behind it. The action buttons were unreachable without zooming out. The admin layout wrapped each page in a plain block, so `h-full` page content overran the viewport by the height of the tab bar and got clipped by the `overflow-hidden` body. Making the layout a flex column bounds each page to the space below the tabs, so the existing per-page scroll works and the feature flags page scrolls like the Users/Orgs tabs. Also capped the confirm dialog's diff list so its footer stays on screen when there are many changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #
✅ Checklist
Testing
[Describe the steps you took to test this change]
Changelog
[Short description of what has changed]
Screenshots
[Screenshots]
💯