Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Update trigger#4

Open
tylerc-govsignals wants to merge 1028 commits into
GovSignals:ConProgramming/two-phase-deployGovSignals/trigger.dev:ConProgramming/two-phase-deployfrom
triggerdotdev:maintriggerdotdev/trigger.dev:mainCopy head branch name to clipboard
Open

Update trigger#4
tylerc-govsignals wants to merge 1028 commits into
GovSignals:ConProgramming/two-phase-deployGovSignals/trigger.dev:ConProgramming/two-phase-deployfrom
triggerdotdev:maintriggerdotdev/trigger.dev:mainCopy head branch name to clipboard

Conversation

@tylerc-govsignals

Copy link
Copy Markdown
Collaborator

Closes #

✅ Checklist

  • I have followed every step in the contributing guide
  • The PR title follows the convention.
  • I ran and tested the code works

Testing

[Describe the steps you took to test this change]


Changelog

[Short description of what has changed]


Screenshots

[Screenshots]

💯

matt-aitken and others added 30 commits May 6, 2026 19:35
…xDuration (TRI-9117) (#3529)

When a Node EventEmitter (e.g. node-redis) emits an "error" event with
no
listener attached, Node escalates it to process.on("uncaughtException")
in
the task worker. The worker reported the error via the
UNCAUGHT_EXCEPTION
IPC event but did not exit, and the supervisor-side handler in
taskRunProcess only logged the message at debug level — leaving the
run()
promise orphaned until maxDuration fired and producing empty attempts
(durationMs=0, costInCents=0).

The supervisor now rejects the in-flight attempt with an
UncaughtExceptionError and gracefully terminates the worker (preserving
the OTEL flush window) on UNCAUGHT_EXCEPTION. The attempt fails fast
with
TASK_EXECUTION_FAILED, surfacing the original error name, message, and
stack trace, and falls under the normal retry policy. This mirrors the
existing indexing-side behavior in indexWorkerManifest. Apply the same
handling to unhandled promise rejections, which Node already routes
through uncaughtException by default.
## Summary

Both Claude Code workflows (`claude.yml` and `claude-md-audit.yml`)
authenticated via `CLAUDE_CODE_OAUTH_TOKEN`, which broke when the org
disabled Claude subscription access for Claude Code:

> Your organization has disabled Claude subscription access for Claude
Code · Use an Anthropic API key instead, or ask your admin to enable
access

This switches both workflows to `anthropic_api_key: ${{
secrets.ANTHROPIC_API_KEY }}` (secret already added to the repo).

## Test plan

- [ ] Confirm `📝 CLAUDE.md Audit` runs to completion on this PR
- [ ] Confirm `@claude` mention in a PR comment still triggers the
`Claude Code` workflow successfully
## Summary

Stamps the active OpenTelemetry `trace_id` and `span_id` onto every
Sentry event captured from the webapp, so engineers can copy a
`trace_id` from a Sentry issue and search for the corresponding trace in
any OTel-aware backend. Also adds an `otel_sampled` tag to indicate
whether the trace was head-sampled — a cheap signal for whether the link
will resolve to span data or hit a missing trace.

## Why

Sentry and OTel were OTel-disconnected: `apps/webapp/sentry.server.ts`
initialised Sentry with `skipOpenTelemetrySetup: true`, and no
error-capture site (`logger.server.ts`, the Remix-wrapped `handleError`,
the root `ErrorBoundary`) attached OTel context to the event. With many
spans/sec across services, getting from a Sentry issue to its trace was
guesswork.

## Approach

Single global Sentry event processor, registered immediately after
`Sentry.init`. On each event it reads
`trace.getActiveSpan()?.spanContext()` via `@opentelemetry/api`, then
writes:

- `event.contexts.trace.trace_id` and `event.contexts.trace.span_id`
(Sentry's native trace context fields)
- `event.tags.otel_sampled` = `"true"` | `"false"` (derived from
`traceFlags`)

If no active span (module-load errors, scheduled timers without a
context, primary cluster process), the processor returns the event
unmodified — Sentry's default propagation context fills in.

Implementation is co-located in `apps/webapp/sentry.server.ts` (no
separate helper module — `sentry.server.ts` is built standalone by
esbuild and a separate import would have required a new bundling step).
Helper functions are exported so the unit tests can reach them without
re-running `Sentry.init`.

## Non-goals (deliberate)

- No sample rate change. ~95% of Sentry events will carry a `trace_id`
that returns no spans in the tracing backend (head-sampled out). The
`otel_sampled` tag makes that obvious at a glance. Raising find-rate is
a separate conversation with cost trade-offs.
- No user/org tags or `Sentry.setUser` (would need auth-helper +
per-request scope wiring across multiple worker entrypoints — separate
ticket).
- Webapp image only. No changes to supervisor or CLI workers.

## Test plan

- [x] Unit tests in `apps/webapp/test/sentryTraceContext.server.test.ts`
— 9 tests covering: helper returns \`undefined\` with no active span;
returns \`traceId\`/\`spanId\`/\`sampled=true\` for a recording span;
returns \`sampled=false\` for a non-recording span; processor leaves the
event unchanged with no active span; processor stamps
\`trace_id\`/\`span_id\` onto \`contexts.trace\`; preserves existing
\`contexts.trace\` fields; tags \`otel_sampled\` correctly for both
sampled and non-sampled cases; never throws if \`@opentelemetry/api\`
access throws.
- [x] \`pnpm run typecheck --filter webapp\` passes.
- [x] Manually verified end-to-end against a sandboxed Sentry project:
confirmed both sampled and non-sampled traces correctly populate
\`contexts.trace.trace_id\` matching the OTel ids logged from the
loader, and the \`otel_sampled\` tag appears with the expected value.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
)

When a webapp API route's catch-all 500 branch handles a non-typed
exception, it returns the raw `error.message` to the caller. If the
exception originates from an internal subsystem (the ORM client, an
infra dependency, etc.) the server-side error string is surfaced
verbatim in the response body — exposing implementation details the API
surface shouldn't carry.

The leak shows up in three shapes across the routes:

- `return json({ error: error.message }, { status: 500 })`
- `return json({ error: error instanceof Error ? error.message :
"Internal Server Error" }, { status: 500 })`
- ``return json({ error: `Internal server error: ${error.message}` }, {
status: 500 })``

(plus a couple of analogous neverthrow-Result variants on admin routes.)

## Fix

Across 19 webapp routes, replace each leaking branch with a generic body
(`"Something went wrong"` / `"Internal Server Error"` to match the
file's existing fallback) and add `logger.error(...)` so full visibility
is preserved server-side. Catch blocks that branch on typed user-input
errors (`ServiceValidationError`, `EngineServiceValidationError`,
`OutOfEntitlementError`, `PrismaClientKnownRequestError`) are left
intact — those messages are constructed deliberately and intended to be
customer-facing.

## Test plan

- [x] `pnpm run typecheck --filter webapp`
- [x] Per-route manual probe: inject a synthetic `Error` at the top of
the catch'd `try` block (or fake the wrapped call's rejection / Result
error), curl the route with the dev API key, confirm the response body
changed from the synthetic message verbatim → generic body. 21/21 leak
sites verified end-to-end.
- [x] 4xx-typed-error paths spot-checked: throwing
`ServiceValidationError` from inside the catch'd try still surfaces its
message at 422 as intended.
## Lots of filter UX improvements across lots of routes

### General
- Promoted important filters out of the "More filters" so they're always
visible
- SearchInput primitive is now reusable and Esc now clears the field (AI
filter input also clears with Esc)
- Tooltips + keyboard shortcuts on every primary filter button
- Brighter text on selected filter items / queue items 
- Filter dropdowns reordered for better hierarchy
- Removed debounce on Tasks page search for faster filtering

### Tasks page search
- Esc now clears the field
- ENTER submits a search to improve performance when you have lots of
tasks


https://github.com/user-attachments/assets/4b30521e-dbc4-4468-b2af-8c85bdfb9002

### Runs filters
- Moves Status and Tasks out of the More filters menu
- "Root only" toggle is set to false when you filter for a Task. This
state isn't stored and flips back to the stored value if filters are
cleared
<img width="1690" height="986" alt="CleanShot 2026-04-26 at 19 24 08@2x"
src="https://github.com/user-attachments/assets/b07da73c-140e-451f-a7bf-c32129317f63"
/>

### Batches filters
- General consistency improvements
<img width="1429" height="948" alt="CleanShot 2026-05-08 at 09 50 35"
src="https://github.com/user-attachments/assets/e5ec267f-2aa3-43ef-991e-93bf01bdaea5"
/>

### Schedules
- General consistency improvements
<img width="1567" height="1141" alt="CleanShot 2026-05-08 at 09 51 11"
src="https://github.com/user-attachments/assets/34b7da88-87c6-4e4d-a70f-fe13ea9f87ec"
/>

### Queues
- General consistency improvements
<img width="824" height="416" alt="CleanShot 2026-05-08 at 09 52 02"
src="https://github.com/user-attachments/assets/b4adc102-8192-4a68-b199-a175c2645a6c"
/>

### Waitpoint tokens
- General consistency improvements
<img width="941" height="363" alt="CleanShot 2026-05-08 at 09 52 19"
src="https://github.com/user-attachments/assets/d43aeb3f-7f80-454d-b183-fd077a4e3ff7"
/>

### Models
- General consistency improvements
<img width="1570" height="509" alt="CleanShot 2026-05-08 at 09 53 17"
src="https://github.com/user-attachments/assets/066d7646-4672-4cae-8ec0-e30a82889914"
/>

### AI metrics
- General consistency improvements
<img width="1568" height="624" alt="CleanShot 2026-05-08 at 09 53 43"
src="https://github.com/user-attachments/assets/fdfc4806-26fa-458d-a5ed-5c226b3bbc9f"
/>

### Logs
- General consistency improvements
<img width="1267" height="752" alt="CleanShot 2026-05-08 at 09 54 30"
src="https://github.com/user-attachments/assets/3e9ba871-b9dd-490e-aded-5d87134fd2bb"
/>

### Errors
- General consistency improvements
<img width="1568" height="670" alt="CleanShot 2026-05-08 at 09 54 50"
src="https://github.com/user-attachments/assets/fdda027a-e24f-4804-b4bb-203a6c2db960"
/>

### Query
- General consistency improvements
- History, Scope, Triggered (date) filters all have shortcut tooltips
- Scope filter now reuses the metrics ScopeFilter component
<img width="1566" height="716" alt="CleanShot 2026-05-08 at 09 55 22"
src="https://github.com/user-attachments/assets/0130b4a2-9daf-4edc-bada-3380aff4022a"
/>

### Dashboards
- General consistency improvements
- Scope filter gets nicer icons and a shortcut
- Nice icons for the Scope menu items
<img width="1567" height="769" alt="CleanShot 2026-05-08 at 09 56 10"
src="https://github.com/user-attachments/assets/7bea25f7-6c33-4d4a-a36d-3a1cb56afe09"
/>

### Custom dashboard
- General consistency improvements
- Add chart, Add title, and the kebab menu now have tooltips + shortcuts
<img width="1566" height="782" alt="CleanShot 2026-05-08 at 09 58 11"
src="https://github.com/user-attachments/assets/9df4db25-b2c0-43a2-b92f-00256337d5a9"
/>

### Environment variables
- General consistency improvements
<img width="1569" height="930" alt="CleanShot 2026-05-08 at 09 58 55"
src="https://github.com/user-attachments/assets/26e614b4-88e7-400b-aa6d-a96bad488fb8"
/>

### Preview branches
- General consistency improvements
<img width="1570" height="986" alt="CleanShot 2026-05-08 at 09 59 17"
src="https://github.com/user-attachments/assets/57a2b939-3670-4252-ab2c-d6dc65bdda1b"
/>
## Summary

Adds a Redis pub/sub reload path to the webapp's in-memory LLM pricing
registry. When enabled on a process, the registry reloads from the
database whenever a publish lands on the configured channel — instead of
waiting for the existing 5-minute interval. Lets pricing/model changes
propagate to cost enrichment within seconds.

Subscription is **off by default** and opt-in per process. Only
OTel-ingesting services need real-time freshness; dashboard and worker
services run fine on the periodic interval and shouldn't pile onto each
publish with a full-table reload.

## Design

When `LLM_PRICING_RELOAD_PUBSUB_ENABLED=true`, subscribes via
`createRedisClient` against `COMMON_WORKER_REDIS_*` and listens on
`LLM_PRICING_RELOAD_CHANNEL` (default `llm-registry:reload`). The
5-minute periodic reload stays as a backstop, and a SIGTERM/SIGINT
handler closes the subscription cleanly.

The publisher side lives outside this PR — any process running in the
same Redis namespace can trigger a reload by `PUBLISH
llm-registry:reload <anything>`. Includes a `.server-changes/` note for
the changelog.

### Debounced reload

Bursts of publishes are coalesced. The first publish schedules a reload
at T+`LLM_PRICING_RELOAD_DEBOUNCE_MS` (default 1s); subsequent publishes
during that window are no-ops because the trailing reload picks up
everything when it queries the DB. Bounds reload rate to at most 1 per
debounce window regardless of publisher chattiness, so a runaway
upstream publisher can't fan out into a flood of full-table-scan
reloads.

## Test plan

- [ ] With `LLM_PRICING_RELOAD_PUBSUB_ENABLED=false` (default):
`redis-cli PUBSUB NUMSUB llm-registry:reload` returns `0` while the
webapp is up
- [ ] With it set to `true`: returns `>= 1`
- [ ] `redis-cli PUBLISH llm-registry:reload test` returns `1` (one
subscriber received) on a subscribed process
- [ ] Mutate an `LlmModel` row externally, publish on the channel,
observe the registry's match() picks up the change without waiting for
the 5-min tick
- [ ] Publish 100x in rapid succession; confirm only one reload fires
within the debounce window
…ze (#3538)

## Summary

- Run-view inspector panel was glitching out on Firefox: visual flicker
on close, locking up at min size, and intermittent `panelHasSpace`
invariant errors. Root cause is the underlying `react-window-splitter`
library's collapse animation, which uses `@react-spring/rafz` and
interacts poorly with Firefox.
- Disabled the library's collapse animation on Firefox only, app-wide
(every consumer of `RESIZABLE_PANEL_ANIMATION`). Chromium and Safari
behaviour is unchanged.

## Changes

- **Firefox animation skip** in `RESIZABLE_PANEL_ANIMATION` —
UA-detected at module load, resolves to `undefined` for Firefox so the
library's animation actor completes in one frame instead of running its
rAF loop.
- **Inspector min raised 50px → 250px** so dragging can't shrink the
panel into a near-useless width.
- **`autosaveId` bumped `v2` → `v3`** to invalidate stale persisted
snapshots (the library has a `// TODO` branch that ignores prop changes
for already-registered panels, so existing users would otherwise still
see the old 50px min).
- **`react-window-splitter` pinned** to exact `0.4.1` to protect the
patch from drifting if line offsets change in a patch release.
- **Two hunks added to the existing `@window-splitter/state` patch:**
- Removed the library's auto-collapse-on-drag block entirely. Every
collapsible panel in the app is parent-controlled, and that block was
triggering state-machine deadlocks when handlers were no-ops.
Drag-to-collapse is now disabled across the app; collapse is only
triggered explicitly (close button, ESC, URL change, etc.).
- In `getDeltaForEvent`, fall back to the panel's `default` before its
`min` when expanding — so the first ever click on a span opens the
inspector at 500px, not 250px.

## Local testing confirmed

- [x] Firefox: open a run, click various spans → panel opens instantly
at 500px, drags freely between 250px and max, closes instantly to 0. No
console errors.
- [x] Chrome/Chromium: same flow, but with smooth open/close animation
as before.
- [x] Safari: same as Chrome.
- [x] Reload mid-session → panel restores cleanly to the dragged size.
- [x] Other resizable panels in the app (logs, deployments, schedules,
batches, bulk-actions, runs index) still animate on Chromium/Safari.

## Notes

- Linear: TRI-8584
- Branch contains intermediate commits exploring an unsuccessful
snapshot-validator approach; they're reverted by the final commit.
Cumulative diff is 6 files. Squash on merge if you'd prefer a clean
history.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…over (#3548)

## Summary

During an ElastiCache role swap (failover) or node-type change (vertical
scale), the ioredis TCP/TLS connection stays open but the server starts
answering with `READONLY` (the client is talking to a node that became a
replica) or `LOADING` (node still loading data from disk). Without an
explicit hook, those errors surface to caller code as `ReplyError`
instances — every write op on the affected connection fails until the
cluster fully cuts over.

This PR adds `reconnectOnError` to every prod ioredis client so the
disconnect + reconnect + retry cycle absorbs these errors and caller
code never sees them.

## Fix

```ts
export function defaultReconnectOnError(err: Error): boolean | 1 | 2 {
  const msg = err.message ?? "";
  if (msg.startsWith("READONLY") || msg.startsWith("LOADING")) return 2;
  return false;
}
```

Returning `2` tells ioredis to disconnect, reconnect, and re-issue the
failed command. After reconnect, DNS / SG state routes the new socket to
a writable node.

The helper lives in `@internal/redis` and is wired into both the shared
`createRedisClient` (which covers RunQueue, schedule-engine,
redis-worker, and every other internal-package consumer) and the direct
`new Redis(...)` call sites in the webapp.

V1-only marqs files are intentionally not migrated.

## Test plan

- [x] `pnpm run typecheck --filter webapp`
- [x] `pnpm run typecheck --filter @internal/run-engine`
- [x] Verified end-to-end against a live ElastiCache vertical-scale
event — caller-surfaced errors went from tens of thousands during the
cutover window down to a handful per ioredis client
- [ ] Confirm steady-state behavior unchanged after deploy
…3549)

## Summary

When ElastiCache demotes a primary to replica — during a Multi-AZ
failover or a vertical node-type change — the demoting primary issues an
`UNBLOCKED` reply to any in-flight blocking commands (`BLPOP`, `BRPOP`,
`BLMOVE`, `XREADGROUP ... BLOCK`, etc.) to clear them before the role
flips. ioredis surfaces these as `ReplyError` to caller code.

The shared `defaultReconnectOnError` added in #3548 only matches
`READONLY` and `LOADING`. This extends it to `UNBLOCKED` so the
disconnect-reconnect-retry cycle handles BLPOP-shaped errors the same
way the existing two cases handle non-blocking-command errors.

## Fix

```ts
export function defaultReconnectOnError(err: Error): boolean | 1 | 2 {
  const msg = err.message ?? "";
  if (
    msg.startsWith("READONLY") ||
    msg.startsWith("LOADING") ||
    msg.startsWith("UNBLOCKED")
  ) {
    return 2;
  }
  return false;
}
```

Returning `2` tells ioredis to disconnect, reconnect, and re-issue the
command. For a BLPOP that means a fresh BLPOP against the new primary
instead of the `UNBLOCKED` error escaping to the caller.

## Test plan

- [ ] CI green
- [ ] Trigger a Multi-AZ failover or a vertical scale event on an
ElastiCache replication group whose clients are running blocking
commands and confirm no `UNBLOCKED` errors surface to caller code during
the cutover.
…h rate limit (#3475)

## Summary

- Adds admin-only editors on the back-office org page for
`Organization.maximumProjectCount` and
`Organization.batchRateLimitConfig`, alongside the existing API rate
limit editor.
- Splits the back-office org page into per-section components
(`ApiRateLimitSection`, `BatchRateLimitSection`, `MaxProjectsSection`)
so each tool is self-contained — adding new sections later doesn't bloat
the route.
- Generalizes the rate-limit form into a reusable `RateLimitSection`
component + `RateLimitDomain` server config so API and batch share the
same UI, validation, and action handler. Each domain only owns its env
defaults, DB column, and logger key.
- "Saved." banner and validation errors are scoped to the section that
submitted, not the page.

Heads-up: the API rate-limit log key was renamed
`admin.backOffice.rateLimit` → `admin.backOffice.apiRateLimit` for
symmetry with the new `admin.backOffice.batchRateLimit`.

## Test plan

- [ ] As an admin, visit `/admin/back-office/orgs/:orgId` and confirm
all three sections render with the org's current values (or system
defaults).
- [ ] Edit and save each section; confirm only that section shows the
"Saved." banner.
- [ ] Submit invalid input (e.g. `0` tokens, malformed interval);
confirm errors render in the offending form only and the other sections
stay closed.
- [ ] Confirm a non-admin user is redirected away from the route.
- [ ] After saving a rate-limit override, hit the org with traffic and
confirm the new limit is enforced (API rate limit + batch rate limit
code paths read the column at request time).
#3554)

## Summary

TTL expiration on queued runs was being scheduled twice: once via a
per-run `expireRun` worker job (the original implementation) and once
via the batch TTL system (added more recently). Both paths attempt to
flip the same run to `EXPIRED`. The per-run job almost always won the
race, leaving the batch consumer to observe runs already expired by the
older path.

This collapses TTL expiration onto the batch path so every queued TTLed
run goes through a single Redis-backed sorted set + batch consumer
instead of also getting its own scheduled redis-worker job.

## Design

`engine.trigger` and `delayedRunSystem.enqueueDelayedRun` no longer call
`ttlSystem.scheduleExpireRun`. The remaining `enqueueSystem.enqueueRun({
includeTtl: true })` already adds the run to the TTL sorted set;
`TtlSystem.expireRunsBatch` flips it to `EXPIRED` when the TTL fires.

Delayed runs get the same coverage by passing `includeTtl: true` on
their post-delay enqueue, so the TTL is armed from the moment the run
enters the queue (matching how the old job behaved —
`parseNaturalLanguageDuration` is evaluated at enqueue time).

The new path explicitly does not re-expire runs once they have been
allocated a concurrency slot. That is intentional: TTL is for runs that
are queued and have never started. Once a run has a slot it is on its
way to executing.

## Test plan

- [x] `pnpm run test --filter @internal/run-engine
./src/engine/tests/ttl.test.ts` — 15 tests, including a new "Re-enqueued
runs are not expired by TTL once they have started" that locks in the
queued-and-never-started contract.
- [x] `pnpm run test --filter @internal/run-engine
./src/engine/tests/delays.test.ts` — 5 tests, including "Delayed run
with a ttl" which now also asserts the TTL is armed from queue-enter
time, not `createdAt`.
- [x] `pnpm run test --filter @internal/run-engine
./src/engine/tests/lazyWaitpoint.test.ts` — 12 tests.
- [x] `pnpm run typecheck --filter @internal/run-engine`.
…ges (#3559)

## Summary

Make `taskIdentifier` optional on the run-queue message schema. No
behavior change in this PR; readers continue to accept payloads that
include the field. A separate change will stop writing it on the wire to
shrink the per-run payload that lives in Redis while runs wait to be
dequeued.

## Design

The field is written into every payload at enqueue time but no consumer
reads it back on the dequeue path. Both the run-engine and supervisor
derive `taskIdentifier` from the loaded `TaskRun` row instead. Relaxing
the schema first means readers tolerate payloads that omit it, so the
writer-side change can ship without producing schema-parse errors during
a rolling deploy.

`projectId` is left required: `WorkerQueueResolver.#getOverride` reads
it for project-scoped runtime worker-queue overrides.

## Test plan

- [x] `pnpm run typecheck --filter @internal/run-engine`
- [x] `pnpm run typecheck --filter webapp`
- [x] `pnpm run test ./src/run-queue/tests/enqueueMessage.test.ts
./src/run-queue/tests/workerQueueResolver.test.ts --run` (28/28 passing)
### Style updates to the notifications
- Tightened up the typography
- Brighter background to make it stand out a bit more
- A bit more padding to make it more readable
- Show the close button on hover instead
- Turned the notification into a separate component as it's shared on
the admin page modal
- Minor tweaks to the behavior of toggling the notification beween
open/closed side menu states

### Before
<img width="224" height="313" alt="before"
src="https://github.com/user-attachments/assets/c9a9377c-4a3b-4477-921a-3c86385d3f0b"
/>

### After (with image)
<img width="239" height="284" alt="CleanShot 2026-05-11 at 17 22 01"
src="https://github.com/user-attachments/assets/311b4dbc-4853-4e6c-9f83-8173b38bd466"
/>

### After (no image)
<img width="239" height="189" alt="after"
src="https://github.com/user-attachments/assets/884e062b-3608-4cb3-a462-d50597257753"
/>

---------

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
## Summary
1 improvement, 1 bug fix.

## Improvements
- Fail attempts on uncaught exceptions instead of hanging to
`MAX_DURATION_EXCEEDED`. A Node `EventEmitter` (e.g. `node-redis`)
emitting `"error"` with no `.on("error", ...)` listener escalates to
`uncaughtException`, which the worker previously reported but did not
act on — runs drifted to maxDuration with empty attempts. They now fail
fast with the original error and status `FAILED`, and respect the task's
normal retry policy. You should still attach `.on("error", ...)`
listeners to long-lived clients to handle errors gracefully.
([#3529](#3529))

## Bug fixes
- Fix dev workers spinning at 100% CPU after the parent CLI disconnects.
Orphaned `trigger-dev-run-worker` (and indexer) processes were caught in
an `uncaughtException` feedback loop: a periodic IPC send via
`process.send` would throw `ERR_IPC_CHANNEL_CLOSED` once the parent
closed the channel, which re-entered the same handler that itself called
`process.send`, scheduled via `setImmediate` and amplified by
source-map-support's `prepareStackTrace`. Fixed by (1) silently dropping
packets in `ZodIpcConnection` when the channel is disconnected, (2)
adding a `process.on("disconnect", ...)` handler in dev workers so they
exit cleanly when the CLI closes the IPC channel, and (3) wrapping all
`uncaughtException`-path `process.send` calls in a `safeSend` guard that
checks `process.connected` and swallows synchronous throws.
([#3491](#3491))

<details>
<summary>Raw changeset output</summary>

# Releases
## @trigger.dev/build@4.4.6

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.6`

## trigger.dev@4.4.6

### Patch Changes

- Fix dev workers spinning at 100% CPU after the parent CLI disconnects.
Orphaned `trigger-dev-run-worker` (and indexer) processes were caught in
an `uncaughtException` feedback loop: a periodic IPC send via
`process.send` would throw `ERR_IPC_CHANNEL_CLOSED` once the parent
closed the channel, which re-entered the same handler that itself called
`process.send`, scheduled via `setImmediate` and amplified by
source-map-support's `prepareStackTrace`. Fixed by (1) silently dropping
packets in `ZodIpcConnection` when the channel is disconnected, (2)
adding a `process.on("disconnect", ...)` handler in dev workers so they
exit cleanly when the CLI closes the IPC channel, and (3) wrapping all
`uncaughtException`-path `process.send` calls in a `safeSend` guard that
checks `process.connected` and swallows synchronous throws.
([#3491](#3491))
- Fail attempts on uncaught exceptions instead of hanging to
`MAX_DURATION_EXCEEDED`. A Node `EventEmitter` (e.g. `node-redis`)
emitting `"error"` with no `.on("error", ...)` listener escalates to
`uncaughtException`, which the worker previously reported but did not
act on — runs drifted to maxDuration with empty attempts. They now fail
fast with the original error and status `FAILED`, and respect the task's
normal retry policy. You should still attach `.on("error", ...)`
listeners to long-lived clients to handle errors gracefully.
([#3529](#3529))
-   Updated dependencies:
    -   `@trigger.dev/core@4.4.6`
    -   `@trigger.dev/build@4.4.6`
    -   `@trigger.dev/schema-to-json@4.4.6`

## @trigger.dev/core@4.4.6

### Patch Changes

- Fix dev workers spinning at 100% CPU after the parent CLI disconnects.
Orphaned `trigger-dev-run-worker` (and indexer) processes were caught in
an `uncaughtException` feedback loop: a periodic IPC send via
`process.send` would throw `ERR_IPC_CHANNEL_CLOSED` once the parent
closed the channel, which re-entered the same handler that itself called
`process.send`, scheduled via `setImmediate` and amplified by
source-map-support's `prepareStackTrace`. Fixed by (1) silently dropping
packets in `ZodIpcConnection` when the channel is disconnected, (2)
adding a `process.on("disconnect", ...)` handler in dev workers so they
exit cleanly when the CLI closes the IPC channel, and (3) wrapping all
`uncaughtException`-path `process.send` calls in a `safeSend` guard that
checks `process.connected` and swallows synchronous throws.
([#3491](#3491))
- Fail attempts on uncaught exceptions instead of hanging to
`MAX_DURATION_EXCEEDED`. A Node `EventEmitter` (e.g. `node-redis`)
emitting `"error"` with no `.on("error", ...)` listener escalates to
`uncaughtException`, which the worker previously reported but did not
act on — runs drifted to maxDuration with empty attempts. They now fail
fast with the original error and status `FAILED`, and respect the task's
normal retry policy. You should still attach `.on("error", ...)`
listeners to long-lived clients to handle errors gracefully.
([#3529](#3529))

## @trigger.dev/python@4.4.6

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.6`
    -   `@trigger.dev/build@4.4.6`
    -   `@trigger.dev/sdk@4.4.6`

## @trigger.dev/react-hooks@4.4.6

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.6`

## @trigger.dev/redis-worker@4.4.6

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.6`

## @trigger.dev/rsc@4.4.6

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.6`

## @trigger.dev/schema-to-json@4.4.6

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.6`

## @trigger.dev/sdk@4.4.6

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.4.6`

</details>

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…3552)

Closes
[TRI-9234](https://linear.app/triggerdotdev/issue/TRI-9234/retry-task-process-sigsegv-errors-respecting-user-retry-config)

## What this changes

SIGSEGV crashes (`TASK_PROCESS_SIGSEGV`) will now be **retried when an
attempt fails**, in line with the task's configured retry settings
(`retry.maxAttempts` etc.) — the same path SIGTERM and uncaught
exceptions already use. Previously SIGSEGV was hard-classified as
non-retriable and failed the run on the first segfault, ignoring the
user's retry policy.

Tasks without a retry policy still fail fast on the first SIGSEGV.
Behaviour is unchanged for OOM kills (separate machine-bump retry path)
and SIGKILL_TIMEOUT.

## Deploy

**Only the webapp needs to ship.** The retry decision lives entirely in
the webapp:
- V2 path: `internal-packages/run-engine` (bundled into the webapp)
- V1 path: `apps/webapp/app/v3/services/completeAttempt.server.ts`

No supervisor, CLI, SDK, or customer-task-image changes required.
Customers do not need to redeploy. The `@trigger.dev/core` changeset is
just keeping the public package in sync — the published npm version
isn't what makes the fix work.

## Why retry

SIGSEGV in Node tasks is frequently non-deterministic across processes:

- **Native addon races** (`sharp`, `canvas`, `better-sqlite3`,
`node-rdkafka`, `bcrypt`, …) — libuv thread-pool work stepping on V8
handles. Different heap layout / thread schedule on a fresh process →
retry often succeeds.
- **JIT / GC interaction** — V8 turbofan deopt or GC during a native
callback. Timing-dependent.
- **Near-OOM in native code** — when RSS approaches the cgroup limit,
native allocations fail and poorly-written addons dereference NULL →
SIGSEGV instead of clean OOM-kill.
- **Host / hardware issues** — bit flips, kernel quirks. Retry lands on
a different host.

The genuinely deterministic case (a user-code bug always tripping the
same addon) is real, but a subset — and `maxAttempts` bounds the damage.

## Pre-existing inconsistency this resolves

- `shouldRetryError` returned `false` for `TASK_PROCESS_SIGSEGV` →
`fail_run`.
- `shouldLookupRetrySettings` already listed `TASK_PROCESS_SIGSEGV` as
retry-config-aware — but that branch was unreachable because
`shouldRetryError` short-circuited first in `retrying.ts:86-90`.
- We already retry `TASK_RUN_UNCAUGHT_EXCEPTION` (clearly a user-code
bug) under the user's retry policy; refusing to retry SIGSEGV was the
odd one out.

## Test plan

- [x] `pnpm exec vitest run test/errors.test.ts` in `packages/core` —
26/26 pass (4 new)
- [x] `pnpm run build --filter @trigger.dev/core`
- [ ] CI green on PR

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary

Adds `.claude/REVIEW.md` — a repo-specific source of truth for what AI /
agent code reviewers should treat as critical in this codebase
(rolling-deploy safety, hot-table indexes, recovery-path queries,
testcontainers usage, etc.). Pairs with a Claude-based PR audit that
flags drift between REVIEW.md and the code as it evolves.

## How the audit works

Mirrors the existing `.github/workflows/claude-md-audit.yml` pattern. On
non-draft, non-fork PRs that touch code, `anthropics/claude-code-action`
reads REVIEW.md, samples the PR diff, and posts a sticky comment with up
to 3 of:

- `[stale]` — rule cites a path / function / table that's been removed
or renamed
- `[contradiction]` — code in the PR violates a current rule
- `[missing]` — PR introduces a new pattern future reviewers should know
about
- `[obsolete]` — rule asserts a constraint the repo has moved past

If nothing's off, posts `✅ REVIEW.md looks current for this PR.`

## Test plan

- [ ] Convert this PR to ready-for-review, confirm the audit runs and
posts a sticky comment
- [ ] Verify the audit doesn't run on fork PRs (gated by
`head.repo.full_name == github.repository`)
- [ ] Verify suggestions are actionable on at least one follow-up PR
…3499)

## Summary

Consolidates the webapp's authentication and authorization into a small
set of route helpers, replacing the ad-hoc `requireUser` /
`requireUserId` / `authenticatedEnvironmentForAuthentication` calls
scattered across routes. Same security model, but the per-request flow
(authenticate → authorize → load) now lives in one place per route
family.

Introduces a plugin seam (`@trigger.dev/plugins`) that lets the cloud
build install a richer RBAC implementation without touching webapp code.
The OSS fallback keeps the pre-RBAC permissive behaviour intact, so
self-hosted deployments work unchanged.

Adds a comprehensive end-to-end auth test suite that didn't exist before
— 193 `it()` blocks (vitest reports ~199 after `it.each` expansion)
covering API key, PAT and JWT auth across the public API surface, plus
dashboard session auth for admin pages.

## Changes

### Plugin contract — `@trigger.dev/plugins`

`RoleBaseAccessController` interface authoritative for both OSS
(fallback) and cloud (enterprise plugin):
- `authenticateBearer(request, { allowJWT? })` — API-key / public-JWT
auth, returns env + ability
- `authenticateSession(request, { userId, organizationId?, projectId?
})` — dashboard auth, caller resolves `userId` from the session cookie
and passes it in (no `helpers.getSessionUserId` callback — decouples the
plugin host from session-cookie code)
- `authenticatePat(request, { organizationId?, projectId? })` — PAT
auth, returns identity + `lastAccessedAt` so the host can throttle the
per-request update
- `authenticateAuthorize*` variants for the auth-and-check-in-one-call
cases
- `isUsingPlugin(): Promise<boolean>` — capability flag for UI /
branching where plugin-present-ness matters; replaces the
sentinel-string coupling that had `personalAccessToken.server` matching
`"RBAC plugin not installed"` literally

### Dashboard auth (started, partial rollout)

Admin and settings pages migrated to a unified `dashboardLoader` /
`dashboardAction` helper that authenticates the session, runs an
authorization check, and exposes the result to the route. Other
dashboard routes still on the old pattern; remaining migration tracked
in TRI-8730.

Migrated routes:
- `admin.*` (14 admin / back-office / feature-flags / LLM-models /
notifications / orgs / concurrency pages)
- `_app.orgs.$organizationSlug.settings.team`
- `_app.orgs.$organizationSlug.settings.roles`

### API / realtime / engine auth (complete for the migrated families)

71 routes migrated to a unified `apiBuilder` that centralizes Bearer /
PAT / Public-JWT authentication and applies the per-route authorization
check before the handler runs. Includes:
- `api.v1.*` and `api.v2.*` and `api.v3.*` — tasks, runs, batches,
queues, prompts, deployments, query, sessions, waitpoints, packets,
workers, idempotency keys
- `realtime.v1.*` — runs, batches, sessions, streams
- `engine.v1.*` — dev / worker-action protocols

29 routes still on the legacy `authenticateApiRequest*` helpers —
tracked as a post-deploy follow-up in TRI-9228.

Multi-resource auth direction is now explicit at the call site via
`anyResource(...)` (OR) and `everyResource(...)` (AND). Bare arrays no
longer typecheck — fixes a class of bug where a JWT scoped to one
resource could implicitly access others under OR semantics.

PAT auth path consolidated: was three DB queries per request (legacy
`authenticateApiRequestWithPersonalAccessToken` findFirst +
`rbac.authenticatePat` join + `lastAccessedAt` update). Now one query in
the steady state — plugin returns `lastAccessedAt`, host smart-skips the
update via JS-side throttle when fresh.

Side effect: action aliases preserved historic JWT scope semantics where
the new model is stricter (e.g. a `write:tasks` JWT now also satisfies
`trigger` / `batchTrigger` / `update` actions on the same resource —
matched at the auth boundary, not in the route handler).

### Backwards-compat fixes

The strict-match model regressed several real-world JWT shapes. Each
preserved via explicit `anyResource(...)` entries in the route's authz
block:

- **Batch retrieve routes** (`api.v1.batches.$batchId`, `api.v2.*`,
`realtime.v1.batches.*`) accept `read:runs` JWTs again (pre-RBAC
literal-match superScope behaviour)
- **Runs list routes** (`api.v1.runs`, `realtime.v1.runs`) accept
type-level `read:tasks` / `read:tags` on unfiltered queries (matched the
legacy `Object.keys` iteration semantic)
- **PAT/OAT auth shape** normalized through `toAuthenticated` so all
auth methods return the same slim `AuthenticatedEnvironment` (was:
API-key returned the slim shape but PAT/OAT returned raw Prisma
`Decimal` / no `orgMember`)
- **Scope `:` preservation** in resource ids — `read:tags:env:staging`
now correctly identifies the tag id as `env:staging`, not `env`

### Slim `AuthenticatedEnvironment`

Extracted to `@trigger.dev/core/v3/auth/environment` — a structural
shape independent of `@trigger.dev/database`. The plugin contract
returns this; webapp consumers import from there; the cloud plugin
(Drizzle) returns the same shape without Prisma's `Decimal` class
leaking into the public surface. Lets internal-packages (run-engine,
etc.) refer to `AuthenticatedEnvironment` without pulling Prisma in.

### Auth test suite (new — `*.e2e.full.test.ts`)

193 e2e tests run against a real spawned webapp + Postgres (no mocks).
Coverage matrix:

- **API key auth** — read / write / trigger / batchTrigger / deploy
actions across runs, batches, deployments, prompts, queues, query,
sessions, input-streams, waitpoints, tasks, idempotency keys; multi-key
resources (a run carries batch / tag / task identifiers — auth must
accept any matching scope)
- **Personal Access Token auth** — comprehensive matrix: scope match,
scope mismatch, missing scope, expired token, malformed token
- **Public JWT auth** — sub-vs-URL environment resolution, expired JWTs,
signature verification, scope checking, otu (one-time-use) token
semantics, branch-environment signing-key fallback
- **Dashboard session auth** — admin-only pages reject non-admins;
per-action gating
- **Cross-cutting edge cases** — revoked API key grace window, JWT
cross-environment isolation, MissingResource branch behaviour

### Hygiene cleanups

- Deleted dead `app/services/authorization.server.ts` (legacy
`checkAuthorization` + types — no live consumers post-migration) and its
orphaned test
- Dropped the never-populated `scopes` field from
`ApiAuthenticationResultSuccess`
- `scheduleEmail` moved out of `email.server.ts` into its own module —
breaks a `commonWorker → marqs/V1` import chain that was poisoning the
auth test graph
- OSS Roles page shows a deployment-aware empty state ("Roles aren't
available in this self-hosted deployment" vs the plan-upsell copy) via
`rbac.isUsingPlugin()`
- Team action handler: explicit per-intent ability gates
(`manage:billing` for purchase-seats, `manage:members` for set-role +
remove-member with self-leave carve-out)

### Cross-repo coordination

All public-package contract changes paired in `triggerdotdev/cloud#763`
(rbac-packages branch) — the enterprise plugin implements the same
`RoleBaseAccessController` interface against Drizzle.

## Test plan

- [x] `pnpm run typecheck --filter webapp` clean
- [x] `pnpm --filter webapp exec vitest run --config
vitest.e2e.full.config.ts` — 193/193 pass (requires Docker for
testcontainers)
- [x] Spot-check an authed API endpoint with a valid + invalid API key
against a local stack
- [x] Spot-check the migrated admin pages render and gate non-admins

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… queues (#3558)

## Summary

Queues that use concurrency keys can no longer bypass the per-queue
length cap, and the "Queued | Running" columns in the dashboard now show
the true total across all CK variants instead of 0.

The cap and the dashboard both relied on `ZCARD` of the base queue key,
but CK-keyed runs live under `<base>:ck:<variant>` keys. Any queue that
used concurrency keys read 0 — letting a single CK variant grow
unbounded past the user's configured cap.

## Fix

Two per-base-queue counters are maintained inside the CK Lua scripts:
`<base>:lengthCounter` and `<base>:runningCounter`. Non-CK
enqueue/dequeue paths are untouched.

Counters are lazy-initialized the first time a CK enqueue (or nack)
lands on a queue: the Lua script sums `ZCARD` across the variants
tracked by `ckIndex`, sets the counter, then `INCR`s. Pre-existing CK
backlog on already-populated queues is captured automatically — no batch
migration required.

`INCR`/`DECR` is gated on `ZADD`/`SADD` returning 1 (a new entry vs an
idempotent no-op), so duplicate enqueues or re-dequeues don't inflate
the counter.

The counter is `SET` with a 24-hour TTL on init. `INCR`/`DECR` do not
extend the TTL, so the counter expires daily and the next CK operation
re-seeds it from `ckIndex`. This bounds any drift that accumulates
during the rolling-deploy overlap window — where old (un-Tracked) and
new (Tracked) webapp instances briefly coexist — to ≤24 hours, with no
admin sweep or background reconciler needed.

Read paths pipeline `ZCARD`/`SCARD` on the base key + `GET` on the
counter and sum. A missing counter is treated as 0, so pure non-CK
queues see the same answer as before.

The counter-aware scripts ship alongside the originals with a `Tracked`
suffix for rolling-deploy safety; a follow-up PR will drop the originals
once this has rolled out.

## Test plan

- [ ] `pnpm run test --filter @internal/run-engine` — 116 tests pass,
including a new `ckCounters.test.ts` covering lazy init from
pre-existing backlog, churn, floor-at-zero, the non-CK regression case,
mixed CK + non-CK on the same base queue, idempotent re-enqueue
(ZADD-already-exists), 24h TTL on the counter, and nack re-seeding after
counter expiry.
- [ ] Verified end-to-end against a live local environment:
- Triggered 24 CK enqueues across 4 variants → `lengthCounter=16`,
`runningCounter=8`, dashboard showed Queued=16 / Running=8 for the CK
queue.
- Set the env queue cap to 16, triggered 12 more enqueues → 8 succeeded,
4 rejected with `QueueSizeLimitExceededError`.
- Deleted the counter on a queue with 31 messages already sitting in CK
variants, triggered one more enqueue → counter materialized to 31 from
the `ckIndex` sum, then INCR'd.
## Summary

Local ClickHouse was burning ~325% CPU endlessly merging its own
telemetry tables (`metric_log`, `asynchronous_metric_log`, `part_log`,
`trace_log`) after the container had been running long enough to
accumulate hundreds of GB of system-log data. OrbStack Helper reflected
this on the host (~400% CPU).

These tables are not used by anything in the dev stack. They only exist
for ClickHouse to log itself, so disabling them eliminates the merge
churn entirely.

## Changes

- Adds `docker/config/clickhouse-disable-system-logs.xml`, mounted into
`/etc/clickhouse-server/config.d/`, that removes the noisy system log
tables via `<table remove="1"/>`.
- Mounts the override file in `docker/docker-compose.yml`.

After applying, idle CPU dropped from 325% to ~12% on my machine.

## Test plan

- [ ] `pnpm run docker` brings up the stack cleanly
- [ ] `docker stats clickhouse` shows low idle CPU
- [ ] App functionality unaffected (system log tables are not queried by
the webapp)
…mpling (#3567)

## Summary

Follow-up to #3561. The drift-audit workflow timed out on PR #3542 (92
files, +5962 lines) by hitting `--max-turns 15` before reaching a
verdict, leaving a red ❌ on that PR with no sticky comment.

## Changes

- `--max-turns` bumped from 15 to 30.
- Prompt now opens with an explicit "Strategy" section: read REVIEW.md
once, scan the file-list only, open at most 5 files (3-5 on PRs >50
files), and bias toward finishing over exploring.
- Final rule: *"when in doubt between one more file read and finish now
— finish now."*

The audit is allowed to miss things. It is not allowed to time out and
leave a red X.

## Test plan

- [ ] Verify this PR's audit posts `✅ REVIEW.md looks current for this
PR.` (small diff)
- [ ] After merge, retry the audit on #3542 or a similarly large PR and
confirm it completes
…#3564)

## Summary

- Users on production are hitting `QuotaExceededError: Failed to execute
'setItem' on 'Storage'` when navigating runs, because their localStorage
is full of orphaned `panel-group-react-aria<n>-:<rid>:` entries.
- Each entry is a session-unique key written by the resizable panel
library; they accumulated to thousands per user over the last two months
and now block legitimate `setItem` calls (the run-view inspector can no
longer persist its layout, and the page crashes mid-render).
- This PR evicts the legacy entries once on client boot. The leak itself
is already plugged by the v1.1.3 upgrade in #XXXX — this is the cleanup
that recovers the wasted quota on existing users' machines.

## Root cause (already fixed, for context)

In v0.4.1 of the underlying library, `PanelGroupImpl` defaulted
`autosaveStrategy` to `"localStorage"` unconditionally — so *every*
`PanelGroup` wrote to localStorage on every autosave trigger, including
the four in `QueryEditor`, the one in `ReplayRunDialog`, the storybook
routes, etc. Without an `autosaveId`, the key fell back to
`panel-group-${useId()}`, and React Aria's `useId()` produces a new
session-unique prefix each visit. Result: entries accumulated without
bound across sessions.

The condition was introduced when
[#3282](#3282) removed
the wrapper's explicit `autosaveStrategy="cookie"` override (to fix HTTP
431 cookie-size errors). That worked, but the library default that took
over silently caused this leak.

The v1.1.3 upgrade in the resizable-panel PR changed the default to
`autosaveStrategy = autosaveId ? "localStorage" : undefined`, so no new
entries are being written. Existing residue still needs to be removed
from users' browsers.

## Changes

- New file
[`apps/webapp/app/clientBeforeFirstRender.ts`](apps/webapp/app/clientBeforeFirstRender.ts)
— exports a `clientBeforeFirstRender()` function that runs
synchronously, before React hydrates. Encapsulates a small cleanup
helper that scans `localStorage` and removes:
- Every key starting with `panel-group-react-aria` (the legacy
auto-generated keys).
- The orphan `panel-run-parent-v2` key from before the autosaveId v2→v3
bump.
- [`apps/webapp/app/entry.client.tsx`](apps/webapp/app/entry.client.tsx)
— imports and invokes `clientBeforeFirstRender()` once, before
`hydrateRoot()`. This guarantees the cleanup completes before any
`ResizablePanelGroup` mounts and tries to write.

The cleanup is wrapped in `try/catch` so private-browsing /
disabled-storage scenarios fail silently. Idempotent: subsequent loads
find no matching keys and exit immediately.

## Test plan

- [x] Locally seed ~50 fake `panel-group-react-aria…` entries plus a
`panel-run-parent-v2` entry via DevTools console, hard reload → legacy
entries gone, real entries (`panel-run-parent-v3`, `panel-run-tree`)
preserved.
- [x] Idempotency: reload a second time, no errors, no state changes.
- [x] Add a control entry (`panel-run-parent-v3-but-different-suffix`) —
confirmed not over-matched.
- [x] Simulate broken `Storage.setItem` throwing — page still renders,
cleanup swallows the error.
- [x] Typecheck clean.

## Notes

- Customer report: `QuotaExceededError: Failed to execute 'setItem' on
'Storage': Setting the value of 'panel-run-parent-v3' exceeded the
quota.`
- The cleanup runs once per page load. Once a user has loaded the app
after this deploys, their localStorage is clean and the function becomes
a no-op forever.
## Summary
- Recommend deploying NodeLocal DNS and lowering `ndots` to `1` in the
Kubernetes self-hosting guide.
- Recommend storing task events in ClickHouse
(`EVENT_REPOSITORY_DEFAULT_STORE=clickhouse_v2`) in both the Docker and
Kubernetes guides, plus a new row in the webapp env var reference.
`pr_checks` runs the full matrix on every PR. #3609 touched only
`apps/webapp/app/routes/admin.tsx` and still ran the 4-job CLI e2e
matrix and 5-job sdk-compat suite.

Adds a `changes` job using `dorny/paths-filter` and gates each tier:

- webapp + e2e-webapp: `apps/webapp/**`, `packages/**`,
`internal-packages/**`
- packages: `packages/**`
- internal: `internal-packages/**` + `packages/**` (cross-deps)
- e2e (cli-v3): `packages/{cli-v3,build,core,schema-to-json}/**`
- sdk-compat: `packages/{trigger-sdk,core}/**`

`.configs/**`, `package.json`, `pnpm-lock.yaml`, `pnpm-workspace.yaml`,
`turbo.json` are also included in every filter since they affect the
whole workspace.

Inlines the `units` reusable-workflow children so each can be gated
independently (status check names also flatten from `units / webapp /
...` to `webapp / ...`). `unit-tests.yml` is unaffected - still used by
`publish.yml`.

Adds an `all-checks` gate that always runs and short-circuits to success
when every dependent is success-or-skipped. With this in place a single
required status check (`All PR Checks`) is enough; before this,
`paths-ignore` would have left required checks Pending on docs/changeset
PRs ([gh
docs](https://docs.github.com/en/actions/managing-workflow-runs/skipping-workflow-runs)).
…nizations (#3609)

Switching between the Users and Organizations tabs in the admin
dashboard now keeps the current `?search=` value, so you can flip
between the two without re-typing your filter. Other admin tabs don't
take `search` and so don't carry it.
Adds Sessions, a durable, run-aware stream primitive that scopes
session.in / session.out records to a session (not a single run).
Records survive run boundaries; reconnect-from-last-event-id is built in.

Server foundation:
- New /realtime/v1/sessions/:session/:io/append + /records routes
- sessionRunManager + sessionsRepository + clickhouseSessionsRepository
- mintRunToken for short-lived per-session tokens
- s2Append retry-with-backoff + undici cause diagnostics
- /api/v[12]/packets/* exempt from customer rate limits
- BackgroundWorker schema gains taskKind enum (TASK, AGENT, SCHEDULED)
- TaskRun.taskKind column + clickhouse 029_add_task_kind_to_task_runs_v2

Core types:
- new sessionStreams, inputStreams, realtimeStreams packages in @trigger.dev/core
- session-streams-api / realtime-streams-api surface

Sessions dashboard UI (the primitive's own viewer):
- /sessions index + detail routes
- SessionsTable, SessionFilters, SessionStatus, CloseSessionDialog
- AGENT/SCHEDULED filter in RunFilters + TaskTriggerSource

Includes the sessions-primitive changeset.
`tasks.trigger`, `tasks.batchTrigger`, `batch.create`,
`wait.createToken`, `wait.forDuration`, and the input/session stream
waitpoint endpoints all accept a caller-supplied `idempotencyKey` and
store it verbatim against a composite-unique index on `TaskRun`,
`BatchTaskRun`, or `Waitpoint`. The schemas had no length cap, so a
sufficiently long high-entropy key produced an index row larger than the
underlying storage layer can hold. The insert failed at the database,
and the caller saw a generic 500 from
`RunEngineTriggerTaskService.call()` / `CreateBatchService` / waitpoint
creation, depending on the endpoint.

Keys produced by `idempotencyKeys.create()` are 64-character SHA-256
hashes and never trip this — it only manifests for direct REST callers
(or SDK callers passing a raw string they generated themselves).
Low-entropy keys also sail through, because the storage layer compresses
repeated bytes before they reach the index, which is why the failure
mode is intermittent and tied to caller-side key shape.

## Fix

Add `.max(2048, "<field> must be 2048 characters or less")` to the seven
schemas that feed an indexed `idempotencyKey` column:

- `TriggerTaskRequestBody.options.idempotencyKey`
- `BatchTriggerTaskItem.options.idempotencyKey`
- `CreateBatchRequestBody.idempotencyKey`
- `CreateWaitpointTokenRequestBody.idempotencyKey`
- `CreateInputStreamWaitpointRequestBody.idempotencyKey`
- `CreateSessionStreamWaitpointRequestBody.idempotencyKey`
- `WaitForDurationRequestBody.idempotencyKey`

Plus the `idempotency-key` HTTP header on the trigger route (and the
three batch routes that re-export `HeadersSchema`). The header schema is
lifted out of `api.v1.tasks.$taskId.trigger.ts` into
`apps/webapp/app/v3/triggerHeaders.server.ts` so it can be exercised in
tests without dragging the route's import-time side effects.

The 2048 character ceiling is chosen to sit safely under the per-row
index limit while staying generous against existing callers — keys that
fit before still fit. Oversized keys now return a structured Zod 400
instead of a generic 500.

Limit is documented under `Idempotency key` in `docs/limits.mdx` and as
a `<Note>` on `docs/idempotency.mdx`.

## Test plan

- [x] 15 schema unit tests added
(`packages/core/src/v3/schemas/idempotencyKey.test.ts`,
`apps/webapp/test/routes/triggerHeaders.test.ts`) —
rejection-with-message + boundary acceptance for each capped schema. The
webapp test exercises the extracted `TriggerHeadersSchema` directly with
no mocks.
- [x] `pnpm run build --filter @trigger.dev/core`
- [x] `pnpm run typecheck --filter webapp`
- [x] End-to-end verified locally: baseline (small key) → 200; 3000-char
high-entropy header → 400 with the expected Zod error; same key at the
2048 boundary → 200; same key with the cap reverted → the database
rejected the insert and the route returned 500 to the caller. Cap
restored.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…3542)

## Summary

A `/sessions` dashboard for inspecting durable Sessions, an `AGENT` /
`SCHEDULED` task-kind filter for the runs list, and the server-side
hardening (rate-limit exemption for packets, retry-with-backoff on
stream appends, typed too-large-chunk error) that the `chat.agent`
runtime in #3543 needs. Builds on the Sessions primitive shipped in
#3417.

## Design

The Sessions list + detail routes mirror the run inspector pattern.
`TaskTriggerSource` gains `AGENT` and `SCHEDULED` values, persisted on
`BackgroundWorker.taskKind` and `TaskRun.taskKind` (plus a matching
Clickhouse column), so the runs list can filter by kind.

New `@trigger.dev/core` modules — `sessionStreams`, `inputStreams`, a
`sessionStreamInstance` for realtime streams, and the
`realtime-streams-api` / `session-streams-api` surfaces — expose the
typed shapes that chat.agent will use to drive `session.out`.
`ChatChunkTooLargeError` lets the runtime drop oversized chunks with a
typed surface instead of failing the run. `s2Append` retries transient
failures with exponential backoff. `/api/v[12]/packets/*` is exempt from
customer rate limits so chat snapshot reads and writes don't get
throttled under load.

## Stack

Part of a 4-PR stack. Merge bottom-up.

1. **This PR** (#3542) → `main`
2. #3543#3542 — `chat.agent` runtime + browser transport
3. #3545#3543 — agent-view dashboard
4. #3546#3545 — ai-chat reference + MCP tooling

Replaces #3173 (closed).

<!-- GitButler Footer Boundary Top -->
---
This is **part 5 of 5 in a stack** made with GitButler:
- <kbd>&nbsp;5&nbsp;</kbd> #3612
- <kbd>&nbsp;4&nbsp;</kbd> #3546
- <kbd>&nbsp;3&nbsp;</kbd> #3545
- <kbd>&nbsp;2&nbsp;</kbd> #3543
- <kbd>&nbsp;1&nbsp;</kbd> #3542 👈 
<!-- GitButler Footer Boundary Bottom -->
The `code` paths filter currently matches `**` minus a tiny exclusion
list, so a PR that only touches `.github/workflows/*.yml` still flips
`code == true` and runs typecheck (~2 min on the runner).

Exclude `.github/**` from `code`, then re-include just `pr_checks.yml`
and `typecheck.yml` so a change to either of those still triggers the
full code check matrix.

Effect:
- workflow-only PRs (this one, future dependabot/codeql/etc.) skip
typecheck; `all-checks` treats the skipped job as non-failure so the
required status passes.
- modifying `pr_checks.yml` or `typecheck.yml` themselves still triggers
typecheck.
- the existing per-suite filters (`webapp`, `packages`, `internal`,
`cli`, `sdk`) already re-include the specific workflows that gate them,
so they're unaffected.
ericallam and others added 30 commits June 15, 2026 11:55
Two defensive fixes to the native realtime backend's run-change
publishing (behind a feature flag, off by default), so turning it on can
never destabilize the run lifecycle.

**Never throws at the caller.** Publish sites run synchronously on the
run-engine event bus and the metadata flush loop. The internal publish
was already wrapped in try/catch, but lazy construction (singleton +
metrics) and record encoding ran before that guard, so a throw could
propagate into a run lifecycle operation. The public
`publishChangeRecord` / `publishManyChangeRecords` helpers now wrap the
whole call and log-and-drop on failure.

**Bounds outage buffering.** The publisher connection caps
`maxRetriesPerRequest` at 1 (vs ioredis's default of 20), so during a
pub/sub Redis outage a publish rejects after ~1 reconnect cycle instead
of holding commands in memory for ~20s. A dropped publish is
latency-only, since the consumer has a periodic backstop full-resolve.
The offline queue stays on, so the first publish after a process boots
still flushes once the connection is ready.
…age (#3947)

## Summary

The webapp Docker image build runs `pnpm run build --filter=webapp...`,
which builds `@trigger.dev/sdk` as a dependency. The SDK's `build`
script recently gained a `bundle-docs` step (`tsx
../../scripts/bundleSdkDocs.ts`), but the build couldn't run it in the
pruned image, breaking the image build.

Two things were missing:

- `docker/Dockerfile` copied `scripts/updateVersion.ts` into the builder
stage but not `scripts/bundleSdkDocs.ts`, so the step failed with
`ERR_MODULE_NOT_FOUND`.
- Even with the script present, the repo-level `docs/` tree it reads is
a separate workspace package that isn't in webapp's dependency graph, so
`turbo prune --scope=webapp` excludes it — the script's missing-docs
guard would then fail the build.

## Design

The Dockerfile now copies `bundleSdkDocs.ts` alongside
`updateVersion.ts`. `bundleSdkDocs.ts` skips gracefully when the repo
`docs/` tree is absent, which is exactly the pruned-dependency-build
case (the SDK is compiled there but never published). Publishing always
runs from the full monorepo where `docs/` exists, so the missing-docs
guard still protects releases — it only fires when `docs/` is present
but a cited doc is genuinely missing, rather than when the whole tree
was pruned away. This avoids dragging 27M of docs into a throwaway
builder stage.

## Test plan

- [x] `bundle-docs` from the full monorepo still bundles all cited docs
(exit 0)
- [x] Simulated pruned tree without `docs/` skips cleanly instead of
failing
- [ ] Webapp Docker image build succeeds in CI

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e menu (#3941)

Major dashboard restructure plus the new task landing pages and
self-serve schedules add-on integration.

## Side menu

- Full restructure: standalone Tasks / Runs / Sessions block at the top;
new collapsible sections for AI, Observability, Deployments, Manage
- Persisted collapse state per section in `dashboardPreferences`
- New / updated icons across the menu
- Dashboards section: built-in Run metrics + AI metrics + custom
dashboards, with drag-to-reorder via ReactGridLayout
(`DashboardList.tsx`)
- DevPresence connection indicator in the env selector (DEV + V2)

## Tasks (`_index` — unified Tasks page)

- Replaces the separated Agents / Standard / Schedules listing pages
with one table
- New `UnifiedTaskListPresenter` composes `TaskListPresenter` +
`AgentListPresenter` (shared `currentWorker` lookup)
- Columns: Type (with kind badge), ID, File, Running (numeric for tasks;
running + suspended pills for agents), Activity (24h stacked-by-status),
sticky menu
- Search + "Task type" multi-select filter (URL-synced)
- Client-side pagination at 25/page
- Right-hand "useful links" panel (cookie-persisted state)
- Live-reload SSE: page revalidates on `WORKER_CREATED` so onboarding
`trigger dev` flips the blank state automatically

## Agent landing page (`/agents/$agentParam`)

- New per-agent detail page
- Top tabs (Sessions / Runs) toggle both the chart panel and the table
- Three dashboard-style chart cards: Sessions/Runs activity, LLM spend,
Tokens
- `AgentDetailPresenter` queries ClickHouse for run activity, session
activity (with FINAL on `sessions_v1`), and LLM cost/token activity from
`llm_metrics_v1`
- TimeFilter at the top drives all three charts
- Sticky table header, resizable horizontal handle, sidebar with Test
agent button + properties
- Docs link → `ai-chat/overview`

## Standard Task landing page (`/tasks/standard/$taskParam`)

- New per-task detail page mirroring the Agent layout
- `TaskDetailPresenter` for activity + properties
- Chart panel wrapped in a Card with "Runs by status" header
- Top bar with title, TimeFilter, pagination
- Right sidebar: Test task + identifier, queue, machine, retry, TTL,
payload schema, etc.

## Scheduled Task landing page (`/tasks/scheduled/$taskParam`)

- New per-task detail page mirroring the Agent / Standard layout
- Top-bar actions (right → left): pagination, Bulk replay…, View all
runs, TimeFilter, Create schedule
- Connected schedules mini-table in the sidebar
- **Self-serve schedules add-on integration** (reincarnated from the
now-removed `/schedules` listing page during the `origin/main` merge):
- Bottom usage bar pinned via `grid-rows-[auto_1fr_auto]` — progress
ring + "X/Y of your schedules" + Purchase / Upgrade / Request CTA
  - At-limit "Create schedule" intercept dialog
- `PurchaseSchedulesModal` extracted as a shared component
(`apps/webapp/app/components/schedules/PurchaseSchedulesModal.tsx`)
handling increase / decrease / above-quota / need-to-delete states
- New resource action route at
`/resources/orgs/$organizationSlug/schedules-addon`

## Sessions

- Index page: list, filters, blank state, help tooltip rework
- Detail page: combined input/output chronological view (replaces split
tabs)
- Improved raw-message view layout (full-height)
- AI payload UI: `data-*` parts grouped under "AI SDK data parts:" label
- `toSafeUrl` helper guards rendered URLs from streamed content
- Fix: duplicate assistant content on inspector tab switch

## Playground (Test agent)

- Restructured top menu; back button + agent-selector popover
- Improved blank state
- Recent agent chat history moved into the tabbed menu
- Better message-scroll container (full height)

## Dashboards

- New Dashboards landing page (`/dashboards`) — Run metrics, AI metrics,
Create your own CTAs
- `BuiltInDashboards` updated; new `TasksDashboardPresenter` for the
tasks overview
- Custom dashboards section gains drag-to-reorder; cosmetic fix for
active-row drag-handle blending

## PageHeader / shared primitives

- `PageTitle` gains an `accessory` prop supporting string (auto-wrapped
in tooltip) and ReactNode
- Help tooltips on Tasks, Runs, Sessions PageTitles explaining the
concept and sub-categories
- `Card` primitive used for dashboard-style chart panels throughout

## Code review fixes (last batch on this branch)

- ClickHouse activity queries hardened: `FINAL` + `_is_deleted = 0` on
`task_runs_v2` (ReplacingMergeTree); `organization_id` + `project_id`
filters for sort-key prefix; `inserted_at` partition filter on
`llm_metrics_v1`
- `UnifiedTaskListPresenter`: shared `currentWorker` lookup;
slug-collision guard in `mergeRunningStates`; off-by-one fixed in 24h
bucket alignment
- `ScheduleListPresenter`: halved platform RPCs by deriving limit from
`currentPlan` instead of calling `getLimit`
- Sessions detail: stopped IntersectionObserver / scroll listener
re-attach on every chunk; `requestAnimationFrame` deferral on
auto-scroll to avoid virtualizer race
- URL hardening: `?types=` validated against known kinds; new
`parseFiniteInt` helper applied to `from`/`to`/`page` params
- AgentView: HITL resolution buffer now cleared once parts reach a
terminal state (was an unbounded Map on long sessions); subscription
effect deps documented with eslint suppression
- `PurchaseSchedulesModal`: bundle state resets on each open instead of
persisting stale drafts

## Manual testing

Manual smoke-test plan is tracked under
[TRI-10883](https://linear.app/triggerdotdev/issue/TRI-10883), broken
into 20 sub-issues covering onboarding, self-serve schedules, side menu,
the four landing pages, sessions, runs, dashboards, regressions and
performance.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
)

## Summary

Several optional workflow jobs fail on forks and private mirrors that
lack org-specific secrets or registry permissions. This adds per-job
repository-variable gates so those deployments can switch them off
without editing workflows — matching the pattern from #3901
(`ENABLE_CLAUDE_CODE` / `ENABLE_WORKFLOW_SECURITY_SCAN`).

Two variables, both **default-enabled** (a job runs unless its variable
is explicitly `'false'`), so canonical-repo behaviour is unchanged where
the variables are unset:

**`ENABLE_HELM_PRERELEASE`** — gates the chart-publish jobs that push to
`oci://ghcr.io/<owner>/charts` (needs `write_package` on the owner's
charts namespace):
- `helm-prerelease.yml` → `prerelease` job
- `release-helm.yml` → `release` job

Without the permission these fail with `403: denied: permission_denied:
write_package` on every PR / `helm-v*` tag. The `lint-and-test` jobs
(lint + template + kubeconform, no push) always run, so chart validity
is still enforced everywhere.

**`ENABLE_DEPENDABOT_ALERTS`** — gates the Dependabot notifier crons
that need `DEPENDABOT_ALERTS_TOKEN` / `SLACK_BOT_TOKEN` and post to a
specific Slack:
- `dependabot-critical-alerts.yml` → `alert` job (daily cron)
- `dependabot-weekly-summary.yml` → `summary` job (weekly cron)

On a fork/mirror these otherwise fire on schedule and fail (or post
nowhere) indefinitely.

## Test plan

- Variables unset (default): all jobs run as today.
- `ENABLE_HELM_PRERELEASE=false`: helm `lint-and-test` runs, publish
jobs skip — no 403 on repos lacking `write_package`.
- `ENABLE_DEPENDABOT_ALERTS=false`: the two cron jobs skip cleanly
(neutral, not failed).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
)

### Summary 
Self-serve billing UI is now hidden for managed-billing organizations.

Plan pickers, upgrade actions, billing alerts, and related upgrade
prompts are replaced with a "Contact us" option where appropriate.

Uses the new showSelfServe subscription flag, defaulting to true for
existing self-serve organizations.

### Testing

- [x] billing pages render correctly for self-serve organizations.
- [x] managed-billing organizations no longer see self-serve upgrade
flows.
- [x] "Contact us" actions are shown instead of upgrade actions where
applicable.

### Changelog

Hide self-serve billing flows for managed-billing organizations behind
the new showSelfServe subscription flag.
## Summary

`chat.agent`'s system prompt (the `chat.prompt` text plus any skills
preamble) could not carry a provider cache breakpoint, so the largest
and most stable part of the prompt re-paid full input price on every
turn. `chat.toStreamTextOptions()` now emits the system prompt as a
structured message carrying `providerOptions` when you opt in, so a
provider can cache the system block. Without an option, `system` stays a
plain string, so existing behavior is unchanged.

## API

Three ways to opt in (most specific wins, no deep merge):

```ts
// Anthropic sugar
chat.toStreamTextOptions({ cacheControl: { type: "ephemeral" } });
// provider-agnostic (also covers Amazon Bedrock's cachePoint)
chat.toStreamTextOptions({ systemProviderOptions: { anthropic: { cacheControl: { type: "ephemeral" } } } });
// at the definition site
chat.prompt.set(SYSTEM_PROMPT, { providerOptions: { anthropic: { cacheControl: { type: "ephemeral" } } } });
```

The `cacheControl` shorthand is Anthropic-only; `systemProviderOptions`
is the general form. Pairs with a `prepareMessages` cache breakpoint to
cache the conversation prefix too.

Docs guide: #3951
…locks (#3954)

## Summary

The script that generates the changeset release PR description was
silently dropping some changelog entries and stripping code examples. In
[#3932](#3932), entry
[#3937](#3937) was
missing entirely from the Improvements list and
[#3952](#3952 code
block was gone, even though both were present in the raw changeset
output.

## Root cause

`parsePrBody` parsed the raw changeset body line by line:

- The dependency-bump filter matched any entry whose text *began* with a
backticked package name, so a real changelog entry like ``
`@trigger.dev/sdk` now bundles... `` got thrown out along with the
genuine version-bump lines.
- Only the first line of each bullet was kept, so fenced code blocks,
sub-bullets, and continuation paragraphs were discarded.

## Fix

Group each top-level bullet with its indented continuation (code blocks,
sub-bullets, paragraphs), dedent it, and re-emit it intact. The
dependency filter is now anchored so it only matches lines that are
*entirely* a package bump, leaving real entries that merely start with a
package name.

Verified by replaying #3932's raw body through the script: #3937 returns
to the list, #3952's code block is preserved, and #3936's sub-bullets
nest correctly under their parent.
Adds `pnpm.overrides` pinning a few transitive deps to their current
releases:

- `js-cookie` → 3.0.7
- `tmp` → 0.2.7
- `brace-expansion` → 1.1.13 / 2.0.3 / 5.0.6 (one entry per major)

Each override is scoped to the affected major range so unaffected majors
aren't dragged forward. Also drops the `fast-xml-builder` override,
which no longer resolves to anything in the tree.

Lockfile-only - no published package's dependencies change.
`js-cookie`/`tmp` parents pin ranges that can't reach the new versions
on their own, so overrides (not a plain lockfile refresh) are needed to
hold them.
Currently the `db:seed` script just hangs on success.

This PR adds `process.exit(0)` to the finally block after db disconnect
so the script exits properly.

---------

Co-authored-by: Chris Arderne <chris@trigger.dev>
…t dispatches (#3918)

### Problem

Firestarter's `didWarmStart: true` means the response was written to a
long-poll socket — not that the runner received it. A silently dead
poller (no FIN, e.g. a VM torn down mid-poll) leaves the dispatched run
stuck in `PENDING_EXECUTING` until the run engine's heartbeat redrive,
and each redrive burns a queue redelivery toward
`TASK_RUN_DEQUEUED_MAX_RETRIES`.

### Change

After a warm-start hit, the supervisor retains the `DequeuedMessage`
(TimerWheel, default 10s), then probes the existing `getLatestSnapshot`
API. If the run is still on the exact dequeued snapshot, no runner ever
acted — it falls through to the regular cold-create path. Recovery: ~10s
+ cold start, no new APIs, no CLI changes.

- **Double-start safe**: `startRunAttempt` runs under a per-run lock and
409s stale snapshot ids, so a reviving runner and the fallback workload
can't both execute; the loser exits before running anything.
- **Probe errors → do nothing**: healthy runners legitimately act late
during platform brownouts (nested attempt-start retries), so falling
back on uncertainty would stampede duplicates. The heartbeat redrive
stays as the backstop (also covers supervisor restarts dropping timers).
- **Off by default**: `TRIGGER_WARM_START_VERIFY_ENABLED` (+
`TRIGGER_WARM_START_VERIFY_DELAY_MS`, 1–60s, default 10s). Disabled =
complete no-op. Works for all workload managers (compute/k8s/docker)
since it hooks the shared dequeue path.
- Emits `warmstart.verify` wide events (`outcome: delivered | fallback |
probe_error`), making the silent-loss rate directly measurable.
…3963)

## Summary

`chat.headStart` (the warm step-1 fast path) previously handed its
response over only to `chat.agent`. This extends handover to the other
two backends: `chat.customAgent` consumes it with
`conversation.consumeHandover({ payload })` on turn 0, and
`chat.createSession` surfaces it as `turn.handover` (call
`turn.complete()` with no source to finalize a pure-text handover). The
low-level `chat.waitForHandover()` and `accumulator.applyHandover()` are
exported for hand-rolled loops.

It also adds `triggerConfig` to `chat.headStart()` and
`chat.openSession()`, so the auto-triggered handover-prepare run
inherits tags, queue, machine, and the other session run options the
same way `chat.createStartSessionAction()` does. The `chat:{chatId}` tag
is prepended automatically. Because the session is created once on the
first head-start turn (idempotent on the chat id), this is the only
place those options can be set for a head-start chat's lifetime.

## Fix: tool-call resume

When the warm step-1 hands over a pending tool call (rather than pure
text), the agent loop resumes that tool round. For it to merge cleanly
the pipe threads the spliced partial as `originalMessages`, so the
resumed tool-output chunk attaches to the handed-over tool-call instead
of throwing `No tool invocation found`. `MessageAccumulator.addResponse`
now also dedups by id (replace-in-place), so the persisted history
doesn't carry a duplicate assistant message when the resumed response
reuses the partial's id.

Incorporates the `triggerConfig` work from
[#3933](#3933) by
@saasjesus, with `createStartSessionAction` extended to also forward
`maxDuration`, `region`, and `lockToVersion` so the two session entry
points stay consistent.

Verified end-to-end against a local environment: handover (pure-text and
tool-call) on both new backends, a `chat.agent` regression pass, and
`triggerConfig` tags and queue landing on the run.

---------

Co-authored-by: saasjesus <armin@chatarmin.com>
## Summary

Reworks the scheduled task page right-hand sidebar.

- Adds **Overview** / **Schedules** tabs. The Schedules tab is a
paginated table of all schedules attached to the task, declarative
first.
- Surfaces schedule fields (ID, CRON + human-readable description,
next/last run, status) directly in the Overview property table.
- Sidebar can be dragged much wider (up to 80% of the viewport).
- "No schedules attached" panel explains declarative vs imperative and
links to docs.
- Schedule **create / edit / enable / disable / delete** all happen
inside the existing Sheet — no more navigating to the standalone
schedule page. Toasts confirm each action.

## Test plan

- Open a scheduled task page and verify the new tabs
- Create, edit, enable/disable, and delete a schedule — confirm you stay
on the page and see a toast each time
- Visit a task with no schedules attached and confirm the info panel
renders
- Drag the sidebar wider; confirm pagination shows when there are >25
schedules
## Summary

Docs deploy from the `docs-live` branch via Mintlify, so merging to
`main` no longer publishes docs on its own. To publish, push a
`docs-release-*` tag at the commit you want live. The workflow runs the
Mintlify broken-links check against that commit, then fast-forwards
`docs-live` to it, which is what Mintlify deploys from.

## Design

The ref move uses the GitHub API with `force=false`, making it
fast-forward only: a tag that is not ahead of `docs-live` fails the job
rather than rewinding production. Mintlify's GitHub app reacts to the
resulting push and deploys, so no extra deploy credentials are needed.

Usage:

```bash
git tag docs-release-2026.06.16   # tag the main commit you want live
git push origin docs-release-2026.06.16
```
…3964)

## Summary

`chat.headStart` now works with the `chat.customAgent` and
`chat.createSession` backends (not just `chat.agent`), and takes a
`triggerConfig` option. These docs cover both.

The Fast starts guide gets a "Handover with custom agents" section
showing how each backend consumes the handover (`consumeHandover`
returning `{ isFinal, skipped }` for custom agents, `turn.handover` for
createSession), including threading `originalMessages` so a resumed tool
round merges into the handed-over assistant. The `chat.headStart` API
section documents `triggerConfig` (tags, queue, machine, and the rest)
on the auto-triggered run.

The reference picks up `ChatTurn.handover`, `turn.complete()` with no
source, `chat.waitForHandover`, and a new `HeadStartHandlerOptions`
table.

Docs for the SDK changes in
[#3963](#3963).
…served keys (#3966)

Fix Vercel onboarding wizard to properly filter out reserved TRIGGER_
env vars
## Summary

New `/ai-chat/prompt-caching` guide covering how to cache a chat agent's
prompt prefix with Anthropic prompt caching: the system prompt, the
conversation history (a `prepareMessages` breakpoint), and how caching
interacts with compaction. It also shows how to verify cache hits via
usage and the dashboard, the prefix-stability footguns, and an "Other
providers" section (OpenAI and Google cache automatically; Amazon
Bedrock uses `cachePoint` through `systemProviderOptions`).

Registered under Features in the AI Agents nav, next to Compaction.

---------

Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Eric Allam <ericallam@users.noreply.github.com>
## Summary

The "What extractNewToolResults returns" reference in the
tool-result-auditing guide did not match the SDK. It listed an `input`
field that `chat.history.extractNewToolResults()` never returns, and
marked `output` as optional when it is always present.

This corrects the block to the real `ChatNewToolResult` shape
(`toolCallId`, `toolName`, `output`, optional `errorText`). Every usage
example in the same guide already reads only those fields, so the
reference now matches both the examples and the code.
…3958)

## Summary

The Models page is now split into two tabs. **Your models** shows the
models your project has actually used in the selected time range, with
usage charts (cost over time, tokens over time, calls by model), a
per-model table of calls / cost / avg TTFC / avg tokens-per-sec, and
calls/tokens trend sparklines. **Model library** is the full catalog,
reordered from alphabetical to a relevance-based provider order
(Anthropic, OpenAI, Google, then the rest), newest models first within
each provider, with a "New" badge on models released in the last 7 days.

One time-range selector drives the whole Your models tab, so the charts,
the table, and the sparklines all share the same window. Opening a model
shows its own metrics with an independent range picker and a "View in AI
metrics" link that opens the AI metrics dashboard filtered to that
model. The active tab is kept in the URL so it survives a refresh and is
shareable.

## Prompt caching & cost accuracy

Both the Your models tab and the AI metrics dashboard now surface
prompt-cache usage: a cache-savings column plus per-model cached-tokens
and cache-hit-rate views, and a caching section on the dashboard (hit
rate, cached tokens, estimated savings, and hit rate by model).

Building this surfaced a cost bug. `input_tokens` is the total prompt
count and already includes cache-read and cache-creation tokens, but the
cost pipeline charged the full input at the input price and then added a
separate cache line, so cached tokens were billed twice (and on
Anthropic, cache reads were never discounted because their price is
keyed differently). The input price now applies only to the non-cached
remainder, with cache prices resolved across the provider-specific keys,
so LLM cost and the cache hit-rate metric are accurate. Hit rate is
computed as cached reads over total input.

## Notes

Also fixes React "invalid DOM property" console warnings from the
provider icons (the Llama and DeepSeek SVGs used raw `fill-rule` /
`clip-rule` / `clip-path` attributes), which this page surfaces by
rendering more provider icons.

## Screenshots

**Your models tab:** usage charts and a per-model table with
calls/tokens trend sparklines.

<img width="2560" height="1267" alt="1-your-models-tab"
src="https://github.com/user-attachments/assets/859bd24f-9047-4828-8bbb-83e5882846d6"
/>


**Model library:** provider-relevance ordering with a "New" badge on
models released in the last 7 days.

<img width="2560" height="1267" alt="2-model-library-tab"
src="https://github.com/user-attachments/assets/46dd54b9-80f9-4922-ade9-5935b08dfebc"
/>


**Model detail, Metrics tab:** per-model range picker and a "View in AI
metrics" link.

<img width="2560" height="1267" alt="3-model-detail-metrics"
src="https://github.com/user-attachments/assets/0f65d9d0-6142-4918-93f0-110bb277101a"
/>


**View in AI metrics:** the dashboard deep-linked and filtered to the
selected model.

<img width="2560" height="1267" alt="4-ai-metrics-filtered"
src="https://github.com/user-attachments/assets/821f256c-e305-493c-98c7-eafaf2f57f83"
/>
…#3939)

## Summary

The agent skills' deep guidance now ships inside `@trigger.dev/sdk` and
is read from `node_modules`, so it tracks the `@trigger.dev/sdk` version
installed in your project automatically. This updates the Skills page,
the Building with AI step, and the rules-redirect page to drop the old
"pinned to the CLI version, re-run to refresh" framing and describe the
version-pinned reference instead.

Pairs with the SDK/CLI change in #3937. Keep this draft until that
ships, since it describes behavior that is not released yet.
## Summary

Typing in the search bar on the task page could clear or reset the input
mid-keystroke. This fixes the re-render race so the field stays stable
while you type.

## Root cause

Two things compounded:

- `SearchInput`'s sync effect depended on `text`, so it re-ran on every
keystroke and could overwrite the input with the URL/controlled value
while focused.
- Each task row unmounted and remounted its activity chart during the
side-panel open/close animation (25 charts at once), forcing heavy
re-renders that the search effect raced against.

## Fix

- `SearchInput` now tracks the last synced value in a ref instead of
comparing against `text`, keeping the effect off the keystroke path. It
only writes to state when the incoming URL/controlled value actually
changes, and never while the input is focused.
- Activity charts are now hidden (`hidden` attribute) instead of
unmounted during the panel animation, so the rows don't churn the tree
and the resize stays smooth.

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ngs (#3970)

## Summary

Three improvements to the SDK-bundled agent skills (follow-up to the
skills installer):

- **`trigger-` namespace.** The installed skills (`authoring-tasks`,
`getting-started`, …) had generic names that collide with unrelated
skills in a shared agent skills directory. They're now prefixed —
`trigger-authoring-tasks`, `trigger-getting-started`, etc. — matching
the convention the public skills repo already uses.
- **New `trigger-cost-savings` skill.** An MCP-driven cost audit:
right-sizes machines, flags missing `maxDuration`, spots sequential
triggers that could batch, and reviews schedule frequency, using
`list_runs` / `get_run_details` for live analysis.
- **Bundle the full docs.** `@trigger.dev/sdk` now bundles the entire
"Documentation" section of the docs (157 pages) instead of a curated
55-page subset, so an agent has the complete, version-pinned reference
in `node_modules`.

## How the bundling works

`scripts/bundleSdkDocs.ts` now reads `docs/docs.json`, walks the
"Documentation" dropdown, and copies every page under it into the SDK.
The set tracks the docs navigation automatically — add a page to the nav
and it ships, no skill edits needed. The API reference and Guides &
examples dropdowns are intentionally excluded. A skill's `sources:`
frontmatter is now informational only.

The dropped idea of a dedicated `trigger-config` skill is replaced by
references to the bundled build-extension docs (`config/extensions/*`)
from the `trigger-authoring-tasks` config section and the chat-agent
skills.
Adds an opt-in mechanism to route a configurable percentage of
organizations onto the compute (MicroVM) backing of their region at
trigger time, without changing their stored region settings.

Routing is gated by three global feature flags -
`computeMigrationEnabled`, `computeMigrationFreePercentage`,
`computeMigrationPaidPercentage` - plus a per-org
`computeMigrationEnabled` override that wins in both directions. A
region's compute backing is resolved from a new
`WorkerInstanceGroup.region` column: a container group and its MicroVM
group share one geo `region`, so the migration swaps the resolved worker
queue to the backing group's queue. Orgs are bucketed deterministically
by id, so ramping a percentage down keeps a strict subset rather than
reshuffling, and a region with no compute backing is never touched.
Everything is off by default - behaviour is unchanged unless the flags
are set.

The flags and the worker-region groups are read on the trigger hot path
from in-memory snapshots rather than the database: a small
`createReloadingRegistry` helper loads each at startup and refreshes
them on an interval, so no per-trigger query is added and a percentage
or kill-switch change propagates within the reload interval. A cold
replica whose snapshot hasn't loaded yet reads as not-migrated (the
container path) and self-corrects on the next load - the same cold-start
contract as the datastore / LLM-pricing registries, with a
`reloading_registry_loaded` metric so a never-loaded registry is
alertable.

The same migration decision is consulted at deploy-time template
creation so a migrated org gets a compute template built ahead of its
first run. This runs in shadow mode (best-effort, never fails the
deploy) by default, or - when the `computeMigrationRequireTemplate` flag
is on - in required mode, built synchronously at deploy so the first run
never builds on-demand and template errors surface at deploy time.

So operators keep "which runs ran where" while customers only see
geography: the run's actual worker queue is stored raw, and the geo
region is stamped separately on `TaskRun.region` (and a new ClickHouse
`region` column) at trigger time. Read surfaces - the dashboard, the
API, and the Query/Logs page - show the geo region, falling back to the
worker queue for runs written before the column existed.

Minor follow-ups left out of scope: the percentage flags render as text
inputs on the admin flags page (the catalog UI has no numeric control
type yet), and `createReloadingRegistry` could later gain pub/sub for
sub-second cross-replica propagation if the reload interval proves too
slow.
## Summary
7 improvements.

## Improvements
- `@trigger.dev/sdk` now bundles the Trigger.dev agent skills and a
curated snapshot of the docs those skills reference. The skills that
`trigger skills` installs into your coding agent read this content from
node_modules, so the guidance your AI assistant follows is pinned to the
SDK version installed in your project and stays current across upgrades
instead of going stale until the next reinstall.
([#3937](#3937))
- Running a CLI command like `dev`, `deploy`, `preview`, or `update`
before initializing a project no longer crashes with a raw `Cannot find
matching package.json` stack trace. The CLI now detects the missing
project and points you to `npx trigger.dev@latest init` instead.
([#3929](#3929))
- The agent skills installed by `trigger skills` are now namespaced with
a `trigger-` prefix (e.g. `trigger-authoring-tasks`,
`trigger-getting-started`) so they don't collide with unrelated skills
in your coding agent's skills directory. Adds a `trigger-cost-savings`
skill for auditing and reducing compute spend (right-sizing machines,
`maxDuration`, batching, debounce), and `@trigger.dev/sdk` now bundles
the full Trigger.dev documentation so your agent can read the complete,
version-pinned reference directly from node_modules.
([#3970](#3970))
- The run span API response now includes `cachedCost` and
`cacheCreationCost` on the `ai` object, alongside the existing
`inputCost` / `outputCost` / `totalCost`. `inputCost` reflects only the
non-cached input, so these fields let you reconstruct the full cost
breakdown for prompt-cached calls.
([#3958](#3958))
- `chat.headStart` now works with the `chat.customAgent` and
`chat.createSession` backends, not only `chat.agent`. The warm step-1
response hands over to your loop the same way it does for a managed
agent. ([#3963](#3963))
  
  In a `chat.customAgent` loop, consume the handover on turn 0:
  
  ```ts
  const conversation = new chat.MessageAccumulator();
const { isFinal, skipped } = await conversation.consumeHandover({
payload });
  if (skipped) return; // warm handler aborted, so exit without a turn
  if (isFinal) {
await chat.writeTurnComplete(); // step 1 is the response, no streamText
  } else {
const result = streamText({ model, messages: conversation.modelMessages,
tools });
// Pass originalMessages so the handed-over tool round merges into the
    // step-1 assistant instead of starting a new message.
    const response = await chat.pipeAndCapture(result, {
      originalMessages: conversation.uiMessages,
    });
    if (response) await conversation.addResponse(response);
  }
  ```
  
With `chat.createSession`, the iterator surfaces it as `turn.handover`;
call `turn.complete()` with no argument on a final handover. The
lower-level `chat.waitForHandover()` and `accumulator.applyHandover()`
are also exported for hand-rolled loops.
- Cache your chat agent's system prompt with Anthropic prompt caching.
`chat.toStreamTextOptions()` now emits the system prompt as a cacheable
message when you opt in, so a large, stable system block is billed at
cache-read rates on every turn instead of full price.
([#3952](#3952))
  
  ```ts
  // at the streamText call site (Anthropic sugar)
  streamText({
...chat.toStreamTextOptions({ cacheControl: { type: "ephemeral" } }),
    messages,
  });
  
  // provider-agnostic equivalent
  chat.toStreamTextOptions({
systemProviderOptions: { anthropic: { cacheControl: { type: "ephemeral"
} } },
  });
  
  // or where the prompt is defined
  chat.prompt.set(SYSTEM_PROMPT, {
providerOptions: { anthropic: { cacheControl: { type: "ephemeral" } } },
  });
  ```
  
Without an option, `system` stays a plain string. Pairs with a
`prepareMessages` cache breakpoint to cache the conversation prefix
across turns too.
- Three fixes for custom agent loops (`chat.customAgent`,
`chat.createSession`, and hand-rolled `MessageAccumulator` loops):
([#3936](#3936))
  
- Continuation runs no longer replay already-answered user messages into
the first turn. The `.in` resume cursor is now seeded before any
listener attaches (the same boot logic `chat.agent` uses), so a chat
that continues after a cancel, crash, or upgrade only sees genuinely new
messages.
- Steering a hand-rolled loop mid-stream no longer wipes the in-flight
assistant response. `chat.pipeAndCapture` now stamps a server-generated
message id on the stream, so a `prepareStep` injection keeps the partial
text instead of replacing the message.
- Task-backed tools (`ai.toolExecute`) now work from custom agent loops:
the parent's session is threaded to the child run, so child tasks can
stream progress into the chat with `chat.stream.writer({ target: "root"
})` instead of failing with "session handle is not initialized".

<details>
<summary>Raw changeset output</summary>

⚠️⚠️⚠️⚠️⚠️⚠️

`main` is currently in **pre mode** so this branch has prereleases
rather than normal releases. If you want to exit prereleases, run
`changeset pre exit` on `main`.

⚠️⚠️⚠️⚠️⚠️⚠️

# Releases
## @trigger.dev/build@4.5.0-rc.7

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.5.0-rc.7`

## trigger.dev@4.5.0-rc.7

### Patch Changes

- `@trigger.dev/sdk` now bundles the Trigger.dev agent skills and a
curated snapshot of the docs those skills reference. The skills that
`trigger skills` installs into your coding agent read this content from
node_modules, so the guidance your AI assistant follows is pinned to the
SDK version installed in your project and stays current across upgrades
instead of going stale until the next reinstall.
([#3937](#3937))
- Running a CLI command like `dev`, `deploy`, `preview`, or `update`
before initializing a project no longer crashes with a raw `Cannot find
matching package.json` stack trace. The CLI now detects the missing
project and points you to `npx trigger.dev@latest init` instead.
([#3929](#3929))
- The agent skills installed by `trigger skills` are now namespaced with
a `trigger-` prefix (e.g. `trigger-authoring-tasks`,
`trigger-getting-started`) so they don't collide with unrelated skills
in your coding agent's skills directory. Adds a `trigger-cost-savings`
skill for auditing and reducing compute spend (right-sizing machines,
`maxDuration`, batching, debounce), and `@trigger.dev/sdk` now bundles
the full Trigger.dev documentation so your agent can read the complete,
version-pinned reference directly from node_modules.
([#3970](#3970))
-   Updated dependencies:
    -   `@trigger.dev/core@4.5.0-rc.7`
    -   `@trigger.dev/build@4.5.0-rc.7`
    -   `@trigger.dev/schema-to-json@4.5.0-rc.7`

## @trigger.dev/core@4.5.0-rc.7

### Patch Changes

- The run span API response now includes `cachedCost` and
`cacheCreationCost` on the `ai` object, alongside the existing
`inputCost` / `outputCost` / `totalCost`. `inputCost` reflects only the
non-cached input, so these fields let you reconstruct the full cost
breakdown for prompt-cached calls.
([#3958](#3958))

## @trigger.dev/python@4.5.0-rc.7

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/sdk@4.5.0-rc.7`
    -   `@trigger.dev/core@4.5.0-rc.7`
    -   `@trigger.dev/build@4.5.0-rc.7`

## @trigger.dev/react-hooks@4.5.0-rc.7

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.5.0-rc.7`

## @trigger.dev/redis-worker@4.5.0-rc.7

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.5.0-rc.7`

## @trigger.dev/rsc@4.5.0-rc.7

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.5.0-rc.7`

## @trigger.dev/schema-to-json@4.5.0-rc.7

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.5.0-rc.7`

## @trigger.dev/sdk@4.5.0-rc.7

### Patch Changes

- `@trigger.dev/sdk` now bundles the Trigger.dev agent skills and a
curated snapshot of the docs those skills reference. The skills that
`trigger skills` installs into your coding agent read this content from
node_modules, so the guidance your AI assistant follows is pinned to the
SDK version installed in your project and stays current across upgrades
instead of going stale until the next reinstall.
([#3937](#3937))

- `chat.headStart` now works with the `chat.customAgent` and
`chat.createSession` backends, not only `chat.agent`. The warm step-1
response hands over to your loop the same way it does for a managed
agent. ([#3963](#3963))

    In a `chat.customAgent` loop, consume the handover on turn 0:

    ```ts
    const conversation = new chat.MessageAccumulator();
const { isFinal, skipped } = await conversation.consumeHandover({
payload });
    if (skipped) return; // warm handler aborted, so exit without a turn
    if (isFinal) {
await chat.writeTurnComplete(); // step 1 is the response, no streamText
    } else {
const result = streamText({ model, messages: conversation.modelMessages,
tools });
// Pass originalMessages so the handed-over tool round merges into the
      // step-1 assistant instead of starting a new message.
      const response = await chat.pipeAndCapture(result, {
        originalMessages: conversation.uiMessages,
      });
      if (response) await conversation.addResponse(response);
    }
    ```

With `chat.createSession`, the iterator surfaces it as `turn.handover`;
call `turn.complete()` with no argument on a final handover. The
lower-level `chat.waitForHandover()` and `accumulator.applyHandover()`
are also exported for hand-rolled loops.

- Add `triggerConfig` support to `chat.headStart()` and
`chat.openSession()`, so the auto-triggered handover-prepare run
inherits tags, queue, machine, and other session trigger options the
same way `chat.createStartSessionAction()` does. The `chat:{chatId}` tag
is prepended automatically.
([#3963](#3963))

    ```ts
    export const POST = chat.headStart({
      agentId: "my-agent",
      triggerConfig: { tags: ["org:acme"], queue: "chat" },
run: async ({ chat }) => streamText({ ...chat.toStreamTextOptions(),
model }),
    });
    ```

Because the session is created once on the first head-start turn and is
idempotent on the chat id, this is the only place to set those options
for a head-start chat's lifetime. `chat.createStartSessionAction()` now
also forwards `maxDuration`, `region`, and `lockToVersion` so both
session entry points stay consistent.

- Cache your chat agent's system prompt with Anthropic prompt caching.
`chat.toStreamTextOptions()` now emits the system prompt as a cacheable
message when you opt in, so a large, stable system block is billed at
cache-read rates on every turn instead of full price.
([#3952](#3952))

    ```ts
    // at the streamText call site (Anthropic sugar)
    streamText({
...chat.toStreamTextOptions({ cacheControl: { type: "ephemeral" } }),
      messages,
    });

    // provider-agnostic equivalent
    chat.toStreamTextOptions({
systemProviderOptions: { anthropic: { cacheControl: { type: "ephemeral"
} } },
    });

    // or where the prompt is defined
    chat.prompt.set(SYSTEM_PROMPT, {
providerOptions: { anthropic: { cacheControl: { type: "ephemeral" } } },
    });
    ```

Without an option, `system` stays a plain string. Pairs with a
`prepareMessages` cache breakpoint to cache the conversation prefix
across turns too.

- Three fixes for custom agent loops (`chat.customAgent`,
`chat.createSession`, and hand-rolled `MessageAccumulator` loops):
([#3936](#3936))

- Continuation runs no longer replay already-answered user messages into
the first turn. The `.in` resume cursor is now seeded before any
listener attaches (the same boot logic `chat.agent` uses), so a chat
that continues after a cancel, crash, or upgrade only sees genuinely new
messages.
- Steering a hand-rolled loop mid-stream no longer wipes the in-flight
assistant response. `chat.pipeAndCapture` now stamps a server-generated
message id on the stream, so a `prepareStep` injection keeps the partial
text instead of replacing the message.
- Task-backed tools (`ai.toolExecute`) now work from custom agent loops:
the parent's session is threaded to the child run, so child tasks can
stream progress into the chat with `chat.stream.writer({ target: "root"
})` instead of failing with "session handle is not initialized".

- The agent skills installed by `trigger skills` are now namespaced with
a `trigger-` prefix (e.g. `trigger-authoring-tasks`,
`trigger-getting-started`) so they don't collide with unrelated skills
in your coding agent's skills directory. Adds a `trigger-cost-savings`
skill for auditing and reducing compute spend (right-sizing machines,
`maxDuration`, batching, debounce), and `@trigger.dev/sdk` now bundles
the full Trigger.dev documentation so your agent can read the complete,
version-pinned reference directly from node_modules.
([#3970](#3970))

-   Updated dependencies:
    -   `@trigger.dev/core@4.5.0-rc.7`

## @trigger.dev/plugins@4.5.0-rc.7

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.5.0-rc.7`

</details>

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Replicates `TaskRun.planType` into the `task_runs_v2` ClickHouse table
so run analytics can group by plan type.

Adds a `plan_type` column (goose migration `033`,
`LowCardinality(String)`), the replication insert mapping, and the
matching schema/column/type entries - same shape as the recent `region`
addition. Write-once at trigger, so it just rides along on existing
replicated rows. Internal analytics only; not exposed in the Query API.
#3960)

## Summary

Prisma infrastructure failures (P1xxx-class: database unreachable, timed
out, connection dropped, engine init/panic) carry the database hostname
in their `.message`. This captures them centrally for observability and
ensures they never reach API clients verbatim.

## Design

A `$allOperations` client extension on the writer and replica clients
logs infrastructure errors with the originating model and operation,
then rethrows the **original** error unchanged — call sites that branch
on `error.code` (unique-violation idempotency, not-found handling) and
transaction retries keep working. Only infrastructure errors are logged;
routine query/validation errors (P2xxx) are left alone.

`$allOperations` can't see the transaction boundary (`$transaction` is a
client method, not an operation), so infrastructure errors surfacing
from `$transaction()` without a Prisma code — e.g.
`PrismaClientInitializationError` — are logged separately at the
transaction wrapper, where the existing coded-error path would otherwise
miss them.

`clientSafeErrorMessage()` swaps an infrastructure error's message for
`"Internal Server Error"` at the API routes that previously returned
`error.message` raw. Status codes, headers, and every non-infrastructure
message are unchanged.

## Test plan

- [x] P2002 / P2025 rethrow with code intact and are not logged
- [x] Statement errors inside `$transaction` keep their code (retry
logic intact)
- [x] Raw queries wrapped without crashing on the undefined model
- [x] A genuine connectivity failure is logged with model/operation/code
- [x] `clientSafeErrorMessage` obfuscates infra messages, preserves all
others
- [x] `pnpm run typecheck --filter webapp` (12/12)

## Note

Overlaps with #3391 (Prisma 7 migration) on
`apps/webapp/app/db.server.ts` — coordinate rebasing.
The global feature flags admin page had a few rough edges.

The percentage flags are numeric (`z.coerce.number()`) but rendered as
free-text inputs, so you could type non-numeric values that only failed
validation after submitting - and the error surfaced behind the confirm
dialog. The control-type detection now recognises numbers and renders a
proper number input, with the min/max range as the placeholder so the
type is clear even when the field is unset. The save error also shows
inside the confirm dialog now, not just behind it.

The action buttons were unreachable without zooming out. The admin
layout wrapped each page in a plain block, so `h-full` page content
overran the viewport by the height of the tab bar and got clipped by the
`overflow-hidden` body. Making the layout a flex column bounds each page
to the space below the tabs, so the existing per-page scroll works and
the feature flags page scrolls like the Users/Orgs tabs. Also capped the
confirm dialog's diff list so its footer stays on screen when there are
many changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Morty Proxy This is a proxified and sanitized view of the page, visit original site.