Reducing production MTTR from 30 minutes to 15 seconds through autonomous runtime-aware incident triage and root cause analysis.
Overwatch is an AI-native autonomous incident commander for local backend/runtime systems. It watches a process, detects deterministic or unexpected crashes, captures runtime telemetry, parses stack traces, scans related workspace context, estimates blast radius, classifies severity, proposes a read-only patch, and generates a production-style postmortem.
The goal is not to make AI write more code. The goal is to compress operational recovery time.
Production systems fail. The highest-leverage engineering move is to reduce the time between "something broke" and "we understand the failure, the risk, the likely fix, and the verification path." Overwatch targets the expensive part of incident response: context reconstruction under pressure.
Overwatch is an AI-native operational engineering agent that autonomously detects backend failures, analyzes runtime crash context, identifies probable root causes, estimates blast radius, and generates production-grade postmortems with suggested remediation patches.
The system is designed around one core principle:
Production failures are inevitable. Operational recovery speed is the highest leverage optimization.
Most incidents are not slowed down by typing speed. They are slowed down by:
- finding the first meaningful stack frame,
- recovering runtime context,
- identifying the subsystem owner,
- estimating blast radius,
- separating symptoms from root cause,
- deciding whether retry, rollback, or patch is safest,
- and producing a clear incident record after the system is stable.
Default AI coding workflows still require a human to notice the crash, copy logs, explain repo context, ask the right questions, and preserve the output. That is too much human coordination in the incident hot path.
AI has already made implementation faster. That makes priority definition more valuable, not less. If execution speed is commoditized, the scarce skill is choosing where autonomy creates operational leverage.
Incident response is a leverage-rich target because every minute saved compounds across:
- customer availability,
- engineer cognitive load,
- on-call fatigue,
- leadership visibility,
- rollback safety,
- and institutional learning.
Junior AI projects optimize code generation volume. Overwatch optimizes mean time to recovery, decision quality, and operational memory.
Many AI-agent repositories are wrappers around a model call. They demonstrate that an LLM can produce text. They rarely demonstrate a mature understanding of production operations.
Overwatch is intentionally different:
- it starts from a real operational pain,
- it models the incident lifecycle,
- it captures machine context before asking AI,
- it treats patching as a safety-sensitive proposal,
- it persists a postmortem artifact,
- and it benchmarks against a real baseline: a human using Cursor manually during a crash.
Overwatch assumes:
- systems fail,
- logs are partial,
- stack traces are clues rather than truth,
- autonomous writes to production code are dangerous,
- humans should retain final authority,
- and the best AI systems reduce cognitive load before they increase automation scope.
The product posture is "autonomous triage, human-approved recovery."
An AI-native incident system should not wait for a prompt. It should:
- observe runtime behavior,
- capture context at failure time,
- select the relevant files,
- produce an operationally useful analysis,
- generate a safe patch proposal,
- provide rollback and verification guidance,
- and leave behind an auditable incident record.
That is what Overwatch implements.
flowchart TD
A["target_app.js<br/>backend process crashes"] --> B["runner.ts<br/>watchdog captures stderr, stdout, exit code"]
B --> C["crash_parser.ts<br/>extracts stack trace and failure class"]
C --> D["workspace_scanner.ts<br/>recovers local source context"]
D --> E["severity + blast radius + retry safety<br/>operational risk model"]
E --> F["agent.ts<br/>Cursor/LLM path or offline deterministic analysis"]
F --> G["POST_MORTEM.md<br/>RCA, patch, rollback, verification"]
- Launch monitored process.
- Stream stdout and stderr with timestamps.
- Detect non-zero exit or fatal signal.
- Parse failure context.
- Scan related workspace files.
- Score severity, blast radius, retry safety, regression risk, and confidence.
- Build an operational prompt for Cursor.
- Generate analysis and read-only patch proposal.
- Write a durable postmortem.
git clone https://github.com/Build4mBottom/overwatch.git
cd overwatch
npm install
cp .env.example .env
npm run start:watchdogThe live watchdog demo runs target_app.js, which deterministically crashes using SCENARIO=malformed-json unless another scenario is selected.
Example:
SCENARIO=api-contract npm run start:watchdogOffline deterministic mode requires no API keys:
npm run demo:offlineThat mode reads examples/sample_crash.log, reconstructs the incident, and generates POST_MORTEM.md. It proves the architecture works even if the evaluator does not configure Cursor or an LLM provider.
Supported scenarios:
malformed-jsonasync-promiseserialization-corruptioninvalid-envapi-contractdependency-resolutiondatabase-connectiontimeout-explosionworker-crashmemory-pressureevent-loop-starvationretry-storm
Overwatch treats a crash as an incident object with phases:
detected: process exited unexpectedly,captured: logs, metadata, and exit state are persisted,classified: severity and probable subsystem are inferred,contextualized: related source files and configs are scanned,analyzed: root cause and blast radius are estimated,proposed: read-only diff and rollback plan are generated,documented:POST_MORTEM.mdis written.
Baseline manual Cursor workflow:
- human notices crash: 1-3 minutes,
- copies terminal logs: 1 minute,
- opens relevant files: 3-8 minutes,
- reconstructs runtime context: 5-10 minutes,
- asks AI for help: 2-5 minutes,
- writes postmortem notes: 10+ minutes.
Overwatch demo path:
- crash detection: immediate,
- log capture: immediate,
- stack parsing: milliseconds,
- workspace scan: milliseconds to seconds,
- analysis artifact: seconds,
- postmortem generation: seconds.
The target compression is from roughly 30 minutes to roughly 15 seconds for first-pass triage.
Overwatch is Cursor-native in three ways:
.cursorrulesdefines operational behavior for AI-assisted incident response.prompt_builder.tscreates workspace-aware FDE prompts optimized for Cursor context.agent.tssupportsCURSOR_AGENT_COMMAND, allowing a local Cursor-compatible agent command to receive the generated prompt. That command can useCURSOR_API_KEY,OPENAI_API_KEY, or another local provider configuration.
If CURSOR_AGENT_COMMAND is not set, Overwatch uses a deterministic local analysis fallback so the repository remains demoable and reliable.
Overwatch is read-only by default.
Safety guarantees:
- no secrets committed,
.envexcluded by.gitignore,.env.exampledocuments configuration,- environment variables are validated,
- source files are scanned with size limits,
- generated patches are written as proposals only,
- no production code is auto-modified,
- human approval is required before applying a patch.
Autonomous hot-path writes are intentionally avoided. In incident response, a wrong automated patch can expand blast radius faster than the original outage. Overwatch optimizes for triage acceleration and decision support, not unchecked mutation.
Severity is scored using:
- process exit code,
- exception type,
- subsystem criticality,
- customer-facing likelihood,
- retry storm risk,
- data corruption risk,
- and confidence in root cause.
Severity labels:
SEV1: customer-facing outage, data loss, or cascading failure risk,SEV2: degraded production path or high operational urgency,SEV3: contained runtime failure with clear owner,SEV4: local or low-risk failure.
Blast radius is estimated from:
- failing files,
- imported modules,
- package/config proximity,
- subsystem ownership,
- stateful resources,
- and failure class.
The output separates:
- affected user paths,
- affected technical components,
- operational dependencies,
- rollback surface,
- and verification scope.
Overwatch uses Triage Efficiency Score (TES), range 1 -> 10,000.
Formula:
TES =
Base Automation Value
+ MTTR Compression
+ Context Recovery Accuracy
+ Patch Quality
+ Operational Safety
+ Blast Radius Reduction
+ Confidence Reliability
- Human Intervention Cost
- Regression Risk
- False Positive Risk
Target score: 9,450 / 10,000.
See docs/BENCHMARK.md for the full calculation.
| Metric | Default Cursor Workflow | Project Overwatch |
|---|---|---|
| Triggering | Manual | Automatic process watcher |
| Context Gathering | Human copies logs | Structured telemetry capture |
| Workspace Awareness | Prompt-dependent | Scanner-driven |
| MTTR | ~30 minutes | ~15 seconds first-pass triage |
| Cognitive Load | High | Low |
| Reliability | Varies by prompt quality | Deterministic demo path |
| Repeatability | Weak | Same scenario, same artifact shape |
| Output Persistence | Manual | POST_MORTEM.md and crash.log |
| Operational Safety | Human discretion | Read-only patch generation |
| Failure Analysis Depth | Ad hoc | Severity, blast radius, retry, regression |
| Patch Confidence | Prompt-dependent | Scored with risk annotations |
| Ownership Modeling | Manual | Subsystem inference |
target_app.jsreceives a malformed JSON payload.- The parser throws a
SyntaxError. runner.tscaptures stderr, exit code, timestamps, and runtime metadata.crash_parser.tsextracts the failing frame and exception class.workspace_scanner.tsreadstarget_app.js, package metadata, and related source files.severity_classifier.tsmarks the incident as likelySEV3.blast_radius.tsidentifies request ingestion and payload parsing as affected.retry_safety.tswarns that blind retry will not repair malformed input.postmortem_generator.tswrites a postmortem with RCA, patch proposal, rollback plan, and verification steps.
See examples/sample_postmortem.md.
Overwatch does not:
- apply patches automatically,
- write to production services,
- transmit secrets intentionally,
- assume retries are safe,
- hide uncertainty,
- or treat AI output as authoritative.
Overwatch does:
- preserve evidence,
- expose confidence,
- show blast radius,
- recommend verification,
- and keep humans in the approval loop.
npm install
cp .env.example .env
npm run start:watchdogNo-key offline proof:
npm run demo:offlineLocal Loom dashboard:
npm run dashboardOpen http://localhost:3000. The dashboard is localhost-only and exposes only the predefined offline and watchdog demo commands.
With Docker:
docker compose up --build overwatch-demoOptional Cursor command integration:
CURSOR_AGENT_COMMAND="cursor-agent --stdin" npm run start:watchdogExpected demo path:
target_app.js crashes
-> runner.ts captures stderr
-> agent.ts analyzes crash
-> POST_MORTEM.md is generated
-> evaluator opens POST_MORTEM.md and sees RCA + patch
See docs/LOOM_SCRIPT.md.
The repo includes a dark social preview asset at docs/social-preview.png and a final polish checklist at docs/REPO_POLISH.md. Suggested GitHub topics are included there for discoverability and evaluation polish.
See docs/ARCHITECTURE_DECISIONS.md.
Key choices:
- read-only patch proposals,
- deterministic demo reliability,
- local-first context gathering,
- explicit severity and confidence models,
- small composable modules,
- and no UI until the operational loop proves value.
High-leverage next steps:
- GitHub issue/PR incident linking,
- OpenTelemetry ingestion,
- Kubernetes event capture,
- CI regression replay,
- service ownership maps,
- and incident trend analysis.
This project chooses downtime reduction over demo spectacle.
It shows the core AI-native insight: the scarce skill is no longer producing more code faster. The scarce skill is deciding which operational bottleneck deserves autonomy.
Overwatch attacks a real engineering bottleneck with a narrow, safe, measurable loop:
- detect failure,
- recover context,
- classify risk,
- propose action,
- preserve learning.
That is leverage.
