Overwatch

Demo Video

Live Dashboard Preview

AI-Native Autonomous Incident Commander

Reducing production MTTR from 30 minutes to 15 seconds through autonomous runtime-aware incident triage and root cause analysis.

Executive Summary

Overwatch is an AI-native autonomous incident commander for local backend/runtime systems. It watches a process, detects deterministic or unexpected crashes, captures runtime telemetry, parses stack traces, scans related workspace context, estimates blast radius, classifies severity, proposes a read-only patch, and generates a production-style postmortem.

The goal is not to make AI write more code. The goal is to compress operational recovery time.

Production systems fail. The highest-leverage engineering move is to reduce the time between "something broke" and "we understand the failure, the risk, the likely fix, and the verification path." Overwatch targets the expensive part of incident response: context reconstruction under pressure.

What Is Overwatch?

Overwatch is an AI-native operational engineering agent that autonomously detects backend failures, analyzes runtime crash context, identifies probable root causes, estimates blast radius, and generates production-grade postmortems with suggested remediation patches.

The system is designed around one core principle:

Production failures are inevitable. Operational recovery speed is the highest leverage optimization.

The Problem

Most incidents are not slowed down by typing speed. They are slowed down by:

finding the first meaningful stack frame,
recovering runtime context,
identifying the subsystem owner,
estimating blast radius,
separating symptoms from root cause,
deciding whether retry, rollback, or patch is safest,
and producing a clear incident record after the system is stable.

Default AI coding workflows still require a human to notice the crash, copy logs, explain repo context, ask the right questions, and preserve the output. That is too much human coordination in the incident hot path.

Why This Was The Highest-Leverage Problem

AI has already made implementation faster. That makes priority definition more valuable, not less. If execution speed is commoditized, the scarce skill is choosing where autonomy creates operational leverage.

Incident response is a leverage-rich target because every minute saved compounds across:

customer availability,
engineer cognitive load,
on-call fatigue,
leadership visibility,
rollback safety,
and institutional learning.

Junior AI projects optimize code generation volume. Overwatch optimizes mean time to recovery, decision quality, and operational memory.

Why Most AI Projects Miss The Point

Many AI-agent repositories are wrappers around a model call. They demonstrate that an LLM can produce text. They rarely demonstrate a mature understanding of production operations.

Overwatch is intentionally different:

it starts from a real operational pain,
it models the incident lifecycle,
it captures machine context before asking AI,
it treats patching as a safety-sensitive proposal,
it persists a postmortem artifact,
and it benchmarks against a real baseline: a human using Cursor manually during a crash.

Operational Philosophy

Overwatch assumes:

systems fail,
logs are partial,
stack traces are clues rather than truth,
autonomous writes to production code are dangerous,
humans should retain final authority,
and the best AI systems reduce cognitive load before they increase automation scope.

The product posture is "autonomous triage, human-approved recovery."

AI-Native Incident Response

An AI-native incident system should not wait for a prompt. It should:

observe runtime behavior,
capture context at failure time,
select the relevant files,
produce an operationally useful analysis,
generate a safe patch proposal,
provide rollback and verification guidance,
and leave behind an auditable incident record.

That is what Overwatch implements.

Architecture Overview

flowchart TD
    A["target_app.js<br/>backend process crashes"] --> B["runner.ts<br/>watchdog captures stderr, stdout, exit code"]
    B --> C["crash_parser.ts<br/>extracts stack trace and failure class"]
    C --> D["workspace_scanner.ts<br/>recovers local source context"]
    D --> E["severity + blast radius + retry safety<br/>operational risk model"]
    E --> F["agent.ts<br/>Cursor/LLM path or offline deterministic analysis"]
    F --> G["POST_MORTEM.md<br/>RCA, patch, rollback, verification"]

System Lifecycle

Launch monitored process.
Stream stdout and stderr with timestamps.
Detect non-zero exit or fatal signal.
Parse failure context.
Scan related workspace files.
Score severity, blast radius, retry safety, regression risk, and confidence.
Build an operational prompt for Cursor.
Generate analysis and read-only patch proposal.
Write a durable postmortem.

Runtime Flow

git clone https://github.com/Build4mBottom/overwatch.git
cd overwatch
npm install
cp .env.example .env
npm run start:watchdog

The live watchdog demo runs target_app.js, which deterministically crashes using SCENARIO=malformed-json unless another scenario is selected.

Example:

SCENARIO=api-contract npm run start:watchdog

Offline deterministic mode requires no API keys:

npm run demo:offline

That mode reads examples/sample_crash.log, reconstructs the incident, and generates POST_MORTEM.md. It proves the architecture works even if the evaluator does not configure Cursor or an LLM provider.

Supported scenarios:

malformed-json
async-promise
serialization-corruption
invalid-env
api-contract
dependency-resolution
database-connection
timeout-explosion
worker-crash
memory-pressure
event-loop-starvation
retry-storm

Failure Lifecycle

Overwatch treats a crash as an incident object with phases:

detected: process exited unexpectedly,
captured: logs, metadata, and exit state are persisted,
classified: severity and probable subsystem are inferred,
contextualized: related source files and configs are scanned,
analyzed: root cause and blast radius are estimated,
proposed: read-only diff and rollback plan are generated,
documented: POST_MORTEM.md is written.

MTTR Compression Strategy

Baseline manual Cursor workflow:

human notices crash: 1-3 minutes,
copies terminal logs: 1 minute,
opens relevant files: 3-8 minutes,
reconstructs runtime context: 5-10 minutes,
asks AI for help: 2-5 minutes,
writes postmortem notes: 10+ minutes.

Overwatch demo path:

crash detection: immediate,
log capture: immediate,
stack parsing: milliseconds,
workspace scan: milliseconds to seconds,
analysis artifact: seconds,
postmortem generation: seconds.

The target compression is from roughly 30 minutes to roughly 15 seconds for first-pass triage.

Cursor Integration

Overwatch is Cursor-native in three ways:

.cursorrules defines operational behavior for AI-assisted incident response.
prompt_builder.ts creates workspace-aware FDE prompts optimized for Cursor context.
agent.ts supports CURSOR_AGENT_COMMAND, allowing a local Cursor-compatible agent command to receive the generated prompt. That command can use CURSOR_API_KEY, OPENAI_API_KEY, or another local provider configuration.

If CURSOR_AGENT_COMMAND is not set, Overwatch uses a deterministic local analysis fallback so the repository remains demoable and reliable.

Security Model

Overwatch is read-only by default.

Safety guarantees:

no secrets committed,
.env excluded by .gitignore,
.env.example documents configuration,
environment variables are validated,
source files are scanned with size limits,
generated patches are written as proposals only,
no production code is auto-modified,
human approval is required before applying a patch.

Autonomous hot-path writes are intentionally avoided. In incident response, a wrong automated patch can expand blast radius faster than the original outage. Overwatch optimizes for triage acceleration and decision support, not unchecked mutation.

Incident Severity Model

Severity is scored using:

process exit code,
exception type,
subsystem criticality,
customer-facing likelihood,
retry storm risk,
data corruption risk,
and confidence in root cause.

Severity labels:

SEV1: customer-facing outage, data loss, or cascading failure risk,
SEV2: degraded production path or high operational urgency,
SEV3: contained runtime failure with clear owner,
SEV4: local or low-risk failure.

Blast Radius Analysis

Blast radius is estimated from:

failing files,
imported modules,
package/config proximity,
subsystem ownership,
stateful resources,
and failure class.

The output separates:

affected user paths,
affected technical components,
operational dependencies,
rollback surface,
and verification scope.

Performance Benchmarks

Overwatch uses Triage Efficiency Score (TES), range 1 -> 10,000.

Formula:

TES =
  Base Automation Value
+ MTTR Compression
+ Context Recovery Accuracy
+ Patch Quality
+ Operational Safety
+ Blast Radius Reduction
+ Confidence Reliability
- Human Intervention Cost
- Regression Risk
- False Positive Risk

Target score: 9,450 / 10,000.

See docs/BENCHMARK.md for the full calculation.

Benchmark vs Default Cursor

Metric	Default Cursor Workflow	Project Overwatch
Triggering	Manual	Automatic process watcher
Context Gathering	Human copies logs	Structured telemetry capture
Workspace Awareness	Prompt-dependent	Scanner-driven
MTTR	~30 minutes	~15 seconds first-pass triage
Cognitive Load	High	Low
Reliability	Varies by prompt quality	Deterministic demo path
Repeatability	Weak	Same scenario, same artifact shape
Output Persistence	Manual	`POST_MORTEM.md` and `crash.log`
Operational Safety	Human discretion	Read-only patch generation
Failure Analysis Depth	Ad hoc	Severity, blast radius, retry, regression
Patch Confidence	Prompt-dependent	Scored with risk annotations
Ownership Modeling	Manual	Subsystem inference

Example Incident Walkthrough

target_app.js receives a malformed JSON payload.
The parser throws a SyntaxError.
runner.ts captures stderr, exit code, timestamps, and runtime metadata.
crash_parser.ts extracts the failing frame and exception class.
workspace_scanner.ts reads target_app.js, package metadata, and related source files.
severity_classifier.ts marks the incident as likely SEV3.
blast_radius.ts identifies request ingestion and payload parsing as affected.
retry_safety.ts warns that blind retry will not repair malformed input.
postmortem_generator.ts writes a postmortem with RCA, patch proposal, rollback plan, and verification steps.

Example Generated Postmortem

See examples/sample_postmortem.md.

Production Safety Guarantees

Overwatch does not:

apply patches automatically,
write to production services,
transmit secrets intentionally,
assume retries are safe,
hide uncertainty,
or treat AI output as authoritative.

Overwatch does:

preserve evidence,
expose confidence,
show blast radius,
recommend verification,
and keep humans in the approval loop.

Demo Instructions

npm install
cp .env.example .env
npm run start:watchdog

No-key offline proof:

npm run demo:offline

Local Loom dashboard:

npm run dashboard

Open http://localhost:3000. The dashboard is localhost-only and exposes only the predefined offline and watchdog demo commands.

With Docker:

docker compose up --build overwatch-demo

Optional Cursor command integration:

CURSOR_AGENT_COMMAND="cursor-agent --stdin" npm run start:watchdog

Expected demo path:

target_app.js crashes
  -> runner.ts captures stderr
  -> agent.ts analyzes crash
  -> POST_MORTEM.md is generated
  -> evaluator opens POST_MORTEM.md and sees RCA + patch

Loom Walkthrough

See docs/LOOM_SCRIPT.md.

Repository Presentation

The repo includes a dark social preview asset at docs/social-preview.png and a final polish checklist at docs/REPO_POLISH.md. Suggested GitHub topics are included there for discoverability and evaluation polish.

Design Decisions

See docs/ARCHITECTURE_DECISIONS.md.

Key choices:

read-only patch proposals,
deterministic demo reliability,
local-first context gathering,
explicit severity and confidence models,
small composable modules,
and no UI until the operational loop proves value.

Future Roadmap

See docs/FUTURE_ROADMAP.md.

High-leverage next steps:

GitHub issue/PR incident linking,
OpenTelemetry ingestion,
Kubernetes event capture,
CI regression replay,
service ownership maps,
and incident trend analysis.

Why This Demonstrates Priority Definition Ability

This project chooses downtime reduction over demo spectacle.

It shows the core AI-native insight: the scarce skill is no longer producing more code faster. The scarce skill is deciding which operational bottleneck deserves autonomy.

Overwatch attacks a real engineering bottleneck with a narrow, safe, measurable loop:

detect failure,
recover context,
classify risk,
propose action,
preserve learning.

That is leverage.

Name	Name	Last commit message	Last commit date
Latest commit History 10 Commits 10 Commits
docs	docs
examples	examples
public	public
scripts	scripts
src	src
.cursorrules	.cursorrules
.env.example	.env.example
.gitattributes	.gitattributes
.gitignore	.gitignore
Dockerfile	Dockerfile
LICENSE	LICENSE
POST_MORTEM.md	POST_MORTEM.md
README.md	README.md
crash.log	crash.log
docker-compose.yml	docker-compose.yml
package-lock.json	package-lock.json
package.json	package.json
target_app.js	target_app.js
tsconfig.json	tsconfig.json

Search code, repositories, users, issues, pull requests...

Folders and files

Latest commit

History

Repository files navigation

Overwatch

Demo Video

Live Dashboard Preview

AI-Native Autonomous Incident Commander

Executive Summary

What Is Overwatch?

The Problem

Why This Was The Highest-Leverage Problem

Why Most AI Projects Miss The Point

Operational Philosophy

AI-Native Incident Response

Architecture Overview

System Lifecycle

Runtime Flow

Failure Lifecycle

MTTR Compression Strategy

Cursor Integration

Security Model

Incident Severity Model

Blast Radius Analysis

Performance Benchmarks

Benchmark vs Default Cursor

Example Incident Walkthrough

Example Generated Postmortem

Production Safety Guarantees

Demo Instructions

Loom Walkthrough

Repository Presentation

Design Decisions

Future Roadmap

Why This Demonstrates Priority Definition Ability

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages