OpenSRE: Build Your Own AI SRE Agents

The open-source framework for AI SRE agents, and the training and evaluation environment they need to improve. Connect the 60+ tools you already run, define your own workflows, and investigate incidents on your own infrastructure.

Quickstart · Docs · FAQ · Security

🚧 Public Alpha: Core workflows are usable for early exploration, though not yet fully stable. The project is in active development, and APIs and integrations may evolve

Why OpenSRE?

When something breaks in production, the evidence is scattered across logs, metrics, traces, runbooks, and Slack threads. OpenSRE is an open-source framework for AI SRE agents that resolve production incidents, built to run on your own infrastructure.

We do that because SWE-bench¹ gave coding agents scalable training data and clear feedback. Production incident response still lacks an equivalent.

Distributed failures are slower, noisier, and harder to simulate and evaluate than local code tasks, which is why AI SRE, and AI for production debugging more broadly, remains unsolved.

OpenSRE is building that missing layer:

an open reinforcement learning environment for agentic infrastructure incident response, with end-to-end tests and synthetic incident simulations for realistic production failures

We do that by:

building easy-to-deploy, customizable AI SRE agents for production incident investigation and response
running scored synthetic RCA suites that check root-cause accuracy, required evidence, and adversarial red herrings (tests/synthetic)
running real-world end-to-end tests across cloud-backed scenarios including Kubernetes, EC2, CloudWatch, Lambda, ECS Fargate, and Flink (tests/e2e)
keeping semantic test-catalog naming so e2e vs synthetic and local vs cloud boundaries stay obvious (tests/README.md)

Our mission is to build AI SRE agents on top of this, scale it to thousands of realistic infrastructure failure scenarios, and establish OpenSRE as the benchmark and training ground for AI SRE.

¹ https://arxiv.org/abs/2310.06770

Install

The root installer URL auto-detects Unix shell vs PowerShell. Add --main when you want the latest rolling build from main instead of the latest stable release.

Latest stable release:

curl -fsSL https://install.opensre.com | bash

Latest build from main:

curl -fsSL https://install.opensre.com | bash -s -- --main

brew tap tracer-cloud/tap
brew install tracer-cloud/tap/opensre

irm https://install.opensre.com | iex

Quick Start

opensre onboard
opensre investigate -i tests/e2e/kubernetes/fixtures/datadog_k8s_alert.json
opensre update
opensre uninstall   # remove opensre and all local data

Interactive mode

Running opensre with no arguments enters a persistent REPL session — an incident response terminal in the style of Claude Code. Describe an alert in plain text, watch the investigation stream live, then ask follow-up questions that stay grounded in what just ran.

opensre
# › MongoDB orders cluster is dropping connections since 14:00 UTC
# ...live streaming investigation...
# › why was the connection pool exhausted?
# ...grounded follow-up answer...
# › /status
# › /exit

Slash commands: /help, /status, /clear, /reset, /trust, /exit. Ctrl+C cancels an in-flight investigation while keeping the session state intact.

Official Deployment: LangGraph Platform

OpenSRE's official deployment path is LangGraph Platform.

Create a deployment on LangGraph Platform and connect this repository.
Keep langgraph.json at the repo root so LangGraph can load the graph entrypoint.
Add your model provider in environment variables (for example LLM_PROVIDER=anthropic).
Add the matching API key for your provider (for example ANTHROPIC_API_KEY or OPENAI_API_KEY).
Add any additional runtime env vars your deployment needs (for example integration credentials and optional storage settings).

Minimum LLM env setup:

LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=...

For other providers, set the same LLM_PROVIDER plus the matching key from .env.example (for example OPENAI_API_KEY, GEMINI_API_KEY, or OPENROUTER_API_KEY).

Railway Deployment (Self-Hosted Alternative)

If you prefer a self-hosted path, you can still deploy to Railway.

Before running opensre deploy railway, make sure the target Railway project has both Postgres and Redis services, and that your OpenSRE service has DATABASE_URI and REDIS_URI set to those connection strings. The containerized LangGraph runtime will not boot without those backing services wired in.

# create/link Railway Postgres and Redis first, then set DATABASE_URI and REDIS_URI
opensre deploy railway --project <project> --service <service> --yes

If the deploy starts but the service never becomes healthy, verify that DATABASE_URI and REDIS_URI are present on the Railway service and point to the project Postgres and Redis instances.

Remote Hosted Ops

After deploying a hosted service, you can run post-deploy operations from the CLI:

# inspect service status, URL, deployment metadata
opensre remote ops --provider railway --project <project> --service <service> status

# tail recent logs
opensre remote ops --provider railway --project <project> --service <service> logs --lines 200

# stream logs live
opensre remote ops --provider railway --project <project> --service <service> logs --follow

# trigger restart/redeploy
opensre remote ops --provider railway --project <project> --service <service> restart --yes

OpenSRE saves your last used provider, so you can run:

opensre remote ops status
opensre remote ops logs --follow

Development

New to OpenSRE? See SETUP.md for detailed platform-specific setup instructions, including Windows setup, environment configuration, and more.

Local development installs use uv and a committed uv.lock (make install runs uv sync --frozen --extra dev). Install uv first, then:

git clone https://github.com/Tracer-Cloud/opensre
cd opensre
make install
# run opensre onboard to configure your local LLM provider
# and optionally validate/save Grafana, Datadog, Honeycomb, Coralogix, Slack, AWS, GitHub MCP, and Sentry integrations
opensre onboard
opensre investigate -i tests/e2e/kubernetes/fixtures/datadog_k8s_alert.json

If you use VS Code, the repo now includes a ready-to-use devcontainer under .devcontainer/devcontainer.json. Open the repo in VS Code and run Dev Containers: Reopen in Container to get the project on Python 3.13 with the contributor toolchain preinstalled. Keep Docker Desktop, OrbStack, Colima, or another Docker-compatible runtime running on the host, since VS Code devcontainers rely on your local Docker engine.

How OpenSRE Works

Investigation Workflow

When an alert fires, OpenSRE automatically:

Fetches the alert context and correlated logs, metrics, and traces
Reasons across your connected systems to identify anomalies
Generates a structured investigation report with probable root cause
Suggests next steps and, optionally, executes remediation actions
Posts a summary directly to Slack or PagerDuty - no context switching needed

Benchmark

Generate the benchmark report:

make benchmark

Capabilities


🔍 Structured incident investigation	Correlated root-cause analysis across all your signals
📋 Runbook-aware reasoning	OpenSRE reads your runbooks and applies them automatically
🔮 Predictive failure detection	Catch emerging issues before they page you
🔗 Evidence-backed root cause	Every conclusion is linked to the data behind it
🤖 Full LLM flexibility	Bring your own model — Anthropic, OpenAI, Ollama, Gemini, OpenRouter, NVIDIA NIM

Integrations

OpenSRE connects to 60+ tools and services across the modern cloud stack, from LLM providers and observability platforms to infrastructure, databases, and incident management.

Category	Integrations	Roadmap
AI / LLM Providers	Anthropic · OpenAI · Ollama · Google Gemini · OpenRouter · NVIDIA NIM · Bedrock
Observability	Grafana (Loki · Mimir · Tempo) · Datadog · Honeycomb · Coralogix · CloudWatch · Sentry · Elasticsearch · Better Stack Telemetry	Splunk · New Relic · Victoria Logs
Infrastructure	Kubernetes · AWS (S3 · Lambda · EKS · EC2 · Bedrock) · GCP · Azure	Helm · ArgoCD
Database	MongoDB · ClickHouse · PostgreSQL · MySQL · MariaDB · MongoDB Atlas · Azure SQL · Snowflake	RDS
Data Platform	Apache Airflow · Apache Kafka · Apache Spark · Prefect · RabbitMQ
Dev Tools	GitHub · GitHub MCP · Bitbucket · GitLab
Incident Management	PagerDuty · Opsgenie · Jira · Alertmanager	Trello · ServiceNow · incident.io · Linear
Communication	Slack · Google Docs · Discord	Notion · Teams · WhatsApp · Confluence
Agent Deployment	Vercel · LangSmith · EC2 · ECS · Railway
Protocols	MCP · ACP · OpenClaw

Contributing

OpenSRE is community-built. Every integration, improvement, and bug fix makes it better for thousands of engineers. We actively review PRs and welcome contributors of all experience levels.

Good first issues are labeled good first issue. Ways to contribute:

🐛 Report bugs or missing edge cases
🔌 Add a new tool integration
📖 Improve documentation or runbook examples
⭐ Star the repo - it helps other engineers find OpenSRE

See CONTRIBUTING.md for the full guide.

Thanks goes to these amazing people:

_davincios	_{VaibhavUpreti}	_aliya-tracer	_arnetracer	_kylie-tracer	_paultracer
_zeel2104	_iamkalio	_w3joe	_yeoreums	_{anandgupta1202}	_rrajan94
_vrk7	_{cerencamkiran}	_edgarmb14	_lukegimza	_{ebrahim-sameh}	_shoaib050326
_venturevd	_shriyashsoni	_Devesh36	_KindaJayant	_overcastbulb	_Yashkapure06
_Davda-James	_{Abhinnavverma}	_{devankitjuneja}	_ramandagar	_mvanhorn	_{abhishek-marathe04}
_{yashksaini-coder}	_{haliaeetusvocifer}	_Bahtya	_{mayankbharati-ops}	_{harshareddy832}	_sundaram2021
_{micheal000010000-hub}	_ljivesh	_{gautamjain1503}	_mudittt	_{hamzzaaamalik}	_octo-patch
_fuleinist	_yas789	_sharkello	_{kaushal-bakrania}	_darthwade	_{aniruddhaadak80}
_chaosreload	_paulovitorcl	_gbsierra	_{alexanderkreidich}	_afif1400	_{gauravch-code}
_divijgera	_daxp472	_Som-0619	_Gust-svg	_Sayeem3051	_{MachineLearning-Nerd}
_F4tal1t	_{MestreY0d4-Uninter}	_qorexdevs	_Agnuxo1	_Ryjen1	_{nandanadileep}
_{Maharshi-Project}	_udit-rawat	_muddlebee	_Jah-yee	_Sarah-Salah	_{jerome-wilson}
_hcombalicer	_CuriousHet	_Dipxssi	_sirohikartik	_imjohnzakkam	_{paarths-collab}
_{wahajahmed010}	_Ade20boss	_{MichaelGurevich}	_SB2318	_Davidson3556	_gitsofaryan
_GoDiao	_7vignesh	_turancannb02	_ShivaniNR	_0xDevNinja	_blut-agent
_Ghraven	_kespineira	_{AarushSharmaa}	_Lozsku	_{Piyushtiwari919}	_hruico
_IBOCATA	_Jeel3011	_Gingiris	_{rameshkumarkoyya}	_JustInCache	_Genmin
_WatchTree-19	_cokerrd	_jason8745	_Yajush-afk	_Aaryan-549	_{CoderHariswar}
_zeesshhh0	_{PrakharJain345}	_Bhavarth7	_emefienem	_TejasS1233	_{DsThakurRawat}
_akshat1074	_{Diwansu-pilania}	_AniketR10	_Jai0401	_shivambehl	_retr0-kernel
_IsaacOdeimor	_RajGajjar-01	_4arjun	_{cloudenochcsis}	_Thibault00	_umeraamir09
_aksKrIITK	_zerone0x	_Powlisher	_{vidhishah2209}	_{aayushprsingh}	_shubh586
_mazenessam77	_mstejas610	_jeetjawale	_rudra496

Security

OpenSRE is designed with production environments in mind:

No storing of raw log data beyond the investigation session
All LLM calls use structured, auditable prompts
Log transcripts are kept locally - never sent externally by default

See SECURITY.md for responsible disclosure.

Telemetry & privacy

opensre ships with two telemetry stacks, both opt-out:

PostHog for anonymous product analytics (which commands are used, success/failure, rough runtime, CLI version, Python version, OS family, machine architecture, and a small amount of command-specific metadata such as which subcommand ran). For opensre onboard and opensre investigate, we may also collect the selected model/provider and whether the command used flags such as --interactive or --input.
Sentry for crash and error reports (stack traces, environment, release tag). Stack traces are scrubbed for home-directory paths; auth headers, cookies, query strings on HTTP breadcrumbs, and obvious secret keys (*_token, *_key, *_secret, *_password) are filtered before transport.

A randomly generated anonymous install ID is created on first run and stored in ~/.config/opensre/anonymous_id. PostHog distinct_id values are scoped to that install ID, so unique-user counts represent unique CLI installs/devices rather than command invocations. One-time lifecycle events use deterministic event IDs to avoid duplicate rows if they are retried.

We never collect alert contents, file contents, hostnames, credentials, raw command arguments, or any other personally identifiable information. Telemetry is automatically disabled in GitHub Actions and pytest runs.

Kill-switch matrix

Env var	PostHog	Sentry
`OPENSRE_NO_TELEMETRY=1`	disabled	disabled
`DO_NOT_TRACK=1`	disabled	disabled
`OPENSRE_ANALYTICS_DISABLED=1`	disabled	unaffected
`OPENSRE_SENTRY_DISABLED=1`	unaffected	disabled

For full opt-out:

export OPENSRE_NO_TELEMETRY=1

Overriding the Sentry DSN

Self-hosted users can route errors to their own Sentry project by setting SENTRY_DSN in the environment before invoking opensre. Leaving it unset uses the bundled default DSN. Setting SENTRY_DSN= (empty) drops all events at the before_send hook.

Inspecting outbound events

To inspect what opensre is sending to PostHog, every event is also appended to ~/.config/opensre/posthog_events.txt by default. The file rotates at 1000 lines (older lines move to posthog_events.txt.1, overwriting any prior backup) so it never grows unbounded. To disable local logging:

export OPENSRE_ANALYTICS_LOG_EVENTS=0

License

Apache 2.0 - see LICENSE for details.

Citations

¹ https://arxiv.org/abs/2310.06770

Name	Name	Last commit message	Last commit date
Latest commit History 1,636 Commits 1,636 Commits
.cursor	.cursor
.devcontainer	.devcontainer
.github	.github
app	app
docs	docs
infra	infra
packaging	packaging
scripts	scripts
tests	tests
.dockerignore	.dockerignore
.editorconfig	.editorconfig
.env.example	.env.example
.gitattributes	.gitattributes
.gitignore	.gitignore
.tool-versions	.tool-versions
AGENTS.md	AGENTS.md
CLAUDE.md	CLAUDE.md
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md
CONTRIBUTING.md	CONTRIBUTING.md
DEPLOYMENT.md	DEPLOYMENT.md
Dockerfile	Dockerfile
LICENSE	LICENSE
Makefile	Makefile
README.md	README.md
SECURITY.md	SECURITY.md
SETUP.md	SETUP.md
install.ps1	install.ps1
install.sh	install.sh
langgraph.json	langgraph.json
mypy.ini	mypy.ini
pyproject.toml	pyproject.toml
pytest.ini	pytest.ini
ruff.toml	ruff.toml
uv.lock	uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenSRE: Build Your Own AI SRE Agents

Table of Contents

Why OpenSRE?

Install

Quick Start

Interactive mode

Official Deployment: LangGraph Platform

Railway Deployment (Self-Hosted Alternative)

Remote Hosted Ops

Development

How OpenSRE Works

Investigation Workflow

Benchmark

Capabilities

Integrations

Contributing

Security

Telemetry & privacy

Kill-switch matrix

Overriding the Sentry DSN

Inspecting outbound events

License

Citations

About

Uh oh!

Releases 30

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Search code, repositories, users, issues, pull requests...

Folders and files

Latest commit

History

Repository files navigation

OpenSRE: Build Your Own AI SRE Agents

Table of Contents

Why OpenSRE?

Install

Quick Start

Interactive mode

Official Deployment: LangGraph Platform

Railway Deployment (Self-Hosted Alternative)

Remote Hosted Ops

Development

How OpenSRE Works

Investigation Workflow

Benchmark

Capabilities

Integrations

Contributing

Security

Telemetry & privacy

Kill-switch matrix

Overriding the Sentry DSN

Inspecting outbound events

License

Citations

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 30

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages