A governed agent runtime built around inspectability — OpenTelemetry-shaped traces, bounded retries, a policy-gate layer that denies unsafe tool calls, and a documented failure-mode catalog — designed to run on a laptop and audit, not a tracing dashboard with agents bolted on.
The runtime ships v1 recorded runs on main. The canonical run runs/2026-05-06_240d1c56_0/ walks search → fetch → read → summarize → final_answer end-to-end via make canonical, against the synthetic 25-document fixture corpus and a deterministic stub LLM (src/runtime/stub_llm/) that keeps the canonical run hosted-LLM-free.
Seven policy-gate runs under runs/policy_gates/ and five failure-mode runs under runs/failure_modes/ complete the v1 recorded-run set. Each committed run carries the four required artifacts: trace.json, state.jsonl, run_report.md, and manifest.json.
| Capability | Captured run | Trace signature |
|---|---|---|
| Canonical demo task — single-agent loop reaches a final answer | runs/2026-05-06_240d1c56_0/run_report.md |
agent_step ×5, llm_call ×5, tool_call ×4, policy_check ×4 (all allow), retry_attempt ×4 (all success); terminal final_answer |
| Policy gates that deny unsafe tool calls | ||
| Off-allowlist URL (PG1) | runs/policy_gates/pg1_off_allowlist_url/run_report.md |
policy_check deny; agent.policy.rule_id = url_allowlist; terminates after deny |
| Sandbox escape (PG2) | runs/policy_gates/pg2_sandbox_escape/run_report.md |
policy_check deny; agent.policy.rule_id = sandbox_path; terminates after deny |
| Loop / iteration budget (PG3) | runs/policy_gates/pg3_loop_budget/run_report.md |
terminates with agent.terminal_reason = loop_budget; deny-span emission is a documented runtime gap (locked in tests/evidence/test_pg3_runs_terminate_with_loop_budget_reason) |
| ↳ Token-budget variant (PG3-tokens) | runs/policy_gates/pg3_loop_budget_tokens/run_report.md |
terminates with agent.terminal_reason = loop_budget; deny-span emission is the same documented runtime gap as PG3 |
| Forbidden tool (PG4) | runs/policy_gates/pg4_forbidden_tool/run_report.md |
policy_check deny; agent.policy.rule_id = tool_registry; terminates after deny |
| ↳ Precedence variant (PG4 with arg-schema violation) | runs/policy_gates/pg4_forbidden_tool_with_arg_schema_violation/run_report.md |
policy_check deny; agent.policy.rule_id = tool_registry (precedence rule fires before arg_schema) |
| Argument-shape violation (PG5) | runs/policy_gates/pg5_arg_schema/run_report.md |
policy_check deny; agent.policy.rule_id = arg_schema; parent agent_step carries agent.failure_mode = schema_mismatch (F3) |
| Failure modes with reproducible triggers | ||
tool_call_failure (F1) |
runs/failure_modes/tool_call_failure/run_report.md |
retry_attempt with outcome = transient_failure then success; agent.failure_mode = tool_call_failure; terminal final_answer |
retry_exhaustion (F2) |
runs/failure_modes/retry_exhaustion/run_report.md |
final retry_attempt with outcome = exhausted; agent.failure_mode = retry_exhaustion; terminal_reason = failure_mode_terminal |
schema_mismatch (F3) |
runs/failure_modes/schema_mismatch/run_report.md |
policy_check deny rule_id = arg_schema; agent.failure_mode = schema_mismatch; terminal_reason = failure_mode_terminal |
cycle_detection (F4) |
runs/failure_modes/cycle_detection/run_report.md |
policy_check deny rule_id = cycle_detection; agent.failure_mode = cycle_detection; terminal_reason = policy_denial_terminal |
catalogued_unhandled (F5) |
runs/failure_modes/catalogued_unhandled/run_report.md |
tool_call span carries agent.tool.exception_class; agent.failure_mode = catalogued_unhandled; terminal_reason = failure_mode_terminal |
The canonical run (2026-05-06_240d1c56_0) executes 5 agent steps with 4 tool calls and 4 retry attempts (0 exhausted; all retries succeeded), emits 22 trace spans (agent_step ×5, llm_call ×5, tool_call ×4, policy_check ×4, retry_attempt ×4), and terminates with final_answer against the 25-document synthetic fixture corpus. Of the five named policy-gate scenarios, four (PG1, PG2, PG4, PG5) fire deny spans with the documented rule_id set in the trace; PG3 (loop budget; both iteration and token-budget variants) terminates via agent.terminal_reason = loop_budget rather than a deny span — a known gap, tracked at runs/policy_gates/pg3_loop_budget/run_report.md. All five canonical failure modes fire agent.failure_mode on the offending step and reach their documented terminal state. Numbers and structural facts come from the captured run_report.md and manifest.json files; nothing in this prose claims more than those files support.
The runtime, policy, trace, tool, fixture, and evidence slices verify the runtime / tool boundary with:
pytest tests/make smoke— verify file structuremake smoke-runtime— runtime-skeleton testsmake policy-gates— exercise the seven policy-gate scenarios;SCENARIO=<id>selects onemake policy-gates-check— re-emit and diff each policy-gate run against its committed artifactsmake failure-modes— exercise the five canonical modes;SCENARIO=<id>selects onemake failure-modes-check— re-emit and diff each failure-mode run against its committed artifactsmake trace-smoke— drive the in-tree trace fixture through the OTLP-JSON exporter and validate against the subset schemamake tool-smoke— drive the five v1 tools through a realAgent.runwith strict-modearg_schemaenforcementmake fixture-build— (re)build the deterministic fixture corpus from the documented seedpython3 -m scripts.build_fixture_corpus --check— verify on-disk fixture matches the manifestmake canonical— drive the canonical task fixture through a realAgent.runagainst the fixture corpus and the v1 tool, policy, and trace surfacesmake canonical-check— re-emit and diff the canonical run against the committed artifactsmake evidence-buildandmake evidence-check— aggregate emit / diff across canonical + policy-gate + failure-mode runs
Scope: this is a bounded reference implementation for governed tool use, schema enforcement, policy gates, failure classification, and traceable agent execution — not a production agent platform.
| Path | What is here today |
|---|---|
src/runtime/ |
Single-agent loop (agent.py), state ledger (state.py), policy seam (policy.py), bounded retry (retry.py), input-schema enforcement (_schema.py), and the deterministic stub LLM driver (stub_llm/). |
src/tracing/ |
OTLP-JSON-subset writer (otel_exporter.py) and the subset-schema validator (otlp_subset_schema.py). |
src/fail/ |
Failure-mode classifier mapping spans to the five catalogued modes (catalog.py). |
src/evidence/ |
Captured-run helpers: manifest.py for the per-run reproducibility envelope, run_report.py for the human-readable headline, and emit.py for writing the four artifact files atomically. |
policy/ |
Canonical YAML policy spec (v1.yaml), JSON meta-schema (v1.schema.json), and policy-gate documentation. The runtime self-validates the spec at startup. |
tools/ |
Five v1 tools with INPUT_SCHEMA per tool: search.py, fetch.py, read.py, write.py, summarize.py. |
data/corpus/v1/ |
Synthetic 25-document fixture corpus (deterministic, license-clean CC0-1.0, isolated to data/). Attestation in data/DATA-SOURCE.md. |
tasks/ |
Canonical task fixture, seven policy-gate fixtures (PG1, PG2, PG3, PG3-tokens, PG4, PG4-precedence, PG5), and five failure-mode triggers. |
tests/ |
Runtime-smoke, structure-smoke (smoke.sh), per-policy-gate scenario tests, per-failure-mode trigger tests, fixture-driven integration tests, and the evidence-suite tests that lock the captured-run shape and reproducibility envelope. |
scripts/ |
run_canonical_smoke.py, run_policy_gates.py, run_failure_modes.py, run_trace_smoke.py, run_tool_smoke.py, build_fixture_corpus.py. |
runs/ |
Committed captured runs: the canonical demo at runs/2026-05-06_240d1c56_0/, seven policy-gate runs under runs/policy_gates/, and five failure-mode runs under runs/failure_modes/. See runs/README.md for the per-run file inventory. |
docs/ |
Architecture, runtime model, policy gates, failure modes, evidence-anchoring discipline. |
failure_modes.md |
Top-level documented failure-mode catalog summary. |
Makefile |
The targets named in Verification surface above. |
git clone https://github.com/Ariel-J-Lee/agent-runtime-observability.git
cd agent-runtime-observability
pip install -r requirements.txt -r requirements-dev.txt
make smoke # verify file structure
make smoke-runtime # runtime-skeleton tests
make policy-gates # exercise all seven policy-gate scenarios
make failure-modes # exercise all five failure-mode triggers
make trace-smoke # trace exporter + subset-schema validation
make tool-smoke # five tools through a real Agent.run
make fixture-build # (re)build the deterministic fixture corpus from the documented seed
python3 -m scripts.build_fixture_corpus --check # verify on-disk fixture matches the manifest
make canonical # canonical task through Agent.run on the fixture corpus
make canonical-check # re-emit and diff the canonical run vs committed artifacts
make policy-gates-check # re-emit and diff all seven policy-gate runs
make failure-modes-check # re-emit and diff all five failure-mode runsThe canonical run is deterministic given the pinned seed, the deterministic stub LLM, and the policy-spec hash. Two reviewers running make canonical-check against the same code state produce byte-identical trace.json, state.jsonl, and run_report.md. The manifest.json is byte-identical except on three documented per-run-volatile keys (timestamp, wall_clock_seconds, code.git_sha) that are excluded from the reproducibility diff.
This is a reference implementation, not a benchmark.
- Deterministic stub LLM canonical default. The committed runs use a deterministic stub LLM at
src/runtime/stub_llm/canned.py, keyed by the canonical fixture. TheAgentconstructor accepts aCallable[[LLMInput], LLMOutput]seam, so a caller can plug in a live local-LLM or hosted-API adapter that satisfies that shape. No live local-LLM or hosted-API adapter ships in this release; the recorded runs use the stub default exclusively. - Synthetic public-safe fixture corpus. Twenty-five hand-authored CC0-1.0 documents under
data/corpus/v1/. The runtime is exercised against these only; results do not generalize beyond this corpus. - Single-agent only at v1. No multi-agent coordination. Multi-agent claims do not appear anywhere in this README,
docs/, or the repo description. - Filesystem-path-based sandbox. Sandbox isolation is
realpath-based against an allowlisted per-run directory. Not container-isolated and not capability-restricted at v1. - PG3 known gap. The loop-budget policy (PG3 and PG3-tokens) terminates via
agent.terminal_reason = loop_budgetrather than emitting apolicy_checkdeny span. The gap is locked intests/evidence/test_pg3_runs_terminate_with_loop_budget_reason. Surfacing the deny span on PG3 is a follow-on runtime fix. - Scope: reference, not benchmark. The captured runs demonstrate that the runtime components, the policy-gate denials, and the failure-mode triggers fire end-to-end with reproducible traces. They are not a performance, latency, throughput, or accuracy benchmark.
- Reproducibility envelope is bounded. Two reviewers running the same
make canonical-checkinvocation produce byte-identicaltrace.jsonandstate.jsonlgiven the pinned policy-spec hash, stub-LLM script hash, code SHA, and seed. Reproducibility across substantively different code or policy revisions is not claimed. - No coding-agent claim, no MCP-server claim. Both deferred; neither is in v1 scope.
- No production deployment claim, no customer-deployment claim, no autonomous-agent operation, no large-scale inference platform claim, no RLHF / DPO / LoRA training. This is a local reference implementation.
production-rag-eval-harness— retrieval-quality evaluation harness.aws-bedrock-iac-reference— AWS / Bedrock infrastructure-as-code reference.
Cross-references are descriptive only; this repository does not import or deploy them.