AgentContract-Bench v2

Benchmarks

293 scenarios across 12 domains. 100% pass rate. Live LLM results.

Benchmark Overview

The most comprehensive behavioral contract benchmark for AI agents.

293 Benchmark Scenarios
100% Pass Rate
12 Domains
3 LLMs Evaluated

Live LLM Results

Aggregate compliance scores across all 293 scenarios.

Claude Sonnet 4.6 Θ = 0.823 Highest Compliance
Mistral-Large-3 Θ = 0.813 Strong Compliance
GPT-5.3 Θ = 0.688 Moderate Compliance

Domain Breakdown

Scenario distribution across 12 enterprise domains.

Domain Scenarios Coverage
E-Commerce42Order, inventory, pricing
Finance38Transactions, compliance, risk
Healthcare35Triage, records, referrals
Retail28Returns, support, recommendations
Telecom26Billing, provisioning, support
Dev Tools24Code generation, review, CI/CD
Research22Literature search, summarization
General20Multi-domain, cross-cutting
MCP18Tool calling, protocol compliance
RAG16Retrieval accuracy, grounding
Content Moderation14Safety, policy enforcement
Legal10Document review, compliance

Explore the Benchmarks

Full benchmark source code, scenarios, and results are available on GitHub.

Morty Proxy This is a proxified and sanitized view of the page, visit original site.