Code Mode Benchmark

"LLMs are better at writing code to call tools than at calling tools directly." — Cloudflare Code Mode Research

A comprehensive benchmark comparing Code Mode (code generation) vs Traditional Function Calling for LLM tool interactions. Demonstrates that Code Mode achieves 60% faster execution, 68% fewer tokens, and 88% fewer API round trips while maintaining equal accuracy.

🎯 Key Results

Metric	Regular Agent	Code Mode	Improvement
Average Latency	11.88s	4.71s	60.4% faster ⚡
API Round Trips	8.0 iterations	1.0 iteration	87.5% reduction 🔄
Token Usage	144,250 tokens	45,741 tokens	68.3% savings 💰
Success Rate	6/8 (75%)	7/8 (88%)	+13% higher ✅
Validation Accuracy	100%	100%	Equal accuracy

Annual Cost Savings: $9,536/year at 1,000 scenarios/day (Claude Haiku pricing)

📊 View Full Results | 📈 Raw Data Tables

🚀 Quick Start

Prerequisites

Python 3.11+
Anthropic API key (for Claude)
Google API key (for Gemini, optional)

Installation

# Clone the repository
git clone <repository-url>
cd codemode_benchmark

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env and add your API keys

Run the Benchmark

# Run full benchmark with Claude
make run

# Run with Gemini
python benchmark.py --model gemini

# Run specific scenario
python benchmark.py --scenario 1

# Run limited scenarios
python benchmark.py --limit 3

📁 Repository Structure

codemode_benchmark/
├── README.md                 # This file
├── benchmark.py             # Main benchmark runner
├── requirements.txt         # Python dependencies
├── Makefile                 # Convenient commands
│
├── agents/                  # Agent implementations
│   ├── __init__.py
│   ├── codemode_agent.py           # Code Mode (code generation)
│   ├── regular_agent.py            # Traditional function calling
│   ├── gemini_codemode_agent.py    # Gemini Code Mode
│   └── gemini_regular_agent.py     # Gemini function calling
│
├── tools/                   # Tool definitions
│   ├── __init__.py
│   ├── business_tools.py           # Accounting/invoicing tools
│   ├── accounting_tools.py         # Core accounting logic
│   └── example_tools.py            # Simple example tools
│
├── sandbox/                 # Secure code execution
│   ├── __init__.py
│   └── executor.py                 # RestrictedPython sandbox
│
├── tests/                   # Test files
│   ├── test_api.py
│   ├── test_scenarios.py           # Scenario definitions
│   └── ...
│
├── debug/                   # Debug scripts (development)
│   └── debug_*.py
│
├── docs/                    # Documentation
│   ├── BENCHMARK_SUMMARY.md        # Comprehensive analysis
│   ├── RESULTS_DATA.md             # Raw data tables
│   ├── QUICKSTART.md               # Quick start guide
│   ├── TOOLS.md                    # Tool API documentation
│   ├── CHANGELOG.md                # Version history
│   └── GEMINI.md                   # Gemini-specific notes
│
└── results/                 # Benchmark results
    ├── benchmark_results_claude.json
    ├── benchmark_results_gemini.json
    ├── results.log
    └── results-gemini.log

🔬 What is Code Mode?

Traditional Function Calling (Regular Agent)

User Query → LLM → Tool Call #1 → Execute → Result
          ↓
       LLM processes result → Tool Call #2 → Execute → Result
          ↓
       [Repeat 5-16 times...]
          ↓
       Final Response

Problems:

Multiple API round trips
Neural network processing between each tool call
Context grows with each iteration
High latency and token costs

Code Mode

User Query → LLM generates complete code → Executes all tools → Final Response

Advantages:

Single code generation pass
Batch multiple operations
No context re-processing
Natural programming constructs (loops, variables, conditionals)

Example:

Regular Agent sees this as 3 separate tool calls:

{"name": "create_transaction", "input": {"amount": 2500, ...}}
{"name": "create_transaction", "input": {"amount": 150, ...}}
{"name": "get_financial_summary", "input": {}}

Code Mode generates efficient code:

expenses = [
    ("rent", 2500, "Monthly rent"),
    ("utilities", 150, "Electricity")
]
for category, amount, desc in expenses:
    tools.create_transaction("expense", category, amount, desc)

summary = json.loads(tools.get_financial_summary())
result = f"Total: ${summary['summary']['total_expenses']}"

🎯 Test Scenarios

The benchmark includes 8 realistic business scenarios:

Monthly Expense Recording - Record 4 expenses and generate summary
Client Invoicing Workflow - Create 2 invoices, update status, summarize
Payment Processing - Create invoice, process partial payments
Mixed Income/Expense Tracking - 7 transactions with financial analysis
Multi-Account Management - Complex transfers between 3 accounts
Quarter-End Analysis - Simulate 3 months of business activity
Complex Multi-Client Invoicing - 3 invoices with partial payments (16 operations)
Budget Tracking - 14 categorized expenses with analysis

Each scenario includes automated validation to ensure correctness.

🛠️ Implementation Details

Code Mode Architecture

class CodeModeAgent:
    def run(self, user_message: str) -> Dict[str, Any]:
        # 1. Send message with tools API documentation
        response = self.client.messages.create(
            system=self._create_system_prompt(),  # Contains tools API
            messages=[{"role": "user", "content": user_message}]
        )

        # 2. Extract generated code
        code = extract_code_from_response(response)

        # 3. Execute in sandbox
        result = self.executor.execute(code)

        return result

Tools API with TypedDict

from typing import TypedDict, Literal

class TransactionResponse(TypedDict):
    status: Literal["success"]
    transaction: TransactionDict
    new_balance: float

def create_transaction(
    transaction_type: Literal["income", "expense", "transfer"],
    category: str,
    amount: float,
    description: str,
    account: str = "checking"
) -> str:
    """
    Create a new transaction.

    Returns: JSON string with TransactionResponse structure

    Example:
        result = tools.create_transaction("expense", "rent", 2500.0, "Monthly rent")
        data = json.loads(result)
        print(data["new_balance"])  # 7500.0
    """
    # Implementation...

Security with RestrictedPython

Code execution uses RestrictedPython for sandboxing:

No filesystem access
No network access
No dangerous imports
Controlled builtins

📊 Performance Breakdown

By Scenario Complexity

Complexity	Scenarios	Avg Speedup	Avg Token Savings
High (10+ ops)	2	79.2%	36,389 tokens
Medium (5-9 ops)	3	47.5%	8,774 tokens
Low (3-4 ops)	1	45.3%	6,209 tokens

Key Insight: Code Mode advantage scales with complexity, but even simple tasks benefit significantly.

Cost Analysis at Scale

Daily Volume	Regular Annual	Code Mode Annual	Annual Savings
100	$252	$77	$175
1,000	$2,519	$766	$1,753
10,000	$25,185	$7,665	$17,520
100,000	$251,850	$76,650	$175,200

(Based on Claude Haiku pricing: $0.25/1M input, $1.25/1M output)

🤖 Supported Models

Claude (Anthropic)

Model: Claude 3 Haiku
Performance: 60.4% faster, 68.3% fewer tokens
Best For: Cost-sensitive production workloads
Status: ✅ Fully tested (8/8 scenarios)

Gemini (Google)

Model: Gemini 2.0 Flash Experimental
Performance: 15.1% faster, 70.6% fewer iterations
Best For: Low-latency requirements
Status: ✅ Partially tested (2/8 scenarios)
Note: Faster baseline but more verbose code generation

🧪 Running Tests

# Run all tests
make test

# Run specific test file
python -m pytest tests/test_scenarios.py

# Test Code Mode agent directly
python agents/codemode_agent.py

# Test Regular Agent directly
python agents/regular_agent.py

# Test sandbox execution
python sandbox/executor.py

📚 Documentation

Benchmark Summary - Comprehensive analysis with insights
Results Data - Raw performance tables
Quick Start Guide - Step-by-step setup
Tools Documentation - Available tools and API
Changelog - Version history
Gemini Notes - Gemini-specific information

💡 Key Learnings

Why Code Mode Wins

Batching Advantage
- Single code block replaces multiple API calls
- No neural network processing between operations
- Example: 16 iterations → 1 iteration (Scenario 7)
Cognitive Efficiency
- LLMs have extensive training on code generation
- Natural programming constructs (loops, variables, conditionals)
- TypedDict provides clear type contracts
Computational Efficiency
- No context re-processing between tool calls
- Direct code execution in sandbox
- Reduced token overhead

When to Use Code Mode

✅ Multi-step workflows - Greatest benefit with many operations ✅ Complex business logic - Invoicing, accounting, data processing ✅ Batch operations - Similar actions on multiple items ✅ Cost-sensitive workloads - Production at scale ✅ Latency-critical applications - User-facing systems

Best Practices

Use TypedDict for response types - Provides clear structure to LLM
Include examples in docstrings - Shows correct usage patterns
Batch similar operations - Leverage loops in code
Validate results - Automated checks ensure correctness
Handle errors gracefully - Try-except in generated code

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Run tests (make test)
Commit (git commit -m 'Add amazing feature')
Push (git push origin feature/amazing-feature)
Open a Pull Request

📖 References

📄 License

MIT License - See LICENSE file for details

🙏 Acknowledgments

Inspired by Cloudflare's Code Mode research
Built on Anthropic's Building Effective Agents framework
Uses RestrictedPython for secure code execution

📞 Contact

For questions or feedback, please open an issue on GitHub.

Benchmark Date: January 2025 Models Tested: Claude 3 Haiku, Gemini 2.0 Flash Experimental Test Scenarios: 8 realistic business workflows Result: Code Mode is 60% faster, uses 68% fewer tokens, with equal accuracy

Name	Name	Last commit message	Last commit date
Latest commit History 4 Commits 4 Commits
agents	agents
debug	debug
docs	docs
sandbox	sandbox
tests	tests
tools	tools
.env.example	.env.example
.gitignore	.gitignore
Makefile	Makefile
README.md	README.md
benchmark.py	benchmark.py
requirements.txt	requirements.txt

Search code, repositories, users, issues, pull requests...

imran31415/codemode_python_benchmark

Folders and files

Latest commit

History

Repository files navigation