Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

imran31415/codemode_python_benchmark

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code Mode Benchmark

"LLMs are better at writing code to call tools than at calling tools directly."Cloudflare Code Mode Research

A comprehensive benchmark comparing Code Mode (code generation) vs Traditional Function Calling for LLM tool interactions. Demonstrates that Code Mode achieves 60% faster execution, 68% fewer tokens, and 88% fewer API round trips while maintaining equal accuracy.

Python 3.11+ License: MIT


🎯 Key Results

Metric Regular Agent Code Mode Improvement
Average Latency 11.88s 4.71s 60.4% faster
API Round Trips 8.0 iterations 1.0 iteration 87.5% reduction 🔄
Token Usage 144,250 tokens 45,741 tokens 68.3% savings 💰
Success Rate 6/8 (75%) 7/8 (88%) +13% higher
Validation Accuracy 100% 100% Equal accuracy

Annual Cost Savings: $9,536/year at 1,000 scenarios/day (Claude Haiku pricing)

📊 View Full Results | 📈 Raw Data Tables


🚀 Quick Start

Prerequisites

  • Python 3.11+
  • Anthropic API key (for Claude)
  • Google API key (for Gemini, optional)

Installation

# Clone the repository
git clone <repository-url>
cd codemode_benchmark

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env and add your API keys

Run the Benchmark

# Run full benchmark with Claude
make run

# Run with Gemini
python benchmark.py --model gemini

# Run specific scenario
python benchmark.py --scenario 1

# Run limited scenarios
python benchmark.py --limit 3

📁 Repository Structure

codemode_benchmark/
├── README.md                 # This file
├── benchmark.py             # Main benchmark runner
├── requirements.txt         # Python dependencies
├── Makefile                 # Convenient commands
│
├── agents/                  # Agent implementations
│   ├── __init__.py
│   ├── codemode_agent.py           # Code Mode (code generation)
│   ├── regular_agent.py            # Traditional function calling
│   ├── gemini_codemode_agent.py    # Gemini Code Mode
│   └── gemini_regular_agent.py     # Gemini function calling
│
├── tools/                   # Tool definitions
│   ├── __init__.py
│   ├── business_tools.py           # Accounting/invoicing tools
│   ├── accounting_tools.py         # Core accounting logic
│   └── example_tools.py            # Simple example tools
│
├── sandbox/                 # Secure code execution
│   ├── __init__.py
│   └── executor.py                 # RestrictedPython sandbox
│
├── tests/                   # Test files
│   ├── test_api.py
│   ├── test_scenarios.py           # Scenario definitions
│   └── ...
│
├── debug/                   # Debug scripts (development)
│   └── debug_*.py
│
├── docs/                    # Documentation
│   ├── BENCHMARK_SUMMARY.md        # Comprehensive analysis
│   ├── RESULTS_DATA.md             # Raw data tables
│   ├── QUICKSTART.md               # Quick start guide
│   ├── TOOLS.md                    # Tool API documentation
│   ├── CHANGELOG.md                # Version history
│   └── GEMINI.md                   # Gemini-specific notes
│
└── results/                 # Benchmark results
    ├── benchmark_results_claude.json
    ├── benchmark_results_gemini.json
    ├── results.log
    └── results-gemini.log

🔬 What is Code Mode?

Traditional Function Calling (Regular Agent)

User Query → LLM → Tool Call #1 → Execute → Result
          ↓
       LLM processes result → Tool Call #2 → Execute → Result
          ↓
       [Repeat 5-16 times...]
          ↓
       Final Response

Problems:

  • Multiple API round trips
  • Neural network processing between each tool call
  • Context grows with each iteration
  • High latency and token costs

Code Mode

User Query → LLM generates complete code → Executes all tools → Final Response

Advantages:

  • Single code generation pass
  • Batch multiple operations
  • No context re-processing
  • Natural programming constructs (loops, variables, conditionals)

Example:

Regular Agent sees this as 3 separate tool calls:

{"name": "create_transaction", "input": {"amount": 2500, ...}}
{"name": "create_transaction", "input": {"amount": 150, ...}}
{"name": "get_financial_summary", "input": {}}

Code Mode generates efficient code:

expenses = [
    ("rent", 2500, "Monthly rent"),
    ("utilities", 150, "Electricity")
]
for category, amount, desc in expenses:
    tools.create_transaction("expense", category, amount, desc)

summary = json.loads(tools.get_financial_summary())
result = f"Total: ${summary['summary']['total_expenses']}"

🎯 Test Scenarios

The benchmark includes 8 realistic business scenarios:

  1. Monthly Expense Recording - Record 4 expenses and generate summary
  2. Client Invoicing Workflow - Create 2 invoices, update status, summarize
  3. Payment Processing - Create invoice, process partial payments
  4. Mixed Income/Expense Tracking - 7 transactions with financial analysis
  5. Multi-Account Management - Complex transfers between 3 accounts
  6. Quarter-End Analysis - Simulate 3 months of business activity
  7. Complex Multi-Client Invoicing - 3 invoices with partial payments (16 operations)
  8. Budget Tracking - 14 categorized expenses with analysis

Each scenario includes automated validation to ensure correctness.


🛠️ Implementation Details

Code Mode Architecture

class CodeModeAgent:
    def run(self, user_message: str) -> Dict[str, Any]:
        # 1. Send message with tools API documentation
        response = self.client.messages.create(
            system=self._create_system_prompt(),  # Contains tools API
            messages=[{"role": "user", "content": user_message}]
        )

        # 2. Extract generated code
        code = extract_code_from_response(response)

        # 3. Execute in sandbox
        result = self.executor.execute(code)

        return result

Tools API with TypedDict

from typing import TypedDict, Literal

class TransactionResponse(TypedDict):
    status: Literal["success"]
    transaction: TransactionDict
    new_balance: float

def create_transaction(
    transaction_type: Literal["income", "expense", "transfer"],
    category: str,
    amount: float,
    description: str,
    account: str = "checking"
) -> str:
    """
    Create a new transaction.

    Returns: JSON string with TransactionResponse structure

    Example:
        result = tools.create_transaction("expense", "rent", 2500.0, "Monthly rent")
        data = json.loads(result)
        print(data["new_balance"])  # 7500.0
    """
    # Implementation...

Security with RestrictedPython

Code execution uses RestrictedPython for sandboxing:

  • No filesystem access
  • No network access
  • No dangerous imports
  • Controlled builtins

📊 Performance Breakdown

By Scenario Complexity

Complexity Scenarios Avg Speedup Avg Token Savings
High (10+ ops) 2 79.2% 36,389 tokens
Medium (5-9 ops) 3 47.5% 8,774 tokens
Low (3-4 ops) 1 45.3% 6,209 tokens

Key Insight: Code Mode advantage scales with complexity, but even simple tasks benefit significantly.

Cost Analysis at Scale

Daily Volume Regular Annual Code Mode Annual Annual Savings
100 $252 $77 $175
1,000 $2,519 $766 $1,753
10,000 $25,185 $7,665 $17,520
100,000 $251,850 $76,650 $175,200

(Based on Claude Haiku pricing: $0.25/1M input, $1.25/1M output)


🤖 Supported Models

Claude (Anthropic)

  • Model: Claude 3 Haiku
  • Performance: 60.4% faster, 68.3% fewer tokens
  • Best For: Cost-sensitive production workloads
  • Status: ✅ Fully tested (8/8 scenarios)

Gemini (Google)

  • Model: Gemini 2.0 Flash Experimental
  • Performance: 15.1% faster, 70.6% fewer iterations
  • Best For: Low-latency requirements
  • Status: ✅ Partially tested (2/8 scenarios)
  • Note: Faster baseline but more verbose code generation

🧪 Running Tests

# Run all tests
make test

# Run specific test file
python -m pytest tests/test_scenarios.py

# Test Code Mode agent directly
python agents/codemode_agent.py

# Test Regular Agent directly
python agents/regular_agent.py

# Test sandbox execution
python sandbox/executor.py

📚 Documentation


💡 Key Learnings

Why Code Mode Wins

  1. Batching Advantage

    • Single code block replaces multiple API calls
    • No neural network processing between operations
    • Example: 16 iterations → 1 iteration (Scenario 7)
  2. Cognitive Efficiency

    • LLMs have extensive training on code generation
    • Natural programming constructs (loops, variables, conditionals)
    • TypedDict provides clear type contracts
  3. Computational Efficiency

    • No context re-processing between tool calls
    • Direct code execution in sandbox
    • Reduced token overhead

When to Use Code Mode

Multi-step workflows - Greatest benefit with many operations ✅ Complex business logic - Invoicing, accounting, data processing ✅ Batch operations - Similar actions on multiple items ✅ Cost-sensitive workloads - Production at scale ✅ Latency-critical applications - User-facing systems

Best Practices

  1. Use TypedDict for response types - Provides clear structure to LLM
  2. Include examples in docstrings - Shows correct usage patterns
  3. Batch similar operations - Leverage loops in code
  4. Validate results - Automated checks ensure correctness
  5. Handle errors gracefully - Try-except in generated code

🤝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Run tests (make test)
  5. Commit (git commit -m 'Add amazing feature')
  6. Push (git push origin feature/amazing-feature)
  7. Open a Pull Request

📖 References


📄 License

MIT License - See LICENSE file for details


🙏 Acknowledgments


📞 Contact

For questions or feedback, please open an issue on GitHub.


Benchmark Date: January 2025 Models Tested: Claude 3 Haiku, Gemini 2.0 Flash Experimental Test Scenarios: 8 realistic business workflows Result: Code Mode is 60% faster, uses 68% fewer tokens, with equal accuracy

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
Morty Proxy This is a proxified and sanitized view of the page, visit original site.