Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

GalenChen320/Otter

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

180 Commits
180 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logo

Otter

An Agent code evaluation framework with native multi-turn feedback iteration.

English | 中文

Coverage Python License

Why Otter

Mainstream code benchmarks use snapshot-style evaluation — one input, one output. But real-world programming involves iterating based on compiler errors, test failures, and other feedback. This feedback-driven iteration is the core of programming ability.

Otter integrates evaluation feedback into the evaluation loop, letting agents work like real developers: write code → run → read errors → fix → run again, until the tests pass or the maximum number of turns is reached.

   ┌────────────────────────────┐
   ↓                            │
Proposer ───→ Executor ───→ Evaluator
                                │
                          Pass? │
                                ↓
                               End

Quick Start

Prerequisites: Python >= 3.11, Docker

# Install
pip install -e .

# Configure
cp .env.example .env
# Edit .env with your API credentials

# Run evaluation
otter run

Configuration

All parameters are managed via .env files. The CLI only accepts --env to select a config file:

otter run                    # uses .env by default
otter run --env .env.local   # specify a config file

See Environment Variable Configuration for the full parameter reference.

Output Structure

Results are saved under experiments/ as a directory tree, with a full record for each turn of each problem:

experiments/{experiment_id}/
└── {task_id}#{sample_id}/
    ├── turn_1/
    │   ├── prop_input/    # Proposer input (created if proposer enabled)
    │   ├── prop_output/   # Proposer output (created if proposer enabled)
    │   ├── exec_input/    # Executor input (created if executor enabled)
    │   ├── exec_output/   # Executor output (created if executor enabled)
    │   ├── eval_input/    # Evaluator input (created if evaluator enabled)
    │   ├── eval_output/   # Evaluator output (created if evaluator enabled)
    │   └── meta.json     # Turn verdict {"passed": true/false}
    ├── turn_2/           # Turn 2 (if turn 1 failed and max_turns > 1)
    │   └── ...
    └── ...

Supported Datasets

Dataset Status Description
MBPP+ Fully supported Function-level Python problems
EvalPlus (HumanEval+) Fully supported Rigorous LLM4Code benchmarks
LiveCodeBench Planned Contamination-free live coding problems
SWE-Bench Planned Real-world GitHub issue resolution
Tau2Bench Planned Multi-turn agentic task evaluation
TerminalBench Planned Terminal-based coding tasks
SWE-CI Planned CI-driven software engineering tasks

License

Apache License 2.0

About

An agent evaluation framework with native multi-turn feedback iteration.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Morty Proxy This is a proxified and sanitized view of the page, visit original site.