Otter

An Agent code evaluation framework with native multi-turn feedback iteration.

English | 中文

Why Otter

Mainstream code benchmarks use snapshot-style evaluation — one input, one output. But real-world programming involves iterating based on compiler errors, test failures, and other feedback. This feedback-driven iteration is the core of programming ability.

Otter integrates evaluation feedback into the evaluation loop, letting agents work like real developers: write code → run → read errors → fix → run again, until the tests pass or the maximum number of turns is reached.

   ┌────────────────────────────┐
   ↓                            │
Proposer ───→ Executor ───→ Evaluator
                                │
                          Pass? │
                                ↓
                               End

Quick Start

Prerequisites: Python >= 3.11, Docker

# Install
pip install -e .

# Configure
cp .env.example .env
# Edit .env with your API credentials

# Run evaluation
otter run

Configuration

All parameters are managed via .env files. The CLI only accepts --env to select a config file:

otter run                    # uses .env by default
otter run --env .env.local   # specify a config file

See Environment Variable Configuration for the full parameter reference.

Output Structure

Results are saved under experiments/ as a directory tree, with a full record for each turn of each problem:

experiments/{experiment_id}/
└── {task_id}#{sample_id}/
    ├── turn_1/
    │   ├── prop_input/    # Proposer input (created if proposer enabled)
    │   ├── prop_output/   # Proposer output (created if proposer enabled)
    │   ├── exec_input/    # Executor input (created if executor enabled)
    │   ├── exec_output/   # Executor output (created if executor enabled)
    │   ├── eval_input/    # Evaluator input (created if evaluator enabled)
    │   ├── eval_output/   # Evaluator output (created if evaluator enabled)
    │   └── meta.json     # Turn verdict {"passed": true/false}
    ├── turn_2/           # Turn 2 (if turn 1 failed and max_turns > 1)
    │   └── ...
    └── ...

Supported Datasets

Dataset	Status	Description
MBPP+	Fully supported	Function-level Python problems
EvalPlus (HumanEval+)	Fully supported	Rigorous LLM4Code benchmarks
LiveCodeBench	Planned	Contamination-free live coding problems
SWE-Bench	Planned	Real-world GitHub issue resolution
Tau2Bench	Planned	Multi-turn agentic task evaluation
TerminalBench	Planned	Terminal-based coding tasks
SWE-CI	Planned	CI-driven software engineering tasks

License

Apache License 2.0

Name	Name	Last commit message	Last commit date
Latest commit History 180 Commits 180 Commits
assets	assets
docs	docs
envs	envs
src	src
tests	tests
.gitignore	.gitignore
.python-version	.python-version
CITATION.cff	CITATION.cff
LICENSE	LICENSE
README.md	README.md
pyproject.toml	pyproject.toml
uv.lock	uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Otter

Why Otter

Quick Start

Configuration

Output Structure

Supported Datasets

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Search code, repositories, users, issues, pull requests...

Folders and files

Latest commit

History

Repository files navigation

Otter

Why Otter

Quick Start

Configuration

Output Structure

Supported Datasets

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages