A-EVO-Lab

A-Evo Lab (Agentic Evolution Laboratory) 🧬

The path to recursive self-improvement (RSI) is to let AI take over how humans build AI.

We studies self-evolving agents under one thesis — AI-as-researcher: frontier agents and models play the researcher in the loop that builds better AI. Today humans build AI in three critical stages — pre-training → post-training → harness building. We are building an autonomous AI researcher for each, have reached SOTA results where we've shipped, and develop everything on one shared stack, A-Evolve, so we can iterate fast.

🗺 The Map

Human stage of building AI	Our program	What the AI researcher does	Status
Harness building	AI-Harness	Evolves prompts / skills / memory / tools around a frozen model	✅ SOTA across benchmarks
↳ long-running deployment	AI-Harness · Adaptive	Sustains performance on open-ended task streams	✅ Leads every reported stream metric
Post-training	AI-Training	Designs data mixtures, schedules, HPs & ablations end-to-end	🔜 First public datapoint of Auto-post-training on 30B scale
Pre-training	AI-Pretraining	—	🧭 The open frontier

🛠 AI-Harness — replacing human harness engineering

With zero manual harness engineering, A-Evolve's reference algorithms push a single Claude Opus-4.6 base model to top-tier performance across diverse agentic benchmarks:

🟢 MCP-Atlas 🥇 #1 _{Baseline → 79.4% (+3.4pp)}	🔵 SWE-bench Verified ~#5 _{Baseline → 76.8% (+2.6pp)}	🟣 Terminal-Bench 2.0 ~#7 _{Baseline → 76.5% (+13.0pp)}	🟡 SkillsBench #2 _{Baseline → 34.9% (+15.2pp)}
🟢 ARC-AGI 🥇 #2 Community Leaderboard _{Baseline → 12.3% (+2.2pp)}	🔵 OSWorld — _{Baseline → 69.6% (+3.9pp)}	🟣 SWE-bench Lite Evolved _{63.7 → 67.0% (+3.3pp)}	🟡 τ-bench Evolved _{72.7 → 77.0% (+4.3pp)}
🟢 CL-Bench Evolved _{29.5 → 34.0% (+4.5pp)}	🔵 WebArena-Infinity Evolved _{72.5 → 76.3% (+3.8pp)}

Single Claude Opus-4.6 base model, evolved with A-Evolve's reference algorithms. 0 hours of human harness engineering. CL-Bench, SWE-bench Lite, τ-bench & WebArena-Infinity show before → after on the same base model. Data checked March 2026.

Key finding — evolver capability decouples from harness quality. A 9B model (Qwen3.5) writes harness updates as good as Claude Opus 4.6 (best-vs-worst evolver ≤ 3.1pp); benefit is non-monotonic — mid-tier agents gain most, weak agents fail to even load the harness. Implication: put your capability budget on the agent, not the evolver.

📄 Evolver-Solver-Bench — Harness Updating Is Not Harness Benefit. arXiv 2605.30621 · HF Daily

📄 Evo-Harness — Context-to-Harness Skill Compilation (online evolution: feedback grounding, abstraction level, solver–evolver alignment). Releasing soon.

↳ Adaptive — sustaining agents on long-running streams

Naive self-evolving agents peak early and then decline — a single dense harness overfits to early evidence. Adaptive Auto-Harness fixes this with a stateful multi-agent evolver, a harness tree with solve-time routing, and scoped human-steering hooks — leading every reported metric against five auto-harness baselines plus the human-designed OctoTools:

Stream	Domain	A-Evolve-Adaptive	Next best
PolyBench	Prediction markets	80.9% Accuracy	50.8%
CTF-Dojo	Security competitions	50.2% Pass	45.2%
FutureX	Event forecasting	49.5% Pass	47.5%

📄 Adaptive Auto-Harness — Sustained Self-Improvement on Open-Ended Task Streams. arXiv 2606.01770

🧪 AI-Training — replacing human post-training

The same loop, carried all the way into model weights: an evolver autonomously runs end-to-end 30B post-training — designing data mixtures, training schedules, hyperparameter regimes, and ablation protocols — reaching parity with a human post-training team. To our knowledge, the first time an autonomous system has done so at this scale.

Four self-directed rounds on a production GPU cluster. The autonomously produced model placed 8th of ~4,000 on NVIDIA's Nemotron Reasoning Challenge (snapshot 6/1/2026) — one point behind the top human team.

The same autonomous system has since post-trained the 120B and 550B Nemotron models end-to-end — evidence the loop closes at that scale too. (No public human baseline exists there yet, so we report it as infrastructure evidence, not a competitiveness claim.)

Tech report — Tech Blog arXiv 2606.20657.

🧭 AI-Pretraining — the open frontier

The largest and most expensive stage of building AI — and the one we have not automated yet. It is where this thesis goes next.

⚙️ One Shared Stack: A-Evolve

Every result above was developed on A-Evolve, our open-source infrastructure for self-improving agents — "the PyTorch for Agentic AI." It evolves any agent, in any domain, with any evolution algorithm, and is what makes fast iteration across all three programs possible.

import agent_evolve as ae

evolver = ae.Evolver(agent="./my_agent", benchmark="swe-verified")
results = evolver.run(cycles=10)        # SOTA agent. 3 lines. 0 hours of manual harness engineering.

Adopted & integrated by: OpenRLHF · DeepSpeed · SGLang · GEPA · AutoResearch

⭐ Star the repo → github.com/A-EVO-Lab/a-evolve

📢 News

6/11 New Tech Report, *A-Evolve-Training: Autonomous Post-Training of a 30B Model * (arXiv 2606.20657). We present an autonomous system that runs this loop with no human in the loop, post-training a 30B Nemotron across four rounds over multiple weeks. The autonomously produced model reaches a held-out score of 0.86 against the top human submission's 0.87 on the public NVIDIA Nemotron-Reasoning Challenge leaderboard, placing 8th of ~4000 at the time of writing. To the best of our knowledge, this is the first publicly reported autonomous post-training run at this scale, where prior public autonomous-ML-research demonstrations sit at GPT-2-class (~124M) budgets. The same system also post-trains the 120B and 550B Nemotron models.
6/1 New Research Paper, Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams (arXiv 2606.01770). We address the brittleness of traditional auto-harness systems when moving from fixed benchmarks to open-ended, shifting task streams. We introduce Adaptive Auto-Harness, a framework that significantly outperforms five existing auto-harness baselines across prediction-market, security-competition, and event-forecasting streams. Code and algorithms are available at A-Evolve
5/30 New Paper — Harness Updating Is Not Harness Benefit (arXiv 2605.30621). 7 evolver models × 6 solver agents × 3 benchmarks: counterintuitive answers on who produces good harness updates and who benefits. Code and algorithms are available at A-Evolve
05/04 New Benchmark Results — A-Evolve results on ARC-AGI-3, evolving a multi-agent system from 10% → 12%.
04/20 New Algorithm — GEPA, submitted by the GEPA team.
04/10 Integration — into Orch-Research Skills Library, alongside AutoResearch, OpenRLHF, DeepSpeed, SGLang.
04/07 New Agent — transplanted our Terminal-Bench 2.0 harness onto ClawCode: 67.8% → 72.9% (+5.1pp).
04/03 New Algorithm — Meta-Harness.
03/25 🚀 Open-sourced A-Evolve + 4 reference algorithms achieving SOTA (#1, ~#5, ~#7, #2) on MCP-Atlas, SWE-bench Verified, Terminal-Bench 2.0, SkillsBench.
02/17 📄 Position paper: Agentic Evolution is the Path to Evolving LLMs (arXiv 2602.00359).

We are evolving fast — support our research by leaving a ⭐ on A-Evolve.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

A-EVO-Lab

A-Evo Lab (Agentic Evolution Laboratory) 🧬

🗺 The Map

🛠 AI-Harness — replacing human harness engineering

🟢 MCP-Atlas

🔵 SWE-bench Verified

🟣 Terminal-Bench 2.0

🟡 SkillsBench

🟢 ARC-AGI

🔵 OSWorld

🟣 SWE-bench Lite

🟡 τ-bench

🟢 CL-Bench

🔵 WebArena-Infinity

↳ Adaptive — sustaining agents on long-running streams

🧪 AI-Training — replacing human post-training

🧭 AI-Pretraining — the open frontier

⚙️ One Shared Stack: A-Evolve

📢 News

Pinned Loading

Repositories

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

People

Top languages

Uh oh!

Most used topics

Uh oh!

Search code, repositories, users, issues, pull requests...

Uh oh!

Uh oh!

A-Evo Lab (Agentic Evolution Laboratory) 🧬

🗺 The Map

🛠 AI-Harness — replacing human harness engineering

🟢 MCP-Atlas

🔵 SWE-bench Verified

🟣 Terminal-Bench 2.0

🟡 SkillsBench

🟢 ARC-AGI

🔵 OSWorld

🟣 SWE-bench Lite

🟡 τ-bench

🟢 CL-Bench

🔵 WebArena-Infinity

↳ Adaptive — sustaining agents on long-running streams

🧪 AI-Training — replacing human post-training

🧭 AI-Pretraining — the open frontier

⚙️ One Shared Stack: A-Evolve

📢 News

Pinned Loading

Repositories

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

People

Top languages

Uh oh!

Most used topics

Uh oh!