Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
@A-EVO-Lab

A-EVO-Lab

A-Evo Lab (Agentic Evolution Laboratory) 🧬

An AI researcher for every stage of building AI

The path to recursive self-improvement (RSI) is to let AI take over how humans build AI.

A-Evo Lab, led by Henry Lu, studies self-evolving agents under one thesis — AI-as-researcher: frontier agents and models play the researcher in the loop that builds better AI. Today humans build AI in three critical stages — pre-training → post-training → harness building. We are building an autonomous AI researcher for each, have reached SOTA results where we've shipped, and develop everything on one shared stack, A-Evolve, so we can iterate fast.


🗺 The Map

Human stage of building AI Our program What the AI researcher does Status
Harness building AI-Harness Evolves prompts / skills / memory / tools around a frozen model ✅ SOTA across benchmarks
↳ long-running deployment AI-Harness · Adaptive Sustains performance on open-ended task streams ✅ Leads every reported stream metric
Post-training AI-Training Designs data mixtures, schedules, HPs & ablations end-to-end 🔜 Human-team parity @ 30B — report in prep
Pre-training AI-Pretraining 🧭 The open frontier

🛠 AI-Harness — replacing human harness engineering

With zero manual harness engineering, A-Evolve's reference algorithms push a single Claude Opus-4.6 base model to top-tier performance across diverse agentic benchmarks:

🟢 MCP-Atlas



🥇 #1
Baseline → 79.4% (+3.4pp)

🔵 SWE-bench Verified



~#5
Baseline → 76.8% (+2.6pp)

🟣 Terminal-Bench 2.0



~#7
Baseline → 76.5% (+13.0pp)

🟡 SkillsBench



#2
Baseline → 34.9% (+15.2pp)

🟢 ARC-AGI



🥇 #2 Community Leaderboard
Baseline → 12.3% (+2.2pp)

🔵 OSWorld




Baseline → 69.6% (+3.9pp)

🟣 SWE-bench Lite



Evolved
63.7 → 67.0% (+3.3pp)

🟡 τ-bench



Evolved
72.7 → 77.0% (+4.3pp)

🟢 CL-Bench



Evolved
29.5 → 34.0% (+4.5pp)

🔵 WebArena-Infinity



Evolved
72.5 → 76.3% (+3.8pp)

Single Claude Opus-4.6 base model, evolved with A-Evolve's reference algorithms. 0 hours of human harness engineering. CL-Bench, SWE-bench Lite, τ-bench & WebArena-Infinity show before → after on the same base model. Data checked March 2026.

Key finding — evolver capability decouples from harness quality. A 9B model (Qwen3.5) writes harness updates as good as Claude Opus 4.6 (best-vs-worst evolver ≤ 3.1pp); benefit is non-monotonic — mid-tier agents gain most, weak agents fail to even load the harness. Implication: put your capability budget on the agent, not the evolver.

Evolver capability barely matters — a 9B model matches Opus 4.6

📄 Evolver-Solver-BenchHarness Updating Is Not Harness Benefit. arXiv 2605.30621 · HF Daily 📄 Evo-HarnessContext-to-Harness Skill Compilation (online evolution: feedback grounding, abstraction level, solver–evolver alignment). Releasing soon.

↳ Adaptive — sustaining agents on long-running streams

Naive self-evolving agents peak early and then decline — a single dense harness overfits to early evidence. Adaptive Auto-Harness fixes this with a stateful multi-agent evolver, a harness tree with solve-time routing, and scoped human-steering hooks — leading every reported metric against five auto-harness baselines plus the human-designed OctoTools:

Stream Domain A-Evolve-Adaptive Next best
PolyBench Prediction markets 80.9% Accuracy 50.8%
CTF-Dojo Security competitions 50.2% Pass 45.2%
FutureX Event forecasting 49.5% Pass 47.5%

Self-evolving agents peak early then decline; Adaptive sustains the gains

📄 Adaptive Auto-HarnessSustained Self-Improvement on Open-Ended Task Streams. Releasing soon.


🧪 AI-Training — replacing human post-training

The same loop, carried all the way into model weights: an evolver autonomously runs end-to-end 30B post-training — designing data mixtures, training schedules, hyperparameter regimes, and ablation protocols — reaching parity with a human post-training team. To our knowledge, the first time an autonomous system has done so at this scale.

Tech report in preparation — full results and methodology on release.


🧭 AI-Pretraining — the open frontier

The largest and most expensive stage of building AI — and the one we have not automated yet. It is where this thesis goes next.


⚙️ One Shared Stack: A-Evolve

Every result above was developed on A-Evolve, our open-source infrastructure for self-improving agents — "the PyTorch for Agentic AI." It evolves any agent, in any domain, with any evolution algorithm, and is what makes fast iteration across all three programs possible.

import agent_evolve as ae

evolver = ae.Evolver(agent="./my_agent", benchmark="swe-verified")
results = evolver.run(cycles=10)        # SOTA agent. 3 lines. 0 hours of manual harness engineering.

Adopted & integrated by: OpenRLHF · DeepSpeed · SGLang · GEPA · AutoResearch

⭐ Star the repo → github.com/A-EVO-Lab/a-evolve

A-Evolve framework


📫 Contact

Building in this direction, or want to collaborate? Reach out — X / Twitter · LinkedIn.


📢 News

  • 5/30 New PaperHarness Updating Is Not Harness Benefit (arXiv 2605.30621). 7 evolver models × 6 solver agents × 3 benchmarks: counterintuitive answers on who produces good harness updates and who benefits.
  • 05/04 New Benchmark Results — A-Evolve results on ARC-AGI-3, evolving a multi-agent system from 10% → 12%.
  • 04/20 New AlgorithmGEPA, submitted by the GEPA team.
  • 04/10 Integration — into Orch-Research Skills Library, alongside AutoResearch, OpenRLHF, DeepSpeed, SGLang.
  • 04/07 New Agent — transplanted our Terminal-Bench 2.0 harness onto ClawCode: 67.8% → 72.9% (+5.1pp).
  • 04/03 New AlgorithmMeta-Harness.
  • 03/25 🚀 Open-sourced A-Evolve + 4 reference algorithms achieving SOTA (#1, ~#5, ~#7, #2) on MCP-Atlas, SWE-bench Verified, Terminal-Bench 2.0, SkillsBench.
  • 02/17 📄 Position paper: Agentic Evolution is the Path to Evolving LLMs (arXiv 2602.00359).

We are evolving fast — support our research by leaving a ⭐ on A-Evolve.

LinkedIn | Twitter/X

Pinned Loading

  1. a-evolve a-evolve Public

    The official repository of "Position: Agentic Evolution is the Path to Evolving LLMs".

    Python 587 74

  2. CrowdResearch CrowdResearch Public

    Python 6

  3. CrowdResearch-demo-logs CrowdResearch-demo-logs Public

Repositories

Loading
Type
Select type
Language
Select language
Sort
Select order
Showing 7 of 7 repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…

Morty Proxy This is a proxified and sanitized view of the page, visit original site.