Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Light-weight AI agent evaluation and optimization framework

License

Notifications You must be signed in to change notification settings

EntityProcess/agentv

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

288 Commits
288 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AgentV

CLI-first AI agent evaluation. No server. No signup. No overhead.

AgentV evaluates your agents locally with multi-objective scoring (correctness, latency, cost, safety) from YAML specifications. Deterministic code judges + customizable LLM judges, all version-controlled in Git.

Installation

1. Install:

npm install -g agentv

2. Initialize your workspace:

agentv init

3. Configure environment variables:

  • The init command creates a .env.example file in your project root
  • Copy .env.example to .env and fill in your API keys, endpoints, and other configuration values
  • Update the environment variable names in .agentv/targets.yaml to match those defined in your .env file

4. Create an eval (./evals/example.yaml):

description: Math problem solving evaluation
execution:
  target: default

tests:
  - id: addition
    criteria: Correctly calculates 15 + 27 = 42

    input: What is 15 + 27?

    expected_output: "42"

    assert:
      - name: math_check
        type: code_judge
        script: ./validators/check_math.py

5. Run the eval:

agentv eval ./evals/example.yaml

Results appear in .agentv/results/eval_<timestamp>.jsonl with scores, reasoning, and execution traces.

Learn more in the examples/ directory. For a detailed comparison with other frameworks, see docs/COMPARISON.md.

Why AgentV?

Feature AgentV LangWatch LangSmith LangFuse
Setup npm install Cloud account + API key Cloud account + API key Cloud account + API key
Server None (local) Managed cloud Managed cloud Managed cloud
Privacy All local Cloud-hosted Cloud-hosted Cloud-hosted
CLI-first Limited Limited
CI/CD ready Requires API calls Requires API calls Requires API calls
Version control ✓ (YAML in Git)
Evaluators Code + LLM + Custom LLM only LLM + Code LLM only

Best for: Developers who want evaluation in their workflow, not a separate dashboard. Teams prioritizing privacy and reproducibility.

Features

  • Multi-objective scoring: Correctness, latency, cost, safety in one run
  • Multiple evaluator types: Code validators, LLM judges, custom Python/TypeScript
  • Built-in targets: VS Code Copilot, Codex CLI, Pi Coding Agent, Azure OpenAI, local CLI agents
  • Structured evaluation: Rubric-based grading with weights and requirements
  • Batch evaluation: Run hundreds of test cases in parallel
  • Export: JSON, JSONL, YAML formats
  • Compare results: Compute deltas between evaluation runs for A/B testing

Development

Contributing to AgentV? Clone and set up the repository:

git clone https://github.com/EntityProcess/agentv.git
cd agentv

# Install Bun if you don't have it
curl -fsSL https://bun.sh/install | bash

# Install dependencies and build
bun install && bun run build

# Run tests
bun test

See AGENTS.md for development guidelines and design principles.

Releasing

Stable release:

bun run release          # patch bump
bun run release minor
bun run release major
bun run publish          # publish to npm `latest`

Prerelease (next) channel:

bun run release:next         # bump/increment `-next.N`
bun run release:next major   # start new major prerelease line
bun run publish:next         # publish to npm `next`

Core Concepts

Evaluation files (.yaml or .jsonl) define test cases with expected outcomes. Targets specify which agent/provider to evaluate. Judges (code or LLM) score results. Results are written as JSONL/YAML for analysis and comparison.

JSONL Format Support

For large-scale evaluations, AgentV supports JSONL (JSON Lines) format as an alternative to YAML:

{"id": "test-1", "criteria": "Calculates correctly", "input": "What is 2+2?"}
{"id": "test-2", "criteria": "Provides explanation", "input": "Explain variables"}

Optional sidecar YAML metadata file (dataset.eval.yaml alongside dataset.jsonl):

description: Math evaluation dataset
dataset: math-tests
execution:
  target: azure_base
evaluator: llm_judge

Benefits: Streaming-friendly, Git-friendly diffs, programmatic generation, industry standard (DeepEval, LangWatch, Hugging Face).

Usage

Running Evaluations

# Validate evals
agentv validate evals/my-eval.yaml

# Run an eval with default target (from eval file or targets.yaml)
agentv eval evals/my-eval.yaml

# Override target
agentv eval --target azure_base evals/**/*.yaml

# Run specific test
agentv eval --test-id case-123 evals/my-eval.yaml

# Dry-run with mock provider
agentv eval --dry-run evals/my-eval.yaml

See agentv eval --help for all options: workers, timeouts, output formats, trace dumping, and more.

Create Custom Evaluators

Write code judges in Python or TypeScript:

# validators/check_answer.py
import json, sys
data = json.load(sys.stdin)
answer = data.get("answer", "")

hits = []
misses = []

if "42" in answer:
    hits.append("Answer contains correct value (42)")
else:
    misses.append("Answer does not contain expected value (42)")

score = 1.0 if hits else 0.0

print(json.dumps({
    "score": score,
    "hits": hits,
    "misses": misses,
    "reasoning": f"Passed {len(hits)} check(s)"
}))

Reference evaluators in your eval file:

assert:
  - name: my_validator
    type: code_judge
    script: ./validators/check_answer.py

For complete templates, examples, and evaluator patterns, see: custom-evaluators

Compare Evaluation Results

Run two evaluations and compare them:

agentv eval evals/my-eval.yaml --out before.jsonl
# ... make changes to your agent ...
agentv eval evals/my-eval.yaml --out after.jsonl
agentv compare before.jsonl after.jsonl --threshold 0.1

Output shows wins, losses, ties, and mean delta to identify improvements.

Targets Configuration

Define execution targets in .agentv/targets.yaml to decouple evals from providers:

targets:
  - name: azure_base
    provider: azure
    endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
    api_key: ${{ AZURE_OPENAI_API_KEY }}
    model: ${{ AZURE_DEPLOYMENT_NAME }}

  - name: vscode_dev
    provider: vscode
    workspace_template: ${{ WORKSPACE_PATH }}
    judge_target: azure_base

  - name: local_agent
    provider: cli
    command_template: 'python agent.py --prompt {PROMPT}'
    judge_target: azure_base

Supports: azure, anthropic, gemini, codex, copilot, pi-coding-agent, claude, vscode, vscode-insiders, cli, and mock.

Use ${{ VARIABLE_NAME }} syntax to reference your .env file. See .agentv/targets.yaml after agentv init for detailed examples and all provider-specific fields.

Evaluation Features

Code Judges

Write validators in any language (Python, TypeScript, Node, etc.):

# Input: stdin JSON with question, criteria, answer
# Output: stdout JSON with score (0-1), hits, misses, reasoning

For complete examples and patterns, see:

LLM Judges

Create markdown judge files with evaluation criteria and scoring guidelines:

assert:
  - name: semantic_check
    type: llm_judge
    prompt: ./judges/correctness.md

Your judge prompt file defines criteria and scoring guidelines.

Rubric-Based Evaluation

Define structured criteria directly in your test:

tests:
  - id: quicksort-explain
    criteria: Explain how quicksort works

    input: Explain quicksort algorithm

    assert:
      - type: rubrics
        criteria:
          - Mentions divide-and-conquer approach
          - Explains partition step
          - States time complexity

Scoring: (satisfied weights) / (total weights) → verdicts: pass (≥0.8), borderline (≥0.6), fail

Auto-generate rubrics from expected outcomes:

agentv generate rubrics evals/my-eval.yaml

See rubric evaluator for detailed patterns.

Advanced Configuration

Retry Behavior

Configure automatic retry with exponential backoff:

targets:
  - name: azure_base
    provider: azure
    max_retries: 5
    retry_initial_delay_ms: 2000
    retry_max_delay_ms: 120000
    retry_backoff_factor: 2
    retry_status_codes: [500, 408, 429, 502, 503, 504]

Automatically retries on rate limits, transient 5xx errors, and network failures with jitter.

Documentation & Learning

Getting Started:

  • Run agentv init to set up your first evaluation workspace
  • Check examples/README.md for demos (math, code generation, tool use)
  • AI agents: Ask Claude Code to /agentv-eval-builder to create and iterate on evals

Detailed Guides:

Reference:

  • Monorepo structure: packages/core/ (engine), packages/eval/ (evaluation logic), apps/cli/ (commands)

Contributing

See AGENTS.md for development guidelines, design principles, and quality assurance workflow.

License

MIT License - see LICENSE for details.

About

Light-weight AI agent evaluation and optimization framework

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  
Morty Proxy This is a proxified and sanitized view of the page, visit original site.