Batch CLI Evaluation
Batch CLI evaluation handles tools that process multiple inputs at once — bulk classifiers, screening engines, or any runner that reads all tests and outputs results in one pass.
Overview
Section titled “Overview”Use batch CLI evaluation when:
- An external tool processes multiple inputs in a single invocation (e.g., AML screening, bulk classification)
- The runner reads the eval YAML directly to extract all tests
- Output is JSONL with records keyed by test
id - Each test has its own evaluator to validate its corresponding output record
Execution Flow
Section titled “Execution Flow”- AgentV invokes the batch runner once, passing
--eval <yaml-path>and--output <jsonl-path> - Batch runner reads the eval YAML, extracts all tests, processes them, and writes JSONL output keyed by
id - AgentV parses the JSONL and routes each record to its matching test by
id - Per-test evaluators validate the output for each test independently
Eval File Structure
Section titled “Eval File Structure”description: Batch CLI demo using structured inputexecution: target: batch_cli
tests: - id: case-001 criteria: |- Batch runner returns JSON with decision=CLEAR.
expected_output: - role: assistant content: decision: CLEAR
input: - role: system content: You are a batch processor. - role: user content: request: type: screening_check jurisdiction: AU row: id: case-001 name: Example A amount: 5000
assert: - name: decision-check type: code_judge script: bun run ./scripts/check-output.ts cwd: .
- id: case-002 criteria: |- Batch runner returns JSON with decision=REVIEW.
expected_output: - role: assistant content: decision: REVIEW
input: - role: system content: You are a batch processor. - role: user content: request: type: screening_check jurisdiction: AU row: id: case-002 name: Example B amount: 25000
assert: - name: decision-check type: code_judge script: bun run ./scripts/check-output.ts cwd: .Batch Runner Contract
Section titled “Batch Runner Contract”The batch runner reads the eval YAML directly and processes all tests in one invocation.
The runner receives the eval file path via --eval and an output path via --output:
bun run batch-runner.ts --eval ./my-eval.yaml --output ./results.jsonlOutput
Section titled “Output”JSONL where each line is a JSON object with an id matching a test:
{"id": "case-001", "text": "{\"decision\": \"CLEAR\", ...}"}{"id": "case-002", "text": "{\"decision\": \"REVIEW\", ...}"}The id field must match the test id for AgentV to route output to the correct evaluator.
Output with Tool Trajectory
Section titled “Output with Tool Trajectory”To enable tool_trajectory evaluation, include output with tool_calls:
{ "id": "case-001", "text": "{\"decision\": \"CLEAR\", ...}", "output": [ { "role": "assistant", "tool_calls": [ { "tool": "screening_check", "input": { "origin_country": "NZ", "amount": 5000 }, "output": { "decision": "CLEAR", "reasons": [] } } ] }, { "role": "assistant", "content": { "decision": "CLEAR" } } ]}AgentV extracts tool calls directly from output[].tool_calls[] for tool_trajectory evaluators.
Evaluator Implementation
Section titled “Evaluator Implementation”Each test has its own evaluator that validates the batch runner output. The evaluator receives the standard code_judge input via stdin.
Input (stdin):
{ "answer": "{\"id\":\"case-001\",\"decision\":\"CLEAR\",...}", "expected_output": [{"role": "assistant", "content": {"decision": "CLEAR"}}], "input": [...]}Output (stdout):
{ "score": 1.0, "hits": ["decision matches: CLEAR"], "misses": [], "reasoning": "Batch runner decision matches expected."}Example Evaluator
Section titled “Example Evaluator”import fs from 'node:fs';
type EvalInput = { answer?: string; expected_output?: Array<{ role: string; content: unknown }>;};
function main() { const stdin = fs.readFileSync(0, 'utf8'); const input = JSON.parse(stdin) as EvalInput;
const expectedDecision = findExpectedDecision(input.expected_output);
let candidateDecision: string | undefined; try { const parsed = JSON.parse(input.answer ?? ''); candidateDecision = parsed.decision; } catch { candidateDecision = undefined; }
const hits: string[] = []; const misses: string[] = [];
if (expectedDecision === candidateDecision) { hits.push(`decision matches: ${expectedDecision}`); } else { misses.push(`mismatch: expected=${expectedDecision} actual=${candidateDecision}`); }
const score = misses.length === 0 ? 1 : 0;
process.stdout.write(JSON.stringify({ score, hits, misses, reasoning: score === 1 ? 'Batch runner output matches expected.' : 'Batch runner output did not match expected.', }));}
function findExpectedDecision(messages?: Array<{ role: string; content: unknown }>) { if (!messages) return undefined; for (const msg of messages) { if (typeof msg.content === 'object' && msg.content !== null) { return (msg.content as Record<string, unknown>).decision as string; } } return undefined;}
main();Structured Content
Section titled “Structured Content”Use structured objects in expected_output to define expected output fields for easy validation:
expected_output: - role: assistant content: decision: CLEAR confidence: high reasons: []The evaluator extracts these fields and compares them against the parsed candidate output.
Target Configuration
Section titled “Target Configuration”Configure the batch CLI provider in your targets file or eval file:
# In agentv-targets.yaml or eval filetargets: batch_cli: provider: cli commandTemplate: bun run ./scripts/batch-runner.ts --eval {EVAL_FILE} --output {OUTPUT_FILE} provider_batching: trueKey settings:
| Setting | Description |
|---|---|
provider: cli | Use the CLI provider |
provider_batching: true | Run once for all tests instead of per-test |
{EVAL_FILE} | Placeholder replaced with the eval file path |
{OUTPUT_FILE} | Placeholder replaced with the JSONL output path |
Best Practices
Section titled “Best Practices”- Use unique test IDs — the batch runner and AgentV use
idto route outputs to the correct evaluator - Structured input — put structured data in
user.contentfor the runner to extract - Structured expected_output — define expected output as objects for easy comparison
- Deterministic runners — batch runners should produce consistent output for reliable testing
- Healthcheck support — add a
--healthcheckflag for runner validation:if (args.includes('--healthcheck')) {console.log('batch-runner: healthy');return;}