Skip to content

Batch CLI Evaluation

Batch CLI evaluation handles tools that process multiple inputs at once — bulk classifiers, screening engines, or any runner that reads all tests and outputs results in one pass.

Use batch CLI evaluation when:

  • An external tool processes multiple inputs in a single invocation (e.g., AML screening, bulk classification)
  • The runner reads the eval YAML directly to extract all tests
  • Output is JSONL with records keyed by test id
  • Each test has its own evaluator to validate its corresponding output record
  1. AgentV invokes the batch runner once, passing --eval <yaml-path> and --output <jsonl-path>
  2. Batch runner reads the eval YAML, extracts all tests, processes them, and writes JSONL output keyed by id
  3. AgentV parses the JSONL and routes each record to its matching test by id
  4. Per-test evaluators validate the output for each test independently
description: Batch CLI demo using structured input
execution:
target: batch_cli
tests:
- id: case-001
criteria: |-
Batch runner returns JSON with decision=CLEAR.
expected_output:
- role: assistant
content:
decision: CLEAR
input:
- role: system
content: You are a batch processor.
- role: user
content:
request:
type: screening_check
jurisdiction: AU
row:
id: case-001
name: Example A
amount: 5000
assert:
- name: decision-check
type: code_judge
script: bun run ./scripts/check-output.ts
cwd: .
- id: case-002
criteria: |-
Batch runner returns JSON with decision=REVIEW.
expected_output:
- role: assistant
content:
decision: REVIEW
input:
- role: system
content: You are a batch processor.
- role: user
content:
request:
type: screening_check
jurisdiction: AU
row:
id: case-002
name: Example B
amount: 25000
assert:
- name: decision-check
type: code_judge
script: bun run ./scripts/check-output.ts
cwd: .

The batch runner reads the eval YAML directly and processes all tests in one invocation.

The runner receives the eval file path via --eval and an output path via --output:

Terminal window
bun run batch-runner.ts --eval ./my-eval.yaml --output ./results.jsonl

JSONL where each line is a JSON object with an id matching a test:

{"id": "case-001", "text": "{\"decision\": \"CLEAR\", ...}"}
{"id": "case-002", "text": "{\"decision\": \"REVIEW\", ...}"}

The id field must match the test id for AgentV to route output to the correct evaluator.

To enable tool_trajectory evaluation, include output with tool_calls:

{
"id": "case-001",
"text": "{\"decision\": \"CLEAR\", ...}",
"output": [
{
"role": "assistant",
"tool_calls": [
{
"tool": "screening_check",
"input": { "origin_country": "NZ", "amount": 5000 },
"output": { "decision": "CLEAR", "reasons": [] }
}
]
},
{
"role": "assistant",
"content": { "decision": "CLEAR" }
}
]
}

AgentV extracts tool calls directly from output[].tool_calls[] for tool_trajectory evaluators.

Each test has its own evaluator that validates the batch runner output. The evaluator receives the standard code_judge input via stdin.

Input (stdin):

{
"answer": "{\"id\":\"case-001\",\"decision\":\"CLEAR\",...}",
"expected_output": [{"role": "assistant", "content": {"decision": "CLEAR"}}],
"input": [...]
}

Output (stdout):

{
"score": 1.0,
"hits": ["decision matches: CLEAR"],
"misses": [],
"reasoning": "Batch runner decision matches expected."
}
import fs from 'node:fs';
type EvalInput = {
answer?: string;
expected_output?: Array<{ role: string; content: unknown }>;
};
function main() {
const stdin = fs.readFileSync(0, 'utf8');
const input = JSON.parse(stdin) as EvalInput;
const expectedDecision = findExpectedDecision(input.expected_output);
let candidateDecision: string | undefined;
try {
const parsed = JSON.parse(input.answer ?? '');
candidateDecision = parsed.decision;
} catch {
candidateDecision = undefined;
}
const hits: string[] = [];
const misses: string[] = [];
if (expectedDecision === candidateDecision) {
hits.push(`decision matches: ${expectedDecision}`);
} else {
misses.push(`mismatch: expected=${expectedDecision} actual=${candidateDecision}`);
}
const score = misses.length === 0 ? 1 : 0;
process.stdout.write(JSON.stringify({
score,
hits,
misses,
reasoning: score === 1
? 'Batch runner output matches expected.'
: 'Batch runner output did not match expected.',
}));
}
function findExpectedDecision(messages?: Array<{ role: string; content: unknown }>) {
if (!messages) return undefined;
for (const msg of messages) {
if (typeof msg.content === 'object' && msg.content !== null) {
return (msg.content as Record<string, unknown>).decision as string;
}
}
return undefined;
}
main();

Use structured objects in expected_output to define expected output fields for easy validation:

expected_output:
- role: assistant
content:
decision: CLEAR
confidence: high
reasons: []

The evaluator extracts these fields and compares them against the parsed candidate output.

Configure the batch CLI provider in your targets file or eval file:

# In agentv-targets.yaml or eval file
targets:
batch_cli:
provider: cli
commandTemplate: bun run ./scripts/batch-runner.ts --eval {EVAL_FILE} --output {OUTPUT_FILE}
provider_batching: true

Key settings:

SettingDescription
provider: cliUse the CLI provider
provider_batching: trueRun once for all tests instead of per-test
{EVAL_FILE}Placeholder replaced with the eval file path
{OUTPUT_FILE}Placeholder replaced with the JSONL output path
  1. Use unique test IDs — the batch runner and AgentV use id to route outputs to the correct evaluator
  2. Structured input — put structured data in user.content for the runner to extract
  3. Structured expected_output — define expected output as objects for easy comparison
  4. Deterministic runners — batch runners should produce consistent output for reliable testing
  5. Healthcheck support — add a --healthcheck flag for runner validation:
    if (args.includes('--healthcheck')) {
    console.log('batch-runner: healthy');
    return;
    }
Morty Proxy This is a proxified and sanitized view of the page, visit original site.