Batch CLI Evaluation

Batch CLI evaluation handles tools that process multiple inputs at once — bulk classifiers, screening engines, or any runner that reads all tests and outputs results in one pass.

Overview

Use batch CLI evaluation when:

An external tool processes multiple inputs in a single invocation (e.g., AML screening, bulk classification)
The runner reads the eval YAML directly to extract all tests
Output is JSONL with records keyed by test id
Each test has its own evaluator to validate its corresponding output record

Execution Flow

AgentV invokes the batch runner once, passing --eval <yaml-path> and --output <jsonl-path>
Batch runner reads the eval YAML, extracts all tests, processes them, and writes JSONL output keyed by id
AgentV parses the JSONL and routes each record to its matching test by id
Per-test evaluators validate the output for each test independently

Eval File Structure

description: Batch CLI demo using structured input
execution:
  target: batch_cli

tests:
  - id: case-001
    criteria: |-
      Batch runner returns JSON with decision=CLEAR.

    expected_output:
      - role: assistant
        content:
          decision: CLEAR

    input:
      - role: system
        content: You are a batch processor.
      - role: user
        content:
          request:
            type: screening_check
            jurisdiction: AU
          row:
            id: case-001
            name: Example A
            amount: 5000

    assert:
      - name: decision-check
        type: code_judge
        script: bun run ./scripts/check-output.ts
        cwd: .

  - id: case-002
    criteria: |-
      Batch runner returns JSON with decision=REVIEW.

    expected_output:
      - role: assistant
        content:
          decision: REVIEW

    input:
      - role: system
        content: You are a batch processor.
      - role: user
        content:
          request:
            type: screening_check
            jurisdiction: AU
          row:
            id: case-002
            name: Example B
            amount: 25000

    assert:
      - name: decision-check
        type: code_judge
        script: bun run ./scripts/check-output.ts
        cwd: .

Batch Runner Contract

The batch runner reads the eval YAML directly and processes all tests in one invocation.

Input

The runner receives the eval file path via --eval and an output path via --output:

bun run batch-runner.ts --eval ./my-eval.yaml --output ./results.jsonl

Output

JSONL where each line is a JSON object with an id matching a test:

{"id": "case-001", "text": "{\"decision\": \"CLEAR\", ...}"}
{"id": "case-002", "text": "{\"decision\": \"REVIEW\", ...}"}

The id field must match the test id for AgentV to route output to the correct evaluator.

Output with Tool Trajectory

To enable tool_trajectory evaluation, include output with tool_calls:

{
  "id": "case-001",
  "text": "{\"decision\": \"CLEAR\", ...}",
  "output": [
    {
      "role": "assistant",
      "tool_calls": [
        {
          "tool": "screening_check",
          "input": { "origin_country": "NZ", "amount": 5000 },
          "output": { "decision": "CLEAR", "reasons": [] }
        }
      ]
    },
    {
      "role": "assistant",
      "content": { "decision": "CLEAR" }
    }
  ]
}

AgentV extracts tool calls directly from output[].tool_calls[] for tool_trajectory evaluators.

Evaluator Implementation

Each test has its own evaluator that validates the batch runner output. The evaluator receives the standard code_judge input via stdin.

Input (stdin):

{
  "answer": "{\"id\":\"case-001\",\"decision\":\"CLEAR\",...}",
  "expected_output": [{"role": "assistant", "content": {"decision": "CLEAR"}}],
  "input": [...]
}

Output (stdout):

{
  "score": 1.0,
  "hits": ["decision matches: CLEAR"],
  "misses": [],
  "reasoning": "Batch runner decision matches expected."
}

Example Evaluator

import fs from 'node:fs';

type EvalInput = {
  answer?: string;
  expected_output?: Array<{ role: string; content: unknown }>;
};

function main() {
  const stdin = fs.readFileSync(0, 'utf8');
  const input = JSON.parse(stdin) as EvalInput;

  const expectedDecision = findExpectedDecision(input.expected_output);

  let candidateDecision: string | undefined;
  try {
    const parsed = JSON.parse(input.answer ?? '');
    candidateDecision = parsed.decision;
  } catch {
    candidateDecision = undefined;
  }

  const hits: string[] = [];
  const misses: string[] = [];

  if (expectedDecision === candidateDecision) {
    hits.push(`decision matches: ${expectedDecision}`);
  } else {
    misses.push(`mismatch: expected=${expectedDecision} actual=${candidateDecision}`);
  }

  const score = misses.length === 0 ? 1 : 0;

  process.stdout.write(JSON.stringify({
    score,
    hits,
    misses,
    reasoning: score === 1
      ? 'Batch runner output matches expected.'
      : 'Batch runner output did not match expected.',
  }));
}

function findExpectedDecision(messages?: Array<{ role: string; content: unknown }>) {
  if (!messages) return undefined;
  for (const msg of messages) {
    if (typeof msg.content === 'object' && msg.content !== null) {
      return (msg.content as Record<string, unknown>).decision as string;
    }
  }
  return undefined;
}

main();

Structured Content

Use structured objects in expected_output to define expected output fields for easy validation:

expected_output:
  - role: assistant
    content:
      decision: CLEAR
      confidence: high
      reasons: []

The evaluator extracts these fields and compares them against the parsed candidate output.

Target Configuration

Configure the batch CLI provider in your targets file or eval file:

# In agentv-targets.yaml or eval file
targets:
  batch_cli:
    provider: cli
    commandTemplate: bun run ./scripts/batch-runner.ts --eval {EVAL_FILE} --output {OUTPUT_FILE}
    provider_batching: true

Key settings:

Setting	Description
`provider: cli`	Use the CLI provider
`provider_batching: true`	Run once for all tests instead of per-test
`{EVAL_FILE}`	Placeholder replaced with the eval file path
`{OUTPUT_FILE}`	Placeholder replaced with the JSONL output path

Best Practices

Use unique test IDs — the batch runner and AgentV use id to route outputs to the correct evaluator
Structured input — put structured data in user.content for the runner to extract
Structured expected_output — define expected output as objects for easy comparison
Deterministic runners — batch runners should produce consistent output for reliable testing

Healthcheck support — add a --healthcheck flag for runner validation:

if (args.includes('--healthcheck')) {
  console.log('batch-runner: healthy');
  return;
}