feat: Add experimental red teamer strategy #5795

sklein12 · Oct 1, 2025

Adding our experimental advanced red teamer strategy.

coderabbitai · Oct 1, 2025

📝 Walkthrough

Walkthrough

Introduces a new Redteam Simba provider and execution path. Evaluator gains Simba-aware routing: classifies tests (serial, concurrent, Simba), runs Simba cases sequentially via provider.runSimba, and preserves existing abort/progress handling. Provider registry maps promptfoo:redteam:simba to the new provider. Redteam metadata, plugins, and strategies gain Simba entries; strategies/index exposes Simba twice. A Simba strategy generates a single derived test case using the Simba provider. remoteGeneration exports buildRemoteUrl. redteam/util adds isSimbaTestCase for routing. types/index removes RunEvalOptions.allTests. The Simba provider scaffolds sessioned API calls, with runSimba returning EvaluateResult[] and callApi unsupported.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The pull request title references the addition of an experimental red teamer strategy, which aligns with part of the changeset but does not reflect the broader Simba provider integration, evaluator updates, and metadata modifications introduced by this PR.
Description Check	✅ Passed	The pull request description succinctly states the addition of an experimental advanced red teamer strategy, which is related to the core changeset and correctly references one of the key additions.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch steve/09052025-simba

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

src/evaluator.ts (3)
259-275: Remove nonexistent parameter allTests from RunEvalOptions

RunEvalOptions (types/index.ts) does not define allTests. Destructuring it here will fail type‑checking when callers pass strictly typed options.

Apply this diff:
 export async function runEval({
   provider,
   prompt, // raw prompt
   test,
   delay,
   nunjucksFilters: filters,
   evaluateOptions,
   testIdx,
   promptIdx,
   conversations,
   registers,
   isRedteam,
-  allTests,
   concurrency,
   abortSignal,
 }: RunEvalOptions): Promise<EvaluateResult[]> {
340-355: Honor test-level provider configs (load raw provider before use)

activeProvider falls back to the suite provider unless test.provider is already an ApiProvider. For Simba tests created with a raw provider config, this prevents the Simba path from ever triggering.

Apply this diff to load a raw provider config:
-      const activeProvider = isApiProvider(test.provider) ? test.provider : provider;
+      let activeProvider = isApiProvider(test.provider) ? test.provider : provider;
+      // If test.provider is a raw config object, load it
+      if (!isApiProvider(test.provider) && typeof test.provider === 'object' && (test.provider as any)?.id) {
+        const { loadApiProvider } = await import('./providers');
+        const providerId =
+          typeof (test.provider as any).id === 'function'
+            ? (test.provider as any).id()
+            : (test.provider as any).id;
+        activeProvider = await loadApiProvider(providerId, {
+          options: test.provider as ProviderOptions,
+        });
+      }
1049-1090: Remove allTests from RunEvalOptions objects
All occurrences of the allTests property in src/evaluator.ts (around lines 271 and 1086) must be deleted—it isn’t part of RunEvalOptions and causes TS errors.
-                allTests: runEvalOptions,

🧹 Nitpick comments (8)

src/redteam/constants/strategies.ts (1)
74-76: Consider alphabetical ordering for maintainability.

The ADDITIONAL_STRATEGIES array appears to mix sorted and unsorted entries. Placing 'simba' between 'retry' and 'rot13' breaks alphabetical order ('simba' should come after 'rot13'). Consider maintaining alphabetical sorting throughout the array for easier maintenance and to prevent duplication.
  'retry',
- 'simba',
  'rot13',
+ 'simba',
  'video',
src/redteam/util.ts (1)
290-309: Consider simplifying with a single return statement.

The function logic is correct but can be more concise. The two separate if-checks with early returns can be combined into a single boolean expression.
-export function isSimbaTestCase(evalOptions: RunEvalOptions): boolean {
-  // Check if provider is Simba
-  if (evalOptions.provider.id() === 'promptfoo:redteam:simba') {
-    return true;
-  }
-
-  // Check if test metadata indicates Simba strategy
-  if (evalOptions.test.metadata?.strategyId === 'simba') {
-    return true;
-  }
-
-  return false;
-}
+export function isSimbaTestCase(evalOptions: RunEvalOptions): boolean {
+  return (
+    evalOptions.provider.id() === 'promptfoo:redteam:simba' ||
+    evalOptions.test.metadata?.strategyId === 'simba'
+  );
+}
src/redteam/strategies/index.ts (1)
271-279: Ensure strategy id uniqueness + consider structured logs

Add a guard to prevent duplicate strategy ids at startup to avoid ambiguous routing.

Prefer structured logging for new entries: logger.debug('Adding Simba test cases', { count: testCases.length }).

Example uniqueness check (outside this block, in validateStrategies):
const ids = Strategies.map(s => s.id);
const dupes = ids.filter((id, i) => ids.indexOf(id) !== i);
if (dupes.length) {
  throw new Error(`Duplicate strategy id(s): ${Array.from(new Set(dupes)).join(', ')}`);
}
src/evaluator.ts (1)
1448-1476: Simba routing split looks good; prefer structured logs

The 3‑way split is sound. For logs, pass context instead of interpolating vars.

Example:
logger.info('Running Simba test cases sequentially after normal tests', {
  count: simbaRunEvalOptions.length,
});
src/redteam/constants/metadata.ts (1)

670-671: Alias may not be safe as a metric key.

categoryAliases is used as “metric name or harm category.” "Simba (beta)" includes space/parentheses; consider a sanitized alias like "SimbaBeta" to avoid downstream selector/metric key issues. Several existing aliases contain spaces, but those have historically caused friction in dashboards.
src/redteam/providers/simba.ts (3)
104-105: Use structured logging; avoid stringified config.

Prefer: logger.debug('...', { config: this.config }) to align with structured/PII‑safe logging.
-    logger.debug(`[Simba] Constructor options: ${JSON.stringify(this.config)}`);
+    logger.debug('[Simba] Constructor options', { config: this.config });
84-106: Constructor invariant good; consider exposing getSessionId().

Since you maintain a session, exposing an optional getSessionId helps with observability and parity with other providers.
 export default class SimbaProvider implements ApiProvider {
@@
   private sessionId: string | null = null;
@@
+  getSessionId() {
+    return this.sessionId;
+  }
139-174: Consider accepting context in startSession to source purpose from test metadata.

You hinted this in comments. Wiring context‑derived purpose reduces config friction and improves test reporting.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ec2b084 and 812a67e.

📒 Files selected for processing (11)

src/evaluator.ts (6 hunks)
src/providers/registry.ts (2 hunks)
src/redteam/constants/metadata.ts (7 hunks)
src/redteam/constants/plugins.ts (1 hunks)
src/redteam/constants/strategies.ts (1 hunks)
src/redteam/providers/simba.ts (1 hunks)
src/redteam/remoteGeneration.ts (1 hunks)
src/redteam/strategies/index.ts (2 hunks)
src/redteam/strategies/simba.ts (1 hunks)
src/redteam/util.ts (2 hunks)
src/types/index.ts (0 hunks)

💤 Files with no reviewable changes (1)

src/types/index.ts

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

Prefer not to introduce new TypeScript types; use existing interfaces whenever possible

**/*.{ts,tsx}: Use TypeScript with strict type checking
Use consistent error handling with proper type checks
Always sanitize sensitive data before logging
Use logger methods with a structured context object (second parameter) instead of interpolating potentially sensitive data into log strings
Use sanitizeObject for non-logging contexts before persisting or transmitting potentially sensitive data

Files:

src/evaluator.ts
src/providers/registry.ts
src/redteam/remoteGeneration.ts
src/redteam/providers/simba.ts
src/redteam/constants/plugins.ts
src/redteam/util.ts
src/redteam/strategies/index.ts
src/redteam/constants/strategies.ts
src/redteam/strategies/simba.ts
src/redteam/constants/metadata.ts

**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.{ts,tsx,js,jsx}: Follow consistent import order (Biome will sort imports)
Use consistent curly braces for all control statements
Prefer const over let; avoid var
Use object shorthand syntax where possible
Use async/await for asynchronous code

Files:

src/evaluator.ts
src/providers/registry.ts
src/redteam/remoteGeneration.ts
src/redteam/providers/simba.ts
src/redteam/constants/plugins.ts
src/redteam/util.ts
src/redteam/strategies/index.ts
src/redteam/constants/strategies.ts
src/redteam/strategies/simba.ts
src/redteam/constants/metadata.ts

src/**

📄 CodeRabbit inference engine (CLAUDE.md)

Place core application logic in src/

Files:

src/evaluator.ts
src/providers/registry.ts
src/redteam/remoteGeneration.ts
src/redteam/providers/simba.ts
src/redteam/constants/plugins.ts
src/redteam/util.ts
src/redteam/strategies/index.ts
src/redteam/constants/strategies.ts
src/redteam/strategies/simba.ts
src/redteam/constants/metadata.ts

🧬 Code graph analysis (6)

src/evaluator.ts (2)

src/types/index.ts (2)

EvaluateResult (263-284)

RunEvalOptions (136-160)

src/redteam/util.ts (1)

isSimbaTestCase (297-309)

src/providers/registry.ts (2)

src/types/providers.ts (1)

ProviderOptions (39-47)

src/types/index.ts (1)

LoadApiProviderContext (1199-1203)

src/redteam/providers/simba.ts (6)

src/types/providers.ts (3)

ApiProvider (79-96)

CallApiContextParams (49-69)

CallApiOptionsParams (71-77)

src/redteam/remoteGeneration.ts (1)

buildRemoteUrl (28-50)

src/logger.ts (1)

logRequestResponse (405-441)

src/globalConfig/accounts.ts (1)

getUserEmail (26-29)

src/types/index.ts (1)

EvaluateResult (263-284)

src/util/tokenUsageUtils.ts (1)

createEmptyTokenUsage (31-41)

src/redteam/util.ts (1)

src/types/index.ts (1)

RunEvalOptions (136-160)

src/redteam/strategies/index.ts (1)

src/redteam/strategies/simba.ts (1)

addSimbaTestCases (4-41)

src/redteam/strategies/simba.ts (1)

src/types/index.ts (2)

TestCaseWithPlugin (701-701)

TestCase (699-699)

🪛 GitHub Check: Build on Node 20.x

src/redteam/providers/simba.ts

[failure] 390-390:
Type 'undefined' is not assignable to type '{ prompt?: number | undefined; completion?: number | undefined; cached?: number | undefined; total?: number | undefined; numRequests?: number | undefined; completionDetails?: { reasoning?: number | undefined; acceptedPrediction?: number | undefined; rejectedPrediction?: number | undefined; } | undefined; }'.

[failure] 389-389:
Type 'undefined' is not assignable to type '{ reasoning?: number | undefined; acceptedPrediction?: number | undefined; rejectedPrediction?: number | undefined; }'.

[failure] 381-381:
Type 'string' is not assignable to type 'ResultFailureReason'.

🪛 GitHub Check: Build on Node 22.x

src/redteam/providers/simba.ts

[failure] 390-390:
Type 'undefined' is not assignable to type '{ prompt?: number | undefined; completion?: number | undefined; cached?: number | undefined; total?: number | undefined; numRequests?: number | undefined; completionDetails?: { reasoning?: number | undefined; acceptedPrediction?: number | undefined; rejectedPrediction?: number | undefined; } | undefined; }'.

[failure] 389-389:
Type 'undefined' is not assignable to type '{ reasoning?: number | undefined; acceptedPrediction?: number | undefined; rejectedPrediction?: number | undefined; }'.

[failure] 381-381:
Type 'string' is not assignable to type 'ResultFailureReason'.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)

GitHub Check: Test on Node 24.x and windows-latest
GitHub Check: Test on Node 22.x and windows-latest
GitHub Check: Test on Node 20.x and macOS-latest
GitHub Check: Test on Node 22.x and macOS-latest
GitHub Check: Test on Node 20.x and ubuntu-latest
GitHub Check: Test on Node 20.x and windows-latest
GitHub Check: Test on Node 24.x and ubuntu-latest
GitHub Check: Build Docs
GitHub Check: webui tests
GitHub Check: Run Integration Tests
GitHub Check: Generate Assets
GitHub Check: Style Check
GitHub Check: Analyze (javascript-typescript)

🔇 Additional comments (13)

src/redteam/constants/plugins.ts (1)

303-303: LGTM!

The 'simba' plugin is correctly added to ADDITIONAL_PLUGINS in alphabetical order.

src/providers/registry.ts (2)

15-15: LGTM!

Import follows the established pattern for other redteam providers.

1175-1184: LGTM!

The provider factory registration follows the same pattern as other redteam providers (e.g., crescendo, goat, custom) and is correctly placed among similar redteam provider registrations.

src/redteam/remoteGeneration.ts (1)

28-50: LGTM!

Exporting buildRemoteUrl enables its reuse by the Simba provider for constructing API endpoints. The function logic remains unchanged and follows the established pattern for URL construction with proper fallback handling.

src/redteam/strategies/index.ts (1)

29-29: Import looks good

Importing addSimbaTestCases is correct and scoped to strategies.

src/redteam/strategies/simba.ts (1)

11-15: Early return on empty input is fine

Returning [] when no test cases is appropriate.

src/evaluator.ts (1)

1511-1531: Sequential Simba phase: OK; confirm runSimba returns fully-formed EvaluateResult[]
Ensure runSimba populates tokenUsage, metadata, and gradingResult on each EvaluateResult for downstream stats and writers.

src/redteam/constants/metadata.ts (5)

149-155: Add looks consistent; no blocking issues.

The subCategoryDescriptions entry for simba is clear and matches surrounding style.

290-291: Display name OK.

"Simba (beta)" reads well for feature‑flagged UI.

813-815: Plugin description LGTM.

Concise and consistent with other entries.

850-852: Strategy description LGTM.

Aligns with provider behavior.

885-887: Strategy display name OK.

"Red teamer (beta)" is clear.
src/redteam/providers/simba.ts (1)
367-394: Use the correct enum member and replace manual tokenUsage in the error path

failureReason: replace the string 'provider_error' with ResultFailureReason.ERROR

tokenUsage: call createEmptyTokenUsage() instead of inlining a zeroed object
-          failureReason: 'provider_error',
+          failureReason: ResultFailureReason.ERROR,

-          tokenUsage: {
-            total: 0,
-            prompt: 0,
-            completion: 0,
-            cached: 0,
-            numRequests: 0,
-            completionDetails: undefined,
-            assertions: undefined,
-          },
+          tokenUsage: createEmptyTokenUsage(),
Likely an incorrect or invalid review comment.

coderabbitai · Oct 1, 2025

src/evaluator.ts

+            abortSignal ? { abortSignal } : undefined,
+            concurrency,
+          );
+
+          // Update results with proper indices for Simba
+          for (const result of simbaResults) {
+            result.promptIdx = promptIdx;
+            result.testIdx = testIdx++;
+          }
+
+          return simbaResults;
+        } else {
+          throw new Error('Simba provider does not have runSimba method');
+        }
+      } else {
+        response = await activeProvider.callApi(
+          renderedPrompt,
+          callApiContext,
+          abortSignal ? { abortSignal } : undefined,
+        );
+      }

      logger.debug(`Provider response properties: ${Object.keys(response).join(', ')}`);
      logger.debug(`Provider response cached property explicitly: ${response.cached}`);
    }
    const endTime = Date.now();
    latencyMs = endTime - startTime;

    let conversationLastInput = undefined;
    if (renderedJson && Array.isArray(renderedJson)) {
      const lastElt = renderedJson[renderedJson.length - 1];
      // Use the `content` field if present (OpenAI chat format)
      conversationLastInput = lastElt?.content || lastElt;
    }
    if (conversations) {
      conversations[conversationKey] = conversations[conversationKey] || [];
      conversations[conversationKey].push({
        prompt: renderedJson || renderedPrompt,
        input: conversationLastInput || renderedJson || renderedPrompt,
        output: response.output || '',
        metadata: response.metadata,
      });
    }

    logger.debug(`Evaluator response = ${JSON.stringify(response).substring(0, 100)}...`);
    logger.debug(
      `Evaluator checking cached flag: response.cached = ${Boolean(response.cached)}, provider.delay = ${provider.delay}`,
    );

    if (!response.cached && provider.delay > 0) {
      logger.debug(`Sleeping for ${provider.delay}ms`);
      await sleep(provider.delay);
    } else if (response.cached) {
      logger.debug(`Skipping delay because response is cached`);
    }

    const ret: EvaluateResult = {
      ...setup,
      response,
      success: false,
      failureReason: ResultFailureReason.NONE,
      score: 0,
      namedScores: {},
      latencyMs,
      cost: response.cost,
      metadata: {
        ...test.metadata,
        ...response.metadata,
        [FILE_METADATA_KEY]: fileMetadata,
        // Add session information to metadata
        ...(() => {
          // If sessionIds array exists from iterative providers, use it
          if (test.metadata?.sessionIds) {
            return { sessionIds: test.metadata.sessionIds };
          }

          // Otherwise, use single sessionId (prioritize response over vars)
          if (response.sessionId) {
            return { sessionId: response.sessionId };
          }

          // Check if vars.sessionId is a valid string
          const varsSessionId = vars.sessionId;
          if (typeof varsSessionId === 'string' && varsSessionId.trim() !== '') {
            return { sessionId: varsSessionId };
          }

          return {};
        })(),
      },
      promptIdx,
      testIdx,
      testCase: test,
      promptId: prompt.id || '',
      tokenUsage: createEmptyTokenUsage(),
    };

    invariant(ret.tokenUsage, 'This is always defined, just doing this to shut TS up');

    // Track token usage at the provider level
    if (response.tokenUsage) {
      const providerId = provider.id();
      const trackingId = provider.constructor?.name
        ? `${providerId} (${provider.constructor.name})`
        : providerId;
      TokenUsageTracker.getInstance().trackUsage(trackingId, response.tokenUsage);
    }

    if (response.error) {
      ret.error = response.error;
      ret.failureReason = ResultFailureReason.ERROR;
      ret.success = false;
    } else if (response.output === null || response.output === undefined) {
      // NOTE: empty output often indicative of guardrails, so behavior differs for red teams.
      if (isRedteam) {
        ret.success = true;
      } else {
        ret.success = false;
        ret.score = 0;
        ret.error = 'No output';
      }
    } else {
      // Create a copy of response so we can potentially mutate it.
      const processedResponse = { ...response };

      // Apply provider transform first (if exists)
      if (provider.transform) {
        processedResponse.output = await transform(provider.transform, processedResponse.output, {
          vars,
          prompt,
        });
      }

      // Store the provider-transformed output for assertions (contextTransform)
      const providerTransformedOutput = processedResponse.output;

      // Apply test transform (if exists)
      const testTransform = test.options?.transform || test.options?.postprocess;
      if (testTransform) {
        processedResponse.output = await transform(testTransform, processedResponse.output, {
          vars,
          prompt,
          ...(response && response.metadata && { metadata: response.metadata }),
        });
      }

      invariant(processedResponse.output != null, 'Response output should not be null');

      // Extract traceId from traceparent if available
      let traceId: string | undefined;
      if (traceContext?.traceparent) {
        // traceparent format: version-traceId-spanId-flags
        const parts = traceContext.traceparent.split('-');
        if (parts.length >= 3) {
          traceId = parts[1];
        }
      }

      // Pass providerTransformedOutput for contextTransform to use
      const checkResult = await runAssertions({
        prompt: renderedPrompt,
        provider,
        providerResponse: {
          ...processedResponse,
          // Add provider-transformed output for contextTransform
          providerTransformedOutput,
        },
        test,
        latencyMs: response.cached ? undefined : latencyMs,
        assertScoringFunction: test.assertScoringFunction as ScoringFunction,
        traceId,
      });

      if (!checkResult.pass) {
        ret.error = checkResult.reason;
        ret.failureReason = ResultFailureReason.ASSERT;
      }
      ret.success = checkResult.pass;
      ret.score = checkResult.score;
      ret.namedScores = checkResult.namedScores || {};
      // Track assertion request count
      if (!ret.tokenUsage.assertions) {
        ret.tokenUsage.assertions = createEmptyAssertions();
      }
      ret.tokenUsage.assertions.numRequests = (ret.tokenUsage.assertions.numRequests ?? 0) + 1;

      // Track assertion token usage if provided
      if (checkResult.tokensUsed) {
        accumulateAssertionTokenUsage(ret.tokenUsage.assertions, checkResult.tokensUsed);
      }
      ret.response = processedResponse;
      ret.gradingResult = checkResult;
    }

    // Update token usage stats
    if (response.tokenUsage) {
      accumulateResponseTokenUsage(ret.tokenUsage, response);
    }

    if (test.options?.storeOutputAs && ret.response?.output && registers) {
      // Save the output in a register for later use
      registers[test.options.storeOutputAs] = ret.response.output;
    }

    return [ret];
  } catch (err) {
    return [
      {
        ...setup,
        error: String(err) + '\n\n' + (err as Error).stack,
        success: false,
        failureReason: ResultFailureReason.ERROR,
        score: 0,
        namedScores: {},
        latencyMs,
        promptIdx,
        testIdx,
        testCase: test,
        promptId: prompt.id || '',
      },
    ];
  }
 }

 /**
 * Safely formats variables for display in progress bars and logs.
 * Handles extremely large variables that could cause RangeError crashes.
 *
 * @param vars - Variables to format
 * @param maxLength - Maximum length of the final formatted string
 * @returns Formatted variables string or fallback message
 */
 export function formatVarsForDisplay(
  vars: Record<string, any> | undefined,
  maxLength: number,
 ): string {
  if (!vars || Object.keys(vars).length === 0) {
    return '';
  }

  try {
    // Simple approach: limit individual values, then truncate the whole result
    const formatted = Object.entries(vars)
      .map(([key, value]) => {
        // Prevent memory issues by limiting individual values first
        const valueStr = String(value).slice(0, 100);
        return `${key}=${valueStr}`;
      })
      .join(' ')
      .replace(/\n/g, ' ')
      .slice(0, maxLength);

    return formatted;
  } catch {
    // Any error - return safe fallback
    return '[vars unavailable]';
  }
 }

 export function generateVarCombinations(
  vars: Record<string, string | string[] | any>,
 ): Record<string, string | any[]>[] {
  const keys = Object.keys(vars);
  const combinations: Record<string, string | any[]>[] = [{}];

  for (const key of keys) {
    let values: any[] = [];

    if (typeof vars[key] === 'string' && vars[key].startsWith('file://')) {
      const filePath = vars[key].slice('file://'.length);

      // For glob patterns, we need to resolve the base directory and use relative patterns
      const basePath = cliState.basePath || '';
      const filePaths =
        globSync(filePath, {
          cwd: basePath || process.cwd(),
          windowsPathsNoEscape: true,
        }) || [];

      values = filePaths.map((path: string) => `file://${path}`);
      if (values.length === 0) {
        throw new Error(
          `No files found for variable ${key} at path ${filePath} in directory ${basePath || process.cwd()}`,
        );
      }
    } else {
      values = Array.isArray(vars[key]) ? vars[key] : [vars[key]];
    }

    // Check if it's an array but not a string array
    if (Array.isArray(vars[key]) && typeof vars[key][0] !== 'string') {
      values = [vars[key]];
    }

    const newCombinations: Record<string, any>[] = [];

    for (const combination of combinations) {
      for (const value of values) {
        newCombinations.push({ ...combination, [key]: value });
      }
    }

    combinations.length = 0;
    combinations.push(...newCombinations);
  }

  return combinations;
 }

 class Evaluator {
  evalRecord: Eval;
  testSuite: TestSuite;
  options: EvaluateOptions;
  stats: EvaluateStats;
  conversations: EvalConversations;
  registers: EvalRegisters;
  fileWriters: JsonlFileWriter[];

  constructor(testSuite: TestSuite, evalRecord: Eval, options: EvaluateOptions) {
    this.testSuite = testSuite;
    this.evalRecord = evalRecord;
    this.options = options;
    this.stats = {
      successes: 0,
      failures: 0,
      errors: 0,
      tokenUsage: createEmptyTokenUsage(),
    };
    this.conversations = {};
    this.registers = {};

    const jsonlFiles = Array.isArray(evalRecord.config.outputPath)
      ? evalRecord.config.outputPath.filter((p) => p.endsWith('.jsonl'))
      : evalRecord.config.outputPath?.endsWith('.jsonl')
        ? [evalRecord.config.outputPath]
        : [];

    this.fileWriters = jsonlFiles.map((p) => new JsonlFileWriter(p));
  }

  private async _runEvaluation(): Promise<Eval> {
    const { options } = this;
    let { testSuite } = this;

    const startTime = Date.now();
    const maxEvalTimeMs = options.maxEvalTimeMs ?? getMaxEvalTimeMs();
    let evalTimedOut = false;
    let globalTimeout: NodeJS.Timeout | undefined;
    let globalAbortController: AbortController | undefined;
    const processedIndices = new Set<number>();

    // Progress reporters declared here for cleanup in finally block
    let ciProgressReporter: CIProgressReporter | null = null;
    let progressBarManager: ProgressBarManager | null = null;

    if (maxEvalTimeMs > 0) {
      globalAbortController = new AbortController();
      options.abortSignal = options.abortSignal
        ? AbortSignal.any([options.abortSignal, globalAbortController.signal])
        : globalAbortController.signal;
      globalTimeout = setTimeout(() => {
        evalTimedOut = true;
        globalAbortController?.abort();
      }, maxEvalTimeMs);
    }

    const vars = new Set<string>();
    const checkAbort = () => {
      if (options.abortSignal?.aborted) {
        throw new Error('Operation cancelled');
      }
    };

    logger.info(`Starting evaluation ${this.evalRecord.id}`);

    // Add abort checks at key points
    checkAbort();

    const prompts: CompletedPrompt[] = [];
    const assertionTypes = new Set<string>();
    const rowsWithSelectBestAssertion = new Set<number>();
    const rowsWithMaxScoreAssertion = new Set<number>();

    const beforeAllOut = await runExtensionHook(testSuite.extensions, 'beforeAll', {
      suite: testSuite,
    });
    testSuite = beforeAllOut.suite;

    if (options.generateSuggestions) {
      // TODO(ian): Move this into its own command/file
      logger.info(`Generating prompt variations...`);
      const { prompts: newPrompts, error } = await generatePrompts(testSuite.prompts[0].raw, 1);
      if (error || !newPrompts) {
        throw new Error(`Failed to generate prompts: ${error}`);
      }

      logger.info(chalk.blue('Generated prompts:'));
      let numAdded = 0;
      for (const prompt of newPrompts) {
        logger.info('--------------------------------------------------------');
        logger.info(`${prompt}`);
        logger.info('--------------------------------------------------------');

        // Ask the user if they want to continue
        const shouldTest = await promptYesNo('Do you want to test this prompt?', false);
        if (shouldTest) {
          testSuite.prompts.push({ raw: prompt, label: prompt });
          numAdded++;
        } else {
          logger.info('Skipping this prompt.');
        }
      }

      if (numAdded < 1) {
        logger.info(chalk.red('No prompts selected. Aborting.'));
        process.exitCode = 1;
        return this.evalRecord;
      }
    }

    // Split prompts by provider
    // Order matters - keep provider in outer loop to reduce need to swap models during local inference.

    // Create a map of existing prompts for resume support
    const existingPromptsMap = new Map<string, CompletedPrompt>();
    if (cliState.resume && this.evalRecord.persisted && this.evalRecord.prompts.length > 0) {
      logger.debug('Resuming evaluation: preserving metrics from previous run');
      for (const existingPrompt of this.evalRecord.prompts) {
        const key = `${existingPrompt.provider}:${existingPrompt.id}`;
        existingPromptsMap.set(key, existingPrompt);
      }
    }

    for (const provider of testSuite.providers) {
      for (const prompt of testSuite.prompts) {
        // Check if providerPromptMap exists and if it contains the current prompt's label
        const providerKey = provider.label || provider.id();
        if (!isAllowedPrompt(prompt, testSuite.providerPromptMap?.[providerKey])) {
          continue;
        }

        const promptId = generateIdFromPrompt(prompt);
        const existingPromptKey = `${providerKey}:${promptId}`;
        const existingPrompt = existingPromptsMap.get(existingPromptKey);

        const completedPrompt = {
          ...prompt,
          id: promptId,
          provider: providerKey,
          label: prompt.label,
          metrics: existingPrompt?.metrics || {
            score: 0,
            testPassCount: 0,
            testFailCount: 0,
            testErrorCount: 0,
            assertPassCount: 0,
            assertFailCount: 0,
            totalLatencyMs: 0,
            tokenUsage: createEmptyTokenUsage(),
            namedScores: {},
            namedScoresCount: {},
            cost: 0,
          },
        };
        prompts.push(completedPrompt);
      }
    }

    this.evalRecord.addPrompts(prompts);

    // Aggregate all vars across test cases
    let tests =
      testSuite.tests && testSuite.tests.length > 0
        ? testSuite.tests
        : testSuite.scenarios
          ? []
          : [
              {
                // Dummy test for cases when we're only comparing raw prompts.
              },
            ];

    // Build scenarios and add to tests
    if (testSuite.scenarios && testSuite.scenarios.length > 0) {
      telemetry.record('feature_used', {
        feature: 'scenarios',
      });
      for (const scenario of testSuite.scenarios) {
        for (const data of scenario.config) {
          // Merge defaultTest with scenario config
          const scenarioTests = (
            scenario.tests || [
              {
                // Dummy test for cases when we're only comparing raw prompts.
              },
            ]
          ).map((test) => {
            return {
              ...(typeof testSuite.defaultTest === 'object' ? testSuite.defaultTest : {}),
              ...data,
              ...test,
              vars: {
                ...(typeof testSuite.defaultTest === 'object' ? testSuite.defaultTest?.vars : {}),
                ...data.vars,
                ...test.vars,
              },
              options: {
                ...(typeof testSuite.defaultTest === 'object'
                  ? testSuite.defaultTest?.options
                  : {}),
                ...test.options,
              },
              assert: [
                // defaultTest.assert is omitted because it will be added to each test case later
                ...(data.assert || []),
                ...(test.assert || []),
              ],
              metadata: {
                ...(typeof testSuite.defaultTest === 'object'
                  ? testSuite.defaultTest?.metadata
                  : {}),
                ...data.metadata,
                ...test.metadata,
              },
            };
          });
          // Add scenario tests to tests
          tests = tests.concat(scenarioTests);
        }
      }
    }

    maybeEmitAzureOpenAiWarning(testSuite, tests);

    // Prepare vars
    const varNames: Set<string> = new Set();
    const varsWithSpecialColsRemoved: Vars[] = [];
    const inputTransformDefault =
      typeof testSuite?.defaultTest === 'object'
        ? testSuite?.defaultTest?.options?.transformVars
        : undefined;
    for (const testCase of tests) {
      testCase.vars = {
        ...(typeof testSuite.defaultTest === 'object' ? testSuite.defaultTest?.vars : {}),
        ...testCase?.vars,
      };

      if (testCase.vars) {
        const varWithSpecialColsRemoved: Vars = {};
        const inputTransformForIndividualTest = testCase.options?.transformVars;
        const inputTransform = inputTransformForIndividualTest || inputTransformDefault;
        if (inputTransform) {
          const transformContext: TransformContext = {
            prompt: {},
            uuid: randomUUID(),
          };
          const transformedVars: Vars = await transform(
            inputTransform,
            testCase.vars,
            transformContext,
            true,
            TransformInputType.VARS,
          );
          invariant(
            typeof transformedVars === 'object',
            'Transform function did not return a valid object',
          );
          testCase.vars = { ...testCase.vars, ...transformedVars };
        }
        for (const varName of Object.keys(testCase.vars)) {
          varNames.add(varName);
          varWithSpecialColsRemoved[varName] = testCase.vars[varName];
        }
        varsWithSpecialColsRemoved.push(varWithSpecialColsRemoved);
      }
    }

    // Set up eval cases
    const runEvalOptions: RunEvalOptions[] = [];
    let testIdx = 0;
    let concurrency = options.maxConcurrency || DEFAULT_MAX_CONCURRENCY;
    for (let index = 0; index < tests.length; index++) {
      const testCase = tests[index];
      invariant(
        typeof testSuite.defaultTest !== 'object' ||
          Array.isArray(testSuite.defaultTest?.assert || []),
        `defaultTest.assert is not an array in test case #${index + 1}`,
      );
      invariant(
        Array.isArray(testCase.assert || []),
        `testCase.assert is not an array in test case #${index + 1}`,
      );
      // Handle default properties
      testCase.assert = [
        ...(typeof testSuite.defaultTest === 'object' ? testSuite.defaultTest?.assert || [] : []),
        ...(testCase.assert || []),
      ];
      testCase.threshold =
        testCase.threshold ??
        (typeof testSuite.defaultTest === 'object' ? testSuite.defaultTest?.threshold : undefined);
      testCase.options = {
        ...(typeof testSuite.defaultTest === 'object' ? testSuite.defaultTest?.options : {}),
        ...testCase.options,
      };
      testCase.metadata = {
        ...(typeof testSuite.defaultTest === 'object' ? testSuite.defaultTest?.metadata : {}),
        ...testCase.metadata,
      };
      // If the test case doesn't have a provider, use the one from defaultTest
      // Note: defaultTest.provider may be a raw config object that needs to be loaded
      if (
        !testCase.provider &&
        typeof testSuite.defaultTest === 'object' &&
        testSuite.defaultTest?.provider
      ) {
        const defaultProvider = testSuite.defaultTest.provider;
        if (isApiProvider(defaultProvider)) {
          // Already loaded
          testCase.provider = defaultProvider;
        } else if (typeof defaultProvider === 'object' && defaultProvider.id) {
          // Raw config object - load it
          const { loadApiProvider } = await import('./providers');
          const providerId =
            typeof defaultProvider.id === 'function' ? defaultProvider.id() : defaultProvider.id;
          testCase.provider = await loadApiProvider(providerId, {
            options: defaultProvider as ProviderOptions,
          });
        } else {
          testCase.provider = defaultProvider;
        }
      }
      testCase.assertScoringFunction =
        testCase.assertScoringFunction ||
        (typeof testSuite.defaultTest === 'object'
          ? testSuite.defaultTest?.assertScoringFunction
          : undefined);

      if (typeof testCase.assertScoringFunction === 'string') {
        const { filePath: resolvedPath, functionName } = parseFileUrl(
          testCase.assertScoringFunction,
        );
        testCase.assertScoringFunction = await loadFunction<ScoringFunction>({
          filePath: resolvedPath,
          functionName,
        });
      }
      const prependToPrompt =
        testCase.options?.prefix ||
        (typeof testSuite.defaultTest === 'object' ? testSuite.defaultTest?.options?.prefix : '') ||
        '';
      const appendToPrompt =
        testCase.options?.suffix ||
        (typeof testSuite.defaultTest === 'object' ? testSuite.defaultTest?.options?.suffix : '') ||
        '';

      // Finalize test case eval
      const varCombinations =
        getEnvBool('PROMPTFOO_DISABLE_VAR_EXPANSION') || testCase.options?.disableVarExpansion
          ? [testCase.vars]
          : generateVarCombinations(testCase.vars || {});

      const numRepeat = this.options.repeat || 1;
      for (let repeatIndex = 0; repeatIndex < numRepeat; repeatIndex++) {
        for (const vars of varCombinations) {
          let promptIdx = 0;
          // Order matters - keep provider in outer loop to reduce need to swap models during local inference.
          for (const provider of testSuite.providers) {
            for (const prompt of testSuite.prompts) {
              const providerKey = provider.label || provider.id();
              if (!isAllowedPrompt(prompt, testSuite.providerPromptMap?.[providerKey])) {
                continue;
              }
              runEvalOptions.push({
                delay: options.delay || 0,
                provider,
                prompt: {
                  ...prompt,
                  raw: prependToPrompt + prompt.raw + appendToPrompt,
                },
                test: (() => {
                  const baseTest = {
                    ...testCase,
                    vars,
                    options: testCase.options,
                  };
                  // Only add tracing metadata fields if tracing is actually enabled
                  const tracingEnabled =
                    testCase.metadata?.tracingEnabled === true ||
                    testSuite.tracing?.enabled === true;

                  if (tracingEnabled) {
                    return {
                      ...baseTest,
                      metadata: {
                        ...testCase.metadata,
                        tracingEnabled: true,
                        evaluationId: this.evalRecord.id,
                      },
                    };
                  }
                  return baseTest;
                })(),
                nunjucksFilters: testSuite.nunjucksFilters,
                testIdx,
                promptIdx,
                repeatIndex,
                evaluateOptions: options,
                conversations: this.conversations,
                registers: this.registers,
                isRedteam: testSuite.redteam != null,
                allTests: runEvalOptions,
                concurrency,
                abortSignal: options.abortSignal,
              });
              promptIdx++;
            }
          }
          testIdx++;
        }
      }
    }
    // Pre-mark comparison rows before any filtering (used by resume logic)
    for (const evalOption of runEvalOptions) {
      if (evalOption.test.assert?.some((a) => a.type === 'select-best')) {
        rowsWithSelectBestAssertion.add(evalOption.testIdx);
      }
      if (evalOption.test.assert?.some((a) => a.type === 'max-score')) {
        rowsWithMaxScoreAssertion.add(evalOption.testIdx);
      }
    }

    // Resume support: if CLI is in resume mode, skip already-completed (testIdx,promptIdx) pairs
    if (cliState.resume && this.evalRecord.persisted) {
      try {
        const { default: EvalResult } = await import('./models/evalResult');
        const completedPairs = await EvalResult.getCompletedIndexPairs(this.evalRecord.id);
        const originalCount = runEvalOptions.length;
        // Filter out steps that already exist in DB
        for (let i = runEvalOptions.length - 1; i >= 0; i--) {
          const step = runEvalOptions[i];
          if (completedPairs.has(`${step.testIdx}:${step.promptIdx}`)) {
            runEvalOptions.splice(i, 1);
          }
        }
        const skipped = originalCount - runEvalOptions.length;
        if (skipped > 0) {
          logger.info(`Resuming: skipping ${skipped} previously completed cases`);
        }
      } catch (err) {
        logger.warn(
          `Resume: failed to load completed results. Running full evaluation. ${String(err)}`,
        );
      }
    }

    // Determine run parameters

    if (concurrency > 1) {
      const usesConversation = prompts.some((p) => p.raw.includes('_conversation'));
      const usesStoreOutputAs = tests.some((t) => t.options?.storeOutputAs);
      if (usesConversation) {
        logger.info(
          `Setting concurrency to 1 because the ${chalk.cyan('_conversation')} variable is used.`,
        );
        concurrency = 1;
      } else if (usesStoreOutputAs) {
        logger.info(`Setting concurrency to 1 because storeOutputAs is used.`);
        concurrency = 1;
      }
    }

    // Actually run the eval
    let numComplete = 0;

    const processEvalStep = async (evalStep: RunEvalOptions, index: number | string) => {
      if (typeof index !== 'number') {
        throw new Error('Expected index to be a number');
      }

      const beforeEachOut = await runExtensionHook(testSuite.extensions, 'beforeEach', {
        test: evalStep.test,
      });
      evalStep.test = beforeEachOut.test;

      const rows = await runEval(evalStep);

      for (const row of rows) {
        for (const varName of Object.keys(row.vars)) {
          vars.add(varName);
        }
        // Print token usage for model-graded assertions and add to stats
        if (row.gradingResult?.tokensUsed && row.testCase?.assert) {
          for (const assertion of row.testCase.assert) {
            if (MODEL_GRADED_ASSERTION_TYPES.has(assertion.type as AssertionType)) {
              const tokensUsed = row.gradingResult.tokensUsed;

              if (!this.stats.tokenUsage.assertions) {
                this.stats.tokenUsage.assertions = createEmptyAssertions();
              }

              // Accumulate assertion tokens using the specialized assertion function
              accumulateAssertionTokenUsage(this.stats.tokenUsage.assertions, tokensUsed);

              break;
            }
          }
        }

        // capture metrics
        if (row.success) {
          this.stats.successes++;
        } else if (row.failureReason === ResultFailureReason.ERROR) {
          this.stats.errors++;
        } else {
          this.stats.failures++;
        }

        if (row.tokenUsage) {
          accumulateResponseTokenUsage(this.stats.tokenUsage, { tokenUsage: row.tokenUsage });
        }

        if (evalStep.test.assert?.some((a) => a.type === 'select-best')) {
          rowsWithSelectBestAssertion.add(row.testIdx);
        }
        if (evalStep.test.assert?.some((a) => a.type === 'max-score')) {
          rowsWithMaxScoreAssertion.add(row.testIdx);
        }
        for (const assert of evalStep.test.assert || []) {
          if (assert.type) {
            assertionTypes.add(assert.type);
          }
        }

        numComplete++;

        try {
          await this.evalRecord.addResult(row);
        } catch (error) {
          const resultSummary = summarizeEvaluateResultForLogging(row);
          logger.error(`Error saving result: ${error} ${safeJsonStringify(resultSummary)}`);
        }

        for (const writer of this.fileWriters) {
          await writer.write(row);
        }

        const { promptIdx } = row;
        const metrics = prompts[promptIdx].metrics;
        invariant(metrics, 'Expected prompt.metrics to be set');
        metrics.score += row.score;
        for (const [key, value] of Object.entries(row.namedScores)) {
          // Update named score value
          metrics.namedScores[key] = (metrics.namedScores[key] || 0) + value;

          // Count assertions contributing to this named score
          let contributingAssertions = 0;
          row.gradingResult?.componentResults?.forEach((result) => {
            if (result.assertion?.metric === key) {
              contributingAssertions++;
            }
          });

          metrics.namedScoresCount[key] =
            (metrics.namedScoresCount[key] || 0) + (contributingAssertions || 1);
        }

        if (testSuite.derivedMetrics) {
          const math = await import('mathjs');
          for (const metric of testSuite.derivedMetrics) {
            if (metrics.namedScores[metric.name] === undefined) {
              metrics.namedScores[metric.name] = 0;
            }
            try {
              if (typeof metric.value === 'function') {
                metrics.namedScores[metric.name] = metric.value(metrics.namedScores, evalStep);
              } else {
                const evaluatedValue = math.evaluate(metric.value, metrics.namedScores);
                metrics.namedScores[metric.name] = evaluatedValue;
              }
            } catch (error) {
              logger.debug(
                `Could not evaluate derived metric '${metric.name}': ${(error as Error).message}`,
              );
            }
          }
        }
        metrics.testPassCount += row.success ? 1 : 0;
        if (!row.success) {
          if (row.failureReason === ResultFailureReason.ERROR) {
            metrics.testErrorCount += 1;
          } else {
            metrics.testFailCount += 1;
          }
        }
        metrics.assertPassCount +=
          row.gradingResult?.componentResults?.filter((r) => r.pass).length || 0;
        metrics.assertFailCount +=
          row.gradingResult?.componentResults?.filter((r) => !r.pass).length || 0;
        metrics.totalLatencyMs += row.latencyMs || 0;
        accumulateResponseTokenUsage(metrics.tokenUsage, row.response);

        // Add assertion token usage to the metrics
        if (row.gradingResult?.tokensUsed) {
          updateAssertionMetrics(metrics, row.gradingResult.tokensUsed);
        }

        metrics.cost += row.cost || 0;

        await runExtensionHook(testSuite.extensions, 'afterEach', {
          test: evalStep.test,
          result: row,
        });

        if (options.progressCallback) {
          options.progressCallback(numComplete, runEvalOptions.length, index, evalStep, metrics);
        }
      }
    };

    // Add a wrapper function that implements timeout
    const processEvalStepWithTimeout = async (evalStep: RunEvalOptions, index: number | string) => {
      // Get timeout value from options or environment, defaults to 0 (no timeout)
      const timeoutMs = options.timeoutMs || getEvalTimeoutMs();

      if (timeoutMs <= 0) {
        // No timeout, process normally
        return processEvalStep(evalStep, index);
      }

      // Create an AbortController to cancel the request if it times out
      const abortController = new AbortController();
      const { signal } = abortController;

      // Add the abort signal to the evalStep
      const evalStepWithSignal = {
        ...evalStep,
        abortSignal: signal,
      };

      try {
        return await Promise.race([
          processEvalStep(evalStepWithSignal, index),
          new Promise<void>((_, reject) => {
            const timeoutId = setTimeout(() => {
              // Abort any ongoing requests
              abortController.abort();

              // If the provider has a cleanup method, call it
              if (typeof evalStep.provider.cleanup === 'function') {
                try {
                  evalStep.provider.cleanup();
                } catch (cleanupErr) {
                  logger.warn(`Error during provider cleanup: ${cleanupErr}`);
                }
              }

              reject(new Error(`Evaluation timed out after ${timeoutMs}ms`));

              // Clear the timeout to prevent memory leaks
              clearTimeout(timeoutId);
            }, timeoutMs);
          }),
        ]);
      } catch (error) {
        // Create and add an error result for timeout
        const timeoutResult = {
          provider: {
            id: evalStep.provider.id(),
            label: evalStep.provider.label,
            config: evalStep.provider.config,
          },
          prompt: {
            raw: evalStep.prompt.raw,
            label: evalStep.prompt.label,
            config: evalStep.prompt.config,
          },
          vars: evalStep.test.vars || {},
          error: `Evaluation timed out after ${timeoutMs}ms: ${String(error)}`,
          success: false,
          failureReason: ResultFailureReason.ERROR, // Using ERROR for timeouts
          score: 0,
          namedScores: {},
          latencyMs: timeoutMs,
          promptIdx: evalStep.promptIdx,
          testIdx: evalStep.testIdx,
          testCase: evalStep.test,
          promptId: evalStep.prompt.id || '',
        };

        // Add the timeout result to the evaluation record
        await this.evalRecord.addResult(timeoutResult);

        // Update stats
        this.stats.errors++;

        // Update prompt metrics
        const { metrics } = prompts[evalStep.promptIdx];
        if (metrics) {
          metrics.testErrorCount += 1;
          metrics.totalLatencyMs += timeoutMs;
        }

        // Progress callback
        if (options.progressCallback) {
          options.progressCallback(
            numComplete,
            runEvalOptions.length,
            typeof index === 'number' ? index : 0,
            evalStep,
            metrics || {
              score: 0,
              testPassCount: 0,
              testFailCount: 0,
              testErrorCount: 1,
              assertPassCount: 0,
              assertFailCount: 0,
              totalLatencyMs: timeoutMs,
              tokenUsage: {
                total: 0,
                prompt: 0,
                completion: 0,
                cached: 0,
                numRequests: 0,
              },
              namedScores: {},
              namedScoresCount: {},
              cost: 0,
            },
          );
        }
      }
    };

    // Set up progress tracking
    const originalProgressCallback = this.options.progressCallback;
    const isWebUI = Boolean(cliState.webUI);

    // Choose appropriate progress reporter
    logger.debug(
      `Progress bar settings: showProgressBar=${this.options.showProgressBar}, isWebUI=${isWebUI}`,
    );

    if (isCI() && !isWebUI) {
      // Use CI-friendly progress reporter
      ciProgressReporter = new CIProgressReporter(runEvalOptions.length);
      ciProgressReporter.start();
    } else if (this.options.showProgressBar && process.stdout.isTTY) {
      // Use visual progress bars
      progressBarManager = new ProgressBarManager(isWebUI);
    }

    this.options.progressCallback = (completed, total, index, evalStep, metrics) => {
      if (originalProgressCallback) {
        originalProgressCallback(completed, total, index, evalStep, metrics);
      }

      if (isWebUI) {
        const provider = evalStep.provider.label || evalStep.provider.id();
        const vars = formatVarsForDisplay(evalStep.test.vars, 50);
        logger.info(`[${numComplete}/${total}] Running ${provider} with vars: ${vars}`);
      } else if (progressBarManager) {
        // Progress bar update is handled by the manager
        const phase = evalStep.test.options?.runSerially ? 'serial' : 'concurrent';
        progressBarManager.updateProgress(index, evalStep, phase);
      } else if (ciProgressReporter) {
        // CI progress reporter update
        ciProgressReporter.update(numComplete);
      } else {
        logger.debug(`Eval #${index + 1} complete (${numComplete} of ${runEvalOptions.length})`);
      }
    };

-    // Separate serial and concurrent eval options
+    // Separate serial, concurrent, and Simba eval options


⚠️ Potential issue | 🔴 Critical

Simba results must keep the same testIdx; don’t increment

Incrementing testIdx per Simba sub‑result breaks grouping by row (compare/select‑best/max‑score rely on stable testIdx).

Apply this diff:

- // Update results with proper indices for Simba - for (const result of simbaResults) { - result.promptIdx = promptIdx; - result.testIdx = testIdx++; - } + // Preserve row grouping across Simba sub-results + const baseTestIdx = testIdx; + for (const result of simbaResults) { + result.promptIdx = promptIdx; + result.testIdx = baseTestIdx; + }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In src/evaluator.ts around lines 368 to 397, the code currently increments testIdx for each Simba sub-result which breaks grouping; change the loop that assigns indices so each simba result uses the same testIdx (set result.testIdx = testIdx) and remove any testIdx++ inside that loop (do not mutate testIdx per sub-result).

coderabbitai · Oct 1, 2025

src/redteam/constants/metadata.ts

+  simba: Severity.Medium,
 };


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Severity mapping added; verify category wiring.

Setting simba to Severity.Medium is fine, but simba is not listed in riskCategories. Without a category, UI grouping via categoryMapReverse will omit it. Please add simba to an appropriate top‑level category or exclude it from riskCategorySeverityMap if it’s meant to be strategy‑only.

🏁 Script executed:

#!/bin/bash set -e # Find definition of riskCategories rg -n "export\s+const\s+riskCategories" -g "src/redteam/constants/metadata.ts" # Find occurrences of riskCategorySeverityMap rg -n "riskCategorySeverityMap" -g "src/redteam/constants/metadata.ts" # Find all occurrences of simba in the file rg -n "simba" -g "src/redteam/constants/metadata.ts"

Length of output: 777

🏁 Script executed:

#!/bin/bash sed -n '400,460p' src/redteam/constants/metadata.ts

Length of output: 1454

Add “simba” to riskCategories or remove from riskCategorySeverityMap
The simba plugin is mapped to Severity.Medium but isn’t included in the riskCategories object (src/redteam/constants/metadata.ts, from line 423), so it will be omitted from UI groupings. Add simba to the appropriate top-level category or remove it from riskCategorySeverityMap if it’s not intended for UI.

🤖 Prompt for AI Agents

In src/redteam/constants/metadata.ts around lines 420–421, the key "simba" is present in riskCategorySeverityMap but missing from the riskCategories object (around line 423), causing it to be omitted from UI groupings; either add "simba" to the appropriate top-level category inside riskCategories (place it under the same category used for other Medium-severity plugins) so UI groupings include it, or if "simba" should not be exposed in the UI, remove the "simba" entry from riskCategorySeverityMap instead—make the change so both maps remain consistent.

coderabbitai · Oct 1, 2025

src/redteam/providers/simba.ts

+    const url =
+      buildRemoteUrl('/api/v1/simba', 'https://api.promptfoo.app/api/v1/simba') + endpoint;
+
+    const response = await fetch(url, {
+      method,
+      headers: {
+        'Content-Type': 'application/json',
+      },
+      body: body ? JSON.stringify(body) : undefined,
+    });
+
+    if (!response.ok) {
+      logRequestResponse({ url, requestMethod: 'POST', requestBody: body, response });
+      throw new Error(`Simba API request failed: ${response.status} ${response.statusText}`);
+    }
+
+    return response.json();
+  }


⚠️ Potential issue | 🔴 Critical

Fix base URL null case and incorrect log method; mark error logs properly.

buildRemoteUrl can return null; current concatenation yields "null..." URL.

When logging a failed GET, requestMethod is hardcoded to 'POST'.

Set error: true for logRequestResponse.

Apply this diff:

- const url = - buildRemoteUrl('/api/v1/simba', 'https://api.promptfoo.app/api/v1/simba') + endpoint; + const base = + buildRemoteUrl('/api/v1/simba', 'https://api.promptfoo.app/api/v1/simba') || + 'https://api.promptfoo.app/api/v1/simba'; + const url = new URL(endpoint, base).toString(); @@ - if (!response.ok) { - logRequestResponse({ url, requestMethod: 'POST', requestBody: body, response }); + if (!response.ok) { + await logRequestResponse({ + url, + requestMethod: method, + requestBody: body, + response, + error: true, + }); throw new Error(`Simba API request failed: ${response.status} ${response.statusText}`); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const url =

buildRemoteUrl('/api/v1/simba', 'https://api.promptfoo.app/api/v1/simba') + endpoint;

const response = await fetch(url, {

method,

headers: {

'Content-Type': 'application/json',

},

body: body ? JSON.stringify(body) : undefined,

});

if (!response.ok) {

logRequestResponse({ url, requestMethod: 'POST', requestBody: body, response });

throw new Error(`Simba API request failed: ${response.status} ${response.statusText}`);

}

return response.json();

}

// Ensure buildRemoteUrl fallback when it returns null

const base =

buildRemoteUrl('/api/v1/simba', 'https://api.promptfoo.app/api/v1/simba') ||

'https://api.promptfoo.app/api/v1/simba';

const url = new URL(endpoint, base).toString();

const response = await fetch(url, {

method,

headers: {

'Content-Type': 'application/json',

},

body: body ? JSON.stringify(body) : undefined,

});

if (!response.ok) {

// Log with the actual HTTP method and mark as an error

await logRequestResponse({

url,

requestMethod: method,

requestBody: body,

response,

error: true,

});

throw new Error(`Simba API request failed: ${response.status} ${response.statusText}`);

}

return response.json();

🤖 Prompt for AI Agents

In src/redteam/providers/simba.ts around lines 120 to 137, buildRemoteUrl can return null which makes URL concatenation produce "null..." so guard that return (e.g., use the fallback base or throw a clear error) and build the final url safely; when logging failed requests use the actual method variable instead of the hardcoded 'POST' (requestMethod: method) and mark the log as an error by passing error: true into logRequestResponse; keep the existing response handling and return response.json() after these fixes.

coderabbitai · Oct 1, 2025

src/redteam/providers/simba.ts

+  private async startSession(): Promise<string> {
+    const email = (await getUserEmail()) || 'demo@promptfoo.dev';
+
+    const startRequest: SimbaStartRequest = {
+      targetInfo: {
+        goal: this.config.goal!,
+        purpose: this.config.purpose!,
+        additionalAttackInstructions: this.config.additionalInstructions,
+      },
+      config: {
+        maxConversationRounds: this.config.maxRounds!,
+        maxAttackVectors: this.config.maxVectors!,
+      },
+      email,
+    };
+
+    const response: SimbaStartResponse = await this.callSimbaApi('/start', startRequest);
+    logger.debug(`[Simba] Started session with ID: ${response.sessionId}`);
+    return response.sessionId;
+  }


⚠️ Potential issue | 🟠 Major

Avoid non‑null assertions for required fields; provide safe defaults.

purpose may be undefined; the non‑null assertion only silences TS, not runtime. Also getUserEmail is sync; no need to await.

- private async startSession(): Promise<string> { - const email = (await getUserEmail()) || 'demo@promptfoo.dev'; + private async startSession(): Promise<string> { + const email = getUserEmail() || 'demo@promptfoo.dev'; @@ - targetInfo: { - goal: this.config.goal!, - purpose: this.config.purpose!, - additionalAttackInstructions: this.config.additionalInstructions, - }, + targetInfo: { + goal: this.config.goal || 'Identify and execute adversarial test cases', + purpose: this.config.purpose ?? 'General adversarial evaluation', + additionalAttackInstructions: this.config.additionalInstructions, + }, @@ - maxConversationRounds: this.config.maxRounds!, - maxAttackVectors: this.config.maxVectors!, + maxConversationRounds: this.config.maxRounds ?? 20, + maxAttackVectors: this.config.maxVectors ?? 5,

🤖 Prompt for AI Agents

In src/redteam/providers/simba.ts around lines 139 to 158, remove the non‑null assertions and make fields safe by supplying defaults and validating types: call getUserEmail synchronously (drop the await) and default to 'demo@promptfoo.dev' if falsy, ensure goal, purpose, and additionalInstructions are set to safe strings (e.g., '' or a descriptive fallback) instead of using !, and ensure numeric config fields maxRounds and maxVectors fall back to safe numbers if undefined; then pass these validated/defaulted values into startRequest before calling the API.

coderabbitai · Oct 1, 2025

src/redteam/providers/simba.ts

+  async runSimba(
+    prompt: string,
+    context?: CallApiContextParams,
+    options?: CallApiOptionsParams,
+    concurrency?: number,
+  ): Promise<EvaluateResult[]> {
+    try {


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Hardcoded sessionId and commented core loop: runSimba currently ignores prompt/context.

Using a fixed GUID breaks real runs and test reproducibility. Initialize a session and fetch results using that id. Also avoid mutating output.messages with pop(); compute response safely. Use a stable promptId. Example minimal fix:

async runSimba( @@ ): Promise<EvaluateResult[]> { try { - const finalOutput = await this.getFinalOutput(`5501c726-0409-4192-b7cd-7457e6649788`); + if (!this.sessionId) { + this.sessionId = await this.startSession(); + } + const sid = this.sessionId; + const finalOutput = await this.getFinalOutput(sid); - return finalOutput.map((output, index) => { - const lastUserMessage = output.messages.filter((message) => message.role === 'user').pop(); + return finalOutput.map((output, index) => { + const lastUserMessage = [...output.messages].filter((m) => m.role === 'user').pop(); + const lastAssistantMessage = + [...output.messages].reverse().find((m) => m.role === 'assistant') || null; return { @@ - promptId: `simba-${this.sessionId}-${index}`, + promptId: `simba-${sid}-${index}`, @@ - response: { - output: output.messages.pop()?.content || '', + response: { + output: lastAssistantMessage?.content || '', tokenUsage: createEmptyTokenUsage(), }, - success: !!output.result.success, - score: output.result.success ? 0 : 1, + // Treat "attack success" as "test failure" + success: !output.result.success, + score: output.result.success ? 0 : 1, // verify scoring convention @@ - failureReason: output.result.success - ? ResultFailureReason.ASSERT - : ResultFailureReason.NONE, + failureReason: output.result.success + ? ResultFailureReason.ASSERT + : ResultFailureReason.NONE, @@ - gradingResult: { - pass: output.result.success, - score: output.result.success ? 0 : 1, + gradingResult: { + pass: !output.result.success, + score: output.result.success ? 0 : 1, // verify reason: output.result.summary, metadata: { pluginId: 'simba', strategyId: 'simba-attack', }, }, namedScores: { - attack_success: output.result.success ? 0 : 1, + attack_success: output.result.success ? 1 : 0, }, tokenUsage: createEmptyTokenUsage(), metadata: { attackPlan: output.attackPlan, result: output.result, redteamHistory: (() => { - const pairs = []; - for (let i = 0; i < output.messages.length; i += 2) { + const pairs = []; + for (let i = 0; i < output.messages.length; i += 2) { const userMessage = output.messages[i]; const assistantMessage = output.messages[i + 1]; if (userMessage?.role === 'user' && assistantMessage?.role === 'assistant') { pairs.push({ prompt: userMessage.content, output: assistantMessage.content, }); } } return pairs; })(), dataExtracted: output.result.dataExtracted.join('\n'), successfulJailbreaks: output.result.successfulJailbreaks.join('\n'), }, }; });

Also applies to: 283-366

🏁 Script executed:

#!/bin/bash # Locate the simba.ts file find . -type f -name "simba.ts"

Length of output: 98

🏁 Script executed:

#!/bin/bash # Show runSimba implementation with context rg -nP -C5 'async runSimba' src/redteam/providers/simba.ts

Length of output: 246

🏁 Script executed:

#!/bin/bash # Extract runSimba implementation for inspection sed -n '170,260p' src/redteam/providers/simba.ts

Length of output: 3362

🏁 Script executed:

#!/bin/bash # Extract continuation of runSimba implementation sed -n '260,350p' src/redteam/providers/simba.ts

Length of output: 3459

Refactor runSimba in src/redteam/providers/simba.ts to use dynamic sessions and real prompt/context

Replace the hardcoded GUID in getFinalOutput with something like:
this.sessionId ??= await this.startSession(); const finalOutput = await this.getFinalOutput(this.sessionId);

Re-enable the core loop to forward the prompt and context through callSimbaApi and targetProvider.callApi, instead of commenting it out.

Don’t mutate output.messages directly—use [...output.messages] or similar to extract lastUserMessage and lastAssistantMessage.

Invert scoring so that “attack success” (output.result.success) yields test failure (success: !output.result.success) and set namedScores.attack_success to 1 when the attack wins.

Build promptId as simba-${this.sessionId}-${index} to guarantee stability.

🤖 Prompt for AI Agents

In src/redteam/providers/simba.ts around lines 170 to 176, runSimba currently uses a hardcoded GUID, commented-out core forwarding loop, mutates output.messages directly, incorrectly maps attack success to pass, and builds unstable prompt IDs; fix by initializing a dynamic session with this.sessionId ??= await this.startSession() and call getFinalOutput(this.sessionId) instead of the GUID, re-enable and use the core loop to forward prompt and context via callSimbaApi and targetProvider.callApi, avoid mutating output.messages by copying (e.g., [...output.messages]) before extracting lastUserMessage/lastAssistantMessage, invert the returned success so test success = !output.result.success and set namedScores.attack_success = 1 when output.result.success is true, and construct promptId as `simba-${this.sessionId}-${index}` to ensure stability.

coderabbitai · Oct 1, 2025

src/redteam/strategies/simba.ts

+  const originalText = String(baseTestCase.vars![injectVar]);
+


⚠️ Potential issue | 🟠 Major

Guard against missing vars/injectVar to avoid runtime NPE

baseTestCase.vars! may be undefined; accessing [injectVar] can throw. Also String(undefined) yields 'undefined'.

Apply this diff:

- const baseTestCase = testCases[0]; - const originalText = String(baseTestCase.vars![injectVar]); + const baseTestCase = testCases[0]; + const vars = baseTestCase.vars ?? {}; + const raw = (vars as Record<string, unknown>)[injectVar]; + const originalText = + typeof raw === 'string' ? raw : raw != null ? String(raw) : '';

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const originalText = String(baseTestCase.vars![injectVar]);

const baseTestCase = testCases[0];

const vars = baseTestCase.vars ?? {};

const raw = (vars as Record<string, unknown>)[injectVar];

const originalText =

typeof raw === 'string' ? raw : raw != null ? String(raw) : '';

🤖 Prompt for AI Agents

In src/redteam/strategies/simba.ts around lines 18-19, guard against baseTestCase.vars being undefined or missing the injectVar before converting to a string: check that baseTestCase.vars && injectVar in baseTestCase.vars (or use optional chaining baseTestCase.vars?.[injectVar]) and provide a safe default (e.g., '') or throw a clear error if it must be present; then convert the resulting value to a string (e.g., String(baseTestCase.vars?.[injectVar] ?? '')). Ensure types reflect the possibility of undefined so no runtime NPE or the literal "undefined" string is produced.

coderabbitai · Oct 1, 2025

src/redteam/strategies/simba.ts

+      provider: {
+        id: 'promptfoo:redteam:simba',
+        config: {
+          injectVar,
+          ...config,
+        },
+      },


⚠️ Potential issue | 🟠 Major

Sanitize provider.config before persisting

Spreading arbitrary config into provider.config risks persisting secrets (API keys, tokens) in results/DB.

Example (adjust to your sanitize helper):

- config: { - injectVar, - ...config, - }, + config: sanitizeObject({ + injectVar, + ...config, + }),

If a sanitize helper isn’t available, whitelist allowed Simba config keys instead of spreading.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

provider: {

id: 'promptfoo:redteam:simba',

config: {

injectVar,

...config,

},

},

provider: {

id: 'promptfoo:redteam:simba',

config: sanitizeObject({

injectVar,

...config,

}),

},

🤖 Prompt for AI Agents

In src/redteam/strategies/simba.ts around lines 23 to 29, provider.config currently spreads arbitrary config (injectVar, ...config) which can persist secrets; update the code to sanitize config before persisting by either invoking the project’s sanitize helper on config and storing only the sanitized result, or replace the spread with an explicit whitelist of allowed Simba config keys (copy only safe keys into config). Ensure injectVar is retained if safe, and do not include any secrets (API keys, tokens) in the persisted provider.config.

coderabbitai · Oct 1, 2025

src/redteam/strategies/simba.ts

+      assert: baseTestCase.assert?.map((assertion) => ({
+        ...assertion,
+        metric: `${assertion.metric}/Simba`,
+      })),


⚠️ Potential issue | 🟡 Minor

Avoid producing "undefined/Simba" metrics

assertion.metric can be undefined; concatenation yields 'undefined/Simba'.

Apply this diff:

- assert: baseTestCase.assert?.map((assertion) => ({ - ...assertion, - metric: `${assertion.metric}/Simba`, - })), + assert: baseTestCase.assert?.map((assertion) => ({ + ...assertion, + ...(assertion.metric + ? { metric: `${assertion.metric}/Simba` } + : {}), + })),

🤖 Prompt for AI Agents

In src/redteam/strategies/simba.ts around lines 30 to 33, the mapping unconditionally concatenates "/Simba" to assertion.metric which can be undefined and produce "undefined/Simba"; change the mapping to only append the suffix when assertion.metric is defined (e.g. set metric to assertion.metric ? `${assertion.metric}/Simba` : undefined or omit the metric property when undefined) so metrics are not "undefined/Simba" and types are preserved.

sklein12 force-pushed the steve/09052025-simba branch from 1f8adb4 to 812a67e Compare October 1, 2025 22:00

sklein12 marked this pull request as draft October 1, 2025 22:01

coderabbitai bot reviewed Oct 1, 2025

View reviewed changes

sklein12 added 5 commits October 15, 2025 16:54

feat: simba provider

755e60c

more fixes

19acdd8

updates

22534dd

display updates

8081993

simba fixes

3dbbded

sklein12 force-pushed the steve/09052025-simba branch from 812a67e to 3dbbded Compare October 16, 2025 00:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: Add experimental red teamer strategy #5795

feat: Add experimental red teamer strategy #5795

Uh oh!

sklein12 commented Oct 1, 2025

Uh oh!

coderabbitai bot commented Oct 1, 2025 •

edited

Loading

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 1, 2025

Uh oh!

coderabbitai bot Oct 1, 2025

Uh oh!

coderabbitai bot Oct 1, 2025

Uh oh!

coderabbitai bot Oct 1, 2025

Uh oh!

coderabbitai bot Oct 1, 2025

Uh oh!

coderabbitai bot Oct 1, 2025

Uh oh!

coderabbitai bot Oct 1, 2025

Uh oh!

coderabbitai bot Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-  const originalText = String(baseTestCase.vars![injectVar]);
+  const baseTestCase = testCases[0];
+  const vars = baseTestCase.vars ?? {};
+  const raw = (vars as Record<string, unknown>)[injectVar];
+  const originalText =
+    typeof raw === 'string' ? raw : raw != null ? String(raw) : '';

Search code, repositories, users, issues, pull requests...

Uh oh!

feat: Add experimental red teamer strategy #5795

Are you sure you want to change the base?

feat: Add experimental red teamer strategy #5795

Uh oh!

Conversation

sklein12 commented Oct 1, 2025

Uh oh!

coderabbitai bot commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Oct 1, 2025 •

edited

Loading