-
-
Notifications
You must be signed in to change notification settings - Fork 735
feat: Add experimental red teamer strategy #5795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1f8adb4
to
812a67e
Compare
📝 WalkthroughWalkthroughIntroduces a new Redteam Simba provider and execution path. Evaluator gains Simba-aware routing: classifies tests (serial, concurrent, Simba), runs Simba cases sequentially via provider.runSimba, and preserves existing abort/progress handling. Provider registry maps promptfoo:redteam:simba to the new provider. Redteam metadata, plugins, and strategies gain Simba entries; strategies/index exposes Simba twice. A Simba strategy generates a single derived test case using the Simba provider. remoteGeneration exports buildRemoteUrl. redteam/util adds isSimbaTestCase for routing. types/index removes RunEvalOptions.allTests. The Simba provider scaffolds sessioned API calls, with runSimba returning EvaluateResult[] and callApi unsupported. Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/evaluator.ts (3)
259-275
: Remove nonexistent parameterallTests
from RunEvalOptionsRunEvalOptions (types/index.ts) does not define allTests. Destructuring it here will fail type‑checking when callers pass strictly typed options.
Apply this diff:
export async function runEval({ provider, prompt, // raw prompt test, delay, nunjucksFilters: filters, evaluateOptions, testIdx, promptIdx, conversations, registers, isRedteam, - allTests, concurrency, abortSignal, }: RunEvalOptions): Promise<EvaluateResult[]> {
340-355
: Honor test-level provider configs (load raw provider before use)activeProvider falls back to the suite provider unless test.provider is already an ApiProvider. For Simba tests created with a raw provider config, this prevents the Simba path from ever triggering.
Apply this diff to load a raw provider config:
- const activeProvider = isApiProvider(test.provider) ? test.provider : provider; + let activeProvider = isApiProvider(test.provider) ? test.provider : provider; + // If test.provider is a raw config object, load it + if (!isApiProvider(test.provider) && typeof test.provider === 'object' && (test.provider as any)?.id) { + const { loadApiProvider } = await import('./providers'); + const providerId = + typeof (test.provider as any).id === 'function' + ? (test.provider as any).id() + : (test.provider as any).id; + activeProvider = await loadApiProvider(providerId, { + options: test.provider as ProviderOptions, + }); + }
1049-1090
: RemoveallTests
from RunEvalOptions objects
All occurrences of theallTests
property insrc/evaluator.ts
(around lines 271 and 1086) must be deleted—it isn’t part ofRunEvalOptions
and causes TS errors.- allTests: runEvalOptions,
🧹 Nitpick comments (8)
src/redteam/constants/strategies.ts (1)
74-76
: Consider alphabetical ordering for maintainability.The ADDITIONAL_STRATEGIES array appears to mix sorted and unsorted entries. Placing 'simba' between 'retry' and 'rot13' breaks alphabetical order ('simba' should come after 'rot13'). Consider maintaining alphabetical sorting throughout the array for easier maintenance and to prevent duplication.
'retry', - 'simba', 'rot13', + 'simba', 'video',src/redteam/util.ts (1)
290-309
: Consider simplifying with a single return statement.The function logic is correct but can be more concise. The two separate if-checks with early returns can be combined into a single boolean expression.
-export function isSimbaTestCase(evalOptions: RunEvalOptions): boolean { - // Check if provider is Simba - if (evalOptions.provider.id() === 'promptfoo:redteam:simba') { - return true; - } - - // Check if test metadata indicates Simba strategy - if (evalOptions.test.metadata?.strategyId === 'simba') { - return true; - } - - return false; -} +export function isSimbaTestCase(evalOptions: RunEvalOptions): boolean { + return ( + evalOptions.provider.id() === 'promptfoo:redteam:simba' || + evalOptions.test.metadata?.strategyId === 'simba' + ); +}src/redteam/strategies/index.ts (1)
271-279
: Ensure strategy id uniqueness + consider structured logs
- Add a guard to prevent duplicate strategy ids at startup to avoid ambiguous routing.
- Prefer structured logging for new entries: logger.debug('Adding Simba test cases', { count: testCases.length }).
Example uniqueness check (outside this block, in validateStrategies):
const ids = Strategies.map(s => s.id); const dupes = ids.filter((id, i) => ids.indexOf(id) !== i); if (dupes.length) { throw new Error(`Duplicate strategy id(s): ${Array.from(new Set(dupes)).join(', ')}`); }src/evaluator.ts (1)
1448-1476
: Simba routing split looks good; prefer structured logsThe 3‑way split is sound. For logs, pass context instead of interpolating vars.
Example:
logger.info('Running Simba test cases sequentially after normal tests', { count: simbaRunEvalOptions.length, });src/redteam/constants/metadata.ts (1)
670-671
: Alias may not be safe as a metric key.categoryAliases is used as “metric name or harm category.” "Simba (beta)" includes space/parentheses; consider a sanitized alias like "SimbaBeta" to avoid downstream selector/metric key issues. Several existing aliases contain spaces, but those have historically caused friction in dashboards.
src/redteam/providers/simba.ts (3)
104-105
: Use structured logging; avoid stringified config.Prefer: logger.debug('...', { config: this.config }) to align with structured/PII‑safe logging.
- logger.debug(`[Simba] Constructor options: ${JSON.stringify(this.config)}`); + logger.debug('[Simba] Constructor options', { config: this.config });
84-106
: Constructor invariant good; consider exposing getSessionId().Since you maintain a session, exposing an optional getSessionId helps with observability and parity with other providers.
export default class SimbaProvider implements ApiProvider { @@ private sessionId: string | null = null; @@ + getSessionId() { + return this.sessionId; + }
139-174
: Consider accepting context in startSession to source purpose from test metadata.You hinted this in comments. Wiring context‑derived purpose reduces config friction and improves test reporting.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
src/evaluator.ts
(6 hunks)src/providers/registry.ts
(2 hunks)src/redteam/constants/metadata.ts
(7 hunks)src/redteam/constants/plugins.ts
(1 hunks)src/redteam/constants/strategies.ts
(1 hunks)src/redteam/providers/simba.ts
(1 hunks)src/redteam/remoteGeneration.ts
(1 hunks)src/redteam/strategies/index.ts
(2 hunks)src/redteam/strategies/simba.ts
(1 hunks)src/redteam/util.ts
(2 hunks)src/types/index.ts
(0 hunks)
💤 Files with no reviewable changes (1)
- src/types/index.ts
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)
Prefer not to introduce new TypeScript types; use existing interfaces whenever possible
**/*.{ts,tsx}
: Use TypeScript with strict type checking
Use consistent error handling with proper type checks
Always sanitize sensitive data before logging
Use logger methods with a structured context object (second parameter) instead of interpolating potentially sensitive data into log strings
Use sanitizeObject for non-logging contexts before persisting or transmitting potentially sensitive data
Files:
src/evaluator.ts
src/providers/registry.ts
src/redteam/remoteGeneration.ts
src/redteam/providers/simba.ts
src/redteam/constants/plugins.ts
src/redteam/util.ts
src/redteam/strategies/index.ts
src/redteam/constants/strategies.ts
src/redteam/strategies/simba.ts
src/redteam/constants/metadata.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.{ts,tsx,js,jsx}
: Follow consistent import order (Biome will sort imports)
Use consistent curly braces for all control statements
Prefer const over let; avoid var
Use object shorthand syntax where possible
Use async/await for asynchronous code
Files:
src/evaluator.ts
src/providers/registry.ts
src/redteam/remoteGeneration.ts
src/redteam/providers/simba.ts
src/redteam/constants/plugins.ts
src/redteam/util.ts
src/redteam/strategies/index.ts
src/redteam/constants/strategies.ts
src/redteam/strategies/simba.ts
src/redteam/constants/metadata.ts
src/**
📄 CodeRabbit inference engine (CLAUDE.md)
Place core application logic in src/
Files:
src/evaluator.ts
src/providers/registry.ts
src/redteam/remoteGeneration.ts
src/redteam/providers/simba.ts
src/redteam/constants/plugins.ts
src/redteam/util.ts
src/redteam/strategies/index.ts
src/redteam/constants/strategies.ts
src/redteam/strategies/simba.ts
src/redteam/constants/metadata.ts
🧬 Code graph analysis (6)
src/evaluator.ts (2)
src/types/index.ts (2)
EvaluateResult
(263-284)RunEvalOptions
(136-160)src/redteam/util.ts (1)
isSimbaTestCase
(297-309)
src/providers/registry.ts (2)
src/types/providers.ts (1)
ProviderOptions
(39-47)src/types/index.ts (1)
LoadApiProviderContext
(1199-1203)
src/redteam/providers/simba.ts (6)
src/types/providers.ts (3)
ApiProvider
(79-96)CallApiContextParams
(49-69)CallApiOptionsParams
(71-77)src/redteam/remoteGeneration.ts (1)
buildRemoteUrl
(28-50)src/logger.ts (1)
logRequestResponse
(405-441)src/globalConfig/accounts.ts (1)
getUserEmail
(26-29)src/types/index.ts (1)
EvaluateResult
(263-284)src/util/tokenUsageUtils.ts (1)
createEmptyTokenUsage
(31-41)
src/redteam/util.ts (1)
src/types/index.ts (1)
RunEvalOptions
(136-160)
src/redteam/strategies/index.ts (1)
src/redteam/strategies/simba.ts (1)
addSimbaTestCases
(4-41)
src/redteam/strategies/simba.ts (1)
src/types/index.ts (2)
TestCaseWithPlugin
(701-701)TestCase
(699-699)
🪛 GitHub Check: Build on Node 20.x
src/redteam/providers/simba.ts
[failure] 390-390:
Type 'undefined' is not assignable to type '{ prompt?: number | undefined; completion?: number | undefined; cached?: number | undefined; total?: number | undefined; numRequests?: number | undefined; completionDetails?: { reasoning?: number | undefined; acceptedPrediction?: number | undefined; rejectedPrediction?: number | undefined; } | undefined; }'.
[failure] 389-389:
Type 'undefined' is not assignable to type '{ reasoning?: number | undefined; acceptedPrediction?: number | undefined; rejectedPrediction?: number | undefined; }'.
[failure] 381-381:
Type 'string' is not assignable to type 'ResultFailureReason'.
🪛 GitHub Check: Build on Node 22.x
src/redteam/providers/simba.ts
[failure] 390-390:
Type 'undefined' is not assignable to type '{ prompt?: number | undefined; completion?: number | undefined; cached?: number | undefined; total?: number | undefined; numRequests?: number | undefined; completionDetails?: { reasoning?: number | undefined; acceptedPrediction?: number | undefined; rejectedPrediction?: number | undefined; } | undefined; }'.
[failure] 389-389:
Type 'undefined' is not assignable to type '{ reasoning?: number | undefined; acceptedPrediction?: number | undefined; rejectedPrediction?: number | undefined; }'.
[failure] 381-381:
Type 'string' is not assignable to type 'ResultFailureReason'.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: Test on Node 24.x and windows-latest
- GitHub Check: Test on Node 22.x and windows-latest
- GitHub Check: Test on Node 20.x and macOS-latest
- GitHub Check: Test on Node 22.x and macOS-latest
- GitHub Check: Test on Node 20.x and ubuntu-latest
- GitHub Check: Test on Node 20.x and windows-latest
- GitHub Check: Test on Node 24.x and ubuntu-latest
- GitHub Check: Build Docs
- GitHub Check: webui tests
- GitHub Check: Run Integration Tests
- GitHub Check: Generate Assets
- GitHub Check: Style Check
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (13)
src/redteam/constants/plugins.ts (1)
303-303
: LGTM!The 'simba' plugin is correctly added to ADDITIONAL_PLUGINS in alphabetical order.
src/providers/registry.ts (2)
15-15
: LGTM!Import follows the established pattern for other redteam providers.
1175-1184
: LGTM!The provider factory registration follows the same pattern as other redteam providers (e.g., crescendo, goat, custom) and is correctly placed among similar redteam provider registrations.
src/redteam/remoteGeneration.ts (1)
28-50
: LGTM!Exporting
buildRemoteUrl
enables its reuse by the Simba provider for constructing API endpoints. The function logic remains unchanged and follows the established pattern for URL construction with proper fallback handling.src/redteam/strategies/index.ts (1)
29-29
: Import looks goodImporting addSimbaTestCases is correct and scoped to strategies.
src/redteam/strategies/simba.ts (1)
11-15
: Early return on empty input is fineReturning [] when no test cases is appropriate.
src/evaluator.ts (1)
1511-1531
: Sequential Simba phase: OK; confirm runSimba returns fully-formed EvaluateResult[]
Ensure runSimba populates tokenUsage, metadata, and gradingResult on each EvaluateResult for downstream stats and writers.src/redteam/constants/metadata.ts (5)
149-155
: Add looks consistent; no blocking issues.The subCategoryDescriptions entry for simba is clear and matches surrounding style.
290-291
: Display name OK."Simba (beta)" reads well for feature‑flagged UI.
813-815
: Plugin description LGTM.Concise and consistent with other entries.
850-852
: Strategy description LGTM.Aligns with provider behavior.
885-887
: Strategy display name OK."Red teamer (beta)" is clear.
src/redteam/providers/simba.ts (1)
367-394
: Use the correct enum member and replace manual tokenUsage in the error path
- failureReason: replace the string
'provider_error'
withResultFailureReason.ERROR
- tokenUsage: call
createEmptyTokenUsage()
instead of inlining a zeroed object- failureReason: 'provider_error', + failureReason: ResultFailureReason.ERROR, - tokenUsage: { - total: 0, - prompt: 0, - completion: 0, - cached: 0, - numRequests: 0, - completionDetails: undefined, - assertions: undefined, - }, + tokenUsage: createEmptyTokenUsage(),Likely an incorrect or invalid review comment.
abortSignal ? { abortSignal } : undefined, | ||
concurrency, | ||
); | ||
|
||
// Update results with proper indices for Simba | ||
for (const result of simbaResults) { | ||
result.promptIdx = promptIdx; | ||
result.testIdx = testIdx++; | ||
} | ||
|
||
return simbaResults; | ||
} else { | ||
throw new Error('Simba provider does not have runSimba method'); | ||
} | ||
} else { | ||
response = await activeProvider.callApi( | ||
renderedPrompt, | ||
callApiContext, | ||
abortSignal ? { abortSignal } : undefined, | ||
); | ||
} | ||
|
||
logger.debug(`Provider response properties: ${Object.keys(response).join(', ')}`); | ||
logger.debug(`Provider response cached property explicitly: ${response.cached}`); | ||
} | ||
const endTime = Date.now(); | ||
latencyMs = endTime - startTime; | ||
|
||
let conversationLastInput = undefined; | ||
if (renderedJson && Array.isArray(renderedJson)) { | ||
const lastElt = renderedJson[renderedJson.length - 1]; | ||
// Use the `content` field if present (OpenAI chat format) | ||
conversationLastInput = lastElt?.content || lastElt; | ||
} | ||
if (conversations) { | ||
conversations[conversationKey] = conversations[conversationKey] || []; | ||
conversations[conversationKey].push({ | ||
prompt: renderedJson || renderedPrompt, | ||
input: conversationLastInput || renderedJson || renderedPrompt, | ||
output: response.output || '', | ||
metadata: response.metadata, | ||
}); | ||
} | ||
|
||
logger.debug(`Evaluator response = ${JSON.stringify(response).substring(0, 100)}...`); | ||
logger.debug( | ||
`Evaluator checking cached flag: response.cached = ${Boolean(response.cached)}, provider.delay = ${provider.delay}`, | ||
); | ||
|
||
if (!response.cached && provider.delay > 0) { | ||
logger.debug(`Sleeping for ${provider.delay}ms`); | ||
await sleep(provider.delay); | ||
} else if (response.cached) { | ||
logger.debug(`Skipping delay because response is cached`); | ||
} | ||
|
||
const ret: EvaluateResult = { | ||
...setup, | ||
response, | ||
success: false, | ||
failureReason: ResultFailureReason.NONE, | ||
score: 0, | ||
namedScores: {}, | ||
latencyMs, | ||
cost: response.cost, | ||
metadata: { | ||
...test.metadata, | ||
...response.metadata, | ||
[FILE_METADATA_KEY]: fileMetadata, | ||
// Add session information to metadata | ||
...(() => { | ||
// If sessionIds array exists from iterative providers, use it | ||
if (test.metadata?.sessionIds) { | ||
return { sessionIds: test.metadata.sessionIds }; | ||
} | ||
|
||
// Otherwise, use single sessionId (prioritize response over vars) | ||
if (response.sessionId) { | ||
return { sessionId: response.sessionId }; | ||
} | ||
|
||
// Check if vars.sessionId is a valid string | ||
const varsSessionId = vars.sessionId; | ||
if (typeof varsSessionId === 'string' && varsSessionId.trim() !== '') { | ||
return { sessionId: varsSessionId }; | ||
} | ||
|
||
return {}; | ||
})(), | ||
}, | ||
promptIdx, | ||
testIdx, | ||
testCase: test, | ||
promptId: prompt.id || '', | ||
tokenUsage: createEmptyTokenUsage(), | ||
}; | ||
|
||
invariant(ret.tokenUsage, 'This is always defined, just doing this to shut TS up'); | ||
|
||
// Track token usage at the provider level | ||
if (response.tokenUsage) { | ||
const providerId = provider.id(); | ||
const trackingId = provider.constructor?.name | ||
? `${providerId} (${provider.constructor.name})` | ||
: providerId; | ||
TokenUsageTracker.getInstance().trackUsage(trackingId, response.tokenUsage); | ||
} | ||
|
||
if (response.error) { | ||
ret.error = response.error; | ||
ret.failureReason = ResultFailureReason.ERROR; | ||
ret.success = false; | ||
} else if (response.output === null || response.output === undefined) { | ||
// NOTE: empty output often indicative of guardrails, so behavior differs for red teams. | ||
if (isRedteam) { | ||
ret.success = true; | ||
} else { | ||
ret.success = false; | ||
ret.score = 0; | ||
ret.error = 'No output'; | ||
} | ||
} else { | ||
// Create a copy of response so we can potentially mutate it. | ||
const processedResponse = { ...response }; | ||
|
||
// Apply provider transform first (if exists) | ||
if (provider.transform) { | ||
processedResponse.output = await transform(provider.transform, processedResponse.output, { | ||
vars, | ||
prompt, | ||
}); | ||
} | ||
|
||
// Store the provider-transformed output for assertions (contextTransform) | ||
const providerTransformedOutput = processedResponse.output; | ||
|
||
// Apply test transform (if exists) | ||
const testTransform = test.options?.transform || test.options?.postprocess; | ||
if (testTransform) { | ||
processedResponse.output = await transform(testTransform, processedResponse.output, { | ||
vars, | ||
prompt, | ||
...(response && response.metadata && { metadata: response.metadata }), | ||
}); | ||
} | ||
|
||
invariant(processedResponse.output != null, 'Response output should not be null'); | ||
|
||
// Extract traceId from traceparent if available | ||
let traceId: string | undefined; | ||
if (traceContext?.traceparent) { | ||
// traceparent format: version-traceId-spanId-flags | ||
const parts = traceContext.traceparent.split('-'); | ||
if (parts.length >= 3) { | ||
traceId = parts[1]; | ||
} | ||
} | ||
|
||
// Pass providerTransformedOutput for contextTransform to use | ||
const checkResult = await runAssertions({ | ||
prompt: renderedPrompt, | ||
provider, | ||
providerResponse: { | ||
...processedResponse, | ||
// Add provider-transformed output for contextTransform | ||
providerTransformedOutput, | ||
}, | ||
test, | ||
latencyMs: response.cached ? undefined : latencyMs, | ||
assertScoringFunction: test.assertScoringFunction as ScoringFunction, | ||
traceId, | ||
}); | ||
|
||
if (!checkResult.pass) { | ||
ret.error = checkResult.reason; | ||
ret.failureReason = ResultFailureReason.ASSERT; | ||
} | ||
ret.success = checkResult.pass; | ||
ret.score = checkResult.score; | ||
ret.namedScores = checkResult.namedScores || {}; | ||
// Track assertion request count | ||
if (!ret.tokenUsage.assertions) { | ||
ret.tokenUsage.assertions = createEmptyAssertions(); | ||
} | ||
ret.tokenUsage.assertions.numRequests = (ret.tokenUsage.assertions.numRequests ?? 0) + 1; | ||
|
||
// Track assertion token usage if provided | ||
if (checkResult.tokensUsed) { | ||
accumulateAssertionTokenUsage(ret.tokenUsage.assertions, checkResult.tokensUsed); | ||
} | ||
ret.response = processedResponse; | ||
ret.gradingResult = checkResult; | ||
} | ||
|
||
// Update token usage stats | ||
if (response.tokenUsage) { | ||
accumulateResponseTokenUsage(ret.tokenUsage, response); | ||
} | ||
|
||
if (test.options?.storeOutputAs && ret.response?.output && registers) { | ||
// Save the output in a register for later use | ||
registers[test.options.storeOutputAs] = ret.response.output; | ||
} | ||
|
||
return [ret]; | ||
} catch (err) { | ||
return [ | ||
{ | ||
...setup, | ||
error: String(err) + '\n\n' + (err as Error).stack, | ||
success: false, | ||
failureReason: ResultFailureReason.ERROR, | ||
score: 0, | ||
namedScores: {}, | ||
latencyMs, | ||
promptIdx, | ||
testIdx, | ||
testCase: test, | ||
promptId: prompt.id || '', | ||
}, | ||
]; | ||
} | ||
} | ||
|
||
/** | ||
* Safely formats variables for display in progress bars and logs. | ||
* Handles extremely large variables that could cause RangeError crashes. | ||
* | ||
* @param vars - Variables to format | ||
* @param maxLength - Maximum length of the final formatted string | ||
* @returns Formatted variables string or fallback message | ||
*/ | ||
export function formatVarsForDisplay( | ||
vars: Record<string, any> | undefined, | ||
maxLength: number, | ||
): string { | ||
if (!vars || Object.keys(vars).length === 0) { | ||
return ''; | ||
} | ||
|
||
try { | ||
// Simple approach: limit individual values, then truncate the whole result | ||
const formatted = Object.entries(vars) | ||
.map(([key, value]) => { | ||
// Prevent memory issues by limiting individual values first | ||
const valueStr = String(value).slice(0, 100); | ||
return `${key}=${valueStr}`; | ||
}) | ||
.join(' ') | ||
.replace(/\n/g, ' ') | ||
.slice(0, maxLength); | ||
|
||
return formatted; | ||
} catch { | ||
// Any error - return safe fallback | ||
return '[vars unavailable]'; | ||
} | ||
} | ||
|
||
export function generateVarCombinations( | ||
vars: Record<string, string | string[] | any>, | ||
): Record<string, string | any[]>[] { | ||
const keys = Object.keys(vars); | ||
const combinations: Record<string, string | any[]>[] = [{}]; | ||
|
||
for (const key of keys) { | ||
let values: any[] = []; | ||
|
||
if (typeof vars[key] === 'string' && vars[key].startsWith('file://')) { | ||
const filePath = vars[key].slice('file://'.length); | ||
|
||
// For glob patterns, we need to resolve the base directory and use relative patterns | ||
const basePath = cliState.basePath || ''; | ||
const filePaths = | ||
globSync(filePath, { | ||
cwd: basePath || process.cwd(), | ||
windowsPathsNoEscape: true, | ||
}) || []; | ||
|
||
values = filePaths.map((path: string) => `file://${path}`); | ||
if (values.length === 0) { | ||
throw new Error( | ||
`No files found for variable ${key} at path ${filePath} in directory ${basePath || process.cwd()}`, | ||
); | ||
} | ||
} else { | ||
values = Array.isArray(vars[key]) ? vars[key] : [vars[key]]; | ||
} | ||
|
||
// Check if it's an array but not a string array | ||
if (Array.isArray(vars[key]) && typeof vars[key][0] !== 'string') { | ||
values = [vars[key]]; | ||
} | ||
|
||
const newCombinations: Record<string, any>[] = []; | ||
|
||
for (const combination of combinations) { | ||
for (const value of values) { | ||
newCombinations.push({ ...combination, [key]: value }); | ||
} | ||
} | ||
|
||
combinations.length = 0; | ||
combinations.push(...newCombinations); | ||
} | ||
|
||
return combinations; | ||
} | ||
|
||
class Evaluator { | ||
evalRecord: Eval; | ||
testSuite: TestSuite; | ||
options: EvaluateOptions; | ||
stats: EvaluateStats; | ||
conversations: EvalConversations; | ||
registers: EvalRegisters; | ||
fileWriters: JsonlFileWriter[]; | ||
|
||
constructor(testSuite: TestSuite, evalRecord: Eval, options: EvaluateOptions) { | ||
this.testSuite = testSuite; | ||
this.evalRecord = evalRecord; | ||
this.options = options; | ||
this.stats = { | ||
successes: 0, | ||
failures: 0, | ||
errors: 0, | ||
tokenUsage: createEmptyTokenUsage(), | ||
}; | ||
this.conversations = {}; | ||
this.registers = {}; | ||
|
||
const jsonlFiles = Array.isArray(evalRecord.config.outputPath) | ||
? evalRecord.config.outputPath.filter((p) => p.endsWith('.jsonl')) | ||
: evalRecord.config.outputPath?.endsWith('.jsonl') | ||
? [evalRecord.config.outputPath] | ||
: []; | ||
|
||
this.fileWriters = jsonlFiles.map((p) => new JsonlFileWriter(p)); | ||
} | ||
|
||
private async _runEvaluation(): Promise<Eval> { | ||
const { options } = this; | ||
let { testSuite } = this; | ||
|
||
const startTime = Date.now(); | ||
const maxEvalTimeMs = options.maxEvalTimeMs ?? getMaxEvalTimeMs(); | ||
let evalTimedOut = false; | ||
let globalTimeout: NodeJS.Timeout | undefined; | ||
let globalAbortController: AbortController | undefined; | ||
const processedIndices = new Set<number>(); | ||
|
||
// Progress reporters declared here for cleanup in finally block | ||
let ciProgressReporter: CIProgressReporter | null = null; | ||
let progressBarManager: ProgressBarManager | null = null; | ||
|
||
if (maxEvalTimeMs > 0) { | ||
globalAbortController = new AbortController(); | ||
options.abortSignal = options.abortSignal | ||
? AbortSignal.any([options.abortSignal, globalAbortController.signal]) | ||
: globalAbortController.signal; | ||
globalTimeout = setTimeout(() => { | ||
evalTimedOut = true; | ||
globalAbortController?.abort(); | ||
}, maxEvalTimeMs); | ||
} | ||
|
||
const vars = new Set<string>(); | ||
const checkAbort = () => { | ||
if (options.abortSignal?.aborted) { | ||
throw new Error('Operation cancelled'); | ||
} | ||
}; | ||
|
||
logger.info(`Starting evaluation ${this.evalRecord.id}`); | ||
|
||
// Add abort checks at key points | ||
checkAbort(); | ||
|
||
const prompts: CompletedPrompt[] = []; | ||
const assertionTypes = new Set<string>(); | ||
const rowsWithSelectBestAssertion = new Set<number>(); | ||
const rowsWithMaxScoreAssertion = new Set<number>(); | ||
|
||
const beforeAllOut = await runExtensionHook(testSuite.extensions, 'beforeAll', { | ||
suite: testSuite, | ||
}); | ||
testSuite = beforeAllOut.suite; | ||
|
||
if (options.generateSuggestions) { | ||
// TODO(ian): Move this into its own command/file | ||
logger.info(`Generating prompt variations...`); | ||
const { prompts: newPrompts, error } = await generatePrompts(testSuite.prompts[0].raw, 1); | ||
if (error || !newPrompts) { | ||
throw new Error(`Failed to generate prompts: ${error}`); | ||
} | ||
|
||
logger.info(chalk.blue('Generated prompts:')); | ||
let numAdded = 0; | ||
for (const prompt of newPrompts) { | ||
logger.info('--------------------------------------------------------'); | ||
logger.info(`${prompt}`); | ||
logger.info('--------------------------------------------------------'); | ||
|
||
// Ask the user if they want to continue | ||
const shouldTest = await promptYesNo('Do you want to test this prompt?', false); | ||
if (shouldTest) { | ||
testSuite.prompts.push({ raw: prompt, label: prompt }); | ||
numAdded++; | ||
} else { | ||
logger.info('Skipping this prompt.'); | ||
} | ||
} | ||
|
||
if (numAdded < 1) { | ||
logger.info(chalk.red('No prompts selected. Aborting.')); | ||
process.exitCode = 1; | ||
return this.evalRecord; | ||
} | ||
} | ||
|
||
// Split prompts by provider | ||
// Order matters - keep provider in outer loop to reduce need to swap models during local inference. | ||
|
||
// Create a map of existing prompts for resume support | ||
const existingPromptsMap = new Map<string, CompletedPrompt>(); | ||
if (cliState.resume && this.evalRecord.persisted && this.evalRecord.prompts.length > 0) { | ||
logger.debug('Resuming evaluation: preserving metrics from previous run'); | ||
for (const existingPrompt of this.evalRecord.prompts) { | ||
const key = `${existingPrompt.provider}:${existingPrompt.id}`; | ||
existingPromptsMap.set(key, existingPrompt); | ||
} | ||
} | ||
|
||
for (const provider of testSuite.providers) { | ||
for (const prompt of testSuite.prompts) { | ||
// Check if providerPromptMap exists and if it contains the current prompt's label | ||
const providerKey = provider.label || provider.id(); | ||
if (!isAllowedPrompt(prompt, testSuite.providerPromptMap?.[providerKey])) { | ||
continue; | ||
} | ||
|
||
const promptId = generateIdFromPrompt(prompt); | ||
const existingPromptKey = `${providerKey}:${promptId}`; | ||
const existingPrompt = existingPromptsMap.get(existingPromptKey); | ||
|
||
const completedPrompt = { | ||
...prompt, | ||
id: promptId, | ||
provider: providerKey, | ||
label: prompt.label, | ||
metrics: existingPrompt?.metrics || { | ||
score: 0, | ||
testPassCount: 0, | ||
testFailCount: 0, | ||
testErrorCount: 0, | ||
assertPassCount: 0, | ||
assertFailCount: 0, | ||
totalLatencyMs: 0, | ||
tokenUsage: createEmptyTokenUsage(), | ||
namedScores: {}, | ||
namedScoresCount: {}, | ||
cost: 0, | ||
}, | ||
}; | ||
prompts.push(completedPrompt); | ||
} | ||
} | ||
|
||
this.evalRecord.addPrompts(prompts); | ||
|
||
// Aggregate all vars across test cases | ||
let tests = | ||
testSuite.tests && testSuite.tests.length > 0 | ||
? testSuite.tests | ||
: testSuite.scenarios | ||
? [] | ||
: [ | ||
{ | ||
// Dummy test for cases when we're only comparing raw prompts. | ||
}, | ||
]; | ||
|
||
// Build scenarios and add to tests | ||
if (testSuite.scenarios && testSuite.scenarios.length > 0) { | ||
telemetry.record('feature_used', { | ||
feature: 'scenarios', | ||
}); | ||
for (const scenario of testSuite.scenarios) { | ||
for (const data of scenario.config) { | ||
// Merge defaultTest with scenario config | ||
const scenarioTests = ( | ||
scenario.tests || [ | ||
{ | ||
// Dummy test for cases when we're only comparing raw prompts. | ||
}, | ||
] | ||
).map((test) => { | ||
return { | ||
...(typeof testSuite.defaultTest === 'object' ? testSuite.defaultTest : {}), | ||
...data, | ||
...test, | ||
vars: { | ||
...(typeof testSuite.defaultTest === 'object' ? testSuite.defaultTest?.vars : {}), | ||
...data.vars, | ||
...test.vars, | ||
}, | ||
options: { | ||
...(typeof testSuite.defaultTest === 'object' | ||
? testSuite.defaultTest?.options | ||
: {}), | ||
...test.options, | ||
}, | ||
assert: [ | ||
// defaultTest.assert is omitted because it will be added to each test case later | ||
...(data.assert || []), | ||
...(test.assert || []), | ||
], | ||
metadata: { | ||
...(typeof testSuite.defaultTest === 'object' | ||
? testSuite.defaultTest?.metadata | ||
: {}), | ||
...data.metadata, | ||
...test.metadata, | ||
}, | ||
}; | ||
}); | ||
// Add scenario tests to tests | ||
tests = tests.concat(scenarioTests); | ||
} | ||
} | ||
} | ||
|
||
maybeEmitAzureOpenAiWarning(testSuite, tests); | ||
|
||
// Prepare vars | ||
const varNames: Set<string> = new Set(); | ||
const varsWithSpecialColsRemoved: Vars[] = []; | ||
const inputTransformDefault = | ||
typeof testSuite?.defaultTest === 'object' | ||
? testSuite?.defaultTest?.options?.transformVars | ||
: undefined; | ||
for (const testCase of tests) { | ||
testCase.vars = { | ||
...(typeof testSuite.defaultTest === 'object' ? testSuite.defaultTest?.vars : {}), | ||
...testCase?.vars, | ||
}; | ||
|
||
if (testCase.vars) { | ||
const varWithSpecialColsRemoved: Vars = {}; | ||
const inputTransformForIndividualTest = testCase.options?.transformVars; | ||
const inputTransform = inputTransformForIndividualTest || inputTransformDefault; | ||
if (inputTransform) { | ||
const transformContext: TransformContext = { | ||
prompt: {}, | ||
uuid: randomUUID(), | ||
}; | ||
const transformedVars: Vars = await transform( | ||
inputTransform, | ||
testCase.vars, | ||
transformContext, | ||
true, | ||
TransformInputType.VARS, | ||
); | ||
invariant( | ||
typeof transformedVars === 'object', | ||
'Transform function did not return a valid object', | ||
); | ||
testCase.vars = { ...testCase.vars, ...transformedVars }; | ||
} | ||
for (const varName of Object.keys(testCase.vars)) { | ||
varNames.add(varName); | ||
varWithSpecialColsRemoved[varName] = testCase.vars[varName]; | ||
} | ||
varsWithSpecialColsRemoved.push(varWithSpecialColsRemoved); | ||
} | ||
} | ||
|
||
// Set up eval cases | ||
const runEvalOptions: RunEvalOptions[] = []; | ||
let testIdx = 0; | ||
let concurrency = options.maxConcurrency || DEFAULT_MAX_CONCURRENCY; | ||
for (let index = 0; index < tests.length; index++) { | ||
const testCase = tests[index]; | ||
invariant( | ||
typeof testSuite.defaultTest !== 'object' || | ||
Array.isArray(testSuite.defaultTest?.assert || []), | ||
`defaultTest.assert is not an array in test case #${index + 1}`, | ||
); | ||
invariant( | ||
Array.isArray(testCase.assert || []), | ||
`testCase.assert is not an array in test case #${index + 1}`, | ||
); | ||
// Handle default properties | ||
testCase.assert = [ | ||
...(typeof testSuite.defaultTest === 'object' ? testSuite.defaultTest?.assert || [] : []), | ||
...(testCase.assert || []), | ||
]; | ||
testCase.threshold = | ||
testCase.threshold ?? | ||
(typeof testSuite.defaultTest === 'object' ? testSuite.defaultTest?.threshold : undefined); | ||
testCase.options = { | ||
...(typeof testSuite.defaultTest === 'object' ? testSuite.defaultTest?.options : {}), | ||
...testCase.options, | ||
}; | ||
testCase.metadata = { | ||
...(typeof testSuite.defaultTest === 'object' ? testSuite.defaultTest?.metadata : {}), | ||
...testCase.metadata, | ||
}; | ||
// If the test case doesn't have a provider, use the one from defaultTest | ||
// Note: defaultTest.provider may be a raw config object that needs to be loaded | ||
if ( | ||
!testCase.provider && | ||
typeof testSuite.defaultTest === 'object' && | ||
testSuite.defaultTest?.provider | ||
) { | ||
const defaultProvider = testSuite.defaultTest.provider; | ||
if (isApiProvider(defaultProvider)) { | ||
// Already loaded | ||
testCase.provider = defaultProvider; | ||
} else if (typeof defaultProvider === 'object' && defaultProvider.id) { | ||
// Raw config object - load it | ||
const { loadApiProvider } = await import('./providers'); | ||
const providerId = | ||
typeof defaultProvider.id === 'function' ? defaultProvider.id() : defaultProvider.id; | ||
testCase.provider = await loadApiProvider(providerId, { | ||
options: defaultProvider as ProviderOptions, | ||
}); | ||
} else { | ||
testCase.provider = defaultProvider; | ||
} | ||
} | ||
testCase.assertScoringFunction = | ||
testCase.assertScoringFunction || | ||
(typeof testSuite.defaultTest === 'object' | ||
? testSuite.defaultTest?.assertScoringFunction | ||
: undefined); | ||
|
||
if (typeof testCase.assertScoringFunction === 'string') { | ||
const { filePath: resolvedPath, functionName } = parseFileUrl( | ||
testCase.assertScoringFunction, | ||
); | ||
testCase.assertScoringFunction = await loadFunction<ScoringFunction>({ | ||
filePath: resolvedPath, | ||
functionName, | ||
}); | ||
} | ||
const prependToPrompt = | ||
testCase.options?.prefix || | ||
(typeof testSuite.defaultTest === 'object' ? testSuite.defaultTest?.options?.prefix : '') || | ||
''; | ||
const appendToPrompt = | ||
testCase.options?.suffix || | ||
(typeof testSuite.defaultTest === 'object' ? testSuite.defaultTest?.options?.suffix : '') || | ||
''; | ||
|
||
// Finalize test case eval | ||
const varCombinations = | ||
getEnvBool('PROMPTFOO_DISABLE_VAR_EXPANSION') || testCase.options?.disableVarExpansion | ||
? [testCase.vars] | ||
: generateVarCombinations(testCase.vars || {}); | ||
|
||
const numRepeat = this.options.repeat || 1; | ||
for (let repeatIndex = 0; repeatIndex < numRepeat; repeatIndex++) { | ||
for (const vars of varCombinations) { | ||
let promptIdx = 0; | ||
// Order matters - keep provider in outer loop to reduce need to swap models during local inference. | ||
for (const provider of testSuite.providers) { | ||
for (const prompt of testSuite.prompts) { | ||
const providerKey = provider.label || provider.id(); | ||
if (!isAllowedPrompt(prompt, testSuite.providerPromptMap?.[providerKey])) { | ||
continue; | ||
} | ||
runEvalOptions.push({ | ||
delay: options.delay || 0, | ||
provider, | ||
prompt: { | ||
...prompt, | ||
raw: prependToPrompt + prompt.raw + appendToPrompt, | ||
}, | ||
test: (() => { | ||
const baseTest = { | ||
...testCase, | ||
vars, | ||
options: testCase.options, | ||
}; | ||
// Only add tracing metadata fields if tracing is actually enabled | ||
const tracingEnabled = | ||
testCase.metadata?.tracingEnabled === true || | ||
testSuite.tracing?.enabled === true; | ||
|
||
if (tracingEnabled) { | ||
return { | ||
...baseTest, | ||
metadata: { | ||
...testCase.metadata, | ||
tracingEnabled: true, | ||
evaluationId: this.evalRecord.id, | ||
}, | ||
}; | ||
} | ||
return baseTest; | ||
})(), | ||
nunjucksFilters: testSuite.nunjucksFilters, | ||
testIdx, | ||
promptIdx, | ||
repeatIndex, | ||
evaluateOptions: options, | ||
conversations: this.conversations, | ||
registers: this.registers, | ||
isRedteam: testSuite.redteam != null, | ||
allTests: runEvalOptions, | ||
concurrency, | ||
abortSignal: options.abortSignal, | ||
}); | ||
promptIdx++; | ||
} | ||
} | ||
testIdx++; | ||
} | ||
} | ||
} | ||
// Pre-mark comparison rows before any filtering (used by resume logic) | ||
for (const evalOption of runEvalOptions) { | ||
if (evalOption.test.assert?.some((a) => a.type === 'select-best')) { | ||
rowsWithSelectBestAssertion.add(evalOption.testIdx); | ||
} | ||
if (evalOption.test.assert?.some((a) => a.type === 'max-score')) { | ||
rowsWithMaxScoreAssertion.add(evalOption.testIdx); | ||
} | ||
} | ||
|
||
// Resume support: if CLI is in resume mode, skip already-completed (testIdx,promptIdx) pairs | ||
if (cliState.resume && this.evalRecord.persisted) { | ||
try { | ||
const { default: EvalResult } = await import('./models/evalResult'); | ||
const completedPairs = await EvalResult.getCompletedIndexPairs(this.evalRecord.id); | ||
const originalCount = runEvalOptions.length; | ||
// Filter out steps that already exist in DB | ||
for (let i = runEvalOptions.length - 1; i >= 0; i--) { | ||
const step = runEvalOptions[i]; | ||
if (completedPairs.has(`${step.testIdx}:${step.promptIdx}`)) { | ||
runEvalOptions.splice(i, 1); | ||
} | ||
} | ||
const skipped = originalCount - runEvalOptions.length; | ||
if (skipped > 0) { | ||
logger.info(`Resuming: skipping ${skipped} previously completed cases`); | ||
} | ||
} catch (err) { | ||
logger.warn( | ||
`Resume: failed to load completed results. Running full evaluation. ${String(err)}`, | ||
); | ||
} | ||
} | ||
|
||
// Determine run parameters | ||
|
||
if (concurrency > 1) { | ||
const usesConversation = prompts.some((p) => p.raw.includes('_conversation')); | ||
const usesStoreOutputAs = tests.some((t) => t.options?.storeOutputAs); | ||
if (usesConversation) { | ||
logger.info( | ||
`Setting concurrency to 1 because the ${chalk.cyan('_conversation')} variable is used.`, | ||
); | ||
concurrency = 1; | ||
} else if (usesStoreOutputAs) { | ||
logger.info(`Setting concurrency to 1 because storeOutputAs is used.`); | ||
concurrency = 1; | ||
} | ||
} | ||
|
||
// Actually run the eval | ||
let numComplete = 0; | ||
|
||
const processEvalStep = async (evalStep: RunEvalOptions, index: number | string) => { | ||
if (typeof index !== 'number') { | ||
throw new Error('Expected index to be a number'); | ||
} | ||
|
||
const beforeEachOut = await runExtensionHook(testSuite.extensions, 'beforeEach', { | ||
test: evalStep.test, | ||
}); | ||
evalStep.test = beforeEachOut.test; | ||
|
||
const rows = await runEval(evalStep); | ||
|
||
for (const row of rows) { | ||
for (const varName of Object.keys(row.vars)) { | ||
vars.add(varName); | ||
} | ||
// Print token usage for model-graded assertions and add to stats | ||
if (row.gradingResult?.tokensUsed && row.testCase?.assert) { | ||
for (const assertion of row.testCase.assert) { | ||
if (MODEL_GRADED_ASSERTION_TYPES.has(assertion.type as AssertionType)) { | ||
const tokensUsed = row.gradingResult.tokensUsed; | ||
|
||
if (!this.stats.tokenUsage.assertions) { | ||
this.stats.tokenUsage.assertions = createEmptyAssertions(); | ||
} | ||
|
||
// Accumulate assertion tokens using the specialized assertion function | ||
accumulateAssertionTokenUsage(this.stats.tokenUsage.assertions, tokensUsed); | ||
|
||
break; | ||
} | ||
} | ||
} | ||
|
||
// capture metrics | ||
if (row.success) { | ||
this.stats.successes++; | ||
} else if (row.failureReason === ResultFailureReason.ERROR) { | ||
this.stats.errors++; | ||
} else { | ||
this.stats.failures++; | ||
} | ||
|
||
if (row.tokenUsage) { | ||
accumulateResponseTokenUsage(this.stats.tokenUsage, { tokenUsage: row.tokenUsage }); | ||
} | ||
|
||
if (evalStep.test.assert?.some((a) => a.type === 'select-best')) { | ||
rowsWithSelectBestAssertion.add(row.testIdx); | ||
} | ||
if (evalStep.test.assert?.some((a) => a.type === 'max-score')) { | ||
rowsWithMaxScoreAssertion.add(row.testIdx); | ||
} | ||
for (const assert of evalStep.test.assert || []) { | ||
if (assert.type) { | ||
assertionTypes.add(assert.type); | ||
} | ||
} | ||
|
||
numComplete++; | ||
|
||
try { | ||
await this.evalRecord.addResult(row); | ||
} catch (error) { | ||
const resultSummary = summarizeEvaluateResultForLogging(row); | ||
logger.error(`Error saving result: ${error} ${safeJsonStringify(resultSummary)}`); | ||
} | ||
|
||
for (const writer of this.fileWriters) { | ||
await writer.write(row); | ||
} | ||
|
||
const { promptIdx } = row; | ||
const metrics = prompts[promptIdx].metrics; | ||
invariant(metrics, 'Expected prompt.metrics to be set'); | ||
metrics.score += row.score; | ||
for (const [key, value] of Object.entries(row.namedScores)) { | ||
// Update named score value | ||
metrics.namedScores[key] = (metrics.namedScores[key] || 0) + value; | ||
|
||
// Count assertions contributing to this named score | ||
let contributingAssertions = 0; | ||
row.gradingResult?.componentResults?.forEach((result) => { | ||
if (result.assertion?.metric === key) { | ||
contributingAssertions++; | ||
} | ||
}); | ||
|
||
metrics.namedScoresCount[key] = | ||
(metrics.namedScoresCount[key] || 0) + (contributingAssertions || 1); | ||
} | ||
|
||
if (testSuite.derivedMetrics) { | ||
const math = await import('mathjs'); | ||
for (const metric of testSuite.derivedMetrics) { | ||
if (metrics.namedScores[metric.name] === undefined) { | ||
metrics.namedScores[metric.name] = 0; | ||
} | ||
try { | ||
if (typeof metric.value === 'function') { | ||
metrics.namedScores[metric.name] = metric.value(metrics.namedScores, evalStep); | ||
} else { | ||
const evaluatedValue = math.evaluate(metric.value, metrics.namedScores); | ||
metrics.namedScores[metric.name] = evaluatedValue; | ||
} | ||
} catch (error) { | ||
logger.debug( | ||
`Could not evaluate derived metric '${metric.name}': ${(error as Error).message}`, | ||
); | ||
} | ||
} | ||
} | ||
metrics.testPassCount += row.success ? 1 : 0; | ||
if (!row.success) { | ||
if (row.failureReason === ResultFailureReason.ERROR) { | ||
metrics.testErrorCount += 1; | ||
} else { | ||
metrics.testFailCount += 1; | ||
} | ||
} | ||
metrics.assertPassCount += | ||
row.gradingResult?.componentResults?.filter((r) => r.pass).length || 0; | ||
metrics.assertFailCount += | ||
row.gradingResult?.componentResults?.filter((r) => !r.pass).length || 0; | ||
metrics.totalLatencyMs += row.latencyMs || 0; | ||
accumulateResponseTokenUsage(metrics.tokenUsage, row.response); | ||
|
||
// Add assertion token usage to the metrics | ||
if (row.gradingResult?.tokensUsed) { | ||
updateAssertionMetrics(metrics, row.gradingResult.tokensUsed); | ||
} | ||
|
||
metrics.cost += row.cost || 0; | ||
|
||
await runExtensionHook(testSuite.extensions, 'afterEach', { | ||
test: evalStep.test, | ||
result: row, | ||
}); | ||
|
||
if (options.progressCallback) { | ||
options.progressCallback(numComplete, runEvalOptions.length, index, evalStep, metrics); | ||
} | ||
} | ||
}; | ||
|
||
// Add a wrapper function that implements timeout | ||
const processEvalStepWithTimeout = async (evalStep: RunEvalOptions, index: number | string) => { | ||
// Get timeout value from options or environment, defaults to 0 (no timeout) | ||
const timeoutMs = options.timeoutMs || getEvalTimeoutMs(); | ||
|
||
if (timeoutMs <= 0) { | ||
// No timeout, process normally | ||
return processEvalStep(evalStep, index); | ||
} | ||
|
||
// Create an AbortController to cancel the request if it times out | ||
const abortController = new AbortController(); | ||
const { signal } = abortController; | ||
|
||
// Add the abort signal to the evalStep | ||
const evalStepWithSignal = { | ||
...evalStep, | ||
abortSignal: signal, | ||
}; | ||
|
||
try { | ||
return await Promise.race([ | ||
processEvalStep(evalStepWithSignal, index), | ||
new Promise<void>((_, reject) => { | ||
const timeoutId = setTimeout(() => { | ||
// Abort any ongoing requests | ||
abortController.abort(); | ||
|
||
// If the provider has a cleanup method, call it | ||
if (typeof evalStep.provider.cleanup === 'function') { | ||
try { | ||
evalStep.provider.cleanup(); | ||
} catch (cleanupErr) { | ||
logger.warn(`Error during provider cleanup: ${cleanupErr}`); | ||
} | ||
} | ||
|
||
reject(new Error(`Evaluation timed out after ${timeoutMs}ms`)); | ||
|
||
// Clear the timeout to prevent memory leaks | ||
clearTimeout(timeoutId); | ||
}, timeoutMs); | ||
}), | ||
]); | ||
} catch (error) { | ||
// Create and add an error result for timeout | ||
const timeoutResult = { | ||
provider: { | ||
id: evalStep.provider.id(), | ||
label: evalStep.provider.label, | ||
config: evalStep.provider.config, | ||
}, | ||
prompt: { | ||
raw: evalStep.prompt.raw, | ||
label: evalStep.prompt.label, | ||
config: evalStep.prompt.config, | ||
}, | ||
vars: evalStep.test.vars || {}, | ||
error: `Evaluation timed out after ${timeoutMs}ms: ${String(error)}`, | ||
success: false, | ||
failureReason: ResultFailureReason.ERROR, // Using ERROR for timeouts | ||
score: 0, | ||
namedScores: {}, | ||
latencyMs: timeoutMs, | ||
promptIdx: evalStep.promptIdx, | ||
testIdx: evalStep.testIdx, | ||
testCase: evalStep.test, | ||
promptId: evalStep.prompt.id || '', | ||
}; | ||
|
||
// Add the timeout result to the evaluation record | ||
await this.evalRecord.addResult(timeoutResult); | ||
|
||
// Update stats | ||
this.stats.errors++; | ||
|
||
// Update prompt metrics | ||
const { metrics } = prompts[evalStep.promptIdx]; | ||
if (metrics) { | ||
metrics.testErrorCount += 1; | ||
metrics.totalLatencyMs += timeoutMs; | ||
} | ||
|
||
// Progress callback | ||
if (options.progressCallback) { | ||
options.progressCallback( | ||
numComplete, | ||
runEvalOptions.length, | ||
typeof index === 'number' ? index : 0, | ||
evalStep, | ||
metrics || { | ||
score: 0, | ||
testPassCount: 0, | ||
testFailCount: 0, | ||
testErrorCount: 1, | ||
assertPassCount: 0, | ||
assertFailCount: 0, | ||
totalLatencyMs: timeoutMs, | ||
tokenUsage: { | ||
total: 0, | ||
prompt: 0, | ||
completion: 0, | ||
cached: 0, | ||
numRequests: 0, | ||
}, | ||
namedScores: {}, | ||
namedScoresCount: {}, | ||
cost: 0, | ||
}, | ||
); | ||
} | ||
} | ||
}; | ||
|
||
// Set up progress tracking | ||
const originalProgressCallback = this.options.progressCallback; | ||
const isWebUI = Boolean(cliState.webUI); | ||
|
||
// Choose appropriate progress reporter | ||
logger.debug( | ||
`Progress bar settings: showProgressBar=${this.options.showProgressBar}, isWebUI=${isWebUI}`, | ||
); | ||
|
||
if (isCI() && !isWebUI) { | ||
// Use CI-friendly progress reporter | ||
ciProgressReporter = new CIProgressReporter(runEvalOptions.length); | ||
ciProgressReporter.start(); | ||
} else if (this.options.showProgressBar && process.stdout.isTTY) { | ||
// Use visual progress bars | ||
progressBarManager = new ProgressBarManager(isWebUI); | ||
} | ||
|
||
this.options.progressCallback = (completed, total, index, evalStep, metrics) => { | ||
if (originalProgressCallback) { | ||
originalProgressCallback(completed, total, index, evalStep, metrics); | ||
} | ||
|
||
if (isWebUI) { | ||
const provider = evalStep.provider.label || evalStep.provider.id(); | ||
const vars = formatVarsForDisplay(evalStep.test.vars, 50); | ||
logger.info(`[${numComplete}/${total}] Running ${provider} with vars: ${vars}`); | ||
} else if (progressBarManager) { | ||
// Progress bar update is handled by the manager | ||
const phase = evalStep.test.options?.runSerially ? 'serial' : 'concurrent'; | ||
progressBarManager.updateProgress(index, evalStep, phase); | ||
} else if (ciProgressReporter) { | ||
// CI progress reporter update | ||
ciProgressReporter.update(numComplete); | ||
} else { | ||
logger.debug(`Eval #${index + 1} complete (${numComplete} of ${runEvalOptions.length})`); | ||
} | ||
}; | ||
|
||
// Separate serial and concurrent eval options | ||
// Separate serial, concurrent, and Simba eval options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simba results must keep the same testIdx; don’t increment
Incrementing testIdx per Simba sub‑result breaks grouping by row (compare/select‑best/max‑score rely on stable testIdx).
Apply this diff:
- // Update results with proper indices for Simba
- for (const result of simbaResults) {
- result.promptIdx = promptIdx;
- result.testIdx = testIdx++;
- }
+ // Preserve row grouping across Simba sub-results
+ const baseTestIdx = testIdx;
+ for (const result of simbaResults) {
+ result.promptIdx = promptIdx;
+ result.testIdx = baseTestIdx;
+ }
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In src/evaluator.ts around lines 368 to 397, the code currently increments
testIdx for each Simba sub-result which breaks grouping; change the loop that
assigns indices so each simba result uses the same testIdx (set result.testIdx =
testIdx) and remove any testIdx++ inside that loop (do not mutate testIdx per
sub-result).
simba: Severity.Medium, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Severity mapping added; verify category wiring.
Setting simba to Severity.Medium is fine, but simba is not listed in riskCategories. Without a category, UI grouping via categoryMapReverse will omit it. Please add simba to an appropriate top‑level category or exclude it from riskCategorySeverityMap if it’s meant to be strategy‑only.
🏁 Script executed:
#!/bin/bash
set -e
# Find definition of riskCategories
rg -n "export\s+const\s+riskCategories" -g "src/redteam/constants/metadata.ts"
# Find occurrences of riskCategorySeverityMap
rg -n "riskCategorySeverityMap" -g "src/redteam/constants/metadata.ts"
# Find all occurrences of simba in the file
rg -n "simba" -g "src/redteam/constants/metadata.ts"
Length of output: 777
🏁 Script executed:
#!/bin/bash
sed -n '400,460p' src/redteam/constants/metadata.ts
Length of output: 1454
Add “simba” to riskCategories or remove from riskCategorySeverityMap
The simba
plugin is mapped to Severity.Medium
but isn’t included in the riskCategories
object (src/redteam/constants/metadata.ts, from line 423), so it will be omitted from UI groupings. Add simba
to the appropriate top-level category or remove it from riskCategorySeverityMap
if it’s not intended for UI.
🤖 Prompt for AI Agents
In src/redteam/constants/metadata.ts around lines 420–421, the key "simba" is
present in riskCategorySeverityMap but missing from the riskCategories object
(around line 423), causing it to be omitted from UI groupings; either add
"simba" to the appropriate top-level category inside riskCategories (place it
under the same category used for other Medium-severity plugins) so UI groupings
include it, or if "simba" should not be exposed in the UI, remove the "simba"
entry from riskCategorySeverityMap instead—make the change so both maps remain
consistent.
const url = | ||
buildRemoteUrl('/api/v1/simba', 'https://api.promptfoo.app/api/v1/simba') + endpoint; | ||
|
||
const response = await fetch(url, { | ||
method, | ||
headers: { | ||
'Content-Type': 'application/json', | ||
}, | ||
body: body ? JSON.stringify(body) : undefined, | ||
}); | ||
|
||
if (!response.ok) { | ||
logRequestResponse({ url, requestMethod: 'POST', requestBody: body, response }); | ||
throw new Error(`Simba API request failed: ${response.status} ${response.statusText}`); | ||
} | ||
|
||
return response.json(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix base URL null case and incorrect log method; mark error logs properly.
- buildRemoteUrl can return null; current concatenation yields "null..." URL.
- When logging a failed GET, requestMethod is hardcoded to 'POST'.
- Set error: true for logRequestResponse.
Apply this diff:
- const url =
- buildRemoteUrl('/api/v1/simba', 'https://api.promptfoo.app/api/v1/simba') + endpoint;
+ const base =
+ buildRemoteUrl('/api/v1/simba', 'https://api.promptfoo.app/api/v1/simba') ||
+ 'https://api.promptfoo.app/api/v1/simba';
+ const url = new URL(endpoint, base).toString();
@@
- if (!response.ok) {
- logRequestResponse({ url, requestMethod: 'POST', requestBody: body, response });
+ if (!response.ok) {
+ await logRequestResponse({
+ url,
+ requestMethod: method,
+ requestBody: body,
+ response,
+ error: true,
+ });
throw new Error(`Simba API request failed: ${response.status} ${response.statusText}`);
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
const url = | |
buildRemoteUrl('/api/v1/simba', 'https://api.promptfoo.app/api/v1/simba') + endpoint; | |
const response = await fetch(url, { | |
method, | |
headers: { | |
'Content-Type': 'application/json', | |
}, | |
body: body ? JSON.stringify(body) : undefined, | |
}); | |
if (!response.ok) { | |
logRequestResponse({ url, requestMethod: 'POST', requestBody: body, response }); | |
throw new Error(`Simba API request failed: ${response.status} ${response.statusText}`); | |
} | |
return response.json(); | |
} | |
// Ensure buildRemoteUrl fallback when it returns null | |
const base = | |
buildRemoteUrl('/api/v1/simba', 'https://api.promptfoo.app/api/v1/simba') || | |
'https://api.promptfoo.app/api/v1/simba'; | |
const url = new URL(endpoint, base).toString(); | |
const response = await fetch(url, { | |
method, | |
headers: { | |
'Content-Type': 'application/json', | |
}, | |
body: body ? JSON.stringify(body) : undefined, | |
}); | |
if (!response.ok) { | |
// Log with the actual HTTP method and mark as an error | |
await logRequestResponse({ | |
url, | |
requestMethod: method, | |
requestBody: body, | |
response, | |
error: true, | |
}); | |
throw new Error(`Simba API request failed: ${response.status} ${response.statusText}`); | |
} | |
return response.json(); |
🤖 Prompt for AI Agents
In src/redteam/providers/simba.ts around lines 120 to 137, buildRemoteUrl can
return null which makes URL concatenation produce "null..." so guard that return
(e.g., use the fallback base or throw a clear error) and build the final url
safely; when logging failed requests use the actual method variable instead of
the hardcoded 'POST' (requestMethod: method) and mark the log as an error by
passing error: true into logRequestResponse; keep the existing response handling
and return response.json() after these fixes.
private async startSession(): Promise<string> { | ||
const email = (await getUserEmail()) || 'demo@promptfoo.dev'; | ||
|
||
const startRequest: SimbaStartRequest = { | ||
targetInfo: { | ||
goal: this.config.goal!, | ||
purpose: this.config.purpose!, | ||
additionalAttackInstructions: this.config.additionalInstructions, | ||
}, | ||
config: { | ||
maxConversationRounds: this.config.maxRounds!, | ||
maxAttackVectors: this.config.maxVectors!, | ||
}, | ||
email, | ||
}; | ||
|
||
const response: SimbaStartResponse = await this.callSimbaApi('/start', startRequest); | ||
logger.debug(`[Simba] Started session with ID: ${response.sessionId}`); | ||
return response.sessionId; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid non‑null assertions for required fields; provide safe defaults.
purpose may be undefined; the non‑null assertion only silences TS, not runtime. Also getUserEmail is sync; no need to await.
- private async startSession(): Promise<string> {
- const email = (await getUserEmail()) || 'demo@promptfoo.dev';
+ private async startSession(): Promise<string> {
+ const email = getUserEmail() || 'demo@promptfoo.dev';
@@
- targetInfo: {
- goal: this.config.goal!,
- purpose: this.config.purpose!,
- additionalAttackInstructions: this.config.additionalInstructions,
- },
+ targetInfo: {
+ goal: this.config.goal || 'Identify and execute adversarial test cases',
+ purpose: this.config.purpose ?? 'General adversarial evaluation',
+ additionalAttackInstructions: this.config.additionalInstructions,
+ },
@@
- maxConversationRounds: this.config.maxRounds!,
- maxAttackVectors: this.config.maxVectors!,
+ maxConversationRounds: this.config.maxRounds ?? 20,
+ maxAttackVectors: this.config.maxVectors ?? 5,
🤖 Prompt for AI Agents
In src/redteam/providers/simba.ts around lines 139 to 158, remove the non‑null
assertions and make fields safe by supplying defaults and validating types: call
getUserEmail synchronously (drop the await) and default to 'demo@promptfoo.dev'
if falsy, ensure goal, purpose, and additionalInstructions are set to safe
strings (e.g., '' or a descriptive fallback) instead of using !, and ensure
numeric config fields maxRounds and maxVectors fall back to safe numbers if
undefined; then pass these validated/defaulted values into startRequest before
calling the API.
async runSimba( | ||
prompt: string, | ||
context?: CallApiContextParams, | ||
options?: CallApiOptionsParams, | ||
concurrency?: number, | ||
): Promise<EvaluateResult[]> { | ||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Hardcoded sessionId and commented core loop: runSimba currently ignores prompt/context.
Using a fixed GUID breaks real runs and test reproducibility. Initialize a session and fetch results using that id. Also avoid mutating output.messages with pop(); compute response safely. Use a stable promptId. Example minimal fix:
async runSimba(
@@
): Promise<EvaluateResult[]> {
try {
- const finalOutput = await this.getFinalOutput(`5501c726-0409-4192-b7cd-7457e6649788`);
+ if (!this.sessionId) {
+ this.sessionId = await this.startSession();
+ }
+ const sid = this.sessionId;
+ const finalOutput = await this.getFinalOutput(sid);
- return finalOutput.map((output, index) => {
- const lastUserMessage = output.messages.filter((message) => message.role === 'user').pop();
+ return finalOutput.map((output, index) => {
+ const lastUserMessage = [...output.messages].filter((m) => m.role === 'user').pop();
+ const lastAssistantMessage =
+ [...output.messages].reverse().find((m) => m.role === 'assistant') || null;
return {
@@
- promptId: `simba-${this.sessionId}-${index}`,
+ promptId: `simba-${sid}-${index}`,
@@
- response: {
- output: output.messages.pop()?.content || '',
+ response: {
+ output: lastAssistantMessage?.content || '',
tokenUsage: createEmptyTokenUsage(),
},
- success: !!output.result.success,
- score: output.result.success ? 0 : 1,
+ // Treat "attack success" as "test failure"
+ success: !output.result.success,
+ score: output.result.success ? 0 : 1, // verify scoring convention
@@
- failureReason: output.result.success
- ? ResultFailureReason.ASSERT
- : ResultFailureReason.NONE,
+ failureReason: output.result.success
+ ? ResultFailureReason.ASSERT
+ : ResultFailureReason.NONE,
@@
- gradingResult: {
- pass: output.result.success,
- score: output.result.success ? 0 : 1,
+ gradingResult: {
+ pass: !output.result.success,
+ score: output.result.success ? 0 : 1, // verify
reason: output.result.summary,
metadata: {
pluginId: 'simba',
strategyId: 'simba-attack',
},
},
namedScores: {
- attack_success: output.result.success ? 0 : 1,
+ attack_success: output.result.success ? 1 : 0,
},
tokenUsage: createEmptyTokenUsage(),
metadata: {
attackPlan: output.attackPlan,
result: output.result,
redteamHistory: (() => {
- const pairs = [];
- for (let i = 0; i < output.messages.length; i += 2) {
+ const pairs = [];
+ for (let i = 0; i < output.messages.length; i += 2) {
const userMessage = output.messages[i];
const assistantMessage = output.messages[i + 1];
if (userMessage?.role === 'user' && assistantMessage?.role === 'assistant') {
pairs.push({
prompt: userMessage.content,
output: assistantMessage.content,
});
}
}
return pairs;
})(),
dataExtracted: output.result.dataExtracted.join('\n'),
successfulJailbreaks: output.result.successfulJailbreaks.join('\n'),
},
};
});
Also applies to: 283-366
🏁 Script executed:
#!/bin/bash
# Locate the simba.ts file
find . -type f -name "simba.ts"
Length of output: 98
🏁 Script executed:
#!/bin/bash
# Show runSimba implementation with context
rg -nP -C5 'async runSimba' src/redteam/providers/simba.ts
Length of output: 246
🏁 Script executed:
#!/bin/bash
# Extract runSimba implementation for inspection
sed -n '170,260p' src/redteam/providers/simba.ts
Length of output: 3362
🏁 Script executed:
#!/bin/bash
# Extract continuation of runSimba implementation
sed -n '260,350p' src/redteam/providers/simba.ts
Length of output: 3459
Refactor runSimba in src/redteam/providers/simba.ts to use dynamic sessions and real prompt/context
- Replace the hardcoded GUID in getFinalOutput with something like:
this.sessionId ??= await this.startSession(); const finalOutput = await this.getFinalOutput(this.sessionId);
- Re-enable the core loop to forward the prompt and context through callSimbaApi and targetProvider.callApi, instead of commenting it out.
- Don’t mutate output.messages directly—use
[...output.messages]
or similar to extract lastUserMessage and lastAssistantMessage. - Invert scoring so that “attack success” (output.result.success) yields test failure (
success: !output.result.success
) and set namedScores.attack_success to 1 when the attack wins. - Build promptId as
simba-${this.sessionId}-${index}
to guarantee stability.
🤖 Prompt for AI Agents
In src/redteam/providers/simba.ts around lines 170 to 176, runSimba currently
uses a hardcoded GUID, commented-out core forwarding loop, mutates
output.messages directly, incorrectly maps attack success to pass, and builds
unstable prompt IDs; fix by initializing a dynamic session with this.sessionId
??= await this.startSession() and call getFinalOutput(this.sessionId) instead of
the GUID, re-enable and use the core loop to forward prompt and context via
callSimbaApi and targetProvider.callApi, avoid mutating output.messages by
copying (e.g., [...output.messages]) before extracting
lastUserMessage/lastAssistantMessage, invert the returned success so test
success = !output.result.success and set namedScores.attack_success = 1 when
output.result.success is true, and construct promptId as
`simba-${this.sessionId}-${index}` to ensure stability.
const originalText = String(baseTestCase.vars![injectVar]); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against missing vars/injectVar to avoid runtime NPE
baseTestCase.vars! may be undefined; accessing [injectVar] can throw. Also String(undefined) yields 'undefined'.
Apply this diff:
- const baseTestCase = testCases[0];
- const originalText = String(baseTestCase.vars![injectVar]);
+ const baseTestCase = testCases[0];
+ const vars = baseTestCase.vars ?? {};
+ const raw = (vars as Record<string, unknown>)[injectVar];
+ const originalText =
+ typeof raw === 'string' ? raw : raw != null ? String(raw) : '';
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
const originalText = String(baseTestCase.vars![injectVar]); | |
const baseTestCase = testCases[0]; | |
const vars = baseTestCase.vars ?? {}; | |
const raw = (vars as Record<string, unknown>)[injectVar]; | |
const originalText = | |
typeof raw === 'string' ? raw : raw != null ? String(raw) : ''; |
🤖 Prompt for AI Agents
In src/redteam/strategies/simba.ts around lines 18-19, guard against
baseTestCase.vars being undefined or missing the injectVar before converting to
a string: check that baseTestCase.vars && injectVar in baseTestCase.vars (or use
optional chaining baseTestCase.vars?.[injectVar]) and provide a safe default
(e.g., '') or throw a clear error if it must be present; then convert the
resulting value to a string (e.g., String(baseTestCase.vars?.[injectVar] ??
'')). Ensure types reflect the possibility of undefined so no runtime NPE or the
literal "undefined" string is produced.
provider: { | ||
id: 'promptfoo:redteam:simba', | ||
config: { | ||
injectVar, | ||
...config, | ||
}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sanitize provider.config before persisting
Spreading arbitrary config into provider.config risks persisting secrets (API keys, tokens) in results/DB.
Example (adjust to your sanitize helper):
- config: {
- injectVar,
- ...config,
- },
+ config: sanitizeObject({
+ injectVar,
+ ...config,
+ }),
If a sanitize helper isn’t available, whitelist allowed Simba config keys instead of spreading.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
provider: { | |
id: 'promptfoo:redteam:simba', | |
config: { | |
injectVar, | |
...config, | |
}, | |
}, | |
provider: { | |
id: 'promptfoo:redteam:simba', | |
config: sanitizeObject({ | |
injectVar, | |
...config, | |
}), | |
}, |
🤖 Prompt for AI Agents
In src/redteam/strategies/simba.ts around lines 23 to 29, provider.config
currently spreads arbitrary config (injectVar, ...config) which can persist
secrets; update the code to sanitize config before persisting by either invoking
the project’s sanitize helper on config and storing only the sanitized result,
or replace the spread with an explicit whitelist of allowed Simba config keys
(copy only safe keys into config). Ensure injectVar is retained if safe, and do
not include any secrets (API keys, tokens) in the persisted provider.config.
assert: baseTestCase.assert?.map((assertion) => ({ | ||
...assertion, | ||
metric: `${assertion.metric}/Simba`, | ||
})), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid producing "undefined/Simba" metrics
assertion.metric can be undefined; concatenation yields 'undefined/Simba'.
Apply this diff:
- assert: baseTestCase.assert?.map((assertion) => ({
- ...assertion,
- metric: `${assertion.metric}/Simba`,
- })),
+ assert: baseTestCase.assert?.map((assertion) => ({
+ ...assertion,
+ ...(assertion.metric
+ ? { metric: `${assertion.metric}/Simba` }
+ : {}),
+ })),
🤖 Prompt for AI Agents
In src/redteam/strategies/simba.ts around lines 30 to 33, the mapping
unconditionally concatenates "/Simba" to assertion.metric which can be undefined
and produce "undefined/Simba"; change the mapping to only append the suffix when
assertion.metric is defined (e.g. set metric to assertion.metric ?
`${assertion.metric}/Simba` : undefined or omit the metric property when
undefined) so metrics are not "undefined/Simba" and types are preserved.
812a67e
to
3dbbded
Compare
Adding our experimental advanced red teamer strategy.