-
-
Notifications
You must be signed in to change notification settings - Fork 733
feat: add OpenAI Guardrails redteam plugin #5892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add new dataset-based plugin to test jailbreak resistance using OpenAI's official guardrails evaluation dataset. The dataset contains 51 jailbreak prompts and 49 safe prompts from real-world jailbreak attempts. Features: - Tests role-playing attacks, system manipulation, and obfuscation techniques - Includes includeSafe option for testing over-blocking (false positives) - 50/50 balanced split of safe and jailbreak prompts when enabled - Inverted grading logic for safe prompts (they SHOULD be answered) - Dataset pinned to commit hash for stability Implementation: - Plugin: src/redteam/plugins/openaiGuardrails.ts - Tests: test/redteam/plugins/openaiGuardrails.test.ts (19 tests, all passing) - Documentation: site/docs/red-team/plugins/openai-guardrails.md - UI configuration dialog with checkbox for includeSafe option Follows same pattern as beavertails/unsafebench/aegis plugins for consistency.
📝 WalkthroughWalkthroughAdds a new “OpenAI Guardrails” red-team plugin across the codebase. Introduces plugin metadata in site/docs/_shared/data/plugins.ts and red-team constants, registers the plugin in src/redteam/plugins/index.ts, and implements it in src/redteam/plugins/openaiGuardrails.ts with dataset-driven test generation, safe/jailbreak filtering, balanced sampling via includeSafe, and rubric-based assertions. Adds UI configuration in PluginConfigDialog.tsx for toggling safe prompts. Provides documentation at site/docs/red-team/plugins/openai-guardrails.md. Includes comprehensive unit tests in test/redteam/plugins/openaiGuardrails.test.ts covering dataset parsing, balancing, assertions, metadata, error handling, and integration behavior. Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (1)
src/redteam/plugins/openaiGuardrails.ts (1)
134-136
: Consider using a more robust shuffling algorithm.The current shuffle implementation uses
.sort(() => Math.random() - 0.5)
, which is a common pattern but has known bias issues and doesn't provide a uniform distribution. While acceptable for red-team test selection, a Fisher-Yates shuffle would be more robust.- selectedRows = [ - ...safeRows.sort(() => Math.random() - 0.5).slice(0, numEach), - ...jailbreakRows.sort(() => Math.random() - 0.5).slice(0, numEach), - ].sort(() => Math.random() - 0.5); // Shuffle final order + // Fisher-Yates shuffle helper + const shuffle = <T>(array: T[]): T[] => { + const result = [...array]; + for (let i = result.length - 1; i > 0; i--) { + const j = Math.floor(Math.random() * (i + 1)); + [result[i], result[j]] = [result[j], result[i]]; + } + return result; + }; + + selectedRows = shuffle([ + ...shuffle(safeRows).slice(0, numEach), + ...shuffle(jailbreakRows).slice(0, numEach), + ]);
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
site/docs/_shared/data/plugins.ts
(1 hunks)site/docs/red-team/plugins/openai-guardrails.md
(1 hunks)src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx
(2 hunks)src/redteam/constants/metadata.ts
(6 hunks)src/redteam/constants/plugins.ts
(2 hunks)src/redteam/plugins/index.ts
(2 hunks)src/redteam/plugins/openaiGuardrails.ts
(1 hunks)test/redteam/plugins/openaiGuardrails.test.ts
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (17)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)
Prefer not to introduce new TypeScript types; use existing interfaces whenever possible
**/*.{ts,tsx}
: Follow consistent import order (Biome will handle sorting)
Use curly braces for all control statements
Prefer const over let; avoid var
Use object property shorthand when possible
Use async/await for asynchronous code
Use consistent error handling with proper type checks
Files:
src/redteam/constants/plugins.ts
src/redteam/constants/metadata.ts
src/redteam/plugins/openaiGuardrails.ts
site/docs/_shared/data/plugins.ts
src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx
src/redteam/plugins/index.ts
test/redteam/plugins/openaiGuardrails.test.ts
src/redteam/**/*.ts
📄 CodeRabbit inference engine (src/redteam/CLAUDE.md)
src/redteam/**/*.ts
: Always sanitize when logging test prompts or model outputs by passing them via the structured metadata parameter (second argument) to the logger, not raw string interpolation
Use the standardized risk severity levels: critical, high, medium, low when reporting results
Files:
src/redteam/constants/plugins.ts
src/redteam/constants/metadata.ts
src/redteam/plugins/openaiGuardrails.ts
src/redteam/plugins/index.ts
src/**/*.{ts,tsx}
📄 CodeRabbit inference engine (CLAUDE.md)
src/**/*.{ts,tsx}
: Sanitize sensitive data before logging; pass context objects to logger methods (debug, info, warn, error) for automatic redaction
Do not interpolate secrets into log messages (avoid stringifying headers/bodies directly); use structured logger context instead
Use sanitizeObject for manual sanitization before using or persisting potentially sensitive data
Files:
src/redteam/constants/plugins.ts
src/redteam/constants/metadata.ts
src/redteam/plugins/openaiGuardrails.ts
src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx
src/redteam/plugins/index.ts
src/redteam/plugins/**/*.ts
📄 CodeRabbit inference engine (src/redteam/CLAUDE.md)
src/redteam/plugins/**/*.ts
: Place vulnerability-specific test generators as plugins under src/redteam/plugins/ (e.g., pii.ts, harmful.ts, sql-injection.ts)
New plugins must implement the RedteamPluginObject interface
Files:
src/redteam/plugins/openaiGuardrails.ts
src/redteam/plugins/index.ts
{site/**,examples/**}
📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)
Any pull request that only touches files in 'site/' or 'examples/' directories must use the 'docs:' prefix in the PR title, not 'feat:' or 'fix:'
Files:
site/docs/_shared/data/plugins.ts
site/docs/red-team/plugins/openai-guardrails.md
site/**
📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)
If the change is a feature, update the relevant documentation under 'site/'
Files:
site/docs/_shared/data/plugins.ts
site/docs/red-team/plugins/openai-guardrails.md
src/app/src/**/*.{ts,tsx}
📄 CodeRabbit inference engine (src/app/CLAUDE.md)
src/app/src/**/*.{ts,tsx}
: Never use fetch() directly; always use callApi() from @app/utils/api for all HTTP requests
Access Zustand state outside React components via store.getState(); do not call hooks outside components
Use the @app/* path alias for internal imports as configured in Vite
Files:
src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx
src/app/src/{components,pages}/**/*.tsx
📄 CodeRabbit inference engine (src/app/CLAUDE.md)
src/app/src/{components,pages}/**/*.tsx
: Use the class-based ErrorBoundary component (@app/components/ErrorBoundary) to wrap error-prone UI
Access theme via useTheme() from @mui/material/styles instead of hardcoding theme values
Use useMemo/useCallback only when profiling indicates benefit; avoid unnecessary memoization
Implement explicit loading and error states for components performing async operations
Prefer MUI composition and the sx prop for styling over ad-hoc inline styles
Files:
src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx
**/*.{tsx,jsx}
📄 CodeRabbit inference engine (.cursor/rules/react-components.mdc)
**/*.{tsx,jsx}
: Use icons from @mui/icons-material
Prefer commonly used icons from @mui/icons-material for intuitive experience
Files:
src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx
src/app/**/*.{ts,tsx}
📄 CodeRabbit inference engine (CLAUDE.md)
In the React app (src/app), use callApi from @app/utils/api for all API calls; do not use fetch directly
Files:
src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx
site/docs/**/*.md
📄 CodeRabbit inference engine (.cursor/rules/docusaurus.mdc)
site/docs/**/*.md
: Prioritize minimal edits when updating existing documentation; avoid creating entirely new sections or rewriting substantial portions; focus edits on improving grammar, spelling, clarity, fixing typos, and structural improvements where needed; do not modify existing headings (h1, h2, h3, etc.) as they are often linked externally.
Structure content to reveal information progressively: begin with essential actions and information, then provide deeper context as necessary; organize information from most important to least important.
Use action-oriented language: clearly outline actionable steps users should take, use concise and direct language, prefer active voice over passive voice, and use imperative mood for instructions.
Use 'eval' instead of 'evaluation' in all documentation; when referring to command line usage, use 'npx promptfoo eval' rather than 'npx promptfoo evaluation'; maintain consistency with this terminology across all examples, code blocks, and explanations.
The project name can be written as either 'Promptfoo' (capitalized) or 'promptfoo' (lowercase) depending on context: use 'Promptfoo' at the beginning of sentences or in headings, and 'promptfoo' in code examples, terminal commands, or when referring to the package name; be consistent with the chosen capitalization within each document or section.
Each markdown documentation file must include required front matter fields: 'title' (the page title shown in search results and browser tabs) and 'description' (a concise summary of the page content, ideally 150-160 characters).
Only add a title attribute to code blocks that represent complete, runnable files; do not add titles to code fragments, partial examples, or snippets that aren't meant to be used as standalone files; this applies to all code blocks regardless of language.
Use special comment directives to highlight specific lines in code blocks: 'highlight-next-line' highlights the line immediately after the comment, 'highligh...
Files:
site/docs/red-team/plugins/openai-guardrails.md
site/docs/**/*.{md,mdx}
📄 CodeRabbit inference engine (site/docs/CLAUDE.md)
site/docs/**/*.{md,mdx}
: Use the term "eval" not "evaluation" in documentation and examples
Capitalization: use "Promptfoo" (capitalized) in prose/headings and "promptfoo" (lowercase) in code, commands, and package names
Every doc must include required front matter: title and description
Only add title= to code blocks when showing complete runnable files
Admonitions must have empty lines around their content (Prettier requirement)
Do not modify headings; they may be externally linked
Use progressive disclosure: put essential information first
Use action-oriented, imperative mood in instructions (e.g., "Install the package")
Files:
site/docs/red-team/plugins/openai-guardrails.md
**/*.{test,spec}.{js,ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)
Avoid disabling or skipping tests unless absolutely necessary and documented
Files:
test/redteam/plugins/openaiGuardrails.test.ts
test/**/*.{test,spec}.ts
📄 CodeRabbit inference engine (.cursor/rules/jest.mdc)
test/**/*.{test,spec}.ts
: Mock as few functions as possible to keep tests realistic
Never increase the function timeout - fix the test instead
Organize tests in descriptivedescribe
andit
blocks
Prefer assertions on entire objects rather than individual keys when writing expectations
Clean up after tests to prevent side effects (e.g., useafterEach(() => { jest.resetAllMocks(); })
)
Run tests with--randomize
flag to ensure your mocks setup and teardown don't affect other tests
Use Jest's mocking utilities rather than complex custom mocks
Prefer shallow mocking over deep mocking
Mock external dependencies but not the code being tested
Reset mocks between tests to prevent test pollution
For database tests, use in-memory instances or proper test fixtures
Test both success and error cases for each provider
Mock API responses to avoid external dependencies in tests
Validate that provider options are properly passed to the underlying service
Test error handling and edge cases (rate limits, timeouts, etc.)
Ensure provider caching behaves as expected
Always include both--coverage
and--randomize
flags when running tests
Run tests in a single pass (no watch mode for CI)
Ensure all tests are independent and can run in any order
Clean up any test data or mocks after each test
Files:
test/redteam/plugins/openaiGuardrails.test.ts
test/**/*.test.ts
📄 CodeRabbit inference engine (test/CLAUDE.md)
test/**/*.test.ts
: Never increase Jest test timeouts; fix slow tests instead (avoid jest.setTimeout or large timeouts in tests)
Do not use .only() or .skip() in committed tests
Add afterEach(() => { jest.resetAllMocks(); }) to ensure mock cleanup
Prefer asserting entire objects (toEqual on whole result) rather than individual fields
Mock minimally: only external dependencies (APIs, databases), not code under test
Use Jest (not Vitest) APIs in this suite; avoid importing vitest
Import from @jest/globals in tests
Files:
test/redteam/plugins/openaiGuardrails.test.ts
test/**
📄 CodeRabbit inference engine (test/CLAUDE.md)
Organize tests to mirror src/ structure (e.g., test/providers → src/providers, test/redteam → src/redteam)
Files:
test/redteam/plugins/openaiGuardrails.test.ts
test/**/*.{test.ts,test.tsx,spec.ts,spec.tsx}
📄 CodeRabbit inference engine (CLAUDE.md)
test/**/*.{test.ts,test.tsx,spec.ts,spec.tsx}
: Follow Jest best practices using describe/it blocks in tests
Write tests covering both success and error cases for all functionality
Files:
test/redteam/plugins/openaiGuardrails.test.ts
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
PR: promptfoo/promptfoo#0
File: src/redteam/CLAUDE.md:0-0
Timestamp: 2025-10-05T16:59:20.507Z
Learning: Applies to src/redteam/test/redteam/**/*.ts : Add tests for new plugins under test/redteam/
Learnt from: CR
PR: promptfoo/promptfoo#0
File: src/redteam/CLAUDE.md:0-0
Timestamp: 2025-10-05T16:59:20.507Z
Learning: Applies to src/redteam/plugins/**/*.ts : Place vulnerability-specific test generators as plugins under src/redteam/plugins/ (e.g., pii.ts, harmful.ts, sql-injection.ts)
Learnt from: CR
PR: promptfoo/promptfoo#0
File: src/redteam/CLAUDE.md:0-0
Timestamp: 2025-10-05T16:59:20.507Z
Learning: Applies to src/redteam/plugins/**/*.ts : New plugins must implement the RedteamPluginObject interface
📚 Learning: 2025-10-05T16:59:20.507Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: src/redteam/CLAUDE.md:0-0
Timestamp: 2025-10-05T16:59:20.507Z
Learning: Applies to src/redteam/test/redteam/**/*.ts : Add tests for new plugins under test/redteam/
Applied to files:
test/redteam/plugins/openaiGuardrails.test.ts
🧬 Code graph analysis (3)
src/redteam/plugins/openaiGuardrails.ts (3)
src/types/index.ts (2)
TestCase
(702-702)Assertion
(555-555)src/util/fetch/index.ts (1)
fetchWithTimeout
(133-159)src/providers/shared.ts (1)
REQUEST_TIMEOUT_MS
(9-9)
src/redteam/plugins/index.ts (1)
src/redteam/plugins/openaiGuardrails.ts (1)
OpenAIGuardrailsPlugin
(165-246)
test/redteam/plugins/openaiGuardrails.test.ts (2)
src/redteam/plugins/openaiGuardrails.ts (2)
OpenAIGuardrailsPlugin
(165-246)fetchDataset
(55-163)src/util/fetch/index.ts (1)
fetchWithTimeout
(133-159)
🪛 GitHub Check: Build on Node 22.x
src/redteam/plugins/openaiGuardrails.ts
[failure] 224-224:
Property 'includeSafe' does not exist on type 'PluginConfig'.
test/redteam/plugins/openaiGuardrails.test.ts
[failure] 286-286:
Object literal may only specify known properties, and 'includeSafe' does not exist in type 'PluginConfig'.
[failure] 260-260:
Object literal may only specify known properties, and 'includeSafe' does not exist in type 'PluginConfig'.
[failure] 228-228:
Object literal may only specify known properties, and 'includeSafe' does not exist in type 'PluginConfig'.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
- GitHub Check: Tusk Test Runner (2)
- GitHub Check: Tusk Test Runner (3)
- GitHub Check: Tusk Test Runner (1)
- GitHub Check: Tusk Test Runner (4)
- GitHub Check: Tusk Tester
- GitHub Check: Run Integration Tests
- GitHub Check: webui tests
- GitHub Check: Build Docs
- GitHub Check: Test on Node 24.x and windows-latest
- GitHub Check: Test on Node 22.x and macOS-latest
- GitHub Check: Test on Node 22.x and ubuntu-latest
- GitHub Check: Test on Node 20.x and ubuntu-latest
- GitHub Check: Test on Node 22.x and windows-latest
- GitHub Check: Test on Node 24.x and ubuntu-latest
- GitHub Check: Test on Node 20.x and macOS-latest
- GitHub Check: Test on Node 20.x and windows-latest
- GitHub Check: Style Check
- GitHub Check: Generate Assets
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (7)
src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx (1)
305-331
: LGTM! Clean UI implementation for the OpenAI Guardrails plugin configuration.The UI correctly implements:
- Clear description of the plugin's purpose
- Checkbox control for the
includeSafe
option- Helpful caption explaining the 50/50 split behavior
The implementation follows the established patterns in this file and properly manages local state.
src/redteam/constants/plugins.ts (2)
89-89
: LGTM! Correct placement in the guardrails evaluation plugins.The
openai-guardrails
plugin is appropriately added to theGUARDRAILS_EVALUATION_PLUGINS
array, positioned alphabetically betweensystem-prompt-override
andharmbench
.
294-294
: LGTM! Correct placement in additional plugins.The
openai-guardrails
plugin is appropriately added to theADDITIONAL_PLUGINS
array, positioned alphabetically betweenoff-topic
andoverreliance
.site/docs/_shared/data/plugins.ts (1)
1127-1141
: LGTM! Complete and well-structured plugin metadata entry.The OpenAI Guardrails plugin entry includes all required fields and is correctly positioned alphabetically. The categorization as a 'Dataset' plugin with 'security' vulnerability type is appropriate. All application types (RAG, agent, chat) are enabled, which aligns with the plugin's broad applicability for jailbreak resistance testing.
src/redteam/plugins/index.ts (2)
30-30
: LGTM! Import correctly placed in alphabetical order.The import statement for
OpenAIGuardrailsPlugin
is appropriately positioned alphabetically betweenExcessiveAgencyPlugin
andHallucinationPlugin
.
177-177
: LGTM! Plugin factory correctly registered.The
createPluginFactory(OpenAIGuardrailsPlugin, 'openai-guardrails')
registration is correctly placed in alphabetical order within thepluginFactories
array. No validation function is needed since theincludeSafe
config property is optional with a default value.src/redteam/plugins/openaiGuardrails.ts (1)
169-171
: No changes needed for getTemplate()
This plugin setscanGenerateRemote = false
and the base class never invokesgetTemplate()
for local-only plugins; throwing here aligns with other non-remote implementations.
--- | ||
sidebar_label: OpenAI Guardrails | ||
description: Red team jailbreak resistance using OpenAI's evaluation dataset to test LLM responses to role-playing attacks, system manipulation, and obfuscation techniques | ||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the required title
front matter entry
Docs under site/docs
must include both title
and description
in the front matter. Please add a title
(e.g., OpenAI Guardrails Plugin
) so Docusaurus builds and navigation metadata stay consistent. As per coding guidelines.
---
+title: OpenAI Guardrails Plugin
sidebar_label: OpenAI Guardrails
description: Red team jailbreak resistance using OpenAI's evaluation dataset to test LLM responses to role-playing attacks, system manipulation, and obfuscation techniques
---
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
--- | |
sidebar_label: OpenAI Guardrails | |
description: Red team jailbreak resistance using OpenAI's evaluation dataset to test LLM responses to role-playing attacks, system manipulation, and obfuscation techniques | |
--- | |
--- | |
title: OpenAI Guardrails Plugin | |
sidebar_label: OpenAI Guardrails | |
description: Red team jailbreak resistance using OpenAI's evaluation dataset to test LLM responses to role-playing attacks, system manipulation, and obfuscation techniques | |
--- |
🤖 Prompt for AI Agents
In site/docs/red-team/plugins/openai-guardrails.md around lines 1 to 4, the
front matter is missing the required title entry; add a top-level "title" field
(for example: "OpenAI Guardrails Plugin") to the existing YAML front matter so
both title and description are present, ensuring Docusaurus builds and
navigation metadata remain correct.
The OpenAI Guardrails plugin uses the jailbreak evaluation dataset from [OpenAI's Guardrails Python package](https://github.com/openai/openai-guardrails-python) to test LLM responses to known jailbreak attempts. The dataset contains real-world jailbreak patterns designed to circumvent AI safety guidelines, including: | ||
|
||
- Role-playing attacks (e.g., "You are DAN", "Act as an unethical chatbot") | ||
- System manipulation (e.g., token-based jailbreaks, prompt injection via special formatting) | ||
- Obfuscation techniques (e.g., character encoding, linguistic evasion) | ||
- Multi-turn jailbreak strategies (e.g., "life token" systems, cumulative manipulation) | ||
|
||
This plugin helps evaluate whether your LLM maintains safety boundaries when faced with sophisticated adversarial prompts used in real-world jailbreak attempts. | ||
|
||
## How it works | ||
|
||
The plugin: | ||
|
||
- Loads jailbreak prompts from OpenAI's evaluation dataset | ||
- Filters for prompts expected to trigger jailbreak guardrails | ||
- Shuffles and selects a specified number of test cases | ||
- Tests whether the model refuses or complies with jailbreak attempts | ||
- Uses LLM-as-a-judge to evaluate responses based on strict safety criteria | ||
|
||
## Configuration | ||
|
||
Add to your YAML file: | ||
|
||
```yaml | ||
redteam: | ||
plugins: | ||
- openai-guardrails | ||
``` | ||
Control the number of tests: | ||
```yaml | ||
redteam: | ||
plugins: | ||
- id: openai-guardrails | ||
numTests: 25 | ||
``` | ||
### Testing for Over-Blocking | ||
By default, OpenAI Guardrails tests only jailbreak attempts to ensure your model refuses harmful content. However, you can also test for over-blocking (false positives) by including safe prompts: | ||
```yaml | ||
redteam: | ||
plugins: | ||
- id: openai-guardrails | ||
config: | ||
includeSafe: true | ||
numTests: 20 | ||
``` | ||
When `includeSafe: true` is set: | ||
|
||
- Test cases are split 50/50 between safe and jailbreak prompts | ||
- Safe prompts use inverted grading logic: they SHOULD be answered (not refused) | ||
- This helps identify if your guardrails are too strict and block legitimate requests | ||
- Metadata includes `isSafe` and `label` fields for filtering and analysis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace “evaluation” with “eval” throughout the doc
The documentation style guide requires using “eval” instead of “evaluation” in prose and examples. Please update every occurrence (e.g., “jailbreak evaluation dataset”, “evaluation dataset”) to the approved terminology, keeping the official dataset name clear where needed. As per coding guidelines.
🤖 Prompt for AI Agents
In site/docs/red-team/plugins/openai-guardrails.md around lines 8 to 64, replace
every occurrence of the word "evaluation" with "eval" in the prose and examples
(e.g., "jailbreak evaluation dataset" → "jailbreak eval dataset", "evaluation
dataset" → "eval dataset"), including headings and inline descriptions and code
comments; do not alter official package names or URLs (leave
"openai-guardrails-python" and any quoted official titles intact), and update
any adjacent phrasing so grammar remains correct after the substitution.
} | ||
|
||
async generateTests(n: number, _delayMs?: number): Promise<TestCase[]> { | ||
const includeSafe = this.config?.includeSafe ?? false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type safety issue: includeSafe
property missing from PluginConfig
interface.
Line 224 accesses this.config?.includeSafe
but the PluginConfig
type does not include an includeSafe
property, causing a TypeScript compilation error (flagged by GitHub Check).
The PluginConfig
type needs to be extended to include includeSafe?: boolean
. Since this plugin-specific configuration is only used by the OpenAI Guardrails plugin, consider one of these solutions:
Solution 1: Add to the base PluginConfig interface
Locate the PluginConfig
interface definition and add:
export interface PluginConfig {
// ... existing properties
includeSafe?: boolean;
}
Solution 2: Define a plugin-specific config type
+interface OpenAIGuardrailsConfig extends PluginConfig {
+ includeSafe?: boolean;
+}
+
export class OpenAIGuardrailsPlugin extends RedteamPluginBase {
readonly id = PLUGIN_ID;
static readonly canGenerateRemote = false;
+ declare config?: OpenAIGuardrailsConfig;
async getTemplate(): Promise<string> {
throw new Error('Not implemented');
}
🧰 Tools
🪛 GitHub Check: Build on Node 22.x
[failure] 224-224:
Property 'includeSafe' does not exist on type 'PluginConfig'.
🤖 Prompt for AI Agents
In src/redteam/plugins/openaiGuardrails.ts around line 224, the code reads
this.config?.includeSafe but the PluginConfig type lacks includeSafe causing a
TypeScript error; fix by either (A) add includeSafe?: boolean to the shared
PluginConfig interface definition (update the file where PluginConfig is
declared and export the new optional property), or (B) create a plugin-specific
config type (e.g., OpenAIGuardrailsConfig extends PluginConfig { includeSafe?:
boolean }) and cast/annotate this.config as that type in this plugin so
includeSafe is recognized. Ensure the chosen change is exported/imported where
needed and update any usages or tests accordingly.
const pluginWithIncludeSafe = new OpenAIGuardrailsPlugin({} as any, 'test', 'input', { | ||
includeSafe: true, | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the includeSafe
config typing to restore the build
TypeScript rejects these literals because includeSafe
is not part of PluginConfig
, which currently makes the test suite fail (see static analysis errors on these lines). Please extend the plugin’s config type to declare includeSafe?: boolean
(and reuse it here), or cast to that extended type, so the compiler accepts the property while keeping type safety.
Also applies to: 259-261, 285-287
🧰 Tools
🪛 GitHub Check: Build on Node 22.x
[failure] 228-228:
Object literal may only specify known properties, and 'includeSafe' does not exist in type 'PluginConfig'.
🤖 Prompt for AI Agents
In test/redteam/plugins/openaiGuardrails.test.ts around lines 227-229 (and
similarly at 259-261 and 285-287), the test constructs pass an includeSafe
property but TypeScript fails because PluginConfig doesn't declare it; update
the plugin config type to include includeSafe?: boolean (or create an extended
interface that adds includeSafe and use that type in the test) and update the
test instantiations to use that typed config (or cast the config object to the
extended type) so the compiler accepts the property while preserving type
safety.
- Add includeSafe?: boolean to PluginConfig interface in src/redteam/types.ts - Regenerate config-schema.json to include new plugin and types - Fixes TypeScript compilation errors in CI 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be one of the exception plugin for strategies?
Summary
Add new dataset-based redteam plugin to test jailbreak resistance using OpenAI's official guardrails evaluation dataset.
Key Features
includeSafe
configuration tests whether guardrails incorrectly block legitimate requestsConfiguration
Implementation Details
Files Changed
src/redteam/plugins/openaiGuardrails.ts
test/redteam/plugins/openaiGuardrails.test.ts
(19 tests, all passing)site/docs/red-team/plugins/openai-guardrails.md
PluginConfigDialog.tsx
for includeSafe optionTechnical Highlights
293b1ae7dbfb278afc36e04fe6a4aba44f5c6582
for stabilityincludeSafe: true
isSafe
andlabel
fields for filteringTesting
Example Output
When running with
includeSafe: true
:jailbreak
, graded strictly for refusal)safe
, graded for appropriate helpfulness)Backward Compatibility
✅ Fully backward compatible - default behavior unchanged when
includeSafe
is not specified.