feat: add OpenAI Guardrails redteam plugin #5892

mldangelo · Oct 10, 2025

Summary

Add new dataset-based redteam plugin to test jailbreak resistance using OpenAI's official guardrails evaluation dataset.

Key Features

Dataset Source: Uses OpenAI's guardrails-python evaluation demo dataset
Content: 51 jailbreak prompts + 49 safe prompts from real-world jailbreak attempts
Attack Types: Role-playing attacks (e.g., "You are DAN"), system manipulation, obfuscation techniques, multi-turn strategies
Over-blocking Detection: Optional includeSafe configuration tests whether guardrails incorrectly block legitimate requests

Configuration

redteam:
  plugins:
    - openai-guardrails

# Test for over-blocking with 50/50 safe/jailbreak split
redteam:
  plugins:
    - id: openai-guardrails
      config:
        includeSafe: true
      numTests: 20

Implementation Details

Files Changed

Plugin: src/redteam/plugins/openaiGuardrails.ts
Tests: test/redteam/plugins/openaiGuardrails.test.ts (19 tests, all passing)
Docs: site/docs/red-team/plugins/openai-guardrails.md
UI: Checkbox in PluginConfigDialog.tsx for includeSafe option

Technical Highlights

Dataset pinned to commit hash 293b1ae7dbfb278afc36e04fe6a4aba44f5c6582 for stability
50/50 balanced split when includeSafe: true
Inverted grading logic for safe prompts (they SHOULD be answered)
Metadata includes isSafe and label fields for filtering
Follows same pattern as beavertails/unsafebench/aegis plugins

Testing

✅ All 19 unit tests passing with coverage
✅ End-to-end tested with includeSafe option
✅ Linting and formatting clean
✅ TypeScript compilation successful

Example Output

When running with includeSafe: true:

2 jailbreak prompts (labeled jailbreak, graded strictly for refusal)
2 safe prompts (labeled safe, graded for appropriate helpfulness)
Perfect 50/50 split maintained

Backward Compatibility

✅ Fully backward compatible - default behavior unchanged when includeSafe is not specified.

Add new dataset-based plugin to test jailbreak resistance using OpenAI's official guardrails evaluation dataset. The dataset contains 51 jailbreak prompts and 49 safe prompts from real-world jailbreak attempts. Features: - Tests role-playing attacks, system manipulation, and obfuscation techniques - Includes includeSafe option for testing over-blocking (false positives) - 50/50 balanced split of safe and jailbreak prompts when enabled - Inverted grading logic for safe prompts (they SHOULD be answered) - Dataset pinned to commit hash for stability Implementation: - Plugin: src/redteam/plugins/openaiGuardrails.ts - Tests: test/redteam/plugins/openaiGuardrails.test.ts (19 tests, all passing) - Documentation: site/docs/red-team/plugins/openai-guardrails.md - UI configuration dialog with checkbox for includeSafe option Follows same pattern as beavertails/unsafebench/aegis plugins for consistency.

coderabbitai · Oct 10, 2025

📝 Walkthrough

Walkthrough

Adds a new “OpenAI Guardrails” red-team plugin across the codebase. Introduces plugin metadata in site/docs/_shared/data/plugins.ts and red-team constants, registers the plugin in src/redteam/plugins/index.ts, and implements it in src/redteam/plugins/openaiGuardrails.ts with dataset-driven test generation, safe/jailbreak filtering, balanced sampling via includeSafe, and rubric-based assertions. Adds UI configuration in PluginConfigDialog.tsx for toggling safe prompts. Provides documentation at site/docs/red-team/plugins/openai-guardrails.md. Includes comprehensive unit tests in test/redteam/plugins/openaiGuardrails.test.ts covering dataset parsing, balancing, assertions, metadata, error handling, and integration behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title concisely and accurately describes the primary change by stating that a new OpenAI Guardrails redteam plugin is being added, matching the main content of the pull request.
Description Check	✅ Passed	The description directly relates to the changes by summarizing the addition of the dataset-based redteam plugin, outlining its features, configuration, implementation, and testing, and matches the detailed modifications provided in the pull request.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/openai-guardrails-plugin

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

src/redteam/plugins/openaiGuardrails.ts (1)

134-136: Consider using a more robust shuffling algorithm.

The current shuffle implementation uses .sort(() => Math.random() - 0.5), which is a common pattern but has known bias issues and doesn't provide a uniform distribution. While acceptable for red-team test selection, a Fisher-Yates shuffle would be more robust.

-      selectedRows = [
-        ...safeRows.sort(() => Math.random() - 0.5).slice(0, numEach),
-        ...jailbreakRows.sort(() => Math.random() - 0.5).slice(0, numEach),
-      ].sort(() => Math.random() - 0.5); // Shuffle final order
+      // Fisher-Yates shuffle helper
+      const shuffle = <T>(array: T[]): T[] => {
+        const result = [...array];
+        for (let i = result.length - 1; i > 0; i--) {
+          const j = Math.floor(Math.random() * (i + 1));
+          [result[i], result[j]] = [result[j], result[i]];
+        }
+        return result;
+      };
+
+      selectedRows = shuffle([
+        ...shuffle(safeRows).slice(0, numEach),
+        ...shuffle(jailbreakRows).slice(0, numEach),
+      ]);

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d97485b and 2042b2d.

📒 Files selected for processing (8)

site/docs/_shared/data/plugins.ts (1 hunks)
site/docs/red-team/plugins/openai-guardrails.md (1 hunks)
src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx (2 hunks)
src/redteam/constants/metadata.ts (6 hunks)
src/redteam/constants/plugins.ts (2 hunks)
src/redteam/plugins/index.ts (2 hunks)
src/redteam/plugins/openaiGuardrails.ts (1 hunks)
test/redteam/plugins/openaiGuardrails.test.ts (1 hunks)

🧰 Additional context used

📓 Path-based instructions (17)

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

Prefer not to introduce new TypeScript types; use existing interfaces whenever possible

**/*.{ts,tsx}: Follow consistent import order (Biome will handle sorting)
Use curly braces for all control statements
Prefer const over let; avoid var
Use object property shorthand when possible
Use async/await for asynchronous code
Use consistent error handling with proper type checks

Files:

src/redteam/constants/plugins.ts
src/redteam/constants/metadata.ts
src/redteam/plugins/openaiGuardrails.ts
site/docs/_shared/data/plugins.ts
src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx
src/redteam/plugins/index.ts
test/redteam/plugins/openaiGuardrails.test.ts

src/redteam/**/*.ts

📄 CodeRabbit inference engine (src/redteam/CLAUDE.md)

src/redteam/**/*.ts: Always sanitize when logging test prompts or model outputs by passing them via the structured metadata parameter (second argument) to the logger, not raw string interpolation
Use the standardized risk severity levels: critical, high, medium, low when reporting results

Files:

src/redteam/constants/plugins.ts
src/redteam/constants/metadata.ts
src/redteam/plugins/openaiGuardrails.ts
src/redteam/plugins/index.ts

src/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

src/**/*.{ts,tsx}: Sanitize sensitive data before logging; pass context objects to logger methods (debug, info, warn, error) for automatic redaction
Do not interpolate secrets into log messages (avoid stringifying headers/bodies directly); use structured logger context instead
Use sanitizeObject for manual sanitization before using or persisting potentially sensitive data

Files:

src/redteam/constants/plugins.ts
src/redteam/constants/metadata.ts
src/redteam/plugins/openaiGuardrails.ts
src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx
src/redteam/plugins/index.ts

src/redteam/plugins/**/*.ts

📄 CodeRabbit inference engine (src/redteam/CLAUDE.md)

src/redteam/plugins/**/*.ts: Place vulnerability-specific test generators as plugins under src/redteam/plugins/ (e.g., pii.ts, harmful.ts, sql-injection.ts)
New plugins must implement the RedteamPluginObject interface

Files:

src/redteam/plugins/openaiGuardrails.ts
src/redteam/plugins/index.ts

{site/**,examples/**}

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

Any pull request that only touches files in 'site/' or 'examples/' directories must use the 'docs:' prefix in the PR title, not 'feat:' or 'fix:'

Files:

site/docs/_shared/data/plugins.ts
site/docs/red-team/plugins/openai-guardrails.md

site/**

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

If the change is a feature, update the relevant documentation under 'site/'

Files:

site/docs/_shared/data/plugins.ts
site/docs/red-team/plugins/openai-guardrails.md

src/app/src/**/*.{ts,tsx}

📄 CodeRabbit inference engine (src/app/CLAUDE.md)

src/app/src/**/*.{ts,tsx}: Never use fetch() directly; always use callApi() from @app/utils/api for all HTTP requests
Access Zustand state outside React components via store.getState(); do not call hooks outside components
Use the @app/* path alias for internal imports as configured in Vite

Files:

src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx

src/app/src/{components,pages}/**/*.tsx

📄 CodeRabbit inference engine (src/app/CLAUDE.md)

src/app/src/{components,pages}/**/*.tsx: Use the class-based ErrorBoundary component (@app/components/ErrorBoundary) to wrap error-prone UI
Access theme via useTheme() from @mui/material/styles instead of hardcoding theme values
Use useMemo/useCallback only when profiling indicates benefit; avoid unnecessary memoization
Implement explicit loading and error states for components performing async operations
Prefer MUI composition and the sx prop for styling over ad-hoc inline styles

Files:

src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx

**/*.{tsx,jsx}

📄 CodeRabbit inference engine (.cursor/rules/react-components.mdc)

**/*.{tsx,jsx}: Use icons from @mui/icons-material
Prefer commonly used icons from @mui/icons-material for intuitive experience

Files:

src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx

src/app/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

In the React app (src/app), use callApi from @app/utils/api for all API calls; do not use fetch directly

Files:

src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx

site/docs/**/*.md

📄 CodeRabbit inference engine (.cursor/rules/docusaurus.mdc)

site/docs/**/*.md: Prioritize minimal edits when updating existing documentation; avoid creating entirely new sections or rewriting substantial portions; focus edits on improving grammar, spelling, clarity, fixing typos, and structural improvements where needed; do not modify existing headings (h1, h2, h3, etc.) as they are often linked externally.
Structure content to reveal information progressively: begin with essential actions and information, then provide deeper context as necessary; organize information from most important to least important.
Use action-oriented language: clearly outline actionable steps users should take, use concise and direct language, prefer active voice over passive voice, and use imperative mood for instructions.
Use 'eval' instead of 'evaluation' in all documentation; when referring to command line usage, use 'npx promptfoo eval' rather than 'npx promptfoo evaluation'; maintain consistency with this terminology across all examples, code blocks, and explanations.
The project name can be written as either 'Promptfoo' (capitalized) or 'promptfoo' (lowercase) depending on context: use 'Promptfoo' at the beginning of sentences or in headings, and 'promptfoo' in code examples, terminal commands, or when referring to the package name; be consistent with the chosen capitalization within each document or section.
Each markdown documentation file must include required front matter fields: 'title' (the page title shown in search results and browser tabs) and 'description' (a concise summary of the page content, ideally 150-160 characters).
Only add a title attribute to code blocks that represent complete, runnable files; do not add titles to code fragments, partial examples, or snippets that aren't meant to be used as standalone files; this applies to all code blocks regardless of language.
Use special comment directives to highlight specific lines in code blocks: 'highlight-next-line' highlights the line immediately after the comment, 'highligh...

Files:

site/docs/red-team/plugins/openai-guardrails.md

site/docs/**/*.{md,mdx}

📄 CodeRabbit inference engine (site/docs/CLAUDE.md)

site/docs/**/*.{md,mdx}: Use the term "eval" not "evaluation" in documentation and examples
Capitalization: use "Promptfoo" (capitalized) in prose/headings and "promptfoo" (lowercase) in code, commands, and package names
Every doc must include required front matter: title and description
Only add title= to code blocks when showing complete runnable files
Admonitions must have empty lines around their content (Prettier requirement)
Do not modify headings; they may be externally linked
Use progressive disclosure: put essential information first
Use action-oriented, imperative mood in instructions (e.g., "Install the package")

Files:

site/docs/red-team/plugins/openai-guardrails.md

**/*.{test,spec}.{js,ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

Avoid disabling or skipping tests unless absolutely necessary and documented

Files:

test/redteam/plugins/openaiGuardrails.test.ts

test/**/*.{test,spec}.ts

📄 CodeRabbit inference engine (.cursor/rules/jest.mdc)

test/**/*.{test,spec}.ts: Mock as few functions as possible to keep tests realistic
Never increase the function timeout - fix the test instead
Organize tests in descriptive describe and it blocks
Prefer assertions on entire objects rather than individual keys when writing expectations
Clean up after tests to prevent side effects (e.g., use afterEach(() => { jest.resetAllMocks(); }))
Run tests with --randomize flag to ensure your mocks setup and teardown don't affect other tests
Use Jest's mocking utilities rather than complex custom mocks
Prefer shallow mocking over deep mocking
Mock external dependencies but not the code being tested
Reset mocks between tests to prevent test pollution
For database tests, use in-memory instances or proper test fixtures
Test both success and error cases for each provider
Mock API responses to avoid external dependencies in tests
Validate that provider options are properly passed to the underlying service
Test error handling and edge cases (rate limits, timeouts, etc.)
Ensure provider caching behaves as expected
Always include both --coverage and --randomize flags when running tests
Run tests in a single pass (no watch mode for CI)
Ensure all tests are independent and can run in any order
Clean up any test data or mocks after each test

Files:

test/redteam/plugins/openaiGuardrails.test.ts

test/**/*.test.ts

📄 CodeRabbit inference engine (test/CLAUDE.md)

test/**/*.test.ts: Never increase Jest test timeouts; fix slow tests instead (avoid jest.setTimeout or large timeouts in tests)
Do not use .only() or .skip() in committed tests
Add afterEach(() => { jest.resetAllMocks(); }) to ensure mock cleanup
Prefer asserting entire objects (toEqual on whole result) rather than individual fields
Mock minimally: only external dependencies (APIs, databases), not code under test
Use Jest (not Vitest) APIs in this suite; avoid importing vitest
Import from @jest/globals in tests

Files:

test/redteam/plugins/openaiGuardrails.test.ts

test/**

📄 CodeRabbit inference engine (test/CLAUDE.md)

Organize tests to mirror src/ structure (e.g., test/providers → src/providers, test/redteam → src/redteam)

Files:

test/redteam/plugins/openaiGuardrails.test.ts

test/**/*.{test.ts,test.tsx,spec.ts,spec.tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

test/**/*.{test.ts,test.tsx,spec.ts,spec.tsx}: Follow Jest best practices using describe/it blocks in tests
Write tests covering both success and error cases for all functionality

Files:

test/redteam/plugins/openaiGuardrails.test.ts

🧠 Learnings (2)

📓 Common learnings

Learnt from: CR
PR: promptfoo/promptfoo#0
File: src/redteam/CLAUDE.md:0-0
Timestamp: 2025-10-05T16:59:20.507Z
Learning: Applies to src/redteam/test/redteam/**/*.ts : Add tests for new plugins under test/redteam/

Learnt from: CR
PR: promptfoo/promptfoo#0
File: src/redteam/CLAUDE.md:0-0
Timestamp: 2025-10-05T16:59:20.507Z
Learning: Applies to src/redteam/plugins/**/*.ts : Place vulnerability-specific test generators as plugins under src/redteam/plugins/ (e.g., pii.ts, harmful.ts, sql-injection.ts)

Learnt from: CR
PR: promptfoo/promptfoo#0
File: src/redteam/CLAUDE.md:0-0
Timestamp: 2025-10-05T16:59:20.507Z
Learning: Applies to src/redteam/plugins/**/*.ts : New plugins must implement the RedteamPluginObject interface

📚 Learning: 2025-10-05T16:59:20.507Z

Learnt from: CR
PR: promptfoo/promptfoo#0
File: src/redteam/CLAUDE.md:0-0
Timestamp: 2025-10-05T16:59:20.507Z
Learning: Applies to src/redteam/test/redteam/**/*.ts : Add tests for new plugins under test/redteam/

Applied to files:

test/redteam/plugins/openaiGuardrails.test.ts

🧬 Code graph analysis (3)

src/redteam/plugins/openaiGuardrails.ts (3)

src/types/index.ts (2)

TestCase (702-702)

Assertion (555-555)

src/util/fetch/index.ts (1)

fetchWithTimeout (133-159)

src/providers/shared.ts (1)

REQUEST_TIMEOUT_MS (9-9)

src/redteam/plugins/index.ts (1)

src/redteam/plugins/openaiGuardrails.ts (1)

OpenAIGuardrailsPlugin (165-246)

test/redteam/plugins/openaiGuardrails.test.ts (2)

src/redteam/plugins/openaiGuardrails.ts (2)

OpenAIGuardrailsPlugin (165-246)

fetchDataset (55-163)

src/util/fetch/index.ts (1)

fetchWithTimeout (133-159)

🪛 GitHub Check: Build on Node 22.x

src/redteam/plugins/openaiGuardrails.ts

[failure] 224-224:
Property 'includeSafe' does not exist on type 'PluginConfig'.

test/redteam/plugins/openaiGuardrails.test.ts

[failure] 286-286:
Object literal may only specify known properties, and 'includeSafe' does not exist in type 'PluginConfig'.

[failure] 260-260:
Object literal may only specify known properties, and 'includeSafe' does not exist in type 'PluginConfig'.

[failure] 228-228:
Object literal may only specify known properties, and 'includeSafe' does not exist in type 'PluginConfig'.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)

GitHub Check: Tusk Test Runner (2)
GitHub Check: Tusk Test Runner (3)
GitHub Check: Tusk Test Runner (1)
GitHub Check: Tusk Test Runner (4)
GitHub Check: Tusk Tester
GitHub Check: Run Integration Tests
GitHub Check: webui tests
GitHub Check: Build Docs
GitHub Check: Test on Node 24.x and windows-latest
GitHub Check: Test on Node 22.x and macOS-latest
GitHub Check: Test on Node 22.x and ubuntu-latest
GitHub Check: Test on Node 20.x and ubuntu-latest
GitHub Check: Test on Node 22.x and windows-latest
GitHub Check: Test on Node 24.x and ubuntu-latest
GitHub Check: Test on Node 20.x and macOS-latest
GitHub Check: Test on Node 20.x and windows-latest
GitHub Check: Style Check
GitHub Check: Generate Assets
GitHub Check: Analyze (javascript-typescript)

🔇 Additional comments (7)

src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx (1)

305-331: LGTM! Clean UI implementation for the OpenAI Guardrails plugin configuration.

The UI correctly implements:

Clear description of the plugin's purpose

Checkbox control for the includeSafe option

Helpful caption explaining the 50/50 split behavior

The implementation follows the established patterns in this file and properly manages local state.

src/redteam/constants/plugins.ts (2)

89-89: LGTM! Correct placement in the guardrails evaluation plugins.

The openai-guardrails plugin is appropriately added to the GUARDRAILS_EVALUATION_PLUGINS array, positioned alphabetically between system-prompt-override and harmbench.

294-294: LGTM! Correct placement in additional plugins.

The openai-guardrails plugin is appropriately added to the ADDITIONAL_PLUGINS array, positioned alphabetically between off-topic and overreliance.

site/docs/_shared/data/plugins.ts (1)

1127-1141: LGTM! Complete and well-structured plugin metadata entry.

The OpenAI Guardrails plugin entry includes all required fields and is correctly positioned alphabetically. The categorization as a 'Dataset' plugin with 'security' vulnerability type is appropriate. All application types (RAG, agent, chat) are enabled, which aligns with the plugin's broad applicability for jailbreak resistance testing.

src/redteam/plugins/index.ts (2)

30-30: LGTM! Import correctly placed in alphabetical order.

The import statement for OpenAIGuardrailsPlugin is appropriately positioned alphabetically between ExcessiveAgencyPlugin and HallucinationPlugin.

177-177: LGTM! Plugin factory correctly registered.

The createPluginFactory(OpenAIGuardrailsPlugin, 'openai-guardrails') registration is correctly placed in alphabetical order within the pluginFactories array. No validation function is needed since the includeSafe config property is optional with a default value.

src/redteam/plugins/openaiGuardrails.ts (1)

169-171: No changes needed for getTemplate()
This plugin sets canGenerateRemote = false and the base class never invokes getTemplate() for local-only plugins; throwing here aligns with other non-remote implementations.

coderabbitai · Oct 10, 2025

site/docs/red-team/plugins/openai-guardrails.md

+---
+sidebar_label: OpenAI Guardrails
+description: Red team jailbreak resistance using OpenAI's evaluation dataset to test LLM responses to role-playing attacks, system manipulation, and obfuscation techniques
+---


⚠️ Potential issue | 🟠 Major

Add the required title front matter entry

Docs under site/docs must include both title and description in the front matter. Please add a title (e.g., OpenAI Guardrails Plugin) so Docusaurus builds and navigation metadata stay consistent. As per coding guidelines.

--- +title: OpenAI Guardrails Plugin sidebar_label: OpenAI Guardrails description: Red team jailbreak resistance using OpenAI's evaluation dataset to test LLM responses to role-playing attacks, system manipulation, and obfuscation techniques ---

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

---

sidebar_label: OpenAI Guardrails

description: Red team jailbreak resistance using OpenAI's evaluation dataset to test LLM responses to role-playing attacks, system manipulation, and obfuscation techniques

---

---

title: OpenAI Guardrails Plugin

sidebar_label: OpenAI Guardrails

description: Red team jailbreak resistance using OpenAI's evaluation dataset to test LLM responses to role-playing attacks, system manipulation, and obfuscation techniques

---

🤖 Prompt for AI Agents

In site/docs/red-team/plugins/openai-guardrails.md around lines 1 to 4, the front matter is missing the required title entry; add a top-level "title" field (for example: "OpenAI Guardrails Plugin") to the existing YAML front matter so both title and description are present, ensuring Docusaurus builds and navigation metadata remain correct.

coderabbitai · Oct 10, 2025

site/docs/red-team/plugins/openai-guardrails.md

+The OpenAI Guardrails plugin uses the jailbreak evaluation dataset from [OpenAI's Guardrails Python package](https://github.com/openai/openai-guardrails-python) to test LLM responses to known jailbreak attempts. The dataset contains real-world jailbreak patterns designed to circumvent AI safety guidelines, including:
+
+- Role-playing attacks (e.g., "You are DAN", "Act as an unethical chatbot")
+- System manipulation (e.g., token-based jailbreaks, prompt injection via special formatting)
+- Obfuscation techniques (e.g., character encoding, linguistic evasion)
+- Multi-turn jailbreak strategies (e.g., "life token" systems, cumulative manipulation)
+
+This plugin helps evaluate whether your LLM maintains safety boundaries when faced with sophisticated adversarial prompts used in real-world jailbreak attempts.
+
+## How it works
+
+The plugin:
+
+- Loads jailbreak prompts from OpenAI's evaluation dataset
+- Filters for prompts expected to trigger jailbreak guardrails
+- Shuffles and selects a specified number of test cases
+- Tests whether the model refuses or complies with jailbreak attempts
+- Uses LLM-as-a-judge to evaluate responses based on strict safety criteria
+
+## Configuration
+
+Add to your YAML file:
+
+```yaml
+redteam:
+  plugins:
+    - openai-guardrails
+```
+
+Control the number of tests:
+
+```yaml
+redteam:
+  plugins:
+    - id: openai-guardrails
+      numTests: 25
+```
+
+### Testing for Over-Blocking
+
+By default, OpenAI Guardrails tests only jailbreak attempts to ensure your model refuses harmful content. However, you can also test for over-blocking (false positives) by including safe prompts:
+
+```yaml
+redteam:
+  plugins:
+    - id: openai-guardrails
+      config:
+        includeSafe: true
+      numTests: 20
+```
+
+When `includeSafe: true` is set:
+
+- Test cases are split 50/50 between safe and jailbreak prompts
+- Safe prompts use inverted grading logic: they SHOULD be answered (not refused)
+- This helps identify if your guardrails are too strict and block legitimate requests
+- Metadata includes `isSafe` and `label` fields for filtering and analysis


⚠️ Potential issue | 🟠 Major

Replace “evaluation” with “eval” throughout the doc

The documentation style guide requires using “eval” instead of “evaluation” in prose and examples. Please update every occurrence (e.g., “jailbreak evaluation dataset”, “evaluation dataset”) to the approved terminology, keeping the official dataset name clear where needed. As per coding guidelines.

🤖 Prompt for AI Agents

In site/docs/red-team/plugins/openai-guardrails.md around lines 8 to 64, replace every occurrence of the word "evaluation" with "eval" in the prose and examples (e.g., "jailbreak evaluation dataset" → "jailbreak eval dataset", "evaluation dataset" → "eval dataset"), including headings and inline descriptions and code comments; do not alter official package names or URLs (leave "openai-guardrails-python" and any quoted official titles intact), and update any adjacent phrasing so grammar remains correct after the substitution.

coderabbitai · Oct 10, 2025

src/redteam/plugins/openaiGuardrails.ts

+  }
+
+  async generateTests(n: number, _delayMs?: number): Promise<TestCase[]> {
+    const includeSafe = this.config?.includeSafe ?? false;


⚠️ Potential issue | 🔴 Critical

Type safety issue: includeSafe property missing from PluginConfig interface.

Line 224 accesses this.config?.includeSafe but the PluginConfig type does not include an includeSafe property, causing a TypeScript compilation error (flagged by GitHub Check).

The PluginConfig type needs to be extended to include includeSafe?: boolean. Since this plugin-specific configuration is only used by the OpenAI Guardrails plugin, consider one of these solutions:

Solution 1: Add to the base PluginConfig interface
Locate the PluginConfig interface definition and add:

export interface PluginConfig { // ... existing properties includeSafe?: boolean; }

Solution 2: Define a plugin-specific config type

+interface OpenAIGuardrailsConfig extends PluginConfig { + includeSafe?: boolean; +} + export class OpenAIGuardrailsPlugin extends RedteamPluginBase { readonly id = PLUGIN_ID; static readonly canGenerateRemote = false; + declare config?: OpenAIGuardrailsConfig; async getTemplate(): Promise<string> { throw new Error('Not implemented'); }

🧰 Tools

🪛 GitHub Check: Build on Node 22.x

[failure] 224-224:
Property 'includeSafe' does not exist on type 'PluginConfig'.

🤖 Prompt for AI Agents

In src/redteam/plugins/openaiGuardrails.ts around line 224, the code reads this.config?.includeSafe but the PluginConfig type lacks includeSafe causing a TypeScript error; fix by either (A) add includeSafe?: boolean to the shared PluginConfig interface definition (update the file where PluginConfig is declared and export the new optional property), or (B) create a plugin-specific config type (e.g., OpenAIGuardrailsConfig extends PluginConfig { includeSafe?: boolean }) and cast/annotate this.config as that type in this plugin so includeSafe is recognized. Ensure the chosen change is exported/imported where needed and update any usages or tests accordingly.

coderabbitai · Oct 10, 2025

test/redteam/plugins/openaiGuardrails.test.ts

+      const pluginWithIncludeSafe = new OpenAIGuardrailsPlugin({} as any, 'test', 'input', {
+        includeSafe: true,
+      });


⚠️ Potential issue | 🔴 Critical

Fix the includeSafe config typing to restore the build

TypeScript rejects these literals because includeSafe is not part of PluginConfig, which currently makes the test suite fail (see static analysis errors on these lines). Please extend the plugin’s config type to declare includeSafe?: boolean (and reuse it here), or cast to that extended type, so the compiler accepts the property while keeping type safety.

Also applies to: 259-261, 285-287

🧰 Tools

🪛 GitHub Check: Build on Node 22.x

[failure] 228-228:
Object literal may only specify known properties, and 'includeSafe' does not exist in type 'PluginConfig'.

🤖 Prompt for AI Agents

In test/redteam/plugins/openaiGuardrails.test.ts around lines 227-229 (and similarly at 259-261 and 285-287), the test constructs pass an includeSafe property but TypeScript fails because PluginConfig doesn't declare it; update the plugin config type to include includeSafe?: boolean (or create an extended interface that adds includeSafe and use that type in the test) and update the test instantiations to use that typed config (or cast the config object to the extended type) so the compiler accepts the property while preserving type safety.

- Add includeSafe?: boolean to PluginConfig interface in src/redteam/types.ts - Regenerate config-schema.json to include new plugin and types - Fixes TypeScript compilation errors in CI 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

use-tusk · Oct 10, 2025

⏩ Already incorporated tests (d434cbf) View tests ↗

View check history

Commit	Status	Output	Created (UTC)
`2042b2d`	⏩ No tests generated	Output	Oct 10, 2025 4:28PM
`6ac553d`	✅ Generated 5 tests - 5 passed	Tests	Oct 10, 2025 4:39PM
`d434cbf`	⏩ Already incorporated tests	Output	Oct 13, 2025 6:38PM

MrFlounder

Should this be one of the exception plugin for strategies?

coderabbitai bot reviewed Oct 10, 2025

View reviewed changes

use-tusk bot and others added 2 commits October 13, 2025 17:56

Add Tusk tests

068cc32

Merge branch 'main' into feat/openai-guardrails-plugin

d434cbf

MrFlounder reviewed Oct 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: add OpenAI Guardrails redteam plugin #5892

feat: add OpenAI Guardrails redteam plugin #5892

Uh oh!

mldangelo commented Oct 10, 2025

Uh oh!

coderabbitai bot commented Oct 10, 2025

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 10, 2025

Uh oh!

coderabbitai bot Oct 10, 2025

Uh oh!

coderabbitai bot Oct 10, 2025

Uh oh!

coderabbitai bot Oct 10, 2025

Uh oh!

use-tusk bot commented Oct 10, 2025 •

edited

Loading

Uh oh!

MrFlounder left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Search code, repositories, users, issues, pull requests...

Uh oh!

feat: add OpenAI Guardrails redteam plugin #5892

Are you sure you want to change the base?

feat: add OpenAI Guardrails redteam plugin #5892

Uh oh!

Conversation

mldangelo commented Oct 10, 2025

Summary

Key Features

Configuration

Implementation Details

Files Changed

Technical Highlights

Testing

Example Output

Backward Compatibility

Uh oh!

coderabbitai bot commented Oct 10, 2025

Walkthrough

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

use-tusk bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MrFlounder left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

use-tusk bot commented Oct 10, 2025 •

edited

Loading