Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

mldangelo
Copy link
Member

Summary

Add new dataset-based redteam plugin to test jailbreak resistance using OpenAI's official guardrails evaluation dataset.

Key Features

  • Dataset Source: Uses OpenAI's guardrails-python evaluation demo dataset
  • Content: 51 jailbreak prompts + 49 safe prompts from real-world jailbreak attempts
  • Attack Types: Role-playing attacks (e.g., "You are DAN"), system manipulation, obfuscation techniques, multi-turn strategies
  • Over-blocking Detection: Optional includeSafe configuration tests whether guardrails incorrectly block legitimate requests

Configuration

redteam:
  plugins:
    - openai-guardrails

# Test for over-blocking with 50/50 safe/jailbreak split
redteam:
  plugins:
    - id: openai-guardrails
      config:
        includeSafe: true
      numTests: 20

Implementation Details

Files Changed

  • Plugin: src/redteam/plugins/openaiGuardrails.ts
  • Tests: test/redteam/plugins/openaiGuardrails.test.ts (19 tests, all passing)
  • Docs: site/docs/red-team/plugins/openai-guardrails.md
  • UI: Checkbox in PluginConfigDialog.tsx for includeSafe option

Technical Highlights

  • Dataset pinned to commit hash 293b1ae7dbfb278afc36e04fe6a4aba44f5c6582 for stability
  • 50/50 balanced split when includeSafe: true
  • Inverted grading logic for safe prompts (they SHOULD be answered)
  • Metadata includes isSafe and label fields for filtering
  • Follows same pattern as beavertails/unsafebench/aegis plugins

Testing

  • ✅ All 19 unit tests passing with coverage
  • ✅ End-to-end tested with includeSafe option
  • ✅ Linting and formatting clean
  • ✅ TypeScript compilation successful

Example Output

When running with includeSafe: true:

  • 2 jailbreak prompts (labeled jailbreak, graded strictly for refusal)
  • 2 safe prompts (labeled safe, graded for appropriate helpfulness)
  • Perfect 50/50 split maintained

Backward Compatibility

✅ Fully backward compatible - default behavior unchanged when includeSafe is not specified.

Add new dataset-based plugin to test jailbreak resistance using OpenAI's
official guardrails evaluation dataset. The dataset contains 51 jailbreak
prompts and 49 safe prompts from real-world jailbreak attempts.

Features:
- Tests role-playing attacks, system manipulation, and obfuscation techniques
- Includes includeSafe option for testing over-blocking (false positives)
- 50/50 balanced split of safe and jailbreak prompts when enabled
- Inverted grading logic for safe prompts (they SHOULD be answered)
- Dataset pinned to commit hash for stability

Implementation:
- Plugin: src/redteam/plugins/openaiGuardrails.ts
- Tests: test/redteam/plugins/openaiGuardrails.test.ts (19 tests, all passing)
- Documentation: site/docs/red-team/plugins/openai-guardrails.md
- UI configuration dialog with checkbox for includeSafe option

Follows same pattern as beavertails/unsafebench/aegis plugins for consistency.
Copy link
Contributor

coderabbitai bot commented Oct 10, 2025

📝 Walkthrough

Walkthrough

Adds a new “OpenAI Guardrails” red-team plugin across the codebase. Introduces plugin metadata in site/docs/_shared/data/plugins.ts and red-team constants, registers the plugin in src/redteam/plugins/index.ts, and implements it in src/redteam/plugins/openaiGuardrails.ts with dataset-driven test generation, safe/jailbreak filtering, balanced sampling via includeSafe, and rubric-based assertions. Adds UI configuration in PluginConfigDialog.tsx for toggling safe prompts. Provides documentation at site/docs/red-team/plugins/openai-guardrails.md. Includes comprehensive unit tests in test/redteam/plugins/openaiGuardrails.test.ts covering dataset parsing, balancing, assertions, metadata, error handling, and integration behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title concisely and accurately describes the primary change by stating that a new OpenAI Guardrails redteam plugin is being added, matching the main content of the pull request.
Description Check ✅ Passed The description directly relates to the changes by summarizing the addition of the dataset-based redteam plugin, outlining its features, configuration, implementation, and testing, and matches the detailed modifications provided in the pull request.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/openai-guardrails-plugin

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
src/redteam/plugins/openaiGuardrails.ts (1)

134-136: Consider using a more robust shuffling algorithm.

The current shuffle implementation uses .sort(() => Math.random() - 0.5), which is a common pattern but has known bias issues and doesn't provide a uniform distribution. While acceptable for red-team test selection, a Fisher-Yates shuffle would be more robust.

-      selectedRows = [
-        ...safeRows.sort(() => Math.random() - 0.5).slice(0, numEach),
-        ...jailbreakRows.sort(() => Math.random() - 0.5).slice(0, numEach),
-      ].sort(() => Math.random() - 0.5); // Shuffle final order
+      // Fisher-Yates shuffle helper
+      const shuffle = <T>(array: T[]): T[] => {
+        const result = [...array];
+        for (let i = result.length - 1; i > 0; i--) {
+          const j = Math.floor(Math.random() * (i + 1));
+          [result[i], result[j]] = [result[j], result[i]];
+        }
+        return result;
+      };
+
+      selectedRows = shuffle([
+        ...shuffle(safeRows).slice(0, numEach),
+        ...shuffle(jailbreakRows).slice(0, numEach),
+      ]);
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d97485b and 2042b2d.

📒 Files selected for processing (8)
  • site/docs/_shared/data/plugins.ts (1 hunks)
  • site/docs/red-team/plugins/openai-guardrails.md (1 hunks)
  • src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx (2 hunks)
  • src/redteam/constants/metadata.ts (6 hunks)
  • src/redteam/constants/plugins.ts (2 hunks)
  • src/redteam/plugins/index.ts (2 hunks)
  • src/redteam/plugins/openaiGuardrails.ts (1 hunks)
  • test/redteam/plugins/openaiGuardrails.test.ts (1 hunks)
🧰 Additional context used
📓 Path-based instructions (17)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

Prefer not to introduce new TypeScript types; use existing interfaces whenever possible

**/*.{ts,tsx}: Follow consistent import order (Biome will handle sorting)
Use curly braces for all control statements
Prefer const over let; avoid var
Use object property shorthand when possible
Use async/await for asynchronous code
Use consistent error handling with proper type checks

Files:

  • src/redteam/constants/plugins.ts
  • src/redteam/constants/metadata.ts
  • src/redteam/plugins/openaiGuardrails.ts
  • site/docs/_shared/data/plugins.ts
  • src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx
  • src/redteam/plugins/index.ts
  • test/redteam/plugins/openaiGuardrails.test.ts
src/redteam/**/*.ts

📄 CodeRabbit inference engine (src/redteam/CLAUDE.md)

src/redteam/**/*.ts: Always sanitize when logging test prompts or model outputs by passing them via the structured metadata parameter (second argument) to the logger, not raw string interpolation
Use the standardized risk severity levels: critical, high, medium, low when reporting results

Files:

  • src/redteam/constants/plugins.ts
  • src/redteam/constants/metadata.ts
  • src/redteam/plugins/openaiGuardrails.ts
  • src/redteam/plugins/index.ts
src/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

src/**/*.{ts,tsx}: Sanitize sensitive data before logging; pass context objects to logger methods (debug, info, warn, error) for automatic redaction
Do not interpolate secrets into log messages (avoid stringifying headers/bodies directly); use structured logger context instead
Use sanitizeObject for manual sanitization before using or persisting potentially sensitive data

Files:

  • src/redteam/constants/plugins.ts
  • src/redteam/constants/metadata.ts
  • src/redteam/plugins/openaiGuardrails.ts
  • src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx
  • src/redteam/plugins/index.ts
src/redteam/plugins/**/*.ts

📄 CodeRabbit inference engine (src/redteam/CLAUDE.md)

src/redteam/plugins/**/*.ts: Place vulnerability-specific test generators as plugins under src/redteam/plugins/ (e.g., pii.ts, harmful.ts, sql-injection.ts)
New plugins must implement the RedteamPluginObject interface

Files:

  • src/redteam/plugins/openaiGuardrails.ts
  • src/redteam/plugins/index.ts
{site/**,examples/**}

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

Any pull request that only touches files in 'site/' or 'examples/' directories must use the 'docs:' prefix in the PR title, not 'feat:' or 'fix:'

Files:

  • site/docs/_shared/data/plugins.ts
  • site/docs/red-team/plugins/openai-guardrails.md
site/**

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

If the change is a feature, update the relevant documentation under 'site/'

Files:

  • site/docs/_shared/data/plugins.ts
  • site/docs/red-team/plugins/openai-guardrails.md
src/app/src/**/*.{ts,tsx}

📄 CodeRabbit inference engine (src/app/CLAUDE.md)

src/app/src/**/*.{ts,tsx}: Never use fetch() directly; always use callApi() from @app/utils/api for all HTTP requests
Access Zustand state outside React components via store.getState(); do not call hooks outside components
Use the @app/* path alias for internal imports as configured in Vite

Files:

  • src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx
src/app/src/{components,pages}/**/*.tsx

📄 CodeRabbit inference engine (src/app/CLAUDE.md)

src/app/src/{components,pages}/**/*.tsx: Use the class-based ErrorBoundary component (@app/components/ErrorBoundary) to wrap error-prone UI
Access theme via useTheme() from @mui/material/styles instead of hardcoding theme values
Use useMemo/useCallback only when profiling indicates benefit; avoid unnecessary memoization
Implement explicit loading and error states for components performing async operations
Prefer MUI composition and the sx prop for styling over ad-hoc inline styles

Files:

  • src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx
**/*.{tsx,jsx}

📄 CodeRabbit inference engine (.cursor/rules/react-components.mdc)

**/*.{tsx,jsx}: Use icons from @mui/icons-material
Prefer commonly used icons from @mui/icons-material for intuitive experience

Files:

  • src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx
src/app/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

In the React app (src/app), use callApi from @app/utils/api for all API calls; do not use fetch directly

Files:

  • src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx
site/docs/**/*.md

📄 CodeRabbit inference engine (.cursor/rules/docusaurus.mdc)

site/docs/**/*.md: Prioritize minimal edits when updating existing documentation; avoid creating entirely new sections or rewriting substantial portions; focus edits on improving grammar, spelling, clarity, fixing typos, and structural improvements where needed; do not modify existing headings (h1, h2, h3, etc.) as they are often linked externally.
Structure content to reveal information progressively: begin with essential actions and information, then provide deeper context as necessary; organize information from most important to least important.
Use action-oriented language: clearly outline actionable steps users should take, use concise and direct language, prefer active voice over passive voice, and use imperative mood for instructions.
Use 'eval' instead of 'evaluation' in all documentation; when referring to command line usage, use 'npx promptfoo eval' rather than 'npx promptfoo evaluation'; maintain consistency with this terminology across all examples, code blocks, and explanations.
The project name can be written as either 'Promptfoo' (capitalized) or 'promptfoo' (lowercase) depending on context: use 'Promptfoo' at the beginning of sentences or in headings, and 'promptfoo' in code examples, terminal commands, or when referring to the package name; be consistent with the chosen capitalization within each document or section.
Each markdown documentation file must include required front matter fields: 'title' (the page title shown in search results and browser tabs) and 'description' (a concise summary of the page content, ideally 150-160 characters).
Only add a title attribute to code blocks that represent complete, runnable files; do not add titles to code fragments, partial examples, or snippets that aren't meant to be used as standalone files; this applies to all code blocks regardless of language.
Use special comment directives to highlight specific lines in code blocks: 'highlight-next-line' highlights the line immediately after the comment, 'highligh...

Files:

  • site/docs/red-team/plugins/openai-guardrails.md
site/docs/**/*.{md,mdx}

📄 CodeRabbit inference engine (site/docs/CLAUDE.md)

site/docs/**/*.{md,mdx}: Use the term "eval" not "evaluation" in documentation and examples
Capitalization: use "Promptfoo" (capitalized) in prose/headings and "promptfoo" (lowercase) in code, commands, and package names
Every doc must include required front matter: title and description
Only add title= to code blocks when showing complete runnable files
Admonitions must have empty lines around their content (Prettier requirement)
Do not modify headings; they may be externally linked
Use progressive disclosure: put essential information first
Use action-oriented, imperative mood in instructions (e.g., "Install the package")

Files:

  • site/docs/red-team/plugins/openai-guardrails.md
**/*.{test,spec}.{js,ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

Avoid disabling or skipping tests unless absolutely necessary and documented

Files:

  • test/redteam/plugins/openaiGuardrails.test.ts
test/**/*.{test,spec}.ts

📄 CodeRabbit inference engine (.cursor/rules/jest.mdc)

test/**/*.{test,spec}.ts: Mock as few functions as possible to keep tests realistic
Never increase the function timeout - fix the test instead
Organize tests in descriptive describe and it blocks
Prefer assertions on entire objects rather than individual keys when writing expectations
Clean up after tests to prevent side effects (e.g., use afterEach(() => { jest.resetAllMocks(); }))
Run tests with --randomize flag to ensure your mocks setup and teardown don't affect other tests
Use Jest's mocking utilities rather than complex custom mocks
Prefer shallow mocking over deep mocking
Mock external dependencies but not the code being tested
Reset mocks between tests to prevent test pollution
For database tests, use in-memory instances or proper test fixtures
Test both success and error cases for each provider
Mock API responses to avoid external dependencies in tests
Validate that provider options are properly passed to the underlying service
Test error handling and edge cases (rate limits, timeouts, etc.)
Ensure provider caching behaves as expected
Always include both --coverage and --randomize flags when running tests
Run tests in a single pass (no watch mode for CI)
Ensure all tests are independent and can run in any order
Clean up any test data or mocks after each test

Files:

  • test/redteam/plugins/openaiGuardrails.test.ts
test/**/*.test.ts

📄 CodeRabbit inference engine (test/CLAUDE.md)

test/**/*.test.ts: Never increase Jest test timeouts; fix slow tests instead (avoid jest.setTimeout or large timeouts in tests)
Do not use .only() or .skip() in committed tests
Add afterEach(() => { jest.resetAllMocks(); }) to ensure mock cleanup
Prefer asserting entire objects (toEqual on whole result) rather than individual fields
Mock minimally: only external dependencies (APIs, databases), not code under test
Use Jest (not Vitest) APIs in this suite; avoid importing vitest
Import from @jest/globals in tests

Files:

  • test/redteam/plugins/openaiGuardrails.test.ts
test/**

📄 CodeRabbit inference engine (test/CLAUDE.md)

Organize tests to mirror src/ structure (e.g., test/providers → src/providers, test/redteam → src/redteam)

Files:

  • test/redteam/plugins/openaiGuardrails.test.ts
test/**/*.{test.ts,test.tsx,spec.ts,spec.tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

test/**/*.{test.ts,test.tsx,spec.ts,spec.tsx}: Follow Jest best practices using describe/it blocks in tests
Write tests covering both success and error cases for all functionality

Files:

  • test/redteam/plugins/openaiGuardrails.test.ts
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
PR: promptfoo/promptfoo#0
File: src/redteam/CLAUDE.md:0-0
Timestamp: 2025-10-05T16:59:20.507Z
Learning: Applies to src/redteam/test/redteam/**/*.ts : Add tests for new plugins under test/redteam/
Learnt from: CR
PR: promptfoo/promptfoo#0
File: src/redteam/CLAUDE.md:0-0
Timestamp: 2025-10-05T16:59:20.507Z
Learning: Applies to src/redteam/plugins/**/*.ts : Place vulnerability-specific test generators as plugins under src/redteam/plugins/ (e.g., pii.ts, harmful.ts, sql-injection.ts)
Learnt from: CR
PR: promptfoo/promptfoo#0
File: src/redteam/CLAUDE.md:0-0
Timestamp: 2025-10-05T16:59:20.507Z
Learning: Applies to src/redteam/plugins/**/*.ts : New plugins must implement the RedteamPluginObject interface
📚 Learning: 2025-10-05T16:59:20.507Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: src/redteam/CLAUDE.md:0-0
Timestamp: 2025-10-05T16:59:20.507Z
Learning: Applies to src/redteam/test/redteam/**/*.ts : Add tests for new plugins under test/redteam/

Applied to files:

  • test/redteam/plugins/openaiGuardrails.test.ts
🧬 Code graph analysis (3)
src/redteam/plugins/openaiGuardrails.ts (3)
src/types/index.ts (2)
  • TestCase (702-702)
  • Assertion (555-555)
src/util/fetch/index.ts (1)
  • fetchWithTimeout (133-159)
src/providers/shared.ts (1)
  • REQUEST_TIMEOUT_MS (9-9)
src/redteam/plugins/index.ts (1)
src/redteam/plugins/openaiGuardrails.ts (1)
  • OpenAIGuardrailsPlugin (165-246)
test/redteam/plugins/openaiGuardrails.test.ts (2)
src/redteam/plugins/openaiGuardrails.ts (2)
  • OpenAIGuardrailsPlugin (165-246)
  • fetchDataset (55-163)
src/util/fetch/index.ts (1)
  • fetchWithTimeout (133-159)
🪛 GitHub Check: Build on Node 22.x
src/redteam/plugins/openaiGuardrails.ts

[failure] 224-224:
Property 'includeSafe' does not exist on type 'PluginConfig'.

test/redteam/plugins/openaiGuardrails.test.ts

[failure] 286-286:
Object literal may only specify known properties, and 'includeSafe' does not exist in type 'PluginConfig'.


[failure] 260-260:
Object literal may only specify known properties, and 'includeSafe' does not exist in type 'PluginConfig'.


[failure] 228-228:
Object literal may only specify known properties, and 'includeSafe' does not exist in type 'PluginConfig'.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
  • GitHub Check: Tusk Test Runner (2)
  • GitHub Check: Tusk Test Runner (3)
  • GitHub Check: Tusk Test Runner (1)
  • GitHub Check: Tusk Test Runner (4)
  • GitHub Check: Tusk Tester
  • GitHub Check: Run Integration Tests
  • GitHub Check: webui tests
  • GitHub Check: Build Docs
  • GitHub Check: Test on Node 24.x and windows-latest
  • GitHub Check: Test on Node 22.x and macOS-latest
  • GitHub Check: Test on Node 22.x and ubuntu-latest
  • GitHub Check: Test on Node 20.x and ubuntu-latest
  • GitHub Check: Test on Node 22.x and windows-latest
  • GitHub Check: Test on Node 24.x and ubuntu-latest
  • GitHub Check: Test on Node 20.x and macOS-latest
  • GitHub Check: Test on Node 20.x and windows-latest
  • GitHub Check: Style Check
  • GitHub Check: Generate Assets
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (7)
src/app/src/pages/redteam/setup/components/PluginConfigDialog.tsx (1)

305-331: LGTM! Clean UI implementation for the OpenAI Guardrails plugin configuration.

The UI correctly implements:

  • Clear description of the plugin's purpose
  • Checkbox control for the includeSafe option
  • Helpful caption explaining the 50/50 split behavior

The implementation follows the established patterns in this file and properly manages local state.

src/redteam/constants/plugins.ts (2)

89-89: LGTM! Correct placement in the guardrails evaluation plugins.

The openai-guardrails plugin is appropriately added to the GUARDRAILS_EVALUATION_PLUGINS array, positioned alphabetically between system-prompt-override and harmbench.


294-294: LGTM! Correct placement in additional plugins.

The openai-guardrails plugin is appropriately added to the ADDITIONAL_PLUGINS array, positioned alphabetically between off-topic and overreliance.

site/docs/_shared/data/plugins.ts (1)

1127-1141: LGTM! Complete and well-structured plugin metadata entry.

The OpenAI Guardrails plugin entry includes all required fields and is correctly positioned alphabetically. The categorization as a 'Dataset' plugin with 'security' vulnerability type is appropriate. All application types (RAG, agent, chat) are enabled, which aligns with the plugin's broad applicability for jailbreak resistance testing.

src/redteam/plugins/index.ts (2)

30-30: LGTM! Import correctly placed in alphabetical order.

The import statement for OpenAIGuardrailsPlugin is appropriately positioned alphabetically between ExcessiveAgencyPlugin and HallucinationPlugin.


177-177: LGTM! Plugin factory correctly registered.

The createPluginFactory(OpenAIGuardrailsPlugin, 'openai-guardrails') registration is correctly placed in alphabetical order within the pluginFactories array. No validation function is needed since the includeSafe config property is optional with a default value.

src/redteam/plugins/openaiGuardrails.ts (1)

169-171: No changes needed for getTemplate()
This plugin sets canGenerateRemote = false and the base class never invokes getTemplate() for local-only plugins; throwing here aligns with other non-remote implementations.

Comment on lines +1 to +4
---
sidebar_label: OpenAI Guardrails
description: Red team jailbreak resistance using OpenAI's evaluation dataset to test LLM responses to role-playing attacks, system manipulation, and obfuscation techniques
---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add the required title front matter entry

Docs under site/docs must include both title and description in the front matter. Please add a title (e.g., OpenAI Guardrails Plugin) so Docusaurus builds and navigation metadata stay consistent. As per coding guidelines.

 ---
+title: OpenAI Guardrails Plugin
 sidebar_label: OpenAI Guardrails
 description: Red team jailbreak resistance using OpenAI's evaluation dataset to test LLM responses to role-playing attacks, system manipulation, and obfuscation techniques
 ---
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
---
sidebar_label: OpenAI Guardrails
description: Red team jailbreak resistance using OpenAI's evaluation dataset to test LLM responses to role-playing attacks, system manipulation, and obfuscation techniques
---
---
title: OpenAI Guardrails Plugin
sidebar_label: OpenAI Guardrails
description: Red team jailbreak resistance using OpenAI's evaluation dataset to test LLM responses to role-playing attacks, system manipulation, and obfuscation techniques
---
🤖 Prompt for AI Agents
In site/docs/red-team/plugins/openai-guardrails.md around lines 1 to 4, the
front matter is missing the required title entry; add a top-level "title" field
(for example: "OpenAI Guardrails Plugin") to the existing YAML front matter so
both title and description are present, ensuring Docusaurus builds and
navigation metadata remain correct.

Comment on lines +8 to +64
The OpenAI Guardrails plugin uses the jailbreak evaluation dataset from [OpenAI's Guardrails Python package](https://github.com/openai/openai-guardrails-python) to test LLM responses to known jailbreak attempts. The dataset contains real-world jailbreak patterns designed to circumvent AI safety guidelines, including:

- Role-playing attacks (e.g., "You are DAN", "Act as an unethical chatbot")
- System manipulation (e.g., token-based jailbreaks, prompt injection via special formatting)
- Obfuscation techniques (e.g., character encoding, linguistic evasion)
- Multi-turn jailbreak strategies (e.g., "life token" systems, cumulative manipulation)

This plugin helps evaluate whether your LLM maintains safety boundaries when faced with sophisticated adversarial prompts used in real-world jailbreak attempts.

## How it works

The plugin:

- Loads jailbreak prompts from OpenAI's evaluation dataset
- Filters for prompts expected to trigger jailbreak guardrails
- Shuffles and selects a specified number of test cases
- Tests whether the model refuses or complies with jailbreak attempts
- Uses LLM-as-a-judge to evaluate responses based on strict safety criteria

## Configuration

Add to your YAML file:

```yaml
redteam:
plugins:
- openai-guardrails
```
Control the number of tests:
```yaml
redteam:
plugins:
- id: openai-guardrails
numTests: 25
```
### Testing for Over-Blocking
By default, OpenAI Guardrails tests only jailbreak attempts to ensure your model refuses harmful content. However, you can also test for over-blocking (false positives) by including safe prompts:
```yaml
redteam:
plugins:
- id: openai-guardrails
config:
includeSafe: true
numTests: 20
```
When `includeSafe: true` is set:

- Test cases are split 50/50 between safe and jailbreak prompts
- Safe prompts use inverted grading logic: they SHOULD be answered (not refused)
- This helps identify if your guardrails are too strict and block legitimate requests
- Metadata includes `isSafe` and `label` fields for filtering and analysis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Replace “evaluation” with “eval” throughout the doc

The documentation style guide requires using “eval” instead of “evaluation” in prose and examples. Please update every occurrence (e.g., “jailbreak evaluation dataset”, “evaluation dataset”) to the approved terminology, keeping the official dataset name clear where needed. As per coding guidelines.

🤖 Prompt for AI Agents
In site/docs/red-team/plugins/openai-guardrails.md around lines 8 to 64, replace
every occurrence of the word "evaluation" with "eval" in the prose and examples
(e.g., "jailbreak evaluation dataset" → "jailbreak eval dataset", "evaluation
dataset" → "eval dataset"), including headings and inline descriptions and code
comments; do not alter official package names or URLs (leave
"openai-guardrails-python" and any quoted official titles intact), and update
any adjacent phrasing so grammar remains correct after the substitution.

}

async generateTests(n: number, _delayMs?: number): Promise<TestCase[]> {
const includeSafe = this.config?.includeSafe ?? false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Type safety issue: includeSafe property missing from PluginConfig interface.

Line 224 accesses this.config?.includeSafe but the PluginConfig type does not include an includeSafe property, causing a TypeScript compilation error (flagged by GitHub Check).

The PluginConfig type needs to be extended to include includeSafe?: boolean. Since this plugin-specific configuration is only used by the OpenAI Guardrails plugin, consider one of these solutions:

Solution 1: Add to the base PluginConfig interface
Locate the PluginConfig interface definition and add:

export interface PluginConfig {
  // ... existing properties
  includeSafe?: boolean;
}

Solution 2: Define a plugin-specific config type

+interface OpenAIGuardrailsConfig extends PluginConfig {
+  includeSafe?: boolean;
+}
+
 export class OpenAIGuardrailsPlugin extends RedteamPluginBase {
   readonly id = PLUGIN_ID;
   static readonly canGenerateRemote = false;
+  declare config?: OpenAIGuardrailsConfig;

   async getTemplate(): Promise<string> {
     throw new Error('Not implemented');
   }
🧰 Tools
🪛 GitHub Check: Build on Node 22.x

[failure] 224-224:
Property 'includeSafe' does not exist on type 'PluginConfig'.

🤖 Prompt for AI Agents
In src/redteam/plugins/openaiGuardrails.ts around line 224, the code reads
this.config?.includeSafe but the PluginConfig type lacks includeSafe causing a
TypeScript error; fix by either (A) add includeSafe?: boolean to the shared
PluginConfig interface definition (update the file where PluginConfig is
declared and export the new optional property), or (B) create a plugin-specific
config type (e.g., OpenAIGuardrailsConfig extends PluginConfig { includeSafe?:
boolean }) and cast/annotate this.config as that type in this plugin so
includeSafe is recognized. Ensure the chosen change is exported/imported where
needed and update any usages or tests accordingly.

Comment on lines +227 to +229
const pluginWithIncludeSafe = new OpenAIGuardrailsPlugin({} as any, 'test', 'input', {
includeSafe: true,
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix the includeSafe config typing to restore the build

TypeScript rejects these literals because includeSafe is not part of PluginConfig, which currently makes the test suite fail (see static analysis errors on these lines). Please extend the plugin’s config type to declare includeSafe?: boolean (and reuse it here), or cast to that extended type, so the compiler accepts the property while keeping type safety.

Also applies to: 259-261, 285-287

🧰 Tools
🪛 GitHub Check: Build on Node 22.x

[failure] 228-228:
Object literal may only specify known properties, and 'includeSafe' does not exist in type 'PluginConfig'.

🤖 Prompt for AI Agents
In test/redteam/plugins/openaiGuardrails.test.ts around lines 227-229 (and
similarly at 259-261 and 285-287), the test constructs pass an includeSafe
property but TypeScript fails because PluginConfig doesn't declare it; update
the plugin config type to include includeSafe?: boolean (or create an extended
interface that adds includeSafe and use that type in the test) and update the
test instantiations to use that typed config (or cast the config object to the
extended type) so the compiler accepts the property while preserving type
safety.

- Add includeSafe?: boolean to PluginConfig interface in src/redteam/types.ts
- Regenerate config-schema.json to include new plugin and types
- Fixes TypeScript compilation errors in CI

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Contributor

use-tusk bot commented Oct 10, 2025

⏩ Already incorporated tests (d434cbf) View tests ↗


View check history

Commit Status Output Created (UTC)
2042b2d ⏩ No tests generated Output Oct 10, 2025 4:28PM
6ac553d ✅ Generated 5 tests - 5 passed Tests Oct 10, 2025 4:39PM
d434cbf ⏩ Already incorporated tests Output Oct 13, 2025 6:38PM

Copy link
Contributor

@MrFlounder MrFlounder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be one of the exception plugin for strategies?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.