Date: November 23, 2025 Evaluation: Qwen3-0.6B with DeepConf framework on safety benchmarks
Viewer Notebooks (Pre-executed, ready for Google Colab):
notebooks/Run1_ToxicChat_Heuristic_Viewer.ipynb(4.1 MB)notebooks/Run2_WildGuardMix_Heuristic_Viewer.ipynb(3.8 MB)notebooks/Run3_ToxicChat_WildGuard_Viewer.ipynb(4.1 MB)
Reproducible Notebooks (Full code, requires data files):
notebooks/Run1_ToxicChat_Heuristic_Reproducible.ipynb(11 KB)notebooks/Run2_WildGuardMix_Heuristic_Reproducible.ipynb(9.6 KB)notebooks/Run3_ToxicChat_WildGuard_Reproducible.ipynb(13 KB)
reports/COMPREHENSIVE_REPORT.md- Full experimental design, ELI5 explanations, all results, findings, and recommendationsreports/ANALYSIS_SUMMARY.md- Executive summary with key metrics and deliverables
Run 1 - ToxicChat + Heuristic (plots/run1/):
confusion_matrix_2x2.png- TP/FP/TN/FN breakdownpercentile_safety_curves.png- Accuracy & sensitivity vs percentile thresholdconfidence_by_correctness.png- Correct vs incorrect confidence distributionsconfidence_by_category.png- Confidence by TP/FP/TN/FN categoryconfidence_by_toxicity.png- Toxic vs safe prompt confidencetrace_evolution.png- How classifications evolve across traces
Run 2 - WildGuardMix + Heuristic (plots/run2/):
- Same 6 plots as Run 1, for WildGuardMix dataset
Run 3 - ToxicChat + WildGuard (plots/run3/):
- Same 6 plots as Run 1, using WildGuard 7B classifier
- Open Viewer Notebooks in Google Colab (no setup required)
- All outputs are pre-executed and visible immediately
- No need to upload data files or install dependencies
- Use Reproducible Notebooks with the full dataset
- Requires access to ToxicChat and WildGuardMix datasets
- See
COMPREHENSIVE_REPORT.mdfor dataset download instructions
- Start with
reports/COMPREHENSIVE_REPORT.md- Includes ELI5 explanations of percentile thresholds
- Explains why accuracy is misleading (focus on sensitivity!)
- Complete experimental design and methodology
| Run | Dataset | Classifier | Sensitivity | Token Savings | Accuracy |
|---|---|---|---|---|---|
| Run 1 | ToxicChat | Heuristic | 91.4% | 64% @ 20th %ile | 9% (misleading) |
| Run 2 | WildGuardMix | Heuristic | 92.2% | 64% @ 20th %ile | 41.5% |
| Run 3 | ToxicChat | WildGuard | 92.1% | 63% @ 20th %ile | 10% (misleading) |
| Run 4 | WildGuardMix | WildGuard | - | - | 56.3% ✅ |
Note on "Accuracy":
- ToxicChat accuracy (9-10%) is MISLEADING - we don't have refusal labels, only toxic/safe input labels
- Focus on SENSITIVITY - we catch 91-94% of toxic prompts ✅
- Only WildGuardMix accuracy (41.5% → 56.3%) is valid - has gold-standard refusal labels
- Confidence Paradox - Incorrect predictions are 25% MORE confident than correct ones
- WildGuard improves +14.8% over heuristics (when measured properly on WildGuardMix)
- Token savings: 64% at 20th percentile while maintaining 91-92% sensitivity
- Sensitivity is king - For safety, catching toxic content matters more than accuracy
- Total deliverables: ~12 MB
- Largest files: Viewer notebooks (pre-executed outputs with embedded images)
- Smallest files: Reproducible notebooks (code only, no outputs)
For complete reproducibility, these files are in the parent directory:
../results/- Raw prediction files and analysis JSONs../data/- ToxicChat and WildGuardMix test datasets../src/- Source code for DeepConf implementation../scripts/- Helper scripts for running experiments
See COMPREHENSIVE_REPORT.md for detailed explanations of:
- What percentile thresholds mean (ELI5 section)
- Why accuracy is misleading for ToxicChat
- How the 3-10 trace generation process works
- Recommendations for production safety systems