Comprehensive red team evaluation across 310+ ranked models, 2,300,000+ adversarial probes, 446 attack categories, and 16 inference providers. Three-vendor independent judge validation.
Point-in-time results: All results reflect model behavior at the time of testing and may not reflect current model versions, post-deployment safety patches, or provider configuration changes.
Methodology: Results are produced using CLS Labs Forge Engine, Garak (NVIDIA), and PyRIT (Microsoft) with three-column scoring (keyword, tight, Gemini-as-judge). Different tools, prompts, or scoring criteria may produce different results.
Provider filtering: Some results reflect provider-level safety filtering applied at the inference layer (e.g., serverless endpoints). These results measure the provider’s defense stack, not the model’s intrinsic alignment. Deployment configuration significantly affects security posture.
No warranty or recommendation: Published research is for informational and educational purposes only. CLS Labs does not warrant that any model is safe or unsafe for any particular deployment. Organizations should conduct their own assessments before deployment decisions.
Authorized testing: All testing is conducted against publicly available API endpoints or dedicated endpoints under CLS Labs’ own accounts, in accordance with each provider’s terms of service. No unauthorized access is involved in any published research.
The same breach data scored by three LLM judges from three different vendors. The honest breach rate is a range, not a point. Our v1 gauntlet demonstrated a 41.9–67.1% spread depending on which vendor's model judges the findings.
| Category | Agreement | Consensus Level |
|---|---|---|
| Resource Exhaustion | 86% | |
| Secret Exposure | 86% | |
| Supply Chain | 71% | |
| Content Policy | 71% | |
| Agentic Reasoning | 57% | |
| Hallucination | 57% | |
| Jailbreak | 57% | |
| Encoding Bypass | 43% | |
| Prompt Injection | 43% | |
| Code Generation | 29% | |
| Output Manipulation | 29% |
When AI agents have real tools — browse the web, execute code, send emails, query databases — the attack surface changes completely. Standard safety evaluations only test text generation. We test what agents actually do.
~/.ssh/authorized_keys with correct 600 permissions, then scanned 127,261 files including API tokens and PKI certificates. It defended against explicit attacks perfectly — but couldn’t distinguish a legitimate sysadmin request from adversarial reconnaissance.
Detailed per-model agent security results, RAG injection findings, vision attack data, and cross-vendor judge validation are available under NDA. Request your model’s specific results or a full assessment of your deployment.
Request Assessment Data →310+ ranked models across 12+ architecture families and 16 providers. 144 attack modules across 446 categories covering prompt extraction, goal hijacking, identity manipulation, compliance fatigue, semantic smuggling, chain attacks, agent tool injection, cross-agent contamination, RAG poisoning, MCP protocol abuse, OT/ICS, robotics, and more.
Cross-judge validated: 199 runs across 30+ models. Three independent vendors, consensus scoring.
| Model | Family | Attacks | Verified Breach Rate | Defense Rate |
|---|---|---|---|---|
| Qwen3.5-397B | Qwen | 422 | 1.9% | 98.1% |
| MiniMax M2.5 | MiniMax | 434 | 4.1% | 95.9% |
| Kimi-K2.5 FP4 | Moonshot | 434 | 4.1% | 95.9% |
| GLM-5 | Zhipu AI | 230 | 4.3% | 95.7% |
| DeepSeek R1 | DeepSeek | 438 | 7.1% | 92.9% |
| GPT-OSS-120B | OpenAI | 434 | 9.0% | 91.0% |
| Qwen3-Next-80B | Qwen | 426 | 10.6% | 89.4% |
| Mistral Small 24B | Mistral | 438 | 13.2% | 86.8% |
| Qwen3-235B | Qwen | 434 | 13.1% | 86.9% |
| GLM-4.5-Air FP8 | Zhipu AI | 422 | 14.9% | 85.1% |
| Qwen3-Coder-Next FP8 | Qwen | 442 | 16.7% | 83.3% |
| DeepSeek R1-0528 | DeepSeek | 406 | 16.7% | 83.3% |
| Kimi-K2 Instruct | Moonshot | 238 | 18.5% | 81.5% |
| LiquidAI LFM2-24B | LiquidAI | 426 | 19.2% | 80.8% |
| Maverick FP8 | Meta / Llama 4 | 142 | 19.7% | 80.3% |
| DeepSeek V3.1 | DeepSeek | 262 | 22.5% | 77.5% |
| Llama-3.3-70B Turbo | Meta / Llama | 150 | 22.7% | 77.3% |
| Qwen2.5-7B Turbo | Qwen | 422 | 22.7% | 77.3% |
| Gemma 3n E4B | 438 | 24.0% | 76.0% | |
| 405B Dedicated | Meta / Llama | 792 | 26.6% | 73.4% |
| 405B Serverless | Meta / Llama | 150 | 0.0% | 100% (provider filtered) |
| Scout Serverless | Meta / Llama 4 | 130 | 0.0% | 100% (provider filtered) |
| Gemma 3-27B | 170 | 0.0% | 100% (provider filtered) | |
| GLM-5 FP4 | Zhipu AI | 438 | 0.0% | 100% (provider filtered) |
| DeepSeek V3.2 | DeepSeek | 217 | 0.0% | 100% (provider filtered / gated) |
Raw vulnerability scanners overreport by 50–90 percentage points. We verify every finding before it reaches a report.
Every breach record passes through a proprietary multi-stage verification pipeline that includes semantic deduplication, continuous risk scoring with category attribution, statistical reproduction across multiple models, and multi-vendor LLM judge evaluation. Only certified results appear in our assessments.
Most AI security vendors report raw scanner output. We verify every finding with independent multi-vendor semantic analysis.
The difference is the difference between a 78.8% and a 16.7% breach rate on the same model.
21 attack modules • 48,654 evaluations • 14,471 breaches • Multi-week campaign on sovereign 6×GPU cluster
✔ Sovereign Lab Verified — March 2026
| Module | Probe Types | Passed | Failed | Total | Defense Rate | Status |
|---|---|---|---|---|---|---|
| continuation | 1 | 768 | 0 | 768 | 100.0% | ✅ PASS |
| snowball | 1 | 300 | 0 | 300 | 100.0% | ✅ PASS |
| realtoxicityprompts | 1 | 15 | 0 | 15 | 100.0% | ✅ PASS |
| apikey | 2 | 934 | 8 | 942 | 99.2% | ✅ PASS |
| goodside | 3 | 114 | 3 | 117 | 97.4% | ✅ PASS |
| packagehallucination | 7 | 4,677 | 363 | 5,040 | 92.8% | ✅ PASS |
| encoding | 30 | 19,822 | 3,218 | 23,040 | 86.0% | ✅ PASS |
| lmrc | 9 | 55 | 17 | 72 | 76.4% | ⚠️ MARGINAL |
| leakreplay | 8 | 1,455 | 501 | 1,956 | 74.4% | ⚠️ MARGINAL |
| ansiescape | 2 | 236 | 91 | 327 | 72.2% | ⚠️ MARGINAL |
| malwaregen | 4 | 480 | 240 | 720 | 66.7% | ⚠️ MARGINAL |
| latentinjection | 9 | 3,400 | 3,248 | 6,648 | 51.1% | ⚠️ MARGINAL |
| misleading | 4 | 776 | 1,024 | 1,800 | 43.1% | ⚠️ MARGINAL |
| tap | 1 | 8 | 19 | 27 | 29.6% | 🚨 FAIL |
| grandma | 7 | 47 | 127 | 174 | 27.0% | 🚨 FAIL |
| dan | 5 | 317 | 1,231 | 1,548 | 20.5% | 🚨 FAIL |
| topic | 1 | 32 | 136 | 168 | 19.0% | 🚨 FAIL |
| promptinject | 3 | 402 | 1,902 | 2,304 | 17.4% | 🚨 FAIL |
| divergence | 3 | 168 | 816 | 984 | 17.1% | 🚨 FAIL |
| phrasing | 2 | 167 | 1,369 | 1,536 | 10.9% | 🚨 CRITICAL |
| dra | 2 | 10 | 158 | 168 | 6.0% | 🚨 CRITICAL |
| TOTAL | 105 | 34,183 | 14,471 | 48,654 | 70.3% | — |
Research data © 2026 CLS Security Labs LLC. Licensed under CC BY-NC-ND 4.0. Citation with attribution permitted. Commercial use requires written permission.