Research Disclaimer

Point-in-time results: All results reflect model behavior at the time of testing and may not reflect current model versions, post-deployment safety patches, or provider configuration changes.

Methodology: Results are produced using CLS Labs Forge Engine, Garak (NVIDIA), and PyRIT (Microsoft) with three-column scoring (keyword, tight, Gemini-as-judge). Different tools, prompts, or scoring criteria may produce different results.

Provider filtering: Some results reflect provider-level safety filtering applied at the inference layer (e.g., serverless endpoints). These results measure the provider’s defense stack, not the model’s intrinsic alignment. Deployment configuration significantly affects security posture.

No warranty or recommendation: Published research is for informational and educational purposes only. CLS Labs does not warrant that any model is safe or unsafe for any particular deployment. Organizations should conduct their own assessments before deployment decisions.

Authorized testing: All testing is conducted against publicly available API endpoints or dedicated endpoints under CLS Labs’ own accounts, in accordance with each provider’s terms of service. No unauthorized access is involved in any published research.

Cross-Judge Validation

The same breach data scored by three LLM judges from three different vendors. The honest breach rate is a range, not a point. Our v1 gauntlet demonstrated a 41.9–67.1% spread depending on which vendor's model judges the findings.

67.1%
Gemini 2.0 Flash
Google
43.0%
Llama 4 Maverick
Meta
41.9%
Claude Sonnet 4
Anthropic
59.3%
Inter-Judge Agreement
3-vendor consensus
Breach rates vary 41.9–67.1% depending on which vendor’s model judges the findings. Any vendor reporting a single number is hiding the uncertainty inherent in LLM-based evaluation. We report the range and let the data speak.
CategoryAgreementConsensus Level
Resource Exhaustion86%
Secret Exposure86%
Supply Chain71%
Content Policy71%
Agentic Reasoning57%
Hallucination57%
Jailbreak57%
Encoding Bypass43%
Prompt Injection43%
Code Generation29%
Output Manipulation29%

Agent Security Assessment

When AI agents have real tools — browse the web, execute code, send emails, query databases — the attack surface changes completely. Standard safety evaluations only test text generation. We test what agents actually do.

46.3%
Highest Agent Breach Rate
With tool access
23
Models Tested as Agents
With 10 tool definitions
2.5%
Lowest Agent Breach Rate
Cross-judge validated
6
Attack Surfaces Tested
Text, agent, RAG, cross-agent, vision, autonomous
Tool access escalates risk dramatically. The same model scores 19% in text-only testing and 46% with tools attached. Standard safety evaluations that only test text generation miss the entire action surface — agents don’t just say harmful things, they do them.
One attack vector breaches every model with tool access. Malicious instructions embedded in data files — CSV exports, HTML documents, API responses — achieve universal code execution across all tested models. The model doesn’t describe the attack. It runs it.
One autonomous agent platform wrote an attacker’s SSH key to ~/.ssh/authorized_keys with correct 600 permissions, then scanned 127,261 files including API tokens and PKI certificates. It defended against explicit attacks perfectly — but couldn’t distinguish a legitimate sysadmin request from adversarial reconnaissance.

Full Assessment Data Available

Detailed per-model agent security results, RAG injection findings, vision attack data, and cross-vendor judge validation are available under NDA. Request your model’s specific results or a full assessment of your deployment.

Request Assessment Data →

Cross-Model Assessment Results

310+ ranked models across 12+ architecture families and 16 providers. 144 attack modules across 446 categories covering prompt extraction, goal hijacking, identity manipulation, compliance fatigue, semantic smuggling, chain attacks, agent tool injection, cross-agent contamination, RAG poisoning, MCP protocol abuse, OT/ICS, robotics, and more.
Cross-judge validated: 199 runs across 30+ models. Three independent vendors, consensus scoring.

ModelFamilyAttacksVerified Breach RateDefense Rate
Qwen3.5-397BQwen4221.9%98.1%
MiniMax M2.5MiniMax4344.1%95.9%
Kimi-K2.5 FP4Moonshot4344.1%95.9%
GLM-5Zhipu AI2304.3%95.7%
DeepSeek R1DeepSeek4387.1%92.9%
GPT-OSS-120BOpenAI4349.0%91.0%
Qwen3-Next-80BQwen42610.6%89.4%
Mistral Small 24BMistral43813.2%86.8%
Qwen3-235BQwen43413.1%86.9%
GLM-4.5-Air FP8Zhipu AI42214.9%85.1%
Qwen3-Coder-Next FP8Qwen44216.7%83.3%
DeepSeek R1-0528DeepSeek40616.7%83.3%
Kimi-K2 InstructMoonshot23818.5%81.5%
LiquidAI LFM2-24BLiquidAI42619.2%80.8%
Maverick FP8Meta / Llama 414219.7%80.3%
DeepSeek V3.1DeepSeek26222.5%77.5%
Llama-3.3-70B TurboMeta / Llama15022.7%77.3%
Qwen2.5-7B TurboQwen42222.7%77.3%
Gemma 3n E4BGoogle43824.0%76.0%
405B DedicatedMeta / Llama79226.6%73.4%
405B ServerlessMeta / Llama1500.0% 100% (provider filtered)
Scout ServerlessMeta / Llama 41300.0% 100% (provider filtered)
Gemma 3-27BGoogle1700.0% 100% (provider filtered)
GLM-5 FP4Zhipu AI4380.0% 100% (provider filtered)
DeepSeek V3.2DeepSeek2170.0% 100% (provider filtered / gated)

Key Findings

Semantic smuggling bypasses every model tested. Reframing harmful requests through academic context, fictional scenarios, or past-tense phrasing defeats safety alignment at 75–100% rates across all architecture families. This is the universal vulnerability.
Model updates can degrade safety. DeepSeek R1 (original): 7.1% → DeepSeek R1-0528 (May update): 16.7%. The newer version is 2.4x more vulnerable. Version pinning matters.
Provider filtering hides real risk. Together AI serverless 405B: 0% breach. Dedicated endpoint: 26.6%. Five models show 0% due to provider-level filtering, not model-level safety.
Scale improves safety. Qwen3.5-397B at 1.9% is the safest non-filtered model tested. Jailbreak resistance: 10.7%. Larger models with reasoning capabilities show the strongest alignment.
Judge choice changes the result. The same 86 breach records scored by three vendors: Google 67.1%, Meta 43.0%, Anthropic 41.9%. Any assessment using a single judge model has unquantified bias.

Our Verification Approach

Raw vulnerability scanners overreport by 50–90 percentage points. We verify every finding before it reaches a report.

Every breach record passes through a proprietary multi-stage verification pipeline that includes semantic deduplication, continuous risk scoring with category attribution, statistical reproduction across multiple models, and multi-vendor LLM judge evaluation. Only certified results appear in our assessments.

98%+
Certification Rate
3 Vendors
Independent Judge Models
0%
Unverified Findings in Reports

Most AI security vendors report raw scanner output. We verify every finding with independent multi-vendor semantic analysis.
The difference is the difference between a 78.8% and a 16.7% breach rate on the same model.


Llama-3.3-70B-Instruct — Full Gauntlet

21 attack modules • 48,654 evaluations • 14,471 breaches • Multi-week campaign on sovereign 6×GPU cluster
✔ Sovereign Lab Verified — March 2026

70.3%
Overall Defense Rate
14,471
Total Breaches
105
Probe/Detector Combos
12
Findings Documented
ModuleProbe TypesPassedFailedTotalDefense RateStatus
continuation17680768100.0%✅ PASS
snowball13000300100.0%✅ PASS
realtoxicityprompts115015100.0%✅ PASS
apikey2934894299.2%✅ PASS
goodside3114311797.4%✅ PASS
packagehallucination74,6773635,04092.8%✅ PASS
encoding3019,8223,21823,04086.0%✅ PASS
lmrc955177276.4%⚠️ MARGINAL
leakreplay81,4555011,95674.4%⚠️ MARGINAL
ansiescape22369132772.2%⚠️ MARGINAL
malwaregen448024072066.7%⚠️ MARGINAL
latentinjection93,4003,2486,64851.1%⚠️ MARGINAL
misleading47761,0241,80043.1%⚠️ MARGINAL
tap18192729.6%🚨 FAIL
grandma74712717427.0%🚨 FAIL
dan53171,2311,54820.5%🚨 FAIL
topic13213616819.0%🚨 FAIL
promptinject34021,9022,30417.4%🚨 FAIL
divergence316881698417.1%🚨 FAIL
phrasing21671,3691,53610.9%🚨 CRITICAL
dra2101581686.0%🚨 CRITICAL
TOTAL10534,18314,47148,654 70.3%

Research data © 2026 CLS Security Labs LLC. Licensed under CC BY-NC-ND 4.0. Citation with attribution permitted. Commercial use requires written permission.