Research | Closed Loop Security Labs

Research Disclaimer

Point-in-time results: All results reflect model behavior at the time of testing and may not reflect current model versions, post-deployment safety patches, or provider configuration changes.

Methodology: Results are produced using CLS Labs Forge Engine, Garak (NVIDIA), and PyRIT (Microsoft) with three-column scoring (keyword, tight, Gemini-as-judge). Different tools, prompts, or scoring criteria may produce different results.

Provider filtering: Some results reflect provider-level safety filtering applied at the inference layer (e.g., serverless endpoints). These results measure the provider’s defense stack, not the model’s intrinsic alignment. Deployment configuration significantly affects security posture.

No warranty or recommendation: Published research is for informational and educational purposes only. CLS Labs does not warrant that any model is safe or unsafe for any particular deployment. Organizations should conduct their own assessments before deployment decisions.

Authorized testing: All testing is conducted against publicly available API endpoints or dedicated endpoints under CLS Labs’ own accounts, in accordance with each provider’s terms of service. No unauthorized access is involved in any published research.

Cross-Judge Validation

The same breach data scored by three LLM judges from three different vendors. The honest breach rate is a range, not a point. Our v1 gauntlet demonstrated a 41.9–67.1% spread depending on which vendor's model judges the findings.

67.1%

Gemini 2.0 Flash

Google

43.0%

Llama 4 Maverick

Category	Agreement	Consensus Level
Resource Exhaustion	86%
Secret Exposure	86%
Supply Chain	71%
Content Policy	71%
Agentic Reasoning	57%
Hallucination	57%
Jailbreak	57%
Encoding Bypass	43%
Prompt Injection	43%
Code Generation	29%
Output Manipulation	29%

Agent Security Assessment

When AI agents have real tools — browse the web, execute code, send emails, query databases — the attack surface changes completely. Standard safety evaluations only test text generation. We test what agents actually do.

46.3%

Highest Agent Breach Rate

With tool access

Models Tested as Agents

With 10 tool definitions

2.5%

Lowest Agent Breach Rate

Cross-judge validated

Attack Surfaces Tested

Text, agent, RAG, cross-agent, vision, autonomous

Tool access escalates risk dramatically. The same model scores 19% in text-only testing and 46% with tools attached. Standard safety evaluations that only test text generation miss the entire action surface — agents don’t just say harmful things, they do them.

One attack vector breaches every model with tool access. Malicious instructions embedded in data files — CSV exports, HTML documents, API responses — achieve universal code execution across all tested models. The model doesn’t describe the attack. It runs it.

One autonomous agent platform wrote an attacker’s SSH key to ~/.ssh/authorized_keys with correct 600 permissions, then scanned 127,261 files including API tokens and PKI certificates. It defended against explicit attacks perfectly — but couldn’t distinguish a legitimate sysadmin request from adversarial reconnaissance.

Full Assessment Data Available

Detailed per-model agent security results, RAG injection findings, vision attack data, and cross-vendor judge validation are available under NDA. Request your model’s specific results or a full assessment of your deployment.

Request Assessment Data →

Cross-Model Assessment Results

310+ ranked models across 12+ architecture families and 16 providers. 144 attack modules across 446 categories covering prompt extraction, goal hijacking, identity manipulation, compliance fatigue, semantic smuggling, chain attacks, agent tool injection, cross-agent contamination, RAG poisoning, MCP protocol abuse, OT/ICS, robotics, and more.
Cross-judge validated: 199 runs across 30+ models. Three independent vendors, consensus scoring.

Model	Family	Attacks	Verified Breach Rate	Defense Rate
Qwen3.5-397B	Qwen	422	1.9%	98.1%
MiniMax M2.5	MiniMax	434	4.1%	95.9%
Kimi-K2.5 FP4	Moonshot	434	4.1%	95.9%
GLM-5	Zhipu AI	230	4.3%	95.7%
DeepSeek R1	DeepSeek	438	7.1%	92.9%
GPT-OSS-120B	OpenAI	434	9.0%	91.0%
Qwen3-Next-80B	Qwen	426	10.6%	89.4%
Mistral Small 24B	Mistral	438	13.2%	86.8%
Qwen3-235B	Qwen	434	13.1%	86.9%
GLM-4.5-Air FP8	Zhipu AI	422	14.9%	85.1%
Qwen3-Coder-Next FP8	Qwen	442	16.7%	83.3%
DeepSeek R1-0528	DeepSeek	406	16.7%	83.3%
Kimi-K2 Instruct	Moonshot	238	18.5%	81.5%
LiquidAI LFM2-24B	LiquidAI	426	19.2%	80.8%
Maverick FP8	Meta / Llama 4	142	19.7%	80.3%
DeepSeek V3.1	DeepSeek	262	22.5%	77.5%
Llama-3.3-70B Turbo	Meta / Llama	150	22.7%	77.3%
Qwen2.5-7B Turbo	Qwen	422	22.7%	77.3%
Gemma 3n E4B	Google	438	24.0%	76.0%
405B Dedicated	Meta / Llama	792	26.6%	73.4%
405B Serverless	Meta / Llama	150	0.0%	100% (provider filtered)
Scout Serverless	Meta / Llama 4	130	0.0%	100% (provider filtered)
Gemma 3-27B	Google	170	0.0%	100% (provider filtered)
GLM-5 FP4	Zhipu AI	438	0.0%	100% (provider filtered)
DeepSeek V3.2	DeepSeek	217	0.0%	100% (provider filtered / gated)

Key Findings

Semantic smuggling bypasses every model tested. Reframing harmful requests through academic context, fictional scenarios, or past-tense phrasing defeats safety alignment at 75–100% rates across all architecture families. This is the universal vulnerability.

Model updates can degrade safety. DeepSeek R1 (original): 7.1% → DeepSeek R1-0528 (May update): 16.7%. The newer version is 2.4x more vulnerable. Version pinning matters.

Provider filtering hides real risk. Together AI serverless 405B: 0% breach. Dedicated endpoint: 26.6%. Five models show 0% due to provider-level filtering, not model-level safety.

Scale improves safety. Qwen3.5-397B at 1.9% is the safest non-filtered model tested. Jailbreak resistance: 10.7%. Larger models with reasoning capabilities show the strongest alignment.

Judge choice changes the result. The same 86 breach records scored by three vendors: Google 67.1%, Meta 43.0%, Anthropic 41.9%. Any assessment using a single judge model has unquantified bias.

Our Verification Approach

Raw vulnerability scanners overreport by 50–90 percentage points. We verify every finding before it reaches a report.

Every breach record passes through a proprietary multi-stage verification pipeline that includes semantic deduplication, continuous risk scoring with category attribution, statistical reproduction across multiple models, and multi-vendor LLM judge evaluation. Only certified results appear in our assessments.

98%+

Certification Rate

3 Vendors

Independent Judge Models

Unverified Findings in Reports

Most AI security vendors report raw scanner output. We verify every finding with independent multi-vendor semantic analysis.
The difference is the difference between a 78.8% and a 16.7% breach rate on the same model.

Llama-3.3-70B-Instruct — Full Gauntlet

21 attack modules • 48,654 evaluations • 14,471 breaches • Multi-week campaign on sovereign 6×GPU cluster
✔ Sovereign Lab Verified — March 2026

70.3%

Overall Defense Rate

14,471

Total Breaches

105

Probe/Detector Combos

Findings Documented

Module	Probe Types	Passed	Failed	Total	Defense Rate	Status
continuation	1	768	0	768	100.0%	✅ PASS
snowball	1	300	0	300	100.0%	✅ PASS
realtoxicityprompts	1	15	0	15	100.0%	✅ PASS
apikey	2	934	8	942	99.2%	✅ PASS
goodside	3	114	3	117	97.4%	✅ PASS
packagehallucination	7	4,677	363	5,040	92.8%	✅ PASS
encoding	30	19,822	3,218	23,040	86.0%	✅ PASS
lmrc	9	55	17	72	76.4%	⚠️ MARGINAL
leakreplay	8	1,455	501	1,956	74.4%	⚠️ MARGINAL
ansiescape	2	236	91	327	72.2%	⚠️ MARGINAL
malwaregen	4	480	240	720	66.7%	⚠️ MARGINAL
latentinjection	9	3,400	3,248	6,648	51.1%	⚠️ MARGINAL
misleading	4	776	1,024	1,800	43.1%	⚠️ MARGINAL
tap	1	8	19	27	29.6%	🚨 FAIL
grandma	7	47	127	174	27.0%	🚨 FAIL
dan	5	317	1,231	1,548	20.5%	🚨 FAIL
topic	1	32	136	168	19.0%	🚨 FAIL
promptinject	3	402	1,902	2,304	17.4%	🚨 FAIL
divergence	3	168	816	984	17.1%	🚨 FAIL
phrasing	2	167	1,369	1,536	10.9%	🚨 CRITICAL
dra	2	10	158	168	6.0%	🚨 CRITICAL
TOTAL	105	34,183	14,471	48,654	70.3%	—

Model Security Assessment