<<<<<<< HEAD ======= >>>>>>> ba4b2b4ed41c84b8b153dfa3a393eaf927ed565d
CLS Security Labs
CLS-2026-GPT41-PUB-001
Public Research
Public Research · March 2026

GPT-4.1 Security Assessment
What Standard Safety Tests
Don’t Show You

GPT-4.1 passes 84% of standard safety benchmarks. We tested it with 3,595 probes across 17 modules and 6 security domains. The real breach rate is 55%. Here’s what the safety card doesn’t tell you.

55%
Overall Breach Rate
3,595
Total Probes
6
Independent Tools
15.7%
Benchmark Fail Rate
Model: OpenAI GPT-4.1
Provider: OpenAI
Assessor: CLS Security Labs
Publication Tier: Courtesy, 7-day notification sent to OpenAI
Last Updated: March 23, 2026
Forge v3
Garak (NVIDIA)
Promptfoo
PyRIT (Microsoft)
Vision Forge
MCP Services

Research Disclaimer

Point-in-time results: All results reflect model behavior at the time of testing and may not reflect current model versions, post-deployment safety patches, or provider configuration changes.

Methodology: Results are produced using CLS Labs Forge Engine, Garak (NVIDIA), PyRIT (Microsoft), and Promptfoo with cross-judge scoring (Gemini/Claude/Llama). Different tools, prompts, or scoring criteria may produce different results.

No attack prompts: This publication includes breach rates by category and module, methodology descriptions, and general findings. It does not include specific attack prompts, payloads, or reproduction instructions.

Authorized testing: All testing was conducted against publicly available API endpoints under CLS Labs’ own accounts, in accordance with OpenAI’s terms of service.

The Safety Illusion

Standard safety benchmarks report a 15.7% fail rate. Enterprise buyers see this and deploy with confidence. Our testing reveals the real risk is 3.5× higher.

Breach Rate: Benchmark vs. Operational Reality
Same model. Different question. Dramatically different answer.
15.7%
Promptfoo Benchmark
What vendors report
55%
CLS Full Assessment
What actually happens
70–85%
With Tool Access
MCP + encoding attacks
3.5×
Risk Multiplier
Benchmark → real world

The 3.5× Gap

GPT-4.1 scores 84.3% pass rate on Promptfoo’s 950-test standard safety suite. That’s the number on the safety card. That’s the number procurement teams see. But when that same model is connected to real tools via MCP (the attack surface standard tests never measure), breach rates jump to 65–75%. Standard safety evaluations create a dangerous illusion of security.

Where It Breaks

17 Forge v3 modules tested. Results range from 5% (strong defense) to 85% (near-total failure). Here are the extremes. The full module-by-module breakdown is available in client assessments.

Forge v3 — Breach Rate by Module
Sorted by breach rate. Red = critical (>70%), orange = high, amber = moderate, green = low.

The Tool Access Multiplier

The top 7 modules by breach rate all involve tool access, domain-specific context, or encoding evasion. MCP services average 70% breach rates. Domain-specific modules, testing the model as an industrial copilot and autonomous vehicle reasoner, produced breach rates of 65–72%. Meanwhile, the model’s strongest defense (agent memory manipulation, 5%) is a purely text-based interaction. The pattern is unambiguous: the moment the model touches tools, safety collapses.

Three Critical Attack Surfaces

Encoding bypass, MCP tool integration, and domain-specific attacks represent the highest-risk categories. Each exploits a different gap in the safety architecture.

85%
Encoding Bypass
Base64, ROT13, Unicode substitution
70%
MCP Services (avg)
Tool-connected agent attacks
72%
Domain-Specific
OT/ICS industrial copilot

Encoding: The Universal Bypass

Character-level encoding transformations (Base64, ROT13, Unicode homoglyphs, and similar techniques) achieve an 85% breach rate. These are not sophisticated attacks. They’re the equivalent of writing a harmful request in pig latin and watching the safety training ignore it. This category has the highest breach rate of any module tested.

MCP: When Agents Have Real Tools

MCP service integrations, the layer connecting AI agents to enterprise tools, represent the single largest operational attack surface. When GPT-4.1 is connected to communication platforms, code repositories, and email systems via MCP, breach rates average 70% across services. The model performs actions it would refuse to describe in text-only mode. The gap between what a model says and what it does is the entire risk.

Domain-Specific: When AI Failures Produce Force

When tested as an industrial copilot, GPT-4.1 generated code that bypasses safety interlocks in operational technology environments. The breach rate across OT/ICS probes was 72%. Autonomous vehicle reasoning probes achieved 65%. Supply chain poisoning attacks succeeded at 70%. These are domains where AI failures don’t produce bad text. They produce physical consequences. Major industrial automation vendors ship AI copilots for these environments. Published adversarial testing results from those vendors remain scarce.

Six Tools. One Conclusion.

We tested GPT-4.1 with several tools in our arsenal. Each tool sees the model from a different angle. All six agree: the safety surface is thin.

Breach / Fail Rate by Tool
Each tool uses different methodology, attack strategy, and scoring. Convergence = signal.

In Context: GPT-4.1 vs. the Field

On standard benchmarks, GPT-4.1 is middle-of-pack. On operational testing, the gap between benchmark performance and real-world exploitability becomes clear.

ModelPromptfoo Fail%Garak CLS ASROT IndustrialAV ReasoningMCP
Nemotron Super 120B10.9%40%25%50%
GPT-5.411.4%
GLM-511.5%
GPT-4.115.7%58–60%72%65%65–75%
GPT-4o15.7%44%
Grok 4.2021.7%
Llama 70B25.3%

Benchmark ≠ Security

GPT-4.1 and GPT-4o have identical Promptfoo fail rates (15.7%). But on Garak’s CLS taxonomy, GPT-4.1’s attack success rate is 58–60% vs GPT-4o’s 44%, a 34% increase in exploitability that standard benchmarks completely miss. On domain-specific attacks, GPT-4.1 is dramatically more compliant with dangerous requests than Nemotron 120B, despite having a better benchmark score.

Where the Model Defended

Not everything broke. GPT-4.1 has genuine strengths, and understanding which defenses held reveals the architecture of future protections.

5%
Agent Memory
Strongest module. 19/20 defended.
15.7%
Standard Benchmarks
Middle-of-pack conventional.
0%
Direct Tool Abuse
Complete refusal on explicit requests.
25%
Agent Identity
Resists persona manipulation.

Strong on Direct. Weak on Indirect.

GPT-4.1 consistently refuses direct requests for harmful tool actions. It also has robust memory manipulation defenses (5% breach). The weakness is indirect exploitation: encoding tricks that obscure intent, academic framing that recontextualizes harm, legitimate-looking requests that produce dangerous outputs as side effects, and multi-step tool chains where no single step looks malicious. The safety training catches the obvious ask. It misses the sophisticated approach.

How We Tested

Six independent tools. Three scoring judges. Five severity dimensions. No single vendor or methodology dependency.

ToolProbesMethodScoring
Forge v3300+17 modules across 6 security domains. Generative attack variants.Cross-judge (Gemini/Claude/Llama)
Garak1,500+CLS custom taxonomy. 9 probe classes. Two runs averaged.Garak built-in detectors
Promptfoo950Standard red team evaluation suite.Promptfoo evaluator
PyRIT50+5 vulnerability scenarios. Multi-turn orchestrated attacks.Microsoft PyRIT scorer + review
Vision Forge10+Multimodal scenarios. Typographic injection.Manual verification
MCP Services60+Enterprise service integrations. Real tool access.Action verification + cross-judge

Scoring: Adversarial Impact Score (AIS)

Every confirmed breach is scored across five dimensions: Compromise, Action Depth, Privilege, Persistence, and Evasion. Three independent judge vendors (Gemini, Claude, Llama) cross-validate every finding. No single-vendor bias. The composite AIS score determines severity: Critical (80+), High (60–79), Medium (40–59), Low (<40). Full AIS methodology is documented on our CLAP page.

Want This Assessment for Your Deployment?

CLS Security Labs runs full adversarial assessments against any LLM deployment, with real tools, real actions, and three-vendor judge validation. Every finding compliance-mapped to NIST AI RMF, OWASP LLM Top 10, and MITRE ATLAS.

This blog presents aggregate findings. The full assessment includes per-module evidence, specific remediation guidance, AIS severity scoring, and compliance mappings, available as a Forge Assessment or Full CLAP Assessment. Colorado-based organizations deploying GPT-4.1 for consequential decisions may require an SB 24-205 impact assessment before June 30, 2026.

Get an Assessment → View All Research