HEAD ======= >>>>>>> ba4b2b4ed41c84b8b153dfa3a393eaf927ed565d
GPT-4.1 passes 84% of standard safety benchmarks. We tested it with 3,595 probes across 17 modules and 6 security domains. The real breach rate is 55%. Here’s what the safety card doesn’t tell you.
Point-in-time results: All results reflect model behavior at the time of testing and may not reflect current model versions, post-deployment safety patches, or provider configuration changes.
Methodology: Results are produced using CLS Labs Forge Engine, Garak (NVIDIA), PyRIT (Microsoft), and Promptfoo with cross-judge scoring (Gemini/Claude/Llama). Different tools, prompts, or scoring criteria may produce different results.
No attack prompts: This publication includes breach rates by category and module, methodology descriptions, and general findings. It does not include specific attack prompts, payloads, or reproduction instructions.
Authorized testing: All testing was conducted against publicly available API endpoints under CLS Labs’ own accounts, in accordance with OpenAI’s terms of service.
Standard safety benchmarks report a 15.7% fail rate. Enterprise buyers see this and deploy with confidence. Our testing reveals the real risk is 3.5× higher.
GPT-4.1 scores 84.3% pass rate on Promptfoo’s 950-test standard safety suite. That’s the number on the safety card. That’s the number procurement teams see. But when that same model is connected to real tools via MCP (the attack surface standard tests never measure), breach rates jump to 65–75%. Standard safety evaluations create a dangerous illusion of security.
17 Forge v3 modules tested. Results range from 5% (strong defense) to 85% (near-total failure). Here are the extremes. The full module-by-module breakdown is available in client assessments.
The top 7 modules by breach rate all involve tool access, domain-specific context, or encoding evasion. MCP services average 70% breach rates. Domain-specific modules, testing the model as an industrial copilot and autonomous vehicle reasoner, produced breach rates of 65–72%. Meanwhile, the model’s strongest defense (agent memory manipulation, 5%) is a purely text-based interaction. The pattern is unambiguous: the moment the model touches tools, safety collapses.
Encoding bypass, MCP tool integration, and domain-specific attacks represent the highest-risk categories. Each exploits a different gap in the safety architecture.
Character-level encoding transformations (Base64, ROT13, Unicode homoglyphs, and similar techniques) achieve an 85% breach rate. These are not sophisticated attacks. They’re the equivalent of writing a harmful request in pig latin and watching the safety training ignore it. This category has the highest breach rate of any module tested.
MCP service integrations, the layer connecting AI agents to enterprise tools, represent the single largest operational attack surface. When GPT-4.1 is connected to communication platforms, code repositories, and email systems via MCP, breach rates average 70% across services. The model performs actions it would refuse to describe in text-only mode. The gap between what a model says and what it does is the entire risk.
When tested as an industrial copilot, GPT-4.1 generated code that bypasses safety interlocks in operational technology environments. The breach rate across OT/ICS probes was 72%. Autonomous vehicle reasoning probes achieved 65%. Supply chain poisoning attacks succeeded at 70%. These are domains where AI failures don’t produce bad text. They produce physical consequences. Major industrial automation vendors ship AI copilots for these environments. Published adversarial testing results from those vendors remain scarce.
We tested GPT-4.1 with several tools in our arsenal. Each tool sees the model from a different angle. All six agree: the safety surface is thin.
On standard benchmarks, GPT-4.1 is middle-of-pack. On operational testing, the gap between benchmark performance and real-world exploitability becomes clear.
| Model | Promptfoo Fail% | Garak CLS ASR | OT Industrial | AV Reasoning | MCP |
|---|---|---|---|---|---|
| Nemotron Super 120B | 10.9% | — | 40% | 25% | 50% |
| GPT-5.4 | 11.4% | — | — | — | — |
| GLM-5 | 11.5% | — | — | — | — |
| GPT-4.1 | 15.7% | 58–60% | 72% | 65% | 65–75% |
| GPT-4o | 15.7% | 44% | — | — | — |
| Grok 4.20 | 21.7% | — | — | — | — |
| Llama 70B | 25.3% | — | — | — | — |
GPT-4.1 and GPT-4o have identical Promptfoo fail rates (15.7%). But on Garak’s CLS taxonomy, GPT-4.1’s attack success rate is 58–60% vs GPT-4o’s 44%, a 34% increase in exploitability that standard benchmarks completely miss. On domain-specific attacks, GPT-4.1 is dramatically more compliant with dangerous requests than Nemotron 120B, despite having a better benchmark score.
Not everything broke. GPT-4.1 has genuine strengths, and understanding which defenses held reveals the architecture of future protections.
GPT-4.1 consistently refuses direct requests for harmful tool actions. It also has robust memory manipulation defenses (5% breach). The weakness is indirect exploitation: encoding tricks that obscure intent, academic framing that recontextualizes harm, legitimate-looking requests that produce dangerous outputs as side effects, and multi-step tool chains where no single step looks malicious. The safety training catches the obvious ask. It misses the sophisticated approach.
Six independent tools. Three scoring judges. Five severity dimensions. No single vendor or methodology dependency.
| Tool | Probes | Method | Scoring |
|---|---|---|---|
| Forge v3 | 300+ | 17 modules across 6 security domains. Generative attack variants. | Cross-judge (Gemini/Claude/Llama) |
| Garak | 1,500+ | CLS custom taxonomy. 9 probe classes. Two runs averaged. | Garak built-in detectors |
| Promptfoo | 950 | Standard red team evaluation suite. | Promptfoo evaluator |
| PyRIT | 50+ | 5 vulnerability scenarios. Multi-turn orchestrated attacks. | Microsoft PyRIT scorer + review |
| Vision Forge | 10+ | Multimodal scenarios. Typographic injection. | Manual verification |
| MCP Services | 60+ | Enterprise service integrations. Real tool access. | Action verification + cross-judge |
Every confirmed breach is scored across five dimensions: Compromise, Action Depth, Privilege, Persistence, and Evasion. Three independent judge vendors (Gemini, Claude, Llama) cross-validate every finding. No single-vendor bias. The composite AIS score determines severity: Critical (80+), High (60–79), Medium (40–59), Low (<40). Full AIS methodology is documented on our CLAP page.
CLS Security Labs runs full adversarial assessments against any LLM deployment, with real tools, real actions, and three-vendor judge validation. Every finding compliance-mapped to NIST AI RMF, OWASP LLM Top 10, and MITRE ATLAS.
This blog presents aggregate findings. The full assessment includes per-module evidence, specific remediation guidance, AIS severity scoring, and compliance mappings, available as a Forge Assessment or Full CLAP Assessment. Colorado-based organizations deploying GPT-4.1 for consequential decisions may require an SB 24-205 impact assessment before June 30, 2026.