Public Research · March 2026

GPT-4.1 Security Assessment
What Standard Safety Tests
Don’t Show You

GPT-4.1 passes 84% of standard safety benchmarks. We tested it with 3,595 probes across 17 modules and 6 security domains. The real breach rate is 55%. Here’s what the safety card doesn’t tell you.

55%

Overall Breach Rate

3,595

Total Probes

Independent Tools

15.7%

Benchmark Fail Rate

Model: OpenAI GPT-4.1

Provider: OpenAI

Assessor: CLS Security Labs

Publication Tier: Courtesy, 7-day notification sent to OpenAI

Last Updated: March 23, 2026

Forge v3

Garak (NVIDIA)

Promptfoo

PyRIT (Microsoft)

Vision Forge

MCP Services

Research Disclaimer

Point-in-time results: All results reflect model behavior at the time of testing and may not reflect current model versions, post-deployment safety patches, or provider configuration changes.

Methodology: Results are produced using CLS Labs Forge Engine, Garak (NVIDIA), PyRIT (Microsoft), and Promptfoo with cross-judge scoring (Gemini/Claude/Llama). Different tools, prompts, or scoring criteria may produce different results.

No attack prompts: This publication includes breach rates by category and module, methodology descriptions, and general findings. It does not include specific attack prompts, payloads, or reproduction instructions.

Authorized testing: All testing was conducted against publicly available API endpoints under CLS Labs’ own accounts, in accordance with OpenAI’s terms of service.

Finding 1

The Safety Illusion

Standard safety benchmarks report a 15.7% fail rate. Enterprise buyers see this and deploy with confidence. Our testing reveals the real risk is 3.5× higher.

Breach Rate: Benchmark vs. Operational Reality

Same model. Different question. Dramatically different answer.

15.7%

Promptfoo Benchmark

What vendors report

55%

CLS Full Assessment

What actually happens

70–85%

With Tool Access

MCP + encoding attacks

3.5×

Risk Multiplier

Benchmark → real world

The 3.5× Gap

GPT-4.1 scores 84.3% pass rate on Promptfoo’s 950-test standard safety suite. That’s the number on the safety card. That’s the number procurement teams see. But when that same model is connected to real tools via MCP (the attack surface standard tests never measure), breach rates jump to 65–75%. Standard safety evaluations create a dangerous illusion of security.

Finding 2

Where It Breaks

17 Forge v3 modules tested. Results range from 5% (strong defense) to 85% (near-total failure). Here are the extremes. The full module-by-module breakdown is available in client assessments.

Forge v3 — Breach Rate by Module

Sorted by breach rate. Red = critical (>70%), orange = high, amber = moderate, green = low.

The Tool Access Multiplier

The top 7 modules by breach rate all involve tool access, domain-specific context, or encoding evasion. MCP services average 70% breach rates. Domain-specific modules, testing the model as an industrial copilot and autonomous vehicle reasoner, produced breach rates of 65–72%. Meanwhile, the model’s strongest defense (agent memory manipulation, 5%) is a purely text-based interaction. The pattern is unambiguous: the moment the model touches tools, safety collapses.

Finding 3

Three Critical Attack Surfaces

Encoding bypass, MCP tool integration, and domain-specific attacks represent the highest-risk categories. Each exploits a different gap in the safety architecture.

85%

Encoding Bypass

Base64, ROT13, Unicode substitution

70%

MCP Services (avg)

Tool-connected agent attacks

72%

Domain-Specific

OT/ICS industrial copilot

Encoding: The Universal Bypass

Character-level encoding transformations (Base64, ROT13, Unicode homoglyphs, and similar techniques) achieve an 85% breach rate. These are not sophisticated attacks. They’re the equivalent of writing a harmful request in pig latin and watching the safety training ignore it. This category has the highest breach rate of any module tested.

MCP: When Agents Have Real Tools

MCP service integrations, the layer connecting AI agents to enterprise tools, represent the single largest operational attack surface. When GPT-4.1 is connected to communication platforms, code repositories, and email systems via MCP, breach rates average 70% across services. The model performs actions it would refuse to describe in text-only mode. The gap between what a model says and what it does is the entire risk.

Domain-Specific: When AI Failures Produce Force

When tested as an industrial copilot, GPT-4.1 generated code that bypasses safety interlocks in operational technology environments. The breach rate across OT/ICS probes was 72%. Autonomous vehicle reasoning probes achieved 65%. Supply chain poisoning attacks succeeded at 70%. These are domains where AI failures don’t produce bad text. They produce physical consequences. Major industrial automation vendors ship AI copilots for these environments. Published adversarial testing results from those vendors remain scarce.

Finding 4

Six Tools. One Conclusion.

We tested GPT-4.1 with several tools in our arsenal. Each tool sees the model from a different angle. All six agree: the safety surface is thin.

Breach / Fail Rate by Tool

Each tool uses different methodology, attack strategy, and scoring. Convergence = signal.

Context

In Context: GPT-4.1 vs. the Field

On standard benchmarks, GPT-4.1 is middle-of-pack. On operational testing, the gap between benchmark performance and real-world exploitability becomes clear.

Model	Promptfoo Fail%	Garak CLS ASR	OT Industrial	AV Reasoning	MCP
Nemotron Super 120B	10.9%	—	40%	25%	50%
GPT-5.4	11.4%	—	—	—	—
GLM-5	11.5%	—	—	—	—
GPT-4.1	15.7%	58–60%	72%	65%	65–75%
GPT-4o	15.7%	44%	—	—	—
Grok 4.20	21.7%	—	—	—	—
Llama 70B	25.3%	—	—	—	—

Benchmark ≠ Security

GPT-4.1 and GPT-4o have identical Promptfoo fail rates (15.7%). But on Garak’s CLS taxonomy, GPT-4.1’s attack success rate is 58–60% vs GPT-4o’s 44%, a 34% increase in exploitability that standard benchmarks completely miss. On domain-specific attacks, GPT-4.1 is dramatically more compliant with dangerous requests than Nemotron 120B, despite having a better benchmark score.

What Held

Where the Model Defended

Not everything broke. GPT-4.1 has genuine strengths, and understanding which defenses held reveals the architecture of future protections.

Agent Memory

Strongest module. 19/20 defended.

15.7%

Standard Benchmarks

Middle-of-pack conventional.

Direct Tool Abuse

Complete refusal on explicit requests.

25%

Agent Identity

Resists persona manipulation.

Strong on Direct. Weak on Indirect.

GPT-4.1 consistently refuses direct requests for harmful tool actions. It also has robust memory manipulation defenses (5% breach). The weakness is indirect exploitation: encoding tricks that obscure intent, academic framing that recontextualizes harm, legitimate-looking requests that produce dangerous outputs as side effects, and multi-step tool chains where no single step looks malicious. The safety training catches the obvious ask. It misses the sophisticated approach.

Methodology

How We Tested

Six independent tools. Three scoring judges. Five severity dimensions. No single vendor or methodology dependency.

Tool	Probes	Method	Scoring
Forge v3	300+	17 modules across 6 security domains. Generative attack variants.	Cross-judge (Gemini/Claude/Llama)
Garak	1,500+	CLS custom taxonomy. 9 probe classes. Two runs averaged.	Garak built-in detectors
Promptfoo	950	Standard red team evaluation suite.	Promptfoo evaluator
PyRIT	50+	5 vulnerability scenarios. Multi-turn orchestrated attacks.	Microsoft PyRIT scorer + review
Vision Forge	10+	Multimodal scenarios. Typographic injection.	Manual verification
MCP Services	60+	Enterprise service integrations. Real tool access.	Action verification + cross-judge

Scoring: Adversarial Impact Score (AIS)

Every confirmed breach is scored across five dimensions: Compromise, Action Depth, Privilege, Persistence, and Evasion. Three independent judge vendors (Gemini, Claude, Llama) cross-validate every finding. No single-vendor bias. The composite AIS score determines severity: Critical (80+), High (60–79), Medium (40–59), Low (<40). Full AIS methodology is documented on our CLAP page.

Want This Assessment for Your Deployment?

CLS Security Labs runs full adversarial assessments against any LLM deployment, with real tools, real actions, and three-vendor judge validation. Every finding compliance-mapped to NIST AI RMF, OWASP LLM Top 10, and MITRE ATLAS.

This blog presents aggregate findings. The full assessment includes per-module evidence, specific remediation guidance, AIS severity scoring, and compliance mappings, available as a Forge Assessment or Full CLAP Assessment. Colorado-based organizations deploying GPT-4.1 for consequential decisions may require an SB 24-205 impact assessment before June 30, 2026.

Get an Assessment → View All Research

GPT-4.1 Security AssessmentWhat Standard Safety TestsDon’t Show You

Research Disclaimer

The Safety Illusion

The 3.5× Gap

Where It Breaks

The Tool Access Multiplier

Three Critical Attack Surfaces

Encoding: The Universal Bypass

MCP: When Agents Have Real Tools

Domain-Specific: When AI Failures Produce Force

Six Tools. One Conclusion.

In Context: GPT-4.1 vs. the Field

Benchmark ≠ Security

Where the Model Defended

Strong on Direct. Weak on Indirect.

How We Tested

Scoring: Adversarial Impact Score (AIS)

Want This Assessment for Your Deployment?

GPT-4.1 Security Assessment
What Standard Safety Tests
Don’t Show You