What did the AISI cyber evaluation find?

The UK AI Safety Institute's May 2026 cyber evaluation tested frontier models on Expert-tier offensive cyber tasks. Results: GPT-5.5 71.4%, Claude Mythos Preview 68.6%, GPT-5.4 52.4%, Claude Opus 4.7 48.6%. GPT-5.5 also completed AISI's 32-step 'The Last Ones' corporate-network attack range end-to-end in 2 of 10 attempts — the first frontier model to do so. AISI flagged GPT-5.5 as the first model needing extra deployment safeguards on offensive cyber capability.

Why does Opus 4.7 score lower than GPT-5.5 on cyber?

Partly by Anthropic design choice. Anthropic's RLHF and Constitutional AI training emphasize harm avoidance — refusing to generate offensive cyber content even when the user has legitimate reasons. GPT-5.5's training is more permissive on dual-use cyber tasks. The result is GPT-5.5 scores higher on AISI cyber benchmarks but Opus 4.7 is harder to misuse for actual attacks. Both represent valid safety trade-offs.

Should I worry about GPT-5.5's cyber capabilities?

OpenAI added extra deployment safeguards in response — usage monitoring, abuse detection, and an enhanced refusal stack for offensive cyber requests above a certain capability threshold. For legitimate users (security researchers, red teams, blue teams) GPT-5.5 remains usable. For threat actors, it raises capability but also detection. The AISI report does not say GPT-5.5 should not be deployed; it says safeguards must scale with capability.

What is 'The Last Ones' attack range?

AISI's 32-step end-to-end corporate-network attack scenario simulating realistic intrusion: reconnaissance, initial access, privilege escalation, lateral movement, exfiltration. GPT-5.5 completed it in 2 of 10 attempts — meaningful but not reliable. Opus 4.7 and GPT-5.4 did not complete it end-to-end. The benchmark is a milestone for frontier-model cyber capability assessment.

Quick Answer

AISI Cyber Eval: GPT-5.5 vs Mythos vs Opus (May 2026)

Published: May 3, 2026

AISI Cyber Eval: GPT-5.5 vs Mythos vs Opus (May 2026)

On May 1, 2026, the UK AI Safety Institute (AISI) published its latest cyber capability evaluation — and the headline numbers reset the conversation about AI safety in offensive cyber. GPT-5.5 leads at 71.4% on Expert-tier cyber tasks, with Mythos Preview close behind at 68.6%. Opus 4.7, by contrast, scored 48.6% — a gap that reflects Anthropic’s deliberate safety design as much as raw capability.

Here’s what the eval shows and what it means for AI deployment in May 2026.

Last verified: May 3, 2026

The headline numbers

Model	AISI Expert cyber tasks	”The Last Ones” 32-step attack range
GPT-5.5	71.4%	2/10 end-to-end
Claude Mythos Preview	68.6%	Not disclosed (research access only)
GPT-5.4	52.4%	0/10
Claude Opus 4.7	48.6%	0/10

Source: AISI Cybersecurity Evaluation, May 1 2026 (per ResultSense reporting May 1 2026 and RevolutionInAI analysis May 2 2026).

What “Expert-tier cyber tasks” actually measures

AISI’s eval suite spans several capability categories:

Vulnerability discovery — finding zero-days in code samples, binaries, and web apps
Exploit writing — turning a known vulnerability into a working exploit
Reverse engineering — analyzing obfuscated binaries
Privilege escalation — finding and chaining local-elevation paths
Lateral movement — moving through network topologies
Persistence — establishing footholds resistant to detection
Defense evasion — bypassing AV/EDR with model-generated payloads

Expert-tier tasks are calibrated against the difficulty distribution of professional CTF challenges (hard tier) and real-world penetration test work. A 71.4% pass rate is what AISI describes as “approaching capable junior offensive security professional” performance.

Why Opus 4.7 lags GPT-5.5

The 22.8 percentage-point gap (71.4% - 48.6%) isn’t pure capability difference. Three factors:

1. Anthropic’s safety training

Anthropic’s Constitutional AI + RLHF training emphasizes harm avoidance. Opus 4.7 will refuse offensive cyber tasks more readily than GPT-5.5 — including some legitimate red-team work. AISI’s eval likely captures both:

Real capability differences (GPT-5.5 may genuinely be better at cyber reasoning)
Refusal differences (Opus 4.7 declining tasks GPT-5.5 attempts)

In practice, separating these requires running models with refusal-suppression jailbreaks, which AISI does not publish.

2. OpenAI’s permissive dual-use stance

OpenAI’s policies allow legitimate security research and red-team work with appropriate context. Anthropic’s policies are more restrictive on dual-use cyber. The result: same prompt, different responses, different scores.

3. Training data and post-training

Beyond policy, GPT-5.5’s pre-training corpus and post-training cyber-specific work appear deeper based on AISI’s qualitative analysis. Mythos Preview (Anthropic’s locked frontier model) closes most of the gap, suggesting Anthropic can match GPT-5.5 on raw capability when it chooses to.

What “The Last Ones” tells us

The 32-step end-to-end attack range is the most consequential result. GPT-5.5 completing it 2 of 10 times means:

Reliability is still low. A 20% success rate on a 32-step task isn’t deploying autonomous attackers; it’s showing the capability exists.
The trajectory matters. GPT-5.4 and Opus 4.7 scored 0/10. GPT-5.5 is the first frontier model to ever finish this end-to-end. Next-generation models are likely to push reliability into the 50-80% range.
Defensive implications. Blue teams should assume autonomous offensive AI capability is real, even if not yet reliable. Detection, monitoring, and zero-trust architectures matter more than ever.

OpenAI’s response

OpenAI added the following safeguards in response to AISI’s eval (as disclosed in their May 2 2026 deployment update):

Capability thresholds. GPT-5.5 with extra refusal stack on offensive cyber requests above an internal benchmark threshold.
Usage monitoring. Automated detection of cyber-attack patterns in API usage; flagged accounts get reviewed.
Red team certifications. Enterprise customers doing legitimate red team work can request relaxed restrictions with verification.
Reporting. Quarterly transparency reports on cyber-related abuse detection.

Anthropic’s response was simpler: continue current policies, since Opus 4.7 already scores below threshold concern levels.

What this means for deployment

For different stakeholder groups:

Security researchers & red teams

GPT-5.5 is the most capable model for offensive cyber work as of May 2026. Use it through OpenAI’s verified red-team channels for compliance. Opus 4.7 will refuse more often but remains useful for defensive work, threat modeling, and blue-team automation.

Blue teams & defenders

The 2/10 “The Last Ones” result is a wake-up call. Build assuming autonomous attacker capability exists. Prioritize:

Detection over prevention (you can’t prevent what you can’t predict)
Zero-trust segmentation (limit lateral movement potential)
AI-augmented defense (use Sonnet 4.7 / GPT-5.5 / Gemini 3.1 Pro for log analysis, anomaly detection, and triage)

Enterprise IT

Most enterprise users are unaffected. The cyber capability gap matters at the threat actor margin, not for typical knowledge work. Continue standard model selection (Sonnet 4.7 / GPT-5.5 / Gemini 3.1 Pro) for normal use; add monitoring for anyone with API-level access.

Policy makers & regulators

AISI’s evaluation is the highest-quality public cyber capability data available. EU AI Act enforcement bodies, US AISI counterparts, and other national AI safety agencies should treat this as the new baseline for what “frontier capability” means in cyber.

What about Mythos Preview?

Anthropic’s Claude Mythos Preview at 68.6% close to GPT-5.5’s 71.4% is the most interesting subtlety. Mythos has not been released publicly — Anthropic has indicated it will only release when safeguards meet internal capability-vs-deployment thresholds. AISI’s eval suggests Mythos is at the threshold where Anthropic must decide:

Release with extra safeguards (OpenAI’s path)
Hold release until safeguards mature
Release in restricted form only (research / select enterprise)

Watch for an announcement from Anthropic in Q3 2026.

What this means for AI investment

Capital implications of the AISI report:

AI cybersecurity startups (defensive) get more credible — autonomous attacker capability raises demand for AI-augmented defense.
Frontier model labs face higher safety bars — capable cyber models will need more safeguards before deployment.
Sovereign AI funds will weigh cyber capability into their investment theses — UK, US, EU, and allies have national security interests here.

Bottom line

AISI’s May 2026 cyber eval is a landmark: it shows frontier AI is approaching the threshold where autonomous offensive cyber capability exists, even if not yet reliable. GPT-5.5 leads, Opus 4.7 lags by safety design, Mythos Preview is the model to watch. For most users this changes nothing day-to-day. For security professionals, AI safety researchers, and policy makers, it resets the threshold for what “frontier capability” means.

Sources: AISI Cybersecurity Evaluation May 1 2026 (via ResultSense), RevolutionInAI “GPT-5.5 vs Claude Mythos: AISI Cybersecurity Numbers” May 2 2026, OpenAI deployment safeguards update May 2 2026, BenchLM.ai Mythos Preview profile May 2026.