AISI Cyber Eval: GPT-5.5 vs Mythos vs Opus (May 2026)
AISI Cyber Eval: GPT-5.5 vs Mythos vs Opus (May 2026)
On May 1, 2026, the UK AI Safety Institute (AISI) published its latest cyber capability evaluation — and the headline numbers reset the conversation about AI safety in offensive cyber. GPT-5.5 leads at 71.4% on Expert-tier cyber tasks, with Mythos Preview close behind at 68.6%. Opus 4.7, by contrast, scored 48.6% — a gap that reflects Anthropic’s deliberate safety design as much as raw capability.
Here’s what the eval shows and what it means for AI deployment in May 2026.
Last verified: May 3, 2026
The headline numbers
| Model | AISI Expert cyber tasks | ”The Last Ones” 32-step attack range |
|---|---|---|
| GPT-5.5 | 71.4% | 2/10 end-to-end |
| Claude Mythos Preview | 68.6% | Not disclosed (research access only) |
| GPT-5.4 | 52.4% | 0/10 |
| Claude Opus 4.7 | 48.6% | 0/10 |
Source: AISI Cybersecurity Evaluation, May 1 2026 (per ResultSense reporting May 1 2026 and RevolutionInAI analysis May 2 2026).
What “Expert-tier cyber tasks” actually measures
AISI’s eval suite spans several capability categories:
- Vulnerability discovery — finding zero-days in code samples, binaries, and web apps
- Exploit writing — turning a known vulnerability into a working exploit
- Reverse engineering — analyzing obfuscated binaries
- Privilege escalation — finding and chaining local-elevation paths
- Lateral movement — moving through network topologies
- Persistence — establishing footholds resistant to detection
- Defense evasion — bypassing AV/EDR with model-generated payloads
Expert-tier tasks are calibrated against the difficulty distribution of professional CTF challenges (hard tier) and real-world penetration test work. A 71.4% pass rate is what AISI describes as “approaching capable junior offensive security professional” performance.
Why Opus 4.7 lags GPT-5.5
The 22.8 percentage-point gap (71.4% - 48.6%) isn’t pure capability difference. Three factors:
1. Anthropic’s safety training
Anthropic’s Constitutional AI + RLHF training emphasizes harm avoidance. Opus 4.7 will refuse offensive cyber tasks more readily than GPT-5.5 — including some legitimate red-team work. AISI’s eval likely captures both:
- Real capability differences (GPT-5.5 may genuinely be better at cyber reasoning)
- Refusal differences (Opus 4.7 declining tasks GPT-5.5 attempts)
In practice, separating these requires running models with refusal-suppression jailbreaks, which AISI does not publish.
2. OpenAI’s permissive dual-use stance
OpenAI’s policies allow legitimate security research and red-team work with appropriate context. Anthropic’s policies are more restrictive on dual-use cyber. The result: same prompt, different responses, different scores.
3. Training data and post-training
Beyond policy, GPT-5.5’s pre-training corpus and post-training cyber-specific work appear deeper based on AISI’s qualitative analysis. Mythos Preview (Anthropic’s locked frontier model) closes most of the gap, suggesting Anthropic can match GPT-5.5 on raw capability when it chooses to.
What “The Last Ones” tells us
The 32-step end-to-end attack range is the most consequential result. GPT-5.5 completing it 2 of 10 times means:
- Reliability is still low. A 20% success rate on a 32-step task isn’t deploying autonomous attackers; it’s showing the capability exists.
- The trajectory matters. GPT-5.4 and Opus 4.7 scored 0/10. GPT-5.5 is the first frontier model to ever finish this end-to-end. Next-generation models are likely to push reliability into the 50-80% range.
- Defensive implications. Blue teams should assume autonomous offensive AI capability is real, even if not yet reliable. Detection, monitoring, and zero-trust architectures matter more than ever.
OpenAI’s response
OpenAI added the following safeguards in response to AISI’s eval (as disclosed in their May 2 2026 deployment update):
- Capability thresholds. GPT-5.5 with extra refusal stack on offensive cyber requests above an internal benchmark threshold.
- Usage monitoring. Automated detection of cyber-attack patterns in API usage; flagged accounts get reviewed.
- Red team certifications. Enterprise customers doing legitimate red team work can request relaxed restrictions with verification.
- Reporting. Quarterly transparency reports on cyber-related abuse detection.
Anthropic’s response was simpler: continue current policies, since Opus 4.7 already scores below threshold concern levels.
What this means for deployment
For different stakeholder groups:
Security researchers & red teams
GPT-5.5 is the most capable model for offensive cyber work as of May 2026. Use it through OpenAI’s verified red-team channels for compliance. Opus 4.7 will refuse more often but remains useful for defensive work, threat modeling, and blue-team automation.
Blue teams & defenders
The 2/10 “The Last Ones” result is a wake-up call. Build assuming autonomous attacker capability exists. Prioritize:
- Detection over prevention (you can’t prevent what you can’t predict)
- Zero-trust segmentation (limit lateral movement potential)
- AI-augmented defense (use Sonnet 4.7 / GPT-5.5 / Gemini 3.1 Pro for log analysis, anomaly detection, and triage)
Enterprise IT
Most enterprise users are unaffected. The cyber capability gap matters at the threat actor margin, not for typical knowledge work. Continue standard model selection (Sonnet 4.7 / GPT-5.5 / Gemini 3.1 Pro) for normal use; add monitoring for anyone with API-level access.
Policy makers & regulators
AISI’s evaluation is the highest-quality public cyber capability data available. EU AI Act enforcement bodies, US AISI counterparts, and other national AI safety agencies should treat this as the new baseline for what “frontier capability” means in cyber.
What about Mythos Preview?
Anthropic’s Claude Mythos Preview at 68.6% close to GPT-5.5’s 71.4% is the most interesting subtlety. Mythos has not been released publicly — Anthropic has indicated it will only release when safeguards meet internal capability-vs-deployment thresholds. AISI’s eval suggests Mythos is at the threshold where Anthropic must decide:
- Release with extra safeguards (OpenAI’s path)
- Hold release until safeguards mature
- Release in restricted form only (research / select enterprise)
Watch for an announcement from Anthropic in Q3 2026.
What this means for AI investment
Capital implications of the AISI report:
- AI cybersecurity startups (defensive) get more credible — autonomous attacker capability raises demand for AI-augmented defense.
- Frontier model labs face higher safety bars — capable cyber models will need more safeguards before deployment.
- Sovereign AI funds will weigh cyber capability into their investment theses — UK, US, EU, and allies have national security interests here.
Bottom line
AISI’s May 2026 cyber eval is a landmark: it shows frontier AI is approaching the threshold where autonomous offensive cyber capability exists, even if not yet reliable. GPT-5.5 leads, Opus 4.7 lags by safety design, Mythos Preview is the model to watch. For most users this changes nothing day-to-day. For security professionals, AI safety researchers, and policy makers, it resets the threshold for what “frontier capability” means.
Sources: AISI Cybersecurity Evaluation May 1 2026 (via ResultSense), RevolutionInAI “GPT-5.5 vs Claude Mythos: AISI Cybersecurity Numbers” May 2 2026, OpenAI deployment safeguards update May 2 2026, BenchLM.ai Mythos Preview profile May 2026.