What is the AI jailbreak consensus framework?

It's a proposed standard, in development in July 2026 by Anthropic and Project Glasswing partners, for scoring the severity of AI jailbreaks. It scores jailbreaks across four criteria: capability gain beyond existing tools, breadth of offensive capabilities unlocked, ease of weaponization, and discoverability of the technique.

Why is Anthropic building this now?

In June 2026, Amazon researchers reported a jailbreak in Anthropic's Fable 5 model. The US Commerce Department briefly restricted Fable 5 and Mythos 5 exports. Restrictions lifted July 1, 2026, but the incident exposed the lack of any industry standard for saying 'this jailbreak is serious' vs. 'this jailbreak is theoretical.' The framework fills that gap.

Who's involved besides Anthropic?

Project Glasswing partners include AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorgan Chase, the Linux Foundation, Microsoft, Nvidia, and Palo Alto Networks. The framework itself is being developed collaboratively, aimed at standardizing how AI developers, defenders, and governments evaluate jailbreak risk.

How is this different from existing red-teaming?

Red-teaming finds jailbreaks. The consensus framework scores them once found. It's the AI equivalent of CVSS for software vulnerabilities — a shared vocabulary that lets Anthropic, AWS, Microsoft, governments, and researchers agree on severity, prioritize mitigations, and coordinate response. Without it, every jailbreak incident becomes a bespoke debate.

Quick Answer

Anthropic's AI Jailbreak Consensus Framework (July 2026)

Published: July 2, 2026

Anthropic’s AI Jailbreak Consensus Framework Explained (July 2026)

Anthropic and Project Glasswing partners are developing a “consensus framework” for scoring AI jailbreaks across four severity criteria, following the June 2026 Fable 5 jailbreak incident that briefly triggered US export controls. The framework is the AI-safety equivalent of CVSS — a shared vocabulary for saying which jailbreaks are dangerous and which are theoretical. Here’s what’s in it and why it matters.

Last verified: July 2, 2026

Why the framework exists

Background: In June 2026, Amazon researchers reported a jailbreak in Anthropic’s Fable 5 model. The US Commerce Department temporarily imposed export controls on Fable 5 and Mythos 5. After negotiation, restrictions lifted on July 1, 2026, with Anthropic adding a new cybersecurity classifier.

The incident exposed a governance gap. There was no industry-shared way to say:

Is this a theoretical jailbreak (proof-of-concept, hard to weaponize)?
Is this a practical jailbreak (any competent bad actor could reproduce)?
Is it novel (new capability the model shouldn’t have) or redundant (info freely available elsewhere)?

Every stakeholder — Anthropic, AWS, the US government, media — had to invent a severity narrative from scratch. That’s slow and inconsistent.

The consensus framework is Anthropic and Project Glasswing partners’ answer.

The four criteria

The framework, as publicly outlined so far, scores jailbreaks on:

Criterion	Question it answers
Capability gain	How much does the jailbroken model exceed capabilities available from existing tools (open web, other models, textbooks)?
Breadth of offensive capabilities	Does the jailbreak unlock one narrow harm or a wide range of harms?
Ease of weaponization	How much technical skill does an attacker need to turn the jailbreak into real damage?
Discoverability	How likely is it that other researchers or attackers will independently find the same technique?

A jailbreak that is high on all four — big capability gain, broad harm surface, easy to weaponize, easily rediscovered — is a five-alarm incident. A jailbreak that is low on all four is essentially a footnote.

Who’s building it

Project Glasswing was launched by Anthropic in April 2026 to leverage Claude Mythos for defensive cybersecurity across critical infrastructure. Its partners include:

Cloud providers: AWS, Google, Microsoft
Silicon and networking: Nvidia, Broadcom, Cisco
Security vendors: CrowdStrike, Palo Alto Networks
Enterprise + finance: Apple, JPMorgan Chase
Open source: Linux Foundation

The consensus framework is being developed collaboratively across this group. It’s not Anthropic imposing a standard — it’s Anthropic corralling the vendors who would have to actually apply the standard.

Why this matters beyond Anthropic

Three structural implications:

Governments get a shared basis for regulation. Instead of every jurisdiction inventing its own severity taxonomy, they can point to the consensus framework. US, UK, EU, and Japan AI safety institutes have all signaled interest in shared severity scoring.
AI incident response gets faster. When the next jailbreak lands, Anthropic, AWS, and the reporting researcher can immediately triage using shared criteria. That compresses the “days-of-uncertainty” phase.
Insurance and compliance follow. Once severity scoring is standardized, cyber insurance policies, SOC 2 controls, and enterprise procurement can all reference it. That drags AI safety practice into mainstream corporate governance.

Compared to existing standards

Standard	Domain	Analogy
CVSS	Software vulnerabilities	The template the AI jailbreak framework is emulating
MITRE ATT&CK	Cyber tactics/techniques	Complementary — describes techniques, not severity
NIST AI RMF	AI risk management	Governance framework, not incident scoring
OWASP LLM Top 10	LLM vulnerabilities	Category-level list, not severity scoring
AI jailbreak consensus framework	AI jailbreak severity	The missing piece — quantitative severity, not just categories

Limitations

Not yet published. As of July 2, 2026, the four criteria are public but the scoring methodology, weightings, and reference examples are still being negotiated among Project Glasswing partners.
Voluntary. Like CVSS, adoption is voluntary. Non-participating labs (Meta, xAI, Chinese labs) may not use it.
Political sensitivity. Any severity score that leads to export controls creates lobbying pressure. Expect debates about what “high capability gain” means in practice.
Model-specific quirks. A jailbreak that scores 9/10 on Fable 5 might score 2/10 on Sonnet 5 if the attack surface differs. Cross-model severity is genuinely hard.

What to watch

Publication of the full spec — likely late 2026 or early 2027
Adoption by non-Anthropic labs — OpenAI, Google DeepMind, Mistral
Government endorsement — US AI Safety Institute, UK AISI, EU AI Office
First live scoring — the next reported jailbreak will be the pressure test
Insurance integration — cyber insurers referencing the score in policy language

Bottom line

The consensus framework is the AI-safety equivalent of CVSS — a shared vocabulary for jailbreak severity that Anthropic and its Project Glasswing partners are building in the wake of the June 2026 Fable 5 incident. It won’t stop jailbreaks, but it should make the response coordinated instead of chaotic. Full spec publication and government endorsement are the milestones to watch.