AI agents · OpenClaw · self-hosting · automation

Quick Answer

Anthropic's AI Jailbreak Consensus Framework (July 2026)

Published:

Anthropic’s AI Jailbreak Consensus Framework Explained (July 2026)

Anthropic and Project Glasswing partners are developing a “consensus framework” for scoring AI jailbreaks across four severity criteria, following the June 2026 Fable 5 jailbreak incident that briefly triggered US export controls. The framework is the AI-safety equivalent of CVSS — a shared vocabulary for saying which jailbreaks are dangerous and which are theoretical. Here’s what’s in it and why it matters.

Last verified: July 2, 2026

Why the framework exists

Background: In June 2026, Amazon researchers reported a jailbreak in Anthropic’s Fable 5 model. The US Commerce Department temporarily imposed export controls on Fable 5 and Mythos 5. After negotiation, restrictions lifted on July 1, 2026, with Anthropic adding a new cybersecurity classifier.

The incident exposed a governance gap. There was no industry-shared way to say:

  • Is this a theoretical jailbreak (proof-of-concept, hard to weaponize)?
  • Is this a practical jailbreak (any competent bad actor could reproduce)?
  • Is it novel (new capability the model shouldn’t have) or redundant (info freely available elsewhere)?

Every stakeholder — Anthropic, AWS, the US government, media — had to invent a severity narrative from scratch. That’s slow and inconsistent.

The consensus framework is Anthropic and Project Glasswing partners’ answer.

The four criteria

The framework, as publicly outlined so far, scores jailbreaks on:

CriterionQuestion it answers
Capability gainHow much does the jailbroken model exceed capabilities available from existing tools (open web, other models, textbooks)?
Breadth of offensive capabilitiesDoes the jailbreak unlock one narrow harm or a wide range of harms?
Ease of weaponizationHow much technical skill does an attacker need to turn the jailbreak into real damage?
DiscoverabilityHow likely is it that other researchers or attackers will independently find the same technique?

A jailbreak that is high on all four — big capability gain, broad harm surface, easy to weaponize, easily rediscovered — is a five-alarm incident. A jailbreak that is low on all four is essentially a footnote.

Who’s building it

Project Glasswing was launched by Anthropic in April 2026 to leverage Claude Mythos for defensive cybersecurity across critical infrastructure. Its partners include:

  • Cloud providers: AWS, Google, Microsoft
  • Silicon and networking: Nvidia, Broadcom, Cisco
  • Security vendors: CrowdStrike, Palo Alto Networks
  • Enterprise + finance: Apple, JPMorgan Chase
  • Open source: Linux Foundation

The consensus framework is being developed collaboratively across this group. It’s not Anthropic imposing a standard — it’s Anthropic corralling the vendors who would have to actually apply the standard.

Why this matters beyond Anthropic

Three structural implications:

  1. Governments get a shared basis for regulation. Instead of every jurisdiction inventing its own severity taxonomy, they can point to the consensus framework. US, UK, EU, and Japan AI safety institutes have all signaled interest in shared severity scoring.
  2. AI incident response gets faster. When the next jailbreak lands, Anthropic, AWS, and the reporting researcher can immediately triage using shared criteria. That compresses the “days-of-uncertainty” phase.
  3. Insurance and compliance follow. Once severity scoring is standardized, cyber insurance policies, SOC 2 controls, and enterprise procurement can all reference it. That drags AI safety practice into mainstream corporate governance.

Compared to existing standards

StandardDomainAnalogy
CVSSSoftware vulnerabilitiesThe template the AI jailbreak framework is emulating
MITRE ATT&CKCyber tactics/techniquesComplementary — describes techniques, not severity
NIST AI RMFAI risk managementGovernance framework, not incident scoring
OWASP LLM Top 10LLM vulnerabilitiesCategory-level list, not severity scoring
AI jailbreak consensus frameworkAI jailbreak severityThe missing piece — quantitative severity, not just categories

Limitations

  • Not yet published. As of July 2, 2026, the four criteria are public but the scoring methodology, weightings, and reference examples are still being negotiated among Project Glasswing partners.
  • Voluntary. Like CVSS, adoption is voluntary. Non-participating labs (Meta, xAI, Chinese labs) may not use it.
  • Political sensitivity. Any severity score that leads to export controls creates lobbying pressure. Expect debates about what “high capability gain” means in practice.
  • Model-specific quirks. A jailbreak that scores 9/10 on Fable 5 might score 2/10 on Sonnet 5 if the attack surface differs. Cross-model severity is genuinely hard.

What to watch

  • Publication of the full spec — likely late 2026 or early 2027
  • Adoption by non-Anthropic labs — OpenAI, Google DeepMind, Mistral
  • Government endorsement — US AI Safety Institute, UK AISI, EU AI Office
  • First live scoring — the next reported jailbreak will be the pressure test
  • Insurance integration — cyber insurers referencing the score in policy language

Bottom line

The consensus framework is the AI-safety equivalent of CVSS — a shared vocabulary for jailbreak severity that Anthropic and its Project Glasswing partners are building in the wake of the June 2026 Fable 5 incident. It won’t stop jailbreaks, but it should make the response coordinated instead of chaotic. Full spec publication and government endorsement are the milestones to watch.


Related: Project Glasswing 50 partners vs OpenAI EU cyber program · What is Project Glasswing (Anthropic) · US lifts Claude Fable 5 Mythos 5 export controls