What are Claude Managed Agents Outcomes?

Outcomes is a public-beta feature Anthropic shipped to Claude Managed Agents at Code with Claude SF on May 6, 2026. It's a declarative way to specify what 'done well' means for an agent task. You write a rubric — the success criteria — and a separate grader evaluates the agent's output against that rubric. The grader's score and feedback go back to the agent, which iterates until the work passes the rubric. Anthropic reports task success rates lifting by up to 10 percentage points on intricate tasks, with significant quality improvements on file generation. Conceptually it's evaluation-driven development for agents — replace 'tell the agent what to do, hope it works' with 'tell the agent the bar, let it iterate to clear the bar.'

How does the Outcomes grader work?

Three pieces. (1) Rubric — you write declarative success criteria in natural language (and optionally structured fields). Examples: 'all customer email addresses must be redacted,' 'the generated test suite must achieve 80%+ branch coverage,' 'the response must cite at least three sources from the provided document set.' (2) Grader — a separate evaluation pass scores the agent's output against the rubric and returns structured feedback. The grader is itself an LLM call but isolated from the producing agent. (3) Iteration — the agent receives the grader's feedback as part of its context and revises until the rubric passes (or it gives up after a configured budget). The architecture is intentionally simple: the work-doer and the evaluator are separated, the rubric is the contract, the iteration loop is built-in. You get evaluation-as-product without integrating LangSmith or building your own eval pipeline.

When should I use Outcomes vs writing a better prompt?

Use Outcomes when (1) the task has measurable acceptance criteria you can articulate (test coverage, redaction completeness, citation count, schema compliance, format requirements). (2) The agent is failing not because the prompt is bad, but because the agent doesn't know how to self-check its work. (3) You're running the same task type repeatedly and want consistency without per-run prompt tweaks. (4) You're on a regulated workload where 'the agent verified its work against a defined rubric' is part of your audit story. Stick with prompt engineering when (1) the task is one-off and writing a rubric is more work than just doing it. (2) Acceptance criteria are subjective ('write something delightful') and a grader can't reliably score them. (3) You need fine-grained control over reasoning steps that Outcomes' iteration loop can't express. Most production teams in May 2026 use Outcomes for repeated structured tasks (file generation, data extraction, code review) and prompts for everything else.

How does Outcomes fit alongside multi-agent orchestration and Dreaming?

Three pieces of the same Anthropic launch at Code with Claude (May 6, 2026), each solving a different agent failure mode. (1) Outcomes (public beta) — solves 'agent did the wrong thing because it couldn't tell good from bad.' Adds a grader-driven feedback loop. (2) Multi-agent orchestration (public beta) — solves 'one agent is overloaded by a complex task.' Lets a lead agent decompose tasks and dispatch specialized sub-agents on a shared filesystem. (3) Dreaming (research preview) — solves 'agent doesn't learn from past sessions.' Async self-improvement that consolidates patterns from past work into refined memory. They compose: a multi-agent orchestration could use Outcomes for the lead agent's quality gate, with sub-agents working under their own per-task rubrics, and Dreaming would consolidate patterns across sessions to improve future runs. The full set positions Managed Agents as Anthropic's answer to LangGraph + LangSmith + a hosted runtime — fewer pieces to assemble.

Quick Answer

What Is Claude Managed Agents Outcomes? (May 2026)

Published: May 10, 2026

What Is Claude Managed Agents Outcomes? (May 2026)

Anthropic shipped Outcomes to Claude Managed Agents at Code with Claude SF on May 6, 2026 as a public beta. It’s a declarative success-criteria + grader-driven iteration loop that lifts task success rates by ~10 percentage points on hard work. Here’s what Outcomes is, how the grader works, and when to use it.

Last verified: May 10, 2026

The announcement at a glance

Property	Value
Launched	May 6, 2026
Status	Public beta
Platform	Claude Managed Agents
Reported uplift	~10pp task success on intricate tasks
Use cases	File generation, data extraction, code review, structured tasks
Sibling launches	Multi-agent orchestration (beta), Dreaming (research preview)

What Outcomes actually is

Outcomes is a way to tell a Claude managed agent what “done well” looks like rather than what to do step by step. Three pieces:

1. The rubric

You write declarative success criteria for the task. In natural language plus optionally structured fields. Examples:

“All customer email addresses must be redacted before the document is returned.”
“The generated test suite must achieve 80%+ branch coverage on the target module.”
“The response must cite at least three sources from the provided document set.”
“The CSV output must conform to the provided schema with no null values in required fields.”

2. The grader

A separate evaluation pass scores the agent’s output against the rubric. The grader is itself an LLM call but isolated from the producing agent — the work-doer and the evaluator are separate. The grader returns:

A pass/fail (or numeric score) per rubric item.
Structured feedback explaining why each item passed or failed.

3. The iteration loop

The agent receives the grader’s feedback as context and revises its output. This loop continues until the rubric passes or the configured iteration budget runs out.

The architecture is intentionally simple. The contract is the rubric. The loop is built-in. You don’t integrate LangSmith or wire up your own eval framework — Anthropic ships the grader and the iteration logic.

What problem Outcomes solves

The most common production agent failure mode in 2025-2026 wasn’t “the model is too dumb.” It was “the agent doesn’t know how to tell good work from bad work.”

Concrete examples:

An agent generating test cases produces tests that compile and run, but cover only the happy path. Without a coverage rubric, it doesn’t know to dig into edge cases.
An agent extracting structured data from PDFs misses the address field because it’s in a footer the agent skimmed. Without a “all fields populated” rubric, it doesn’t go back.
An agent writing customer support responses produces fluent text that uses internal-only terminology. Without a “no internal jargon” rubric, the agent doesn’t notice.

In all three, the prompt isn’t the bottleneck. The agent’s self-evaluation is. Outcomes adds an explicit, separately-graded self-evaluation step.

When to use Outcomes

Good fits:

Tasks with measurable acceptance criteria. Test coverage, redaction completeness, citation counts, schema compliance, format requirements.
Repeated task types. Same kind of work, run hundreds of times, where consistency matters more than per-run cleverness.
Regulated workloads. “The agent verified its work against a defined rubric” is a useful sentence in a SOC 2 / HIPAA / PCI audit.
Tasks the agent currently flunks. When a stronger prompt isn’t fixing it, an explicit rubric and grader often will.

Bad fits:

Subjective tasks. “Write something delightful” doesn’t grade reliably.
One-off tasks. Writing a rubric for a job you’ll do once is more work than doing the job.
Fine-grained reasoning control. If you need precise control over the agent’s internal steps, Outcomes’ “rubric → iterate until passes” loop is too coarse.

Concrete example: code review agent

A managed agent reviews pull requests. Without Outcomes:

You are a code reviewer. Look at this PR and provide feedback.

The agent produces feedback that’s plausible but inconsistent — sometimes thorough, sometimes shallow.

With Outcomes:

rubric:
  - "All flagged issues must include a specific file and line number."
  - "Security-relevant changes must call out potential vulnerability classes."
  - "Test coverage gaps must be identified for any new public function."
  - "Performance-sensitive changes (loops, queries) must include a complexity comment."

The grader checks each rubric item. The agent iterates until all pass. The output is consistently thorough because the contract is explicit.

How Outcomes fits with multi-agent orchestration and Dreaming

Three Code with Claude launches, three different agent failure modes:

Failure mode	Solution shipped May 6, 2026
”Agent did the wrong thing because it couldn’t tell good from bad”	Outcomes
”One agent overloaded by a complex task”	Multi-agent orchestration
”Agent doesn’t learn from past sessions”	Dreaming

They compose:

A multi-agent orchestration could have a lead agent with an Outcomes rubric for the overall result, dispatching specialized sub-agents (each with their own rubrics) to handle pieces.
Dreaming could consolidate patterns from past Outcomes runs — “every time the rubric required X, the failing strategies were Y” — into refined memory.

The full set positions Managed Agents as Anthropic’s answer to LangGraph + LangSmith + a hosted runtime, with one less integration to manage.

What Outcomes doesn’t replace

LangSmith / your own observability. Outcomes gives you per-task evaluation. For organization-wide observability across many agent runs, you still want a tracing/eval product.
Human review for high-stakes work. The grader is an LLM. It’s fallible on subjective or adversarial cases.
Multi-model setups. Outcomes runs inside Managed Agents. If you need to use Claude here and GPT-5.5 there, you’ll still want LangGraph or your own router around the top.

Pricing and availability

Public beta as of May 6, 2026. No GA SLA yet.
Available to Claude Managed Agents customers — same access tier as the rest of Managed Agents.
Grader runs cost extra LLM calls per iteration; budget for 1.5-3x the base agent run cost depending on iteration count.

What to watch next

GA timeline. When does Outcomes leave public beta?
Custom grader support. Can you bring your own grader (different model, custom code) instead of Anthropic’s default?
Outcomes templates. Reusable rubrics for common tasks (PR review, data extraction, content moderation).
Multi-model orchestration. Does Outcomes-style grading reach beyond Claude-only Managed Agents?

Last verified: May 10, 2026 — sources: Anthropic Code with Claude SF announcements, SDTimes, TheNewStack, 9to5Mac, VentureBeat, Simon Willison’s CwC notes.

What Is Claude Managed Agents Outcomes? (May 2026)

The announcement at a glance

What Outcomes actually is

1. The rubric

2. The grader

3. The iteration loop

What problem Outcomes solves

When to use Outcomes

Concrete example: code review agent

How Outcomes fits with multi-agent orchestration and Dreaming

What Outcomes doesn’t replace

Pricing and availability

What to watch next

Related reading