AI agents · OpenClaw · self-hosting · automation

Quick Answer

What Is Claude Managed Agents Outcomes? (May 2026)

Published:

What Is Claude Managed Agents Outcomes? (May 2026)

Anthropic shipped Outcomes to Claude Managed Agents at Code with Claude SF on May 6, 2026 as a public beta. It’s a declarative success-criteria + grader-driven iteration loop that lifts task success rates by ~10 percentage points on hard work. Here’s what Outcomes is, how the grader works, and when to use it.

Last verified: May 10, 2026

The announcement at a glance

PropertyValue
LaunchedMay 6, 2026
StatusPublic beta
PlatformClaude Managed Agents
Reported uplift~10pp task success on intricate tasks
Use casesFile generation, data extraction, code review, structured tasks
Sibling launchesMulti-agent orchestration (beta), Dreaming (research preview)

What Outcomes actually is

Outcomes is a way to tell a Claude managed agent what “done well” looks like rather than what to do step by step. Three pieces:

1. The rubric

You write declarative success criteria for the task. In natural language plus optionally structured fields. Examples:

  • “All customer email addresses must be redacted before the document is returned.”
  • “The generated test suite must achieve 80%+ branch coverage on the target module.”
  • “The response must cite at least three sources from the provided document set.”
  • “The CSV output must conform to the provided schema with no null values in required fields.”

2. The grader

A separate evaluation pass scores the agent’s output against the rubric. The grader is itself an LLM call but isolated from the producing agent — the work-doer and the evaluator are separate. The grader returns:

  • A pass/fail (or numeric score) per rubric item.
  • Structured feedback explaining why each item passed or failed.

3. The iteration loop

The agent receives the grader’s feedback as context and revises its output. This loop continues until the rubric passes or the configured iteration budget runs out.

The architecture is intentionally simple. The contract is the rubric. The loop is built-in. You don’t integrate LangSmith or wire up your own eval framework — Anthropic ships the grader and the iteration logic.

What problem Outcomes solves

The most common production agent failure mode in 2025-2026 wasn’t “the model is too dumb.” It was “the agent doesn’t know how to tell good work from bad work.”

Concrete examples:

  • An agent generating test cases produces tests that compile and run, but cover only the happy path. Without a coverage rubric, it doesn’t know to dig into edge cases.
  • An agent extracting structured data from PDFs misses the address field because it’s in a footer the agent skimmed. Without a “all fields populated” rubric, it doesn’t go back.
  • An agent writing customer support responses produces fluent text that uses internal-only terminology. Without a “no internal jargon” rubric, the agent doesn’t notice.

In all three, the prompt isn’t the bottleneck. The agent’s self-evaluation is. Outcomes adds an explicit, separately-graded self-evaluation step.

When to use Outcomes

Good fits:

  1. Tasks with measurable acceptance criteria. Test coverage, redaction completeness, citation counts, schema compliance, format requirements.
  2. Repeated task types. Same kind of work, run hundreds of times, where consistency matters more than per-run cleverness.
  3. Regulated workloads. “The agent verified its work against a defined rubric” is a useful sentence in a SOC 2 / HIPAA / PCI audit.
  4. Tasks the agent currently flunks. When a stronger prompt isn’t fixing it, an explicit rubric and grader often will.

Bad fits:

  1. Subjective tasks. “Write something delightful” doesn’t grade reliably.
  2. One-off tasks. Writing a rubric for a job you’ll do once is more work than doing the job.
  3. Fine-grained reasoning control. If you need precise control over the agent’s internal steps, Outcomes’ “rubric → iterate until passes” loop is too coarse.

Concrete example: code review agent

A managed agent reviews pull requests. Without Outcomes:

You are a code reviewer. Look at this PR and provide feedback.

The agent produces feedback that’s plausible but inconsistent — sometimes thorough, sometimes shallow.

With Outcomes:

rubric:
  - "All flagged issues must include a specific file and line number."
  - "Security-relevant changes must call out potential vulnerability classes."
  - "Test coverage gaps must be identified for any new public function."
  - "Performance-sensitive changes (loops, queries) must include a complexity comment."

The grader checks each rubric item. The agent iterates until all pass. The output is consistently thorough because the contract is explicit.

How Outcomes fits with multi-agent orchestration and Dreaming

Three Code with Claude launches, three different agent failure modes:

Failure modeSolution shipped May 6, 2026
”Agent did the wrong thing because it couldn’t tell good from bad”Outcomes
”One agent overloaded by a complex task”Multi-agent orchestration
”Agent doesn’t learn from past sessions”Dreaming

They compose:

  • A multi-agent orchestration could have a lead agent with an Outcomes rubric for the overall result, dispatching specialized sub-agents (each with their own rubrics) to handle pieces.
  • Dreaming could consolidate patterns from past Outcomes runs — “every time the rubric required X, the failing strategies were Y” — into refined memory.

The full set positions Managed Agents as Anthropic’s answer to LangGraph + LangSmith + a hosted runtime, with one less integration to manage.

What Outcomes doesn’t replace

  • LangSmith / your own observability. Outcomes gives you per-task evaluation. For organization-wide observability across many agent runs, you still want a tracing/eval product.
  • Human review for high-stakes work. The grader is an LLM. It’s fallible on subjective or adversarial cases.
  • Multi-model setups. Outcomes runs inside Managed Agents. If you need to use Claude here and GPT-5.5 there, you’ll still want LangGraph or your own router around the top.

Pricing and availability

  • Public beta as of May 6, 2026. No GA SLA yet.
  • Available to Claude Managed Agents customers — same access tier as the rest of Managed Agents.
  • Grader runs cost extra LLM calls per iteration; budget for 1.5-3x the base agent run cost depending on iteration count.

What to watch next

  • GA timeline. When does Outcomes leave public beta?
  • Custom grader support. Can you bring your own grader (different model, custom code) instead of Anthropic’s default?
  • Outcomes templates. Reusable rubrics for common tasks (PR review, data extraction, content moderation).
  • Multi-model orchestration. Does Outcomes-style grading reach beyond Claude-only Managed Agents?

Last verified: May 10, 2026 — sources: Anthropic Code with Claude SF announcements, SDTimes, TheNewStack, 9to5Mac, VentureBeat, Simon Willison’s CwC notes.