Skip to main content
Home/Blog/Claude's Outcomes Feature: Rubric Grading That Knows When an Agent Is Done
Artificial Intelligence

Claude's Outcomes Feature: Rubric Grading That Knows When an Agent Is Done

Claude Managed Agents now ship with Outcomes, a rubric-driven grading loop where a separate agent scores the work against your definition of done. Here's how it works, who can use it, and how to write a rubric that actually finishes the job.

By Sean

The hardest problem in agentic AI isn't getting an agent to do work. It's knowing when the work is good enough to ship. An agent will happily declare victory on a half-finished spreadsheet, a report that misses half the brief, or a slide deck that technically exists but wouldn't survive a glance from a real reviewer. Someone still has to inspect the output and decide whether it clears the bar.

Anthropic's Outcomes feature, announced May 6, 2026 at the Code with Claude conference in San Francisco, is a direct attack on that problem. Instead of hoping the agent self-assesses honestly, you write down what "done" looks like as a rubric, and a separate grading agent scores the finished artifact against it. If it falls short, the grader says exactly what's missing and the agent takes another pass. It's a quality gate built into the loop rather than bolted on after.

What Outcomes actually is

Outcomes turns a Managed Agents session from a conversation into a unit of work. In Anthropic's words, it "elevates a session from conversation to work": you state the result you want and how to measure quality, and the agent works toward that target, self-evaluating and iterating until it gets there.

The mechanism is the part worth understanding. When you define an outcome, the harness automatically provisions a grader that evaluates the artifact against your rubric. Crucially, the grader runs in a separate context window so it isn't influenced by the main agent's implementation choices. It returns an explanation summarizing which criteria passed or failed, and that feedback is handed back to the agent for the next iteration.

That separation is the whole point. An agent grading its own work is anchored to the reasoning that produced it. A fresh evaluator that has never seen the agent's chain of thought judges the output on its own terms. If you've read our guide to the Claude Code /goal command, this is the same independent-evaluator idea applied to artifacts instead of conversation turns: the thing that decides "done" is deliberately kept separate from the thing doing the work.

Why it matters

Most agent failures in production aren't dramatic. They're quiet quality gaps: a figure that doesn't reconcile, a tone that's slightly off-brand, a missing sensitivity analysis. These are exactly the things a busy human reviewer skims past, and exactly the things a structured rubric catches every time because it scores each criterion independently.

Anthropic's internal benchmarks bear this out. Outcomes improved task success by up to 10 points over a standard prompting loop, with the largest gains on the hardest problems. On file generation specifically, the company reported +8.4% task success on .docx outputs and +10.1% on .pptx outputs.

One number worth getting straight: you may have seen a "~6x completion rate" figure floating around from legal-AI company Harvey, which piloted these features. That gain is attributed to dreaming (a separate Managed Agents feature that lets agents retain learnings between sessions), not Outcomes. Don't conflate the two. Outcomes' own headline result is the up-to-10-point lift above.

Who can use it and when it launched

Outcomes shipped in public beta on May 6, 2026 as part of Claude Managed Agents on the Claude platform, alongside dreaming (research preview) and multiagent orchestration (public beta). It's an API capability, not a chat-window toggle.

To use it, your Managed Agents API requests need the managed-agents-2026-04-01 beta header. The official SDKs (Python, TypeScript, Go, Java, C#, PHP, Ruby) set that header for you. If you upload rubrics through the Files API for reuse, you'll also need the files-api-2025-04-14 header.

How to use it

The flow has three parts: write a rubric, attach it to a session, and read the result.

1. Write the rubric. A rubric is just a markdown document with per-criterion scoring. The single most important tip from Anthropic's docs is to make criteria explicit and gradeable. Write "The CSV contains a price column with numeric values," not "the data looks good." Vague criteria produce noisy evaluations because the grader scores each one independently. If you're stuck, hand Claude a known-good example artifact and ask it to reverse-engineer what makes it good, then turn that into criteria.

Here's a trimmed example for a financial model:

# DCF Model Rubric

## Revenue Projections
- Uses historical revenue data from the last 5 fiscal years
- Projects revenue for at least 5 years forward
- Growth rate assumptions are explicitly stated and reasonable

## Discount Rate
- WACC is calculated with stated cost-of-equity and cost-of-debt assumptions
- Beta, risk-free rate, and equity risk premium are sourced or justified

## Output Quality
- All figures are in a single .xlsx file with clearly labeled sheets
- Key assumptions live on a separate "Assumptions" sheet
- Sensitivity analysis on WACC and terminal growth rate is included

2. Attach it to a session. After creating a session, send a user.define_outcome event. The agent starts working the moment it receives the event; no separate prompt is required. You pass the rubric inline or by file ID, plus an optional iteration cap:

{
  "type": "user.define_outcome",
  "description": "Build a DCF model for Costco in .xlsx",
  "rubric": { "type": "text", "content": "# DCF Model Rubric\n..." },
  "max_iterations": 5
}

max_iterations is optional, defaults to 3, and maxes out at 20. Each iteration is one grade-and-revise cycle.

3. Read the score. Progress surfaces on the event stream through span.outcome_evaluation_* events, or you can poll GET /v1/sessions/:id and read outcome_evaluations[].result. The grader's internal reasoning is opaque (you see that it's working, not what it's thinking), but the final result is explicit:

ResultWhat it means
satisfiedRubric met. Session goes idle.
needs_revisionAgent starts another iteration cycle.
max_iterations_reachedCap hit; agent may run one final revision, then idle.
failedRubric fundamentally doesn't match the task (e.g., description and rubric contradict each other).
interruptedA user.interrupt event paused an in-progress evaluation.

A satisfied result also returns an explanation, for example "All 12 criteria met: revenue projections use 5 years of historical data, WACC assumptions are stated, sensitivity table is included." Finished deliverables land in /mnt/session/outputs/ inside the sandbox and are pulled out through the Files API scoped to the session. Only one outcome runs at a time, but you can chain them by sending a new define_outcome event after the previous one terminates.

Where it fits in an MSP workflow

If you're building agents that produce client-facing artifacts (security assessments, compliance summaries, network diagrams, audit reports) Outcomes is the difference between an agent you have to babysit and one you can trust to self-correct. Encode your firm's quality bar once as a rubric, version it, reuse it across sessions through the Files API, and let the grader enforce it on every run. That's far more consistent than a human reviewer who's seen forty reports today and is skimming the forty-first.

The bottom line

Outcomes solves a real, unglamorous problem: deciding when an agent's work is good enough to deliver. By handing grading to a separate agent that never saw the reasoning behind the output, Anthropic gets an evaluator that judges results honestly rather than rubber-stamping its own work. Write explicit, gradeable criteria, set a sensible iteration cap, and read the result field. The up-to-10-point benchmark lift is nice, but the more durable win is structural: a built-in quality gate that catches the quiet gaps a tired reviewer would wave through.

Frequently Asked Questions

Find answers to common questions

Outcomes lets you tell a Claude Managed Agent what "done" looks like by attaching a rubric, then a separate grading agent scores the finished work against that rubric. If the output falls short, the grader explains what's missing and the agent revises and tries again. The loop repeats until the rubric is satisfied or the iteration limit is reached.

Anthropic announced Outcomes on May 6, 2026 at the Code with Claude developer conference in San Francisco, alongside dreaming and multiagent orchestration. Outcomes shipped in public beta as part of Managed Agents on the Claude platform. API requests require the managed-agents-2026-04-01 beta header, which the official SDKs set automatically.

When you define an outcome, the harness automatically provisions a grader that evaluates the artifact in its own context window. Because the grader never sees the main agent's reasoning or implementation choices, it judges the output on its own terms instead of being anchored to how the work was done. It returns an explanation of which criteria passed or failed, which is fed back to the agent for the next pass.

A rubric is a markdown document with explicit, gradeable criteria such as "The CSV contains a price column with numeric values" rather than vague statements like "the data looks good." The grader scores each criterion independently, so vague criteria produce noisy results. If you don't have a rubric, give Claude a known-good example and ask it to analyze what makes it good, then turn that into criteria.

No. The roughly 6x completion-rate gain reported by legal-AI company Harvey is attributed to dreaming, a different Managed Agents feature that lets agents retain learnings between sessions. For Outcomes specifically, Anthropic's internal benchmarks showed task success improved by up to 10 points over a standard prompting loop, including +8.4% on .docx generation and +10.1% on .pptx generation.

You control this with the max_iterations field on the define_outcome event. It is optional, defaults to 3, and can be set as high as 20. Each iteration is one grade-and-revise cycle. When the cap is reached the agent may run one final revision before the session goes idle.

Both use an independent evaluator to decide when work is finished, but they target different surfaces. The /goal command runs inside the Claude Code CLI and checks a plain-language completion condition with a small fast model after each turn. Outcomes runs in Managed Agents on the API, grades a produced artifact against a structured rubric in a separate context window, and iterates until it passes.

Let's turn this knowledge into action

Our experts can help you apply these insights to your specific situation. No sales pitch — just a technical conversation.