The hardest problem in agentic AI isn't getting an agent to do work. It's knowing when the work is good enough to ship. An agent will happily declare victory on a half-finished spreadsheet, a report that misses half the brief, or a slide deck that technically exists but wouldn't survive a glance from a real reviewer. Someone still has to inspect the output and decide whether it clears the bar.
Anthropic's Outcomes feature, announced May 6, 2026 at the Code with Claude conference in San Francisco, is a direct attack on that problem. Instead of hoping the agent self-assesses honestly, you write down what "done" looks like as a rubric, and a separate grading agent scores the finished artifact against it. If it falls short, the grader says exactly what's missing and the agent takes another pass. It's a quality gate built into the loop rather than bolted on after.
What Outcomes actually is
Outcomes turns a Managed Agents session from a conversation into a unit of work. In Anthropic's words, it "elevates a session from conversation to work": you state the result you want and how to measure quality, and the agent works toward that target, self-evaluating and iterating until it gets there.
The mechanism is the part worth understanding. When you define an outcome, the harness automatically provisions a grader that evaluates the artifact against your rubric. Crucially, the grader runs in a separate context window so it isn't influenced by the main agent's implementation choices. It returns an explanation summarizing which criteria passed or failed, and that feedback is handed back to the agent for the next iteration.
That separation is the whole point. An agent grading its own work is anchored to the reasoning that produced it. A fresh evaluator that has never seen the agent's chain of thought judges the output on its own terms. If you've read our guide to the Claude Code /goal command, this is the same independent-evaluator idea applied to artifacts instead of conversation turns: the thing that decides "done" is deliberately kept separate from the thing doing the work.
Why it matters
Most agent failures in production aren't dramatic. They're quiet quality gaps: a figure that doesn't reconcile, a tone that's slightly off-brand, a missing sensitivity analysis. These are exactly the things a busy human reviewer skims past, and exactly the things a structured rubric catches every time because it scores each criterion independently.
Anthropic's internal benchmarks bear this out. Outcomes improved task success by up to 10 points over a standard prompting loop, with the largest gains on the hardest problems. On file generation specifically, the company reported +8.4% task success on .docx outputs and +10.1% on .pptx outputs.
One number worth getting straight: you may have seen a "~6x completion rate" figure floating around from legal-AI company Harvey, which piloted these features. That gain is attributed to dreaming (a separate Managed Agents feature that lets agents retain learnings between sessions), not Outcomes. Don't conflate the two. Outcomes' own headline result is the up-to-10-point lift above.
Who can use it and when it launched
Outcomes shipped in public beta on May 6, 2026 as part of Claude Managed Agents on the Claude platform, alongside dreaming (research preview) and multiagent orchestration (public beta). It's an API capability, not a chat-window toggle.
To use it, your Managed Agents API requests need the managed-agents-2026-04-01 beta header. The official SDKs (Python, TypeScript, Go, Java, C#, PHP, Ruby) set that header for you. If you upload rubrics through the Files API for reuse, you'll also need the files-api-2025-04-14 header.
How to use it
The flow has three parts: write a rubric, attach it to a session, and read the result.
1. Write the rubric. A rubric is just a markdown document with per-criterion scoring. The single most important tip from Anthropic's docs is to make criteria explicit and gradeable. Write "The CSV contains a price column with numeric values," not "the data looks good." Vague criteria produce noisy evaluations because the grader scores each one independently. If you're stuck, hand Claude a known-good example artifact and ask it to reverse-engineer what makes it good, then turn that into criteria.
Here's a trimmed example for a financial model:
# DCF Model Rubric
## Revenue Projections
- Uses historical revenue data from the last 5 fiscal years
- Projects revenue for at least 5 years forward
- Growth rate assumptions are explicitly stated and reasonable
## Discount Rate
- WACC is calculated with stated cost-of-equity and cost-of-debt assumptions
- Beta, risk-free rate, and equity risk premium are sourced or justified
## Output Quality
- All figures are in a single .xlsx file with clearly labeled sheets
- Key assumptions live on a separate "Assumptions" sheet
- Sensitivity analysis on WACC and terminal growth rate is included
2. Attach it to a session. After creating a session, send a user.define_outcome event. The agent starts working the moment it receives the event; no separate prompt is required. You pass the rubric inline or by file ID, plus an optional iteration cap:
{
"type": "user.define_outcome",
"description": "Build a DCF model for Costco in .xlsx",
"rubric": { "type": "text", "content": "# DCF Model Rubric\n..." },
"max_iterations": 5
}
max_iterations is optional, defaults to 3, and maxes out at 20. Each iteration is one grade-and-revise cycle.
3. Read the score. Progress surfaces on the event stream through span.outcome_evaluation_* events, or you can poll GET /v1/sessions/:id and read outcome_evaluations[].result. The grader's internal reasoning is opaque (you see that it's working, not what it's thinking), but the final result is explicit:
| Result | What it means |
|---|---|
satisfied | Rubric met. Session goes idle. |
needs_revision | Agent starts another iteration cycle. |
max_iterations_reached | Cap hit; agent may run one final revision, then idle. |
failed | Rubric fundamentally doesn't match the task (e.g., description and rubric contradict each other). |
interrupted | A user.interrupt event paused an in-progress evaluation. |
A satisfied result also returns an explanation, for example "All 12 criteria met: revenue projections use 5 years of historical data, WACC assumptions are stated, sensitivity table is included." Finished deliverables land in /mnt/session/outputs/ inside the sandbox and are pulled out through the Files API scoped to the session. Only one outcome runs at a time, but you can chain them by sending a new define_outcome event after the previous one terminates.
Where it fits in an MSP workflow
If you're building agents that produce client-facing artifacts (security assessments, compliance summaries, network diagrams, audit reports) Outcomes is the difference between an agent you have to babysit and one you can trust to self-correct. Encode your firm's quality bar once as a rubric, version it, reuse it across sessions through the Files API, and let the grader enforce it on every run. That's far more consistent than a human reviewer who's seen forty reports today and is skimming the forty-first.
The bottom line
Outcomes solves a real, unglamorous problem: deciding when an agent's work is good enough to deliver. By handing grading to a separate agent that never saw the reasoning behind the output, Anthropic gets an evaluator that judges results honestly rather than rubber-stamping its own work. Write explicit, gradeable criteria, set a sensible iteration cap, and read the result field. The up-to-10-point benchmark lift is nice, but the more durable win is structural: a built-in quality gate that catches the quiet gaps a tired reviewer would wave through.