Claude Managed Agents

beta · managed-agents-2026-04-01 Capability survey + kendo application sketches 2026-05-08
▸ Summary

Anthropic's Managed Agents API is a hosted agent runtime — Anthropic provisions the container, runs the agent loop, executes tool calls, and persists events server-side. We define what the agent is and what it runs in; Anthropic does the rest. Right fit for autonomous coding work and long-running tool-using tasks where we don't want to operate our own sandbox.

Recommendation: Don't pick the application before deciding the goal. Shipping a feature this sprint → Application A on Messages API. Learning CMA on a real workload → Application C, slice 2 (just ResearchAction). Avoid B as v1 — too much new infra in flight at once.

Beta header required
2026-04-01
SDKs set automatically
Container ceiling
8GB / 10GB
RAM / disk · Ubuntu 22.04 x86_64
Multiagent depth
1
Coordinators cannot recurse
Outcome iterations
3 → 20
Default → max per outcome

§12Three plausible kendo applications

Feature
Application A

Issue-quality analyzer

Sidebar card that runs a 4-criteria rubric on the issue body — clear acceptance criteria, bounded scope, no design ambiguity, no cross-cutting concerns. Single Action on top of AiService::generateStructured(). Doesn't need CMA — Messages API is the right tool. Useful as a feature, weak as a CMA testbed.
Effort
3-5 days
Blast radius
Low
CMA value
None
Flagship · later
Application B

Hand to Claude (autonomous)

Button hands an issue to a CMA session that plans, implements, runs tests, opens a PR. Outcome rubric = acceptance criteria. Webhook on session.outcome_evaluation_ended posts back to kendo. Largest payoff, biggest blast radius. Worth doing only after CMA experience is established.
Effort
Multi-week
Blast radius
High
CMA value
Maximum

§15 — Decision points if we proceed

If goal is
A useful feature shipped this sprint
Application A · Messages API
Skip CMA. Cheap, clean, valuable to the team.
If goal is
Learn CMA on a real workload
Application C, slice 2
Migrate just ResearchAction. Production workload, small blast radius.
If goal is
The flagship autonomous-execution feature
B — but only after C ships
Otherwise too many new things in flight at once.

Stop-the-presses corrections vs initial impressions

!
Session-completion webhooks do existsession.status_idled, session.outcome_evaluation_ended, session.thread_*. No polling needed.
!
The Anthropic-blessed GitHub MCP at https://api.githubcopilot.com/mcp/ is alive and documented, not archived. (The archived one is a separate community implementation.)
!
gh CLI is not required. The github_repository resource mounts the repo with git auth pre-wired. Native flow: edit files in mounted repo → git push via bash → create PR via MCP create_pull_request tool.
!
Skills are first-class on the Agent definition but don't sync from local Claude Code. Re-author for CMA: third-person, ≤500 lines, MCP refs fully qualified as ServerName:tool_name.

Beta status & limits

  • Header managed-agents-2026-04-01 on every endpoint
  • Outcomes + multiagent public beta as of 2026-05-06 — no access form needed
  • Rate limits 300 req/min creates · 600 req/min reads
  • Branding "Claude Agent" is the preferred surface label
  • No published pricing in beta docs as of read

§1Resource model

Agent versioned
Reusable, versioned config: model + system prompt + tools + MCP servers + skills + (optional) multiagent coordinator declaration.
Updates create new versions. Sessions can pin a version or float to latest. Archive makes it read-only; existing sessions continue.
Environment not versioned
Container template: pre-installed packages (apt, pip, npm, cargo, gem, go), networking policy.
Persists until archived. Multiple sessions share an environment but each gets its own container. Mutating retroactively affects future sessions.
Session per task
Running agent instance within an environment, tied to a specific task. Maintains conversation history and a checkpointed container filesystem.
idle → running → terminated. Events persist until session deletion. Container checkpoints expire after 30 days inactivity.
Vault per end-user
Per-end-user credential store, scoped at workspace level, used to inject auth into MCP server calls.
One credential per mcp_server_url per vault, max 20 credentials per vault. OAuth refresh handled by Anthropic.
Resources at session creation
Currently documented: github_repository — mounts a GitHub repo into the container with git auth pre-wired.
Lives at the session-creation boundary. Token can be rotated mid-session via PATCH /v1/sessions/{id}/resources/{rid}.

§2The container

Languages 8

Python 3.12+Node 20+Go 1.22+Rust 1.77+Java 21+Ruby 3.3+PHP 8.3+C/C++ GCC 13+

Databases clients only

  • SQLite running
  • PostgreSQL client only — no server
  • Redis client only — no server

Networking 2 modes

  • unrestricted — full outbound, except safety blocklist
  • limited — allowlist + MCP/package-mgr carve-outs (recommended for production)

System tools disabled by default: network

gitcurlwgetjqtarzipunzipsshscptmuxscreenmakecmakedockerripgrep (rg)treehtopsedawkgrepvimnanodiffpatch

Notable absence: gh CLI is not pre-installed. Add via packages.apt: ["gh"] on the environment — Anthropic pre-installs once and caches across sessions.

Specs

8 GB
RAM ceiling
10 GB
Disk ceiling
x86_64
Ubuntu 22.04 LTS

§3Tools — three categories

Built-in agent toolset agent_toolset_20260401

All on by default

always_allow
bashreadwriteeditglobgrepweb_fetchweb_search
Disable individually via configs[], or default-disable everything with default_config.enabled: false + per-tool allowlist.

MCP toolset HTTP only

Two-step wire-up

always_ask
Declare mcp_servers on the agent + expose via {type: "mcp_toolset", mcp_server_name: "..."}. Streamable HTTP only — no stdio. Auth flows through vault credentials at session creation. Invalid creds → session.error event, retries on next idle→running.

Custom tools your application executes

Backend executes the tool

app-decided
Agent invokes a custom tool → emits agent.custom_tool_usesession.status_idle with stop_reason.type === "requires_action". Backend executes, then posts user.custom_tool_result. Permission policies do not apply — your application decides.
{
  "type": "custom",
  "name": "get_weather",
  "description": "Get current weather for a location",
  "input_schema": { "type": "object", "properties": {...}, "required": [...] }
}

§4Skills — first-class on the agent

Authoring rules from best-practices

  • YAML frontmatter requiredname (≤64 chars, [a-z0-9-]+, no "anthropic"/"claude"); description (≤1024 chars)
  • Description must be third-person"Processes Excel files and generates reports" ✓ — "You can help with..." ✗ (causes discovery problems)
  • Body ≤500 lines for performance; split into bundled files referenced from SKILL.md
  • One level deep for file references — Claude only head -100s nested references
  • Gerund naming preferred — processing-pdfs, not pdf-helper
  • Inside skill bodies, MCP tools must be fully qualifiedServerName:tool_name
  • Bundled scripts are more reliable than asking the agent to regenerate equivalent code

Cross-surface gotcha

Custom skills do not sync between surfaces:

  • Claude.ai uploads (per-user)
  • Claude API uploads (workspace-wide via /v1/skills)
  • Claude Code filesystem-based, per-project at .claude/skills/

Local kendo Claude Code skills (vue-vitest-testing, php-unit-test) cannot be copy-pasted. They use first/second person and reference Claude Code-only slash commands.

Anthropic pre-built skills

xlsxdocxpptxpdf

Use as {type: "anthropic", skill_id: "xlsx"}. Custom org-uploaded skills support version pinning ("latest" or specific version).

Progressive disclosure

Skill metadata (~100 tokens per skill from YAML frontmatter) is always loaded into the system prompt. The SKILL.md body (≤5K tokens) loads only when the agent decides the skill is relevant. Bundled scripts execute via bash; their source code never enters context, only the script's stdout/stderr.

Max 20 skills per session (across all sub-agents in multiagent).

§5Multiagent coordinator

How it works

A single agent declares a roster of sub-agents. All sub-agents share the same container and filesystem. Each runs in its own session thread with isolated context — own conversation history, model, system prompt, tools, MCP servers, skills.

Threads are persistent — coordinator can send a follow-up to an earlier sub-agent and that sub-agent retains its full prior turns. Maps cleanly to multi-stage workflows: Planner → Implementer → Reviewer.

"multiagent": {
  "type": "coordinator",
  "agents": [
    { "type": "agent", "id": "agent_xxx" },
    { "type": "agent", "id": "agent_yyy", "version": 3 },
    { "type": "self" }
  ]
}

Constraints

20
Max unique agents in roster
25
Max concurrent threads / session
1
Depth — coordinators cannot recurse

§6Outcomes public beta · 2026-05-06

Define what "done" looks like

The harness iterates the agent until the artifact passes a separate grader. Rubric is markdown with explicit per-criterion checks. Grader runs in a separate context window from the main agent — isn't biased by the agent's implementation choices.

{
  "type": "user.define_outcome",
  "description": "Build a DCF model for Costco in .xlsx",
  "rubric": { "type": "text", "content": "# DCF Model Rubric\n..." },
  "max_iterations": 5
}

Default max_iterations: 3, max 20. Rubric inline or via Files API. Span events span.outcome_evaluation_* make iteration loop observable. One outcome at a time per session, but outcomes can be chained sequentially.

Result types

satisfiedSession transitions to idle
needs_revisionNew iteration with grader's per-criterion feedback
max_iterations_reachedNo further evaluation; one final revision
failedRubric ↔ description fundamentally contradict
interruptedOnly if eval already started before user.interrupt

§7Events & streaming

User events you send

user.messageuser.interruptuser.custom_tool_resultuser.tool_confirmationuser.define_outcome

Race condition: open SSE stream before sending kickoff user.message.
Reconnect: open new stream → list past events for dedup IDs → tail.

Token usage on session object

  • input_tokens
  • output_tokens
  • cache_creation_input_tokens
  • cache_read_input_tokens

5-minute prompt cache TTL.

Agent / session / span events you receive

Domain Event types
Agent agent.message · agent.thinking · agent.tool_use · agent.tool_result · agent.mcp_tool_use · agent.mcp_tool_result · agent.custom_tool_use · agent.thread_context_compacted · agent.thread_message_received · agent.thread_message_sent
Session session.status_running · session.status_idle · session.status_rescheduled · session.status_terminated · session.error · session.thread_created · session.thread_status_running · session.thread_status_idle · session.thread_status_terminated
Span span.model_request_start · span.model_request_end · span.outcome_evaluation_start · span.outcome_evaluation_ongoing · span.outcome_evaluation_end

§9GitHub access — the canonical path

The Anthropic-blessed GitHub MCP is at https://api.githubcopilot.com/mcp/. Token declared on the session's resources, not the agent — agent stays repo-agnostic and reusable.

1
Mount repo on session
resources.github_repository
2
Edit files in mounted repo
read · write · edit
3
Push branch
bash · git push
4
Create PR
github:create_pull_request

Multiple repos: add entries to resources array. Repos are cached across sessions sharing them. Token rotates mid-session via PATCH /v1/sessions/{id}/resources/{rid} — useful for short-lived GH App installation tokens. Fine-grained PATs are explicitly recommended over broad-access tokens.

§10Webhooks

Session events

session.status_run_startedsession.status_idledsession.status_rescheduledsession.status_terminatedsession.thread_createdsession.thread_idledsession.thread_terminatedsession.outcome_evaluation_ended

Vault events

vault.createdvault.archivedvault.deletedvault_credential.createdvault_credential.archivedvault_credential.deletedvault_credential.refresh_failed

Delivery semantics

  • Verification: SDK unwrap() checks X-Webhook-Signature, rejects payloads >5 min old
  • Payloads carry event type + ID, not the full object — fetch via GET
  • At-least-once retries with same event.id — dedupe accordingly
  • Ordering not guaranteed
  • 3xx counts as failure — redirects not followed
  • Auto-disabled after ~20 consecutive failures or immediately on private-IP hostname

§13Kendo-specific implications

Concern What it means for kendo
laravel/ai coverage Covers Messages API only — not Managed Agents. Need direct HTTP via Http:: or official PHP SDK if it ships CMA support. New App\Services\ManagedAgentsService in Services deptrac layer.
kendo-script MCP transport ✓ Verified streamable HTTP — no blocker (2026-05-09). Mcp::web('/mcp/kendo', KendoServer::class) in backend/routes/ai.php:15 uses laravel/mcp 0.6.6's HttpTransport, which cites MCP spec 2025-06-18/basic/transports verbatim. Live probe: POST → 401 + WWW-Authenticate: Bearer realm="mcp", resource_metadata="..."; GET/DELETE → 405 + Allow: POST; .well-known/oauth-protected-resource (RFC 9728) and .well-known/oauth-authorization-server (RFC 8414, S256 PKCE, refresh tokens, dynamic client registration) both serve correct metadata. Exactly the shape Anthropic's mcp_oauth vault credential type consumes.
GitHub repo-mount auth ✓ Reuse existing GitHub App installation — no per-repo key (2026-05-09). Kendo connects via per-tenant GitHub App: GithubInstallation (central DB) maps installation_id ↔ tenant_id; ProjectGithubRepo.repo_full_name holds linked repos. GithubAppService::getInstallationToken(int $installationId): string (backend/app/Services/GithubAppService.php:30) mints a fresh 1-hour installation access token via JWT-signed POST /app/installations/{id}/access_tokens — covers every repo the installation has access to, no per-repo keying. Feed it directly to Anthropic's resources: [{type: "github_repository", authorization_token: $token, ...}]. App permissions (pull_requests: write + contents: write, proven by existing createCheckRun/createPrComment) cover Anthropic's required repo scope (§9). Caveats: 1-hour TTL → PATCH /v1/sessions/{id}/resources/{resource_id} rotation for long sessions; repo must be in the installation's selected list (already enforced by ProjectGithubRepo UX).
Reusable AI infra AiOutboundLogger (hash-chained per ADR-0003) already covers tokens + status + errors. AgentProgressEvent + private Echo channel Tenant.{tenantId}.App.Models.User.{userId} handles client-side streaming.
Pennant pattern Established: feature classes in app/Features/<Pascal>.php, #[Name('kebab-name')], resolve(): bool (default-off), per-tenant scoped. Frontend bridge: useFeatureActive('flagName').
Webhook intake mirror Mirror existing GithubWebhookController + VerifyGithubWebhook + ProcessPullRequestWebhookJob. New handler uses X-Webhook-Signature + 5-minute freshness check.
Audit logging ADR-0001 mandates append-only hash-chained logs. Session lifecycle events would write through new ClaudeSessionAuditLogger mirroring IssueAuditLogger.
Strategic context (KD-0390) kendo previously had in-house "AI bot assignment" — app/Actions/AiRun/*, App\Models\AiRun, AiRunWorkflowJob. Removed 2026-04-25 ("not on roadmap, if we ever want it we'll rebuild it"). Any new autonomous-agent feature is a deliberate revisit of that decision on a different shape.

§14Open questions

Cost model
Pricing is unpublished in beta docs
Per-token? Per-hour? Per-session-creation overhead? Container boot — billed? Need small experiments before scoping cost guardrails.
Latency
First-call latency vs Messages API
Container boot + session create. Likely meaningful regression for short-lived workloads. Need benchmark, not a guess.
Resolved 2026-05-09
Streamable HTTP on kendo-script MCP
Verified in laravel/mcp 0.6.6 source (cites MCP spec 2025-06-18 + 2025-11-25) and against the live endpoint. Streamable HTTP confirmed; full RFC 9728 + RFC 8414 OAuth discovery in place. Anthropic-compatible — see §13.
Identity model
Vault-per-user vs single bot vault
Per-user means audit trails attribute to the kicking-off Member; single bot vault is simpler but loses attribution granularity.
Failure semantics
failed outcome — what does kendo do?
If grader's failed result fires (rubric ↔ description contradicts), auto-comment + unassign Claude is the obvious answer but worth confirming.
Skill scope
Custom-skill authoring scope for multiagent path
MVP: planner + implementer + reviewer + vue-tests + php-tests (5 skills). Third-person, ≤500 lines each, with bundled validation scripts. Not trivial.
Resolved 2026-05-09
Outcomes / multiagent are research preview
Both moved to public beta at Code with Claude 2026 (May 6-8). Available to all developers on the standard managed-agents-2026-04-01 beta header — no access form needed. New research-preview feature is "dreaming" (sessions that learn from past sessions); not a dependency for any of the three sketched apps.

§15Decisions made 2026-05-09 · superseded

Superseded — migration cancelled 2026-05-09. PR #1101 closed without merging after the post-implementation production-telemetry check showed CMA fails the latency gate against real production p95 (28 s) by 4–7×. The 8 decisions below are preserved as the historical record of what was decided at the time. Decision #8's premise (Application B parked until ResearchAction soaks) cannot be met because ResearchAction never shipped — Application B is therefore re-opened to first-class consideration. Future CMA work must bake in the methodology rules from the production-telemetry postmortem.

Settled in a planning round between CEO and parent agent. These commit the project to a specific path; future /plan-feature rounds (for ResearchAction migration first, then Application B) start from these.

Decision Detail
1. Next concrete CMA work Migrate ResearchAction only — the multi-turn codebase-exploration phase of StoryGenerationHarnessAction.php. Other 4 phases stay on Messages API (structured-output calls don't benefit from CMA).
2. No Pennant flag on migration ResearchAction is internal infra. Pennant pattern is reserved for HandOffToClaude (Application B), the user-visible feature.
3. Pre-migration spike ✓ ran 2026-05-09 Ran with actual ResearchAction system prompt + 4 story-gen inputs. Prompt 1 produced a clean Sonnet apples-to-apples comparison; prompts 2-4 surfaced harness reliability issues. Single clean datapoint shows 20.7× cheaper, 2.7× faster, 3.3× fewer tool calls — both gates pass with order-of-magnitude margin. Migration green-lit. Full writeup in research/2026-05-08-managed-agents-spike.md.
4. Spike scaffold location Extend the existing ~/Code/cma-spike/ Python scaffolding with a new prompts/ file. Standalone, throwaway. Numbers stay comparable to the first spike via the same compare.py harness.
5. Pass/fail thresholds Migration ships only if both: cost ≥ 2× cheaper and p95 latency ≤ 1.5× current ResearchAction. Conservative on cost (lab-side 19× gives margin); forgiving on latency (story-gen is async-job, not interactive).
6. Rollback strategy Replace outright. Trust the benchmark as the gate. If CMA misbehaves in production, rollback via git revert + redeploy. No shadow mode, no in-code A/B.
7. Outcomes + multiagent gating Not a decision — both already public beta as of 2026-05-06. No access form. Available now on the standard beta header.
8. Application B timing Park until ResearchAction has shipped and run in production for ≥ 2 weeks. Then /plan-feature with battle-tested CMA infra (ManagedAgentsService + GitHub installation-token rotation + audit logging).

§16Benchmark results 2026-05-09 · cost ✓ · latency ✗ on production telemetry

Sonnet · prompt 1 · clean comparison

Production ResearchAction system prompt against a story-gen input ("issue pinning" feature). Both paths ran to end_turn with no interruption.

Messages API CMA Ratio
Cost $11.51 $0.56 20.7× cheaper
Wall time 483s · 8 min 180s · 3 min 2.7× faster
Tool calls 118 36 3.3× fewer
Cumulative input 3.79M tokens · uncached 38in + 1.54M cache_read

Messages-API per-prompt cost was 3.5× higher than the first spike's average ($3.24) — production-shape ResearchAction prompts are far more open-ended than synthetic codebase questions, so the no-cache agent loop balloons accordingly. CMA's automatic prompt caching absorbs this — same prompt cost $0.56.

Verdict at the time: migration green-lit. Cost gate (≥ 2×) passes by 10× margin. Latency gate (CMA p95 ≤ 1.5× current) was scored against the spike's 483 s headline — trivially passable.

Post-implementation correction (2026-05-09): a query against prod-issue-tracker.ai_outbound_logs (n=36, last 30 days, Script tenant) showed real production p95 is 28 s, not 483 s. The spike's "production-shape" prompt was 17× the real production p95. Today's CMA smokes (Sonnet 191 s, Haiku 125 s) are 4–7× slower than production p95 — the latency gate fails against the real baseline. Cost case still holds. Full postmortem with methodology lessons: research/2026-05-09-production-telemetry-correction.md.

Harness reliability findings

Prompts 2-4 surfaced limitations of both the spike scaffold and Anthropic's session-control primitives. None block the migration decision; all flow into the ManagedAgentsService design when /plan-feature runs.

  1. Production prompts run 5–10× longer than synthetic. First spike: 100–180s on CMA. Production-shape: 5–13 min. Patched session timeout 300s → 600s; future runs still need outlier tolerance.
  2. user.interrupt doesn't reliably stop sessions mid-bash-tool-execution. One stuck session ignored 3 interrupts over 15s and kept running for ~13 min until its bash command naturally returned. Interrupts seem to take effect only at model-loop boundaries, not mid-tool-call. Production constraint: bound work via outcome rubrics, not client-side interrupts.
  3. Reading session.usage too early returns 0/0/0/0. Patched to poll for idle status before reading. Without this fix, ~$1 of orphan-session spend went unattributed initially.
  4. Full agent_toolset_20260401 is too broad to faithfully simulate ResearchAction. Production has 3 read-only tools; the spike agent had bash + read + write + edit + grep + glob + originally web_fetch + web_search. On prompt 2, the agent ignored "your ONLY job is to explore" and made 35 edit calls patching 16 files (a complete ValidationException-passthrough fix across all create/update MCP tools). No git push attempted — verified via bash log; the fix lived only inside the ephemeral container and was reaped on archive. Migration plan must use default_config.enabled: false + read-only allowlist.

Spend tally

Run Cost Notes
Messages API · Sonnet · prompt 1 $11.51 Full completion · the headline datapoint · exactly the cost the migration eliminates
CMA · Sonnet · prompt 1 $0.56 Full completion · the headline datapoint
CMA · Sonnet · prompt 2 (off-script) ~$2.96 Agent ignored "research only" prompt and patched 16 files
CMA · Sonnet · orphan-session cleanup ~$1.06 Recovered post-hoc via session-usage queries
Total ~$16 ~$11 of which was the Messages-API run that the migration removes