Claude Managed Agents

beta · managed-agents-2026-04-01 Capability survey + kendo application sketches 2026-05-08

▸ Summary

Anthropic's Managed Agents API is a hosted agent runtime — Anthropic provisions the container, runs the agent loop, executes tool calls, and persists events server-side. We define what the agent is and what it runs in; Anthropic does the rest. Right fit for autonomous coding work and long-running tool-using tasks where we don't want to operate our own sandbox.

Recommendation: Don't pick the application before deciding the goal. Shipping a feature this sprint → Application A on Messages API. Learning CMA on a real workload → Application C, slice 2 (just ResearchAction). Avoid B as v1 — too much new infra in flight at once.

Beta header required

2026-04-01

SDKs set automatically

Container ceiling

8GB / 10GB

RAM / disk · Ubuntu 22.04 x86_64

Multiagent depth

Coordinators cannot recurse

Outcome iterations

3 → 20

Default → max per outcome

§12Three plausible kendo applications

Feature

Application A

Issue-quality analyzer

Sidebar card that runs a 4-criteria rubric on the issue body — clear acceptance criteria, bounded scope, no design ambiguity, no cross-cutting concerns. Single Action on top of AiService::generateStructured(). Doesn't need CMA — Messages API is the right tool. Useful as a feature, weak as a CMA testbed.

Effort

3-5 days

Blast radius

Low

CMA value

None

Recommended testbed

Application C — slice 2

Migrate `ResearchAction`

Port just the multi-turn codebase-exploration phase of the existing 5-phase StoryGenerationHarnessAction onto Managed Agents. CMA's mounted-repo + bash + grep is genuinely better than tool-calling Messages API for codebase exploration. Real production workload, smallest informative slice. Highest CMA learning per scope unit.

Effort

~2 weeks

Blast radius

Medium

CMA value

High

Flagship · later

Application B

Hand to Claude (autonomous)

Button hands an issue to a CMA session that plans, implements, runs tests, opens a PR. Outcome rubric = acceptance criteria. Webhook on session.outcome_evaluation_ended posts back to kendo. Largest payoff, biggest blast radius. Worth doing only after CMA experience is established.

Effort

Multi-week

Blast radius

High

CMA value

Maximum

§15 — Decision points if we proceed

If goal is

A useful feature shipped this sprint

↓

Application A · Messages API

Skip CMA. Cheap, clean, valuable to the team.

If goal is

Learn CMA on a real workload

↓

Application C, slice 2

Migrate just ResearchAction. Production workload, small blast radius.

If goal is

The flagship autonomous-execution feature

↓

B — but only after C ships

Otherwise too many new things in flight at once.

Stop-the-presses corrections vs initial impressions

Session-completion webhooks do exist — session.status_idled, session.outcome_evaluation_ended, session.thread_*. No polling needed.

The Anthropic-blessed GitHub MCP at https://api.githubcopilot.com/mcp/ is alive and documented, not archived. (The archived one is a separate community implementation.)

gh CLI is not required. The github_repository resource mounts the repo with git auth pre-wired. Native flow: edit files in mounted repo → git push via bash → create PR via MCP create_pull_request tool.

Skills are first-class on the Agent definition but don't sync from local Claude Code. Re-author for CMA: third-person, ≤500 lines, MCP refs fully qualified as ServerName:tool_name.

Beta status & limits

Header managed-agents-2026-04-01 on every endpoint
Outcomes + multiagent public beta as of 2026-05-06 — no access form needed
Rate limits 300 req/min creates · 600 req/min reads
Branding "Claude Agent" is the preferred surface label
No published pricing in beta docs as of read

§1Resource model

Agent versioned

Reusable, versioned config: model + system prompt + tools + MCP servers + skills + (optional) multiagent coordinator declaration.

Updates create new versions. Sessions can pin a version or float to latest. Archive makes it read-only; existing sessions continue.

Environment not versioned

Container template: pre-installed packages (apt, pip, npm, cargo, gem, go), networking policy.

Persists until archived. Multiple sessions share an environment but each gets its own container. Mutating retroactively affects future sessions.

Session per task

Running agent instance within an environment, tied to a specific task. Maintains conversation history and a checkpointed container filesystem.

idle → running → terminated. Events persist until session deletion. Container checkpoints expire after 30 days inactivity.

Vault per end-user

Per-end-user credential store, scoped at workspace level, used to inject auth into MCP server calls.

One credential per mcp_server_url per vault, max 20 credentials per vault. OAuth refresh handled by Anthropic.

Resources at session creation

Currently documented: github_repository — mounts a GitHub repo into the container with git auth pre-wired.

Lives at the session-creation boundary. Token can be rotated mid-session via PATCH /v1/sessions/{id}/resources/{rid}.

§2The container

Languages 8

Python 3.12+Node 20+Go 1.22+Rust 1.77+Java 21+Ruby 3.3+PHP 8.3+C/C++ GCC 13+

Databases clients only

SQLite running
PostgreSQL client only — no server
Redis client only — no server

Networking 2 modes

unrestricted — full outbound, except safety blocklist
limited — allowlist + MCP/package-mgr carve-outs (recommended for production)

System tools disabled by default: network

gitcurlwgetjqtarzipunzipsshscptmuxscreenmakecmakedockerripgrep (rg)treehtopsedawkgrepvimnanodiffpatch

Notable absence: gh CLI is not pre-installed. Add via packages.apt: ["gh"] on the environment — Anthropic pre-installs once and caches across sessions.

Specs

8 GB

RAM ceiling

10 GB

Disk ceiling

x86_64

Ubuntu 22.04 LTS

§3Tools — three categories

Built-in agent toolset agent_toolset_20260401

All on by default

always_allow

bashreadwriteeditglobgrepweb_fetchweb_search

Disable individually via configs[], or default-disable everything with default_config.enabled: false + per-tool allowlist.

MCP toolset HTTP only

Two-step wire-up

always_ask

Declare mcp_servers on the agent + expose via {type: "mcp_toolset", mcp_server_name: "..."}. Streamable HTTP only — no stdio. Auth flows through vault credentials at session creation. Invalid creds → session.error event, retries on next idle→running.

Custom tools your application executes

Backend executes the tool

app-decided

Agent invokes a custom tool → emits agent.custom_tool_use → session.status_idle with stop_reason.type === "requires_action". Backend executes, then posts user.custom_tool_result. Permission policies do not apply — your application decides.

{
  "type": "custom",
  "name": "get_weather",
  "description": "Get current weather for a location",
  "input_schema": { "type": "object", "properties": {...}, "required": [...] }
}

§4Skills — first-class on the agent

Authoring rules from best-practices

YAML frontmatter required — name (≤64 chars, [a-z0-9-]+, no "anthropic"/"claude"); description (≤1024 chars)
Description must be third-person"Processes Excel files and generates reports" ✓ — "You can help with..." ✗ (causes discovery problems)
Body ≤500 lines for performance; split into bundled files referenced from SKILL.md
One level deep for file references — Claude only head -100s nested references
Gerund naming preferred — processing-pdfs, not pdf-helper
Inside skill bodies, MCP tools must be fully qualified — ServerName:tool_name
Bundled scripts are more reliable than asking the agent to regenerate equivalent code

Cross-surface gotcha

Custom skills do not sync between surfaces:

Claude.ai uploads (per-user)
Claude API uploads (workspace-wide via /v1/skills)
Claude Code filesystem-based, per-project at .claude/skills/

Local kendo Claude Code skills (vue-vitest-testing, php-unit-test) cannot be copy-pasted. They use first/second person and reference Claude Code-only slash commands.

Anthropic pre-built skills

xlsxdocxpptxpdf

Use as {type: "anthropic", skill_id: "xlsx"}. Custom org-uploaded skills support version pinning ("latest" or specific version).

Progressive disclosure

Skill metadata (~100 tokens per skill from YAML frontmatter) is always loaded into the system prompt. The SKILL.md body (≤5K tokens) loads only when the agent decides the skill is relevant. Bundled scripts execute via bash; their source code never enters context, only the script's stdout/stderr.

Max 20 skills per session (across all sub-agents in multiagent).

§5Multiagent coordinator

How it works

A single agent declares a roster of sub-agents. All sub-agents share the same container and filesystem. Each runs in its own session thread with isolated context — own conversation history, model, system prompt, tools, MCP servers, skills.

Threads are persistent — coordinator can send a follow-up to an earlier sub-agent and that sub-agent retains its full prior turns. Maps cleanly to multi-stage workflows: Planner → Implementer → Reviewer.

"multiagent": {
  "type": "coordinator",
  "agents": [
    { "type": "agent", "id": "agent_xxx" },
    { "type": "agent", "id": "agent_yyy", "version": 3 },
    { "type": "self" }
  ]
}

Constraints

Max unique agents in roster

Max concurrent threads / session

Depth — coordinators cannot recurse

§6Outcomes public beta · 2026-05-06

Define what "done" looks like

The harness iterates the agent until the artifact passes a separate grader. Rubric is markdown with explicit per-criterion checks. Grader runs in a separate context window from the main agent — isn't biased by the agent's implementation choices.

{
  "type": "user.define_outcome",
  "description": "Build a DCF model for Costco in .xlsx",
  "rubric": { "type": "text", "content": "# DCF Model Rubric\n..." },
  "max_iterations": 5
}

Default max_iterations: 3, max 20. Rubric inline or via Files API. Span events span.outcome_evaluation_* make iteration loop observable. One outcome at a time per session, but outcomes can be chained sequentially.

Result types

satisfied	Session transitions to idle
needs_revision	New iteration with grader's per-criterion feedback
max_iterations_reached	No further evaluation; one final revision
failed	Rubric ↔ description fundamentally contradict
interrupted	Only if eval already started before `user.interrupt`

§7Events & streaming

User events you send

user.messageuser.interruptuser.custom_tool_resultuser.tool_confirmationuser.define_outcome

Race condition: open SSE stream before sending kickoff user.message.
Reconnect: open new stream → list past events for dedup IDs → tail.

Token usage on session object

input_tokens
output_tokens
cache_creation_input_tokens
cache_read_input_tokens

5-minute prompt cache TTL.

Agent / session / span events you receive

Domain	Event types
Agent	`agent.message` · `agent.thinking` · `agent.tool_use` · `agent.tool_result` · `agent.mcp_tool_use` · `agent.mcp_tool_result` · `agent.custom_tool_use` · `agent.thread_context_compacted` · `agent.thread_message_received` · `agent.thread_message_sent`
Session	`session.status_running` · `session.status_idle` · `session.status_rescheduled` · `session.status_terminated` · `session.error` · `session.thread_created` · `session.thread_status_running` · `session.thread_status_idle` · `session.thread_status_terminated`
Span	`span.model_request_start` · `span.model_request_end` · `span.outcome_evaluation_start` · `span.outcome_evaluation_ongoing` · `span.outcome_evaluation_end`

§9GitHub access — the canonical path

The Anthropic-blessed GitHub MCP is at https://api.githubcopilot.com/mcp/. Token declared on the session's resources, not the agent — agent stays repo-agnostic and reusable.

Mount repo on session

resources.github_repository

→

Edit files in mounted repo

read · write · edit

→

Push branch

bash · git push

→

Create PR

github:create_pull_request

Multiple repos: add entries to resources array. Repos are cached across sessions sharing them. Token rotates mid-session via PATCH /v1/sessions/{id}/resources/{rid} — useful for short-lived GH App installation tokens. Fine-grained PATs are explicitly recommended over broad-access tokens.

§10Webhooks

Session events

session.status_run_startedsession.status_idledsession.status_rescheduledsession.status_terminatedsession.thread_createdsession.thread_idledsession.thread_terminatedsession.outcome_evaluation_ended

Vault events

vault.createdvault.archivedvault.deletedvault_credential.createdvault_credential.archivedvault_credential.deletedvault_credential.refresh_failed

Delivery semantics

Verification: SDK unwrap() checks X-Webhook-Signature, rejects payloads >5 min old
Payloads carry event type + ID, not the full object — fetch via GET
At-least-once retries with same event.id — dedupe accordingly

Ordering not guaranteed
3xx counts as failure — redirects not followed
Auto-disabled after ~20 consecutive failures or immediately on private-IP hostname

§13Kendo-specific implications

Concern	What it means for kendo
laravel/ai coverage	Covers Messages API only — not Managed Agents. Need direct HTTP via `Http::` or official PHP SDK if it ships CMA support. New `App\Services\ManagedAgentsService` in `Services` deptrac layer.
kendo-script MCP transport	✓ Verified streamable HTTP — no blocker (2026-05-09). `Mcp::web('/mcp/kendo', KendoServer::class)` in `backend/routes/ai.php:15` uses `laravel/mcp` 0.6.6's `HttpTransport`, which cites MCP spec `2025-06-18/basic/transports` verbatim. Live probe: `POST` → 401 + `WWW-Authenticate: Bearer realm="mcp", resource_metadata="..."`; `GET`/`DELETE` → 405 + `Allow: POST`; `.well-known/oauth-protected-resource` (RFC 9728) and `.well-known/oauth-authorization-server` (RFC 8414, S256 PKCE, refresh tokens, dynamic client registration) both serve correct metadata. Exactly the shape Anthropic's `mcp_oauth` vault credential type consumes.
GitHub repo-mount auth	✓ Reuse existing GitHub App installation — no per-repo key (2026-05-09). Kendo connects via per-tenant GitHub App: `GithubInstallation` (central DB) maps `installation_id ↔ tenant_id`; `ProjectGithubRepo.repo_full_name` holds linked repos. `GithubAppService::getInstallationToken(int $installationId): string` (`backend/app/Services/GithubAppService.php:30`) mints a fresh 1-hour installation access token via JWT-signed `POST /app/installations/{id}/access_tokens` — covers every repo the installation has access to, no per-repo keying. Feed it directly to Anthropic's `resources: [{type: "github_repository", authorization_token: $token, ...}]`. App permissions (`pull_requests: write` + `contents: write`, proven by existing `createCheckRun`/`createPrComment`) cover Anthropic's required `repo` scope (§9). Caveats: 1-hour TTL → `PATCH /v1/sessions/{id}/resources/{resource_id}` rotation for long sessions; repo must be in the installation's selected list (already enforced by `ProjectGithubRepo` UX).
Reusable AI infra	`AiOutboundLogger` (hash-chained per ADR-0003) already covers tokens + status + errors. `AgentProgressEvent` + private Echo channel `Tenant.{tenantId}.App.Models.User.{userId}` handles client-side streaming.
Pennant pattern	Established: feature classes in `app/Features/<Pascal>.php`, `#[Name('kebab-name')]`, `resolve(): bool` (default-off), per-tenant scoped. Frontend bridge: `useFeatureActive('flagName')`.
Webhook intake mirror	Mirror existing `GithubWebhookController` + `VerifyGithubWebhook` + `ProcessPullRequestWebhookJob`. New handler uses `X-Webhook-Signature` + 5-minute freshness check.
Audit logging	ADR-0001 mandates append-only hash-chained logs. Session lifecycle events would write through new `ClaudeSessionAuditLogger` mirroring `IssueAuditLogger`.
Strategic context (KD-0390)	kendo previously had in-house "AI bot assignment" — `app/Actions/AiRun/`, `App\Models\AiRun`, `AiRunWorkflowJob`. Removed 2026-04-25* ("not on roadmap, if we ever want it we'll rebuild it"). Any new autonomous-agent feature is a deliberate revisit of that decision on a different shape.

§14Open questions

Cost model

Pricing is unpublished in beta docs

Per-token? Per-hour? Per-session-creation overhead? Container boot — billed? Need small experiments before scoping cost guardrails.

Latency

First-call latency vs Messages API

Container boot + session create. Likely meaningful regression for short-lived workloads. Need benchmark, not a guess.

Resolved 2026-05-09

Streamable HTTP on kendo-script MCP

Verified in laravel/mcp 0.6.6 source (cites MCP spec 2025-06-18 + 2025-11-25) and against the live endpoint. Streamable HTTP confirmed; full RFC 9728 + RFC 8414 OAuth discovery in place. Anthropic-compatible — see §13.

Identity model

Vault-per-user vs single bot vault

Per-user means audit trails attribute to the kicking-off Member; single bot vault is simpler but loses attribution granularity.

Failure semantics

failed outcome — what does kendo do?

If grader's failed result fires (rubric ↔ description contradicts), auto-comment + unassign Claude is the obvious answer but worth confirming.

Skill scope

Custom-skill authoring scope for multiagent path

MVP: planner + implementer + reviewer + vue-tests + php-tests (5 skills). Third-person, ≤500 lines each, with bundled validation scripts. Not trivial.

Resolved 2026-05-09

Outcomes / multiagent are research preview

Both moved to public beta at Code with Claude 2026 (May 6-8). Available to all developers on the standard managed-agents-2026-04-01 beta header — no access form needed. New research-preview feature is "dreaming" (sessions that learn from past sessions); not a dependency for any of the three sketched apps.

§15Decisions made 2026-05-09 · superseded

Superseded — migration cancelled 2026-05-09. PR #1101 closed without merging after the post-implementation production-telemetry check showed CMA fails the latency gate against real production p95 (28 s) by 4–7×. The 8 decisions below are preserved as the historical record of what was decided at the time. Decision #8's premise (Application B parked until ResearchAction soaks) cannot be met because ResearchAction never shipped — Application B is therefore re-opened to first-class consideration. Future CMA work must bake in the methodology rules from the production-telemetry postmortem.

Settled in a planning round between CEO and parent agent. These commit the project to a specific path; future /plan-feature rounds (for ResearchAction migration first, then Application B) start from these.

Decision	Detail
1. Next concrete CMA work	Migrate `ResearchAction` only — the multi-turn codebase-exploration phase of `StoryGenerationHarnessAction.php`. Other 4 phases stay on Messages API (structured-output calls don't benefit from CMA).
2. No Pennant flag on migration	ResearchAction is internal infra. Pennant pattern is reserved for HandOffToClaude (Application B), the user-visible feature.
3. Pre-migration spike ✓ ran 2026-05-09	Ran with actual ResearchAction system prompt + 4 story-gen inputs. Prompt 1 produced a clean Sonnet apples-to-apples comparison; prompts 2-4 surfaced harness reliability issues. Single clean datapoint shows 20.7× cheaper, 2.7× faster, 3.3× fewer tool calls — both gates pass with order-of-magnitude margin. Migration green-lit. Full writeup in `research/2026-05-08-managed-agents-spike.md`.
4. Spike scaffold location	Extend the existing `~/Code/cma-spike/` Python scaffolding with a new `prompts/` file. Standalone, throwaway. Numbers stay comparable to the first spike via the same `compare.py` harness.
5. Pass/fail thresholds	Migration ships only if both: cost ≥ 2× cheaper and p95 latency ≤ 1.5× current ResearchAction. Conservative on cost (lab-side 19× gives margin); forgiving on latency (story-gen is async-job, not interactive).
6. Rollback strategy	Replace outright. Trust the benchmark as the gate. If CMA misbehaves in production, rollback via `git revert` + redeploy. No shadow mode, no in-code A/B.
7. Outcomes + multiagent gating	Not a decision — both already public beta as of 2026-05-06. No access form. Available now on the standard beta header.
8. Application B timing	Park until ResearchAction has shipped and run in production for ≥ 2 weeks. Then `/plan-feature` with battle-tested CMA infra (ManagedAgentsService + GitHub installation-token rotation + audit logging).

§16Benchmark results 2026-05-09 · cost ✓ · latency ✗ on production telemetry

Sonnet · prompt 1 · clean comparison

Production ResearchAction system prompt against a story-gen input ("issue pinning" feature). Both paths ran to end_turn with no interruption.

	Messages API	CMA	Ratio
Cost	$11.51	$0.56	20.7× cheaper
Wall time	483s · 8 min	180s · 3 min	2.7× faster
Tool calls	118	36	3.3× fewer
Cumulative input	3.79M tokens · uncached	38in + 1.54M cache_read	—

Messages-API per-prompt cost was 3.5× higher than the first spike's average ($3.24) — production-shape ResearchAction prompts are far more open-ended than synthetic codebase questions, so the no-cache agent loop balloons accordingly. CMA's automatic prompt caching absorbs this — same prompt cost $0.56.

Verdict at the time: migration green-lit. Cost gate (≥ 2×) passes by 10× margin. Latency gate (CMA p95 ≤ 1.5× current) was scored against the spike's 483 s headline — trivially passable.

Post-implementation correction (2026-05-09): a query against prod-issue-tracker.ai_outbound_logs (n=36, last 30 days, Script tenant) showed real production p95 is 28 s, not 483 s. The spike's "production-shape" prompt was 17× the real production p95. Today's CMA smokes (Sonnet 191 s, Haiku 125 s) are 4–7× slower than production p95 — the latency gate fails against the real baseline. Cost case still holds. Full postmortem with methodology lessons: research/2026-05-09-production-telemetry-correction.md.

Harness reliability findings

Prompts 2-4 surfaced limitations of both the spike scaffold and Anthropic's session-control primitives. None block the migration decision; all flow into the ManagedAgentsService design when /plan-feature runs.

Production prompts run 5–10× longer than synthetic. First spike: 100–180s on CMA. Production-shape: 5–13 min. Patched session timeout 300s → 600s; future runs still need outlier tolerance.
user.interrupt doesn't reliably stop sessions mid-bash-tool-execution. One stuck session ignored 3 interrupts over 15s and kept running for ~13 min until its bash command naturally returned. Interrupts seem to take effect only at model-loop boundaries, not mid-tool-call. Production constraint: bound work via outcome rubrics, not client-side interrupts.
Reading session.usage too early returns 0/0/0/0. Patched to poll for idle status before reading. Without this fix, ~$1 of orphan-session spend went unattributed initially.
Full agent_toolset_20260401 is too broad to faithfully simulate ResearchAction. Production has 3 read-only tools; the spike agent had bash + read + write + edit + grep + glob + originally web_fetch + web_search. On prompt 2, the agent ignored "your ONLY job is to explore" and made 35 edit calls patching 16 files (a complete ValidationException-passthrough fix across all create/update MCP tools). No git push attempted — verified via bash log; the fix lived only inside the ephemeral container and was reaped on archive. Migration plan must use default_config.enabled: false + read-only allowlist.

Spend tally

Run	Cost	Notes
Messages API · Sonnet · prompt 1	$11.51	Full completion · the headline datapoint · exactly the cost the migration eliminates
CMA · Sonnet · prompt 1	$0.56	Full completion · the headline datapoint
CMA · Sonnet · prompt 2 (off-script)	~$2.96	Agent ignored "research only" prompt and patched 16 files
CMA · Sonnet · orphan-session cleanup	~$1.06	Recovered post-hoc via session-usage queries
Total	~$16	~$11 of which was the Messages-API run that the migration removes

§12Three plausible kendo applications

Issue-quality analyzer

Migrate ResearchAction

Hand to Claude (autonomous)

§15 — Decision points if we proceed

Stop-the-presses corrections vs initial impressions

Beta status & limits

§1Resource model

§2The container

Languages 8

Databases clients only

Networking 2 modes

System tools disabled by default: network

Specs

§3Tools — three categories

Built-in agent toolset agent_toolset_20260401

All on by default

MCP toolset HTTP only

Two-step wire-up

Custom tools your application executes

Backend executes the tool

§4Skills — first-class on the agent

Authoring rules from best-practices

Cross-surface gotcha

Anthropic pre-built skills

Progressive disclosure

§5Multiagent coordinator

How it works

Constraints

§6Outcomes public beta · 2026-05-06

Define what "done" looks like

Result types

§7Events & streaming

User events you send

Token usage on session object

Agent / session / span events you receive

§9GitHub access — the canonical path

§10Webhooks

Session events

Vault events

Delivery semantics

§13Kendo-specific implications

§14Open questions

§15Decisions made 2026-05-09 · superseded

§16Benchmark results 2026-05-09 · cost ✓ · latency ✗ on production telemetry

Sonnet · prompt 1 · clean comparison

Harness reliability findings

Spend tally

Migrate `ResearchAction`