Skip to content

Claude Managed Agents — Capability Survey and Kendo Application Sketches

Summary

Anthropic's Managed Agents API (beta managed-agents-2026-04-01) is a hosted agent runtime. Anthropic provisions the container, runs the agent loop, executes tool calls, and persists events server-side. We define what the agent is (model + system prompt + tools + skills) and what it runs in (container template + network policy); Anthropic does the rest. It is the right fit for autonomous coding work and other long-running tool-using tasks where we don't want to operate our own sandbox.

Three plausible kendo applications, in order of strategic value:

  1. Issue-quality analyzer — sidebar card on issue show page that runs a 4-criteria rubric check before the issue is picked up. Smallest scope, largest ROI on team workflow. Doesn't actually need Managed Agents — a single Messages API call on laravel/ai is sufficient. Useful as a feature, weak as a CMA learning vehicle.
  2. Autonomous issue execution — "Hand to Claude" button hands an issue to a Managed Agents session that plans, implements, and opens a PR. Highest user-visible payoff, biggest infra investment (~weeks), highest blast radius. The flagship application.
  3. Story-generation harness migration — port the existing 5-phase StoryGenerationHarnessAction (Messages-API-on-laravel/ai) onto Managed Agents. Real production workload, smaller blast radius than (2), most informative as a CMA testbed because the workload is already proven. Migrating just ResearchAction (the multi-turn codebase-exploration phase) is the smallest informative slice.

Stop-the-presses corrections vs initial impressions (these were all wrong in the first scoping pass):

  • Session-completion webhooks do exist (session.status_idled, session.outcome_evaluation_ended, session.thread_*). No polling needed.
  • The Anthropic-blessed GitHub MCP at https://api.githubcopilot.com/mcp/ is alive and documented, not archived. (The archived one is a separate community implementation in modelcontextprotocol/servers-archived.)
  • gh CLI is not required — github_repository resource mounts the repo with git auth pre-wired; git push works natively from bash. The Anthropic flow is "edit files in mounted repo → push branch via bash → create PR via the GitHub MCP create_pull_request tool".
  • Skills are first-class on the Agent definition (skills: [{type: "anthropic"|"custom", skill_id}]) but don't sync from local Claude Code. We'd re-author them for Managed Agents, third-person, ≤500 lines, with MCP refs fully qualified as ServerName:tool_name.

Recommendation: Don't pick the application before deciding the goal. If the goal is a useful feature shipped this sprint, the analyzer (1) on Messages API is right. If the goal is to learn CMA on a real workload before betting the autonomous-execution feature on it, migrating just ResearchAction of the story-gen harness (3) is the highest-information, smallest-scope option. Avoid (2) as a v1 — too much new infra in flight at once.


1. Resource model

Four primary resources, related but independent:

ResourceWhat it isLifecycle
AgentReusable, versioned configuration: model + system prompt + tools + MCP servers + skills + (optional) multiagent coordinator declaration.Updates create new versions. Sessions can pin a version or float to latest. Archive makes it read-only; existing sessions continue.
EnvironmentContainer template: pre-installed packages (apt, pip, npm, cargo, gem, go), networking policy. Not versioned — mutating an environment retroactively affects future sessions.Persists until archived/deleted. Multiple sessions share an environment but each gets its own container.
SessionRunning agent instance within an environment, tied to a specific task. Maintains conversation history and a checkpointed container filesystem.idle → running → terminated. Session events persist until session is deleted. Container checkpoints expire after 30 days of inactivity.
VaultPer-end-user credential store, scoped at the workspace level, used to inject auth into MCP server calls during a session.One credential per mcp_server_url per vault, max 20 credentials per vault. OAuth refresh handled by Anthropic.

A separate concept attached at session creation only:

  • Resources (resources: [...]) — currently the documented type is github_repository (mounts a GitHub repo into the container with git auth pre-wired). This is separate from environments and vaults and lives at the session-creation boundary.

2. The container

Cloud containers ship with this verbatim list (from /managed-agents/cloud-containers):

Programming languages: Python 3.12+ (pip, uv), Node.js 20+ (npm/yarn/pnpm), Go 1.22+, Rust 1.77+, Java 21+ (maven/gradle), Ruby 3.3+ (bundler/gem), PHP 8.3+ (composer), C/C++ GCC 13+.

Databases: SQLite (running), PostgreSQL client (psql only — no server), Redis client (redis-cli only — no server).

System tools: git, curl, wget, jq, tar/zip/unzip, ssh/scp, tmux/screen, make/cmake, docker (limited availability), ripgrep (rg), tree, htop, sed/awk/grep, vim/nano, diff/patch.

Specs: Ubuntu 22.04 LTS, x86_64, ≤8 GB RAM, ≤10 GB disk, network DISABLED by default.

Notable absence: GitHub CLI (gh) is not pre-installed. To add it: packages.apt: ["gh"] on the environment config — Anthropic pre-installs once and caches across sessions sharing the environment.

Networking has two modes:

json
// "Full outbound network access, except for a general safety blocklist."
{ "networking": { "type": "unrestricted" } }

// Restricts to allowlisted hosts; toggleable carve-outs for MCP servers and package managers.
{ "networking": {
    "type": "limited",
    "allowed_hosts": ["api.example.com"],
    "allow_mcp_servers": true,
    "allow_package_managers": true
}}

For production, the docs explicitly recommend limited with an explicit allowed_hosts list — least privilege.


3. Tools

The agent's tool surface is composed of three categories:

Built-in agent toolset (agent_toolset_20260401)

ToolDescription
bashExecute shell commands
readRead file from local filesystem
writeWrite file
editString replacement in file
globFile pattern matching
grepRegex text search
web_fetchFetch URL content
web_searchWeb search

All on by default. Disable individual tools via configs[], or default-disable everything with default_config.enabled: false and explicit per-tool allowlist.

Custom tools (your application executes them)

json
{
  "type": "custom",
  "name": "get_weather",
  "description": "Get current weather for a location",
  "input_schema": { "type": "object", "properties": {...}, "required": [...] }
}

When the agent invokes a custom tool, the session emits agent.custom_tool_use, then session.status_idle with stop_reason.type === "requires_action". Backend executes the tool and posts a user.custom_tool_result event. Permission policies do not apply to custom tools — your application decides.

MCP toolset (mcp_toolset)

Two-step wire-up:

  1. Declare the MCP server on the agent:
    json
    "mcp_servers": [{ "type": "url", "name": "github", "url": "https://api.githubcopilot.com/mcp/" }]
  2. Expose its tools via a toolset entry:
    json
    "tools": [{ "type": "mcp_toolset", "mcp_server_name": "github" }]

Only HTTP MCP supported (must implement streamable HTTP transport — no stdio). Auth flows through vault credentials at session creation. If credentials are invalid or missing, the session still creates — a session.error event fires and auth retries on the next idle → running transition. So credentials can be rotated mid-session without restart.

Permission policies

Two policy types: always_allow (autonomous) and always_ask (emits session.status_idle{stop_reason: requires_action}, expects user.tool_confirmation).

Defaults are deliberate:

  • Agent toolset → always_allow
  • MCP toolset → always_ask (so newly added MCP tools don't auto-fire without operator review)

Set blanket policies via default_config.permission_policy; override per-tool via configs[].permission_policy. deny_message lets you reject with a steering note.


4. Skills

Skills are first-class on the Agent definition:

json
"skills": [
  { "type": "anthropic", "skill_id": "xlsx" },
  { "type": "custom", "skill_id": "skill_abc123", "version": "latest" }
]

Pre-built Anthropic skills: xlsx, docx, pptx, pdf. Custom org-uploaded skills supported with version pinning (latest or pin to a specific version). Max 20 skills per session (across all sub-agents in multiagent).

Progressive disclosure is the load-bearing concept: skill metadata (name, description from YAML frontmatter — ~100 tokens per skill) is always loaded into the system prompt, but the SKILL.md body (≤5K tokens) loads only when the agent decides the skill is relevant. Bundled scripts execute via bash; their source code never enters context, only the script's stdout/stderr.

Authoring rules

From /agents-and-tools/agent-skills/best-practices:

  • YAML frontmatter required: name (max 64 chars, [a-z0-9-]+, no "anthropic"/"claude"); description (max 1024 chars; must be third person — it's injected into a system prompt; "You can help with..." causes discovery problems, "Processes Excel files and generates reports" is correct).
  • Body ≤500 lines for performance; split into bundled files referenced from SKILL.md.
  • One level deep for file references — Claude only head -100s nested references and may miss content.
  • Naming convention: gerund form preferred (processing-pdfs, not pdf-helper).
  • Inside skill bodies, MCP tools must be fully qualified: ServerName:tool_name. Without the prefix Claude may fail to locate the tool.
  • Bundled scripts are more reliable than asking the agent to generate equivalent code each time. Pre-shipped validation scripts shine for "plan-validate-execute" patterns.

Cross-surface gotcha

Custom skills do not sync between surfaces:

  • Claude.ai uploads (per-user, individual)
  • Claude API uploads (workspace-wide via /v1/skills)
  • Claude Code (filesystem-based, per-project at .claude/skills/)

Local kendo Claude Code skills (vue-vitest-testing, php-unit-test, etc.) cannot be copy-pasted. They use first/second person, reference Claude Code-only slash commands, and assume Claude Code's tool surface. Re-authoring is required.


5. Multiagent coordinator

A single agent can declare a roster of sub-agents:

json
"multiagent": {
  "type": "coordinator",
  "agents": [
    { "type": "agent", "id": "agent_xxx" },
    { "type": "agent", "id": "agent_yyy", "version": 3 },
    { "type": "self" }
  ]
}

Constraints:

  • Max 20 unique agents in the roster
  • Max 25 concurrent threads in a session
  • Depth-1 only — coordinators cannot recurse

All sub-agents share the same container and filesystem. Each runs in its own session thread with isolated context (own conversation history, own model, own system prompt, own tools, own MCP servers, own skills). Threads are persistent — coordinator can send a follow-up to an earlier sub-agent and that sub-agent retains its full prior turns.

This maps cleanly to multi-stage workflows: e.g. one coordinator delegating to a Planner, Implementer, and Reviewer, each with a tighter system prompt and skill set than a single mega-agent could carry.


6. Outcomes (public beta as of 2026-05-06)

Define what "done" looks like, and the harness iterates the agent until the artifact passes a separate grader:

json
{
  "type": "user.define_outcome",
  "description": "Build a DCF model for Costco in .xlsx",
  "rubric": { "type": "text", "content": "# DCF Model Rubric\n..." },
  "max_iterations": 5
}

Rubric is markdown with explicit per-criterion checks. Default max_iterations: 3, max 20. Rubric can be inline or uploaded via Files API for reuse.

The grader runs in a separate context window from the main agent, so it isn't biased by the agent's implementation choices. Result types:

ResultMeaning
satisfiedSession transitions to idle
needs_revisionAgent starts a new iteration cycle with grader's per-criterion feedback
max_iterations_reachedNo further evaluation; agent may run one final revision
failedRubric and description fundamentally contradict each other (rubric is unmappable to the task)
interruptedOnly emitted if outcome_evaluation_start already fired before a user.interrupt

Span events (span.outcome_evaluation_start/ongoing/end) make the iteration loop observable from the event stream.

Only one outcome at a time per session, but outcomes can be chained sequentially. The session can also be continued conversationally after an outcome resolves — history is retained.


7. Events and streaming

Sessions are event-driven. Two directions:

User events (you send): user.message, user.interrupt, user.custom_tool_result, user.tool_confirmation, user.define_outcome.

Agent/session/span events (you receive — full list quoted from the docs):

DomainType
Agentagent.message, agent.thinking, agent.tool_use, agent.tool_result, agent.mcp_tool_use, agent.mcp_tool_result, agent.custom_tool_use, agent.thread_context_compacted, agent.thread_message_received (multiagent), agent.thread_message_sent (multiagent)
Sessionsession.status_running, session.status_idle, session.status_rescheduled, session.status_terminated, session.error, session.thread_created, session.thread_status_running, session.thread_status_idle, session.thread_status_terminated
Spanspan.model_request_start, span.model_request_end, span.outcome_evaluation_start, span.outcome_evaluation_ongoing, span.outcome_evaluation_end

Critical for handling long-running sessions:

  • Open the SSE stream before sending the kickoff user.message to avoid race-conditioned event loss
  • To reconnect cleanly: open new stream, list past events for dedup IDs, then tail
  • Token usage rolls up on the session object (input_tokens, output_tokens, cache_creation_input_tokens, cache_read_input_tokens); 5-minute prompt cache TTL

8. Vaults

bash
# Create vault
POST /v1/vaults { display_name, metadata }

# Add credential
POST /v1/vaults/{vault_id}/credentials
  auth: {
    type: "mcp_oauth" | "static_bearer",
    mcp_server_url: "https://...",
    access_token: "...",
    expires_at: "...",
    refresh: { token_endpoint, client_id, refresh_token, token_endpoint_auth }
  }

# Reference at session creation
POST /v1/sessions { agent, environment_id, vault_ids: ["vlt_..."] }

OAuth refresh is handled by Anthropic. Secret fields (token, access_token, refresh_token, client_secret) are write-only; never returned in API responses. mcp_server_url is immutable per credential — to point at a different server, archive the credential and create a new one.

Vaults are workspace-scoped. The docs warn explicitly: anyone with API key access can use them for authorizing an agent. Revoke by archive or delete.

When OAuth refresh fails, a vault_credential.refresh_failed webhook fires; diagnose via POST /v1/vaults/{vault_id}/credentials/{cred_id}/mcp_oauth_validate which returns valid | invalid | unknown plus the failed MCP handshake step.


9. GitHub access (the canonical path)

Anthropic documents a specific flow at /managed-agents/github:

json
// On the Agent — declares the GitHub MCP, no token at this stage:
{
  "mcp_servers": [{
    "type": "url",
    "name": "github",
    "url": "https://api.githubcopilot.com/mcp/"
  }],
  "tools": [
    { "type": "agent_toolset_20260401" },
    { "type": "mcp_toolset", "mcp_server_name": "github" }
  ]
}

// On the Session — mounts the repo with auth pre-wired:
{
  "agent": "...",
  "environment_id": "...",
  "resources": [{
    "type": "github_repository",
    "url": "https://github.com/org/repo",
    "mount_path": "/workspace/repo",
    "authorization_token": "ghp_..."
  }]
}

The flow per the docs: "edit files in the mounted repo → push branch via bash → create PR via the MCP create_pull_request tool".

Multiple repositories can be mounted by adding entries to the resources array. Repos are cached across sessions sharing them — second hand-off to a previously-mounted repo starts faster.

The GH token can be rotated mid-session via PATCH /v1/sessions/{session_id}/resources/{resource_id} with a new authorization_token. Useful for long-lived sessions where short-lived GH App installation tokens expire.

Token permission table from the docs:

ActionRequired scopes
Clone private reposrepo
Create PRsrepo
Read issuesrepo (private) or public_repo
Create issuesrepo (private) or public_repo

Fine-grained personal access tokens are explicitly recommended over broad-access tokens.


10. Webhooks

Webhooks fire for major state changes. Register endpoints at Console → Manage → Webhooks; Anthropic generates a whsec_-prefixed signing secret shown only once.

Session events: session.status_run_started, session.status_idled, session.status_rescheduled, session.status_terminated, session.thread_created, session.thread_idled, session.thread_terminated, session.outcome_evaluation_ended.

Vault events: vault.created, vault.archived, vault.deleted, vault_credential.created, vault_credential.archived, vault_credential.deleted, vault_credential.refresh_failed.

Verification: the SDK's unwrap() helper checks the X-Webhook-Signature header and rejects payloads more than 5 minutes old. Webhook payloads carry the event type and ID, not the full object — fetch via GET to handle.

Delivery semantics: ordering is not guaranteed; retries are at-least-once with the same event.id (dedupe accordingly); a 3xx response counts as failure (redirects not followed); endpoints auto-disable after ~20 consecutive failures or immediately if the hostname resolves to a private IP.


11. Beta status, pricing, branding

  • All endpoints require the managed-agents-2026-04-01 beta header. SDKs set it automatically.
  • Outcomes and multiagent are research preview — request access via the form linked from the overview. Updated 2026-05-09: at Code with Claude 2026 (May 6-8), Anthropic moved both outcomes and multiagent out of research preview into public beta — available to all developers on the standard managed-agents-2026-04-01 beta header. No access form needed. The overview page has a stale "research preview / request access" note; the feature-specific pages (/managed-agents/define-outcomes, /managed-agents/multi-agent) drop any gating language. The current research-preview slot is dreaming (sessions that learn from past sessions; Harvey reported ~6× task-completion gains). Dreaming is not a dependency for any of the three sketched applications.
  • Rate limits per organization: 300 req/min on creates, 600 req/min on reads. Org-level spend limits and tier rate limits apply on top.
  • Branding rules for partners: "Claude Agent" is the preferred surface. Forbidden: any framing that implies the partner product is Claude Code or Claude Cowork.

Pricing is not published in the docs as of this read; usage shows up on the session via standard token-counting fields.


12. Three plausible kendo applications

A. Issue-quality analyzer (smallest scope)

What it is: sidebar card on the issue show page that runs a 4-criteria rubric on the issue body — clear acceptance criteria, bounded scope, no design ambiguity, no cross-cutting concerns. Returns per-criterion verdict; gates a "Hand to Claude" button or just surfaces a quality score.

Implementation shape: single Action on top of existing AiService::generateStructured() (backend/app/Services/AiService.php). One Messages API call via laravel/ai's Anthropic driver. ~3-5 days of work.

Does NOT need Managed Agents. A Messages API structured-output call is the right tool for this job. Useful as a feature, weak as a CMA testbed.

B. Autonomous issue execution ("Hand to Claude")

What it is: button on the issue sidebar that hands an issue to a Managed Agents session. The agent plans, implements, runs tests, and opens a PR — autonomously. Outcome rubric = the issue's acceptance criteria. Webhook on session.outcome_evaluation_ended posts a comment back to kendo. Existing kendo GitHub webhook moves the issue to Review when the PR opens.

Implementation shape: significant. New Service (ManagedAgentsService), several new Actions (HandIssueToClaudeAction, HandleSessionWebhookAction, RunIssueEligibilityCheckAction), new Model (ClaudeSession), new Controller + Middleware (AnthropicWebhookController + signature verification), new Job, new Audit logger, new Pennant feature, frontend card + composable, custom skills uploaded to our Anthropic org.

Multi-week effort. Highest user-visible payoff, biggest blast radius, most CMA primitives exercised. Worth doing only after CMA experience is established.

C. Story-generation harness migration

What it is: port the existing 5-phase StoryGenerationHarnessAction (backend/app/Actions/Agent/StoryGenerationHarnessAction.php) — currently Messages-API-on-laravel/ai with AgentToolFactory-built tools — onto Managed Agents.

Implementation shape options:

  1. Whole harness, big-bang: replace the orchestrating Action with a session create-and-stream. Keep the controller + Vue UX; only the layer below changes.
  2. One-phase migration (recommended testbed): migrate just ResearchAction (the multi-turn codebase-exploration phase). The other 4 phases (ValidateInputAction, DuplicateCheckAction, ClassifyAction, WriteAction) are essentially structured-output calls that Messages API handles fine. CMA's mounted-repo + bash + grep is genuinely better than tool-calling Messages API for codebase exploration.
  3. Parallel A/B: keep the current harness, add a second pipeline behind AiStoryGenerationCma Pennant flag. Compare quality + latency + cost on real workload before retiring the old path.

Tools mismatch: app/Agent/Tools/*Tool.php implements the laravel/ai Tool contract, not MCP. Three options for crossing the boundary: re-expose the eight existing tools as MCP tools on kendo-script (~8 new tools, several have natural MCP analogues already), have CMA agent use bash + repo mount + grep instead of the abstractions, or wrap them as Custom Tools (agent emits agent.custom_tool_use, our backend executes existing Tool class).

Strategic value: real production workload, smaller blast radius than (B). Highest information density per dollar of refactor for "is CMA worth betting (B) on?"


13. Kendo-specific implications

  • laravel/ai covers Messages API only, not Managed Agents. CMA integration requires direct HTTP calls (Laravel's Http:: facade) or the official Anthropic PHP SDK if it ships Managed Agents support. New App\Services\ManagedAgentsService lives in the Services deptrac layer.
  • kendo-script MCP transport: ✅ Verified — streamable HTTP, Anthropic-compatible. Mcp::web('/mcp/kendo', KendoServer::class) in backend/routes/ai.php:15 registers the laravel/mcp HttpTransport (vendor/laravel/mcp/src/Server/Transport/HttpTransport.php), which implements MCP spec 2025-06-18 streamable HTTP verbatim — the source even cites https://modelcontextprotocol.io/specification/2025-06-18/basic/transports#sending-messages-to-the-server (line 70) and 2025-11-25/basic/transports#listening-for-messages-from-the-server (Registrar.php:34). Live probe of https://script.kendo.dev/mcp/kendo (2026-05-09) confirms: POST → 401 with WWW-Authenticate: Bearer realm="mcp", resource_metadata="..."; GET/DELETE → 405 with Allow: POST; /.well-known/oauth-protected-resource/mcp/kendo returns {resource, authorization_servers, scopes_supported: [mcp:use]} (RFC 9728); /.well-known/oauth-authorization-server returns issuer + authorize/token/register endpoints with S256 PKCE and authorization_code + refresh_token grants (RFC 8414). This is exactly the shape Anthropic's mcp_oauth vault credential type expects — no transport blocker for attaching kendo-script MCP to a Managed Agents session.
  • GitHub repo-mount auth — reuse the existing GitHub App installation, no per-repo key: kendo connects to GitHub via a per-tenant GitHub App installation (GithubInstallation central-DB model, one installation_id per tenant). Linked repos are ProjectGithubRepo rows with repo_full_name ("owner/repo"). GithubAppService::getInstallationToken(int $installationId): string (backend/app/Services/GithubAppService.php:30) mints a fresh 1-hour installation access token via JWT-signed POST /app/installations/{id}/access_tokens. The same token covers every repo that installation has access to — there is no separate per-repo key. Anthropic's github_repository resource (authorization_token field — see § 9) consumes this directly. CMA wiring: (1) resolve tenant → GithubInstallation.installation_id, (2) mint token via getInstallationToken(), (3) pass it as authorization_token for each resources[] entry (same token for multi-repo sessions). Token TTL is 1 hour — long-running sessions need PATCH /v1/sessions/{id}/resources/{resource_id} rotation before expiry (see § 9). The App already holds pull_requests: write and contents: write permissions (proven by existing createCheckRun / createPrComment calls in GithubAppService), which covers Anthropic's required scopes (repo for clone + PR per § 9 token-permission table). Repo-selection precondition: a user must have granted the App access to the repo at install time — already enforced by ProjectGithubRepo UX, no new infra.
  • Existing AI infra is reusable: AiOutboundLogger (hash-chained per ADR-0003) already covers token + status + error logging — Managed Agents calls would log here too. AgentProgressEvent + private Echo channel Tenant.{tenantId}.App.Models.User.{userId} already handles client-side streaming.
  • Pennant pattern is established: features as classes in app/Features/<Pascal>.php with #[Name('kebab-name')] attribute and resolve(): bool (default-off). Per-tenant scoped via Feature::for($tenant)->active(...). Frontend bridge exposes flags via ProfileResourceData; useFeatureActive('flagName') composable on the Vue side.
  • GitHub webhook intake pattern: GithubWebhookController + VerifyGithubWebhook middleware (HMAC X-Hub-Signature-256) + ProcessPullRequestWebhookJob ($tries=3, backoff [30, 60, 120]). An Anthropic webhook handler would mirror this with X-Webhook-Signature and 5-minute freshness check.
  • Audit logging: ADR-0001 mandates append-only hash-chained logs for all auditable mutations. CMA session lifecycle events (created, completed, failed, terminated) would write through a new ClaudeSessionAuditLogger mirroring IssueAuditLogger etc.
  • Strategic context (KD-0390): kendo previously had an in-house "AI bot assignment" feature — app/Actions/AiRun/*, App\Models\AiRun, AiRunWorkflowJob. Removed on 2026-04-25 ("not on roadmap, if we ever want it we'll rebuild it"). Any new autonomous-agent feature is a deliberate revisit of that decision, on a different shape (hosted runtime, not in-house orchestration).

14. Open questions

  • Pricing. Beta docs do not publish per-token, per-hour, or per-session-creation overhead. Container boot cost — is it billed? Need to check via small experiments before scoping cost guardrails.
  • First-call latency. Container boot + session create vs. one Messages API call. Likely meaningful regression for short-lived workloads (story-gen). Need a benchmark, not a guess.
  • Streamable HTTP transport on kendo-script MCP. Resolved 2026-05-09 — verified in source (vendor/laravel/mcp/src/Server/Transport/HttpTransport.php, vendor/laravel/mcp/src/Server/Registrar.php) and against the live endpoint. Streamable HTTP per MCP spec 2025-06-18 + 2025-11-25, with full RFC 9728 / RFC 8414 OAuth discovery. See section 13.
  • Vault-per-user vs single bot vault. Per-user means audit trails attribute the work to the kicking-off Member; single bot vault is simpler but loses attribution granularity.
  • failed outcome semantics. If the rubric grader's failed result fires (rubric ↔ description contradicts), what does kendo do? Auto-comment + unassign Claude is the obvious answer but worth confirming end-state expectations before building.
  • Custom-skill authoring scope. If we go down the multiagent path: which kendo skills get re-authored for the API? Minimum viable is probably planner + implementer + reviewer + vue-tests + php-tests (5 skills). Authoring is third-person, ≤500 lines each, with bundled validation scripts where useful — not trivial.
  • Outcomes/multiagent gating. Both are research preview and require a request-access form. Worth filing the request now even if we don't pick that direction immediately. Resolved 2026-05-09 — both moved to public beta at Code with Claude 2026 (May 6-8). No gating, no form needed. See § 11.

15. Decision points if we proceed

If the goal is a useful feature shipped this sprint: Application A (analyzer) on Messages API. Skip CMA. Cheap, clean, valuable to the team.

If the goal is learn CMA on a real workload before betting bigger features on it: Application C, slice 2 (migrate just ResearchAction). Highest CMA learning per scope unit. Production workload, small blast radius.

If the goal is the flagship autonomous-execution feature: Application B, but only after Application C has put CMA infra in production. Otherwise too many new things in flight at once.

Avoid: starting with B without prior CMA experience; or "do all three at once".


16. Resolved decisions (2026-05-09)

Status: superseded — ResearchAction migration cancelled 2026-05-09. PR #1101 closed without merging. The 8 decisions below are preserved as the historical record of what was decided at the time; they are no longer the project's path. The post-implementation production-telemetry correction (cost gate ✅, latency gate ❌ against real production p95) made the cost-vs-UX-regression tradeoff fail. Decision #8's premise (Application B is parked until ResearchAction soaks ≥ 2 weeks in production) cannot be met because ResearchAction never went to production — Application B is therefore re-opened to first-class consideration. Any future CMA work for kendo must bake in the methodology rules from ./2026-05-09-production-telemetry-correction.md before scoping: query the audit log for the real workload distribution, separate cost claims (single-prompt-OK) from latency claims (distribution-required), and pre-register the AC's "current" baseline against a documented dataset.

Settled in a planning round on 2026-05-09 between CEO and parent agent. These commit the project to a specific path; future planning (/plan-feature for the migration, then for Application B) starts from these.

#DecisionDetail
1Next concrete CMA workMigrate ResearchAction (slice 2 of Application C) — the multi-turn codebase-exploration phase of app/Actions/Agent/StoryGenerationHarnessAction.php. The other 4 phases (ValidateInputAction, DuplicateCheckAction, ClassifyAction, WriteAction) stay on Messages API — they're structured-output calls; CMA adds session-creation overhead with no benefit.
2No Pennant flag on ResearchActionResearchAction migration is internal infra, not a user-visible feature. Pennant pattern (app/Features/<Pascal>.php) is reserved for HandOffToClaude (Application B), which is per-tenant gated.
3Pre-migration benchmark spikeRun a production-prompt benchmark spike before scoping the migration. Closes the cost/latency open questions on the real prompt shape, not the 3 synthetic prompts the first spike used.
4Spike scaffold locationExtend the existing ~/Code/cma-spike/ Python scaffolding with a new prompts/ file containing the actual ResearchAction system prompt + 3-5 sampled story-gen inputs. Reuses the same compare.py harness so numbers stay comparable to the first spike. Standalone and throwaway.
5Spike pass/fail thresholdsMigration ships only if both: cost ratio ≥ 2× cheaper than the current Messages-API ResearchAction and p95 latency ≤ 1.5× current. The 19× ratio from the first spike gives huge margin; demanding 2× on real prompts is conservative. Latency gate stops UX regression even if cost is fantastic.
6Rollback strategyReplace outright. Trust the benchmark spike as the gate. If CMA misbehaves in production, rollback via git revert + redeploy. No config flag, no shadow mode, no A/B safety net beyond the deploy cycle.
7Outcomes + multiagent gatingNot a decision — both moved to public beta at Code with Claude 2026 (May 6-8). Available now on the standard managed-agents-2026-04-01 beta header. See § 11.
8Application B (Hand to Claude) timingPark until the ResearchAction migration has shipped and run in production for ≥ 2 weeks. Then scope via /plan-feature. The CMA infra (ManagedAgentsService, ClaudeSession, AnthropicWebhookController, GitHub-installation-token rotation) gets exercised on the low-blast-radius ResearchAction first.

What's still benchmarked-not-decided

The two cost/latency questions on § 14 are owned by the pre-migration spike (decision #3) — pricing model and first-call-latency numbers come out of running it. Application-B-specific decisions (vault model, failed outcome semantics, custom-skill authoring scope) are deliberately deferred until B gets planned per decision #8.

Update 2026-05-09: production-prompt benchmark ran, migration partially green-lit

The pre-migration benchmark spike from decision #3 ran the same day. Full writeup in ./2026-05-08-managed-agents-spike.md ("Production-prompt benchmark (2026-05-09)") — short version:

  • Prompt 1 (Sonnet, both paths, clean completion): Messages API $11.51 / 483s / 118 tool calls, CMA $0.56 / 180s / 36 tool calls. 20.7× cheaper, 2.7× faster, 3.3× fewer tool calls on this single prompt.
  • Prompts 2-4: harness reliability issues (session-timeout edge cases, user.interrupt not propagating mid-bash-tool, full agent_toolset_20260401 causing the agent to go off-script and edit files when its system prompt said "research only"). Did not produce additional clean datapoints.
  • Migration green-lit at the time on prompt 1 alone. The reasoning was "single-datapoint gap is so large that statistical noise can't close it." That reasoning was wrong on the latency axis — see the next bullet. Decision #1 (ResearchAction migration) and #6 (replace outright, rollback via git revert) still stand on the cost case.
  • Spike total spend ~$16, dominated by the single Messages-API run ($11.51) — exactly the cost the migration removes.
  • Harness lessons that flow into the migration plan: production ManagedAgentsService should expose only read-only tools (no bash/write/edit); long-running CMA sessions need server-side deadlines because user.interrupt is unreliable mid-tool-call; bound work via outcome rubrics rather than interrupts.

Update 2026-05-09 (post-implementation): production telemetry contradicts the spike's wall-clock claim

After the migration code was implementation-complete, a query against prod-issue-tracker.ai_outbound_logs (n=36 over 30 days, Script tenant, feature='agent_research' AND status='success') showed production Messages-API ResearchAction runs at 19 s median / 28 s p95 — not the 483 s the spike's headline implied. The spike's "production-shape" prompt was 17× the real production p95.

Verdict by gate (decision #5):

GateVerdict
Cost ≥ 2× cheaper✅ Likely passes by ~50× margin (Haiku ~$0.12/run vs $11.51 spike — even against real production cost the gate trivially clears). Token-volume on a representative single prompt characterises cost plausibly; this part of the spike's reasoning was sound.
p95 latency ≤ 1.5× currentLikely fails. Production p95 is 28 s; CMA smokes ran 125 s (Haiku) and 191 s (Sonnet) — 4–7× the production p95.

Methodology lesson: when the feature being replaced has Channel-1 audit logs, the spike must include a SELECT over the audit table for the response_time / token-volume distribution. Single-datapoint benchmarks are fine for cost (token volume on a representative prompt is plausibly representative) but not for latency (which depends on workload distribution that can't be inferred from one prompt). Full postmortem with methodology rules and Application-B implications: ./2026-05-09-production-telemetry-correction.md.

The KD-0650 PR (#1101 in the kendo repo) is open with this regression captured. Decision pending: ship-as-is, side-by-side dispatcher, abandon, or redesign with max_iterations cap on Haiku.


References

All quotes and schemas in this doc are from the Anthropic Managed Agents documentation at https://platform.claude.com/docs/en/managed-agents/* and the Agent Skills documentation at https://platform.claude.com/docs/en/agents-and-tools/agent-skills/*. Pages read in full as of 2026-05-08.