Anthropic's Managed Agents API is a hosted agent runtime — Anthropic provisions the container, runs the agent loop, executes tool calls, and persists events server-side. We define what the agent is and what it runs in; Anthropic does the rest. Right fit for autonomous coding work and long-running tool-using tasks where we don't want to operate our own sandbox.
Recommendation: Don't pick the application before deciding the goal. Shipping a feature this sprint → Application A on Messages API. Learning CMA on a real workload → Application C, slice 2 (just ResearchAction). Avoid B as v1 — too much new infra in flight at once.
AiService::generateStructured(). Doesn't need CMA — Messages API is the right tool. Useful as a feature, weak as a CMA testbed.
ResearchActionStoryGenerationHarnessAction onto Managed Agents. CMA's mounted-repo + bash + grep is genuinely better than tool-calling Messages API for codebase exploration. Real production workload, smallest informative slice. Highest CMA learning per scope unit.
session.outcome_evaluation_ended posts back to kendo. Largest payoff, biggest blast radius. Worth doing only after CMA experience is established.
ResearchAction. Production workload, small blast radius.session.status_idled, session.outcome_evaluation_ended, session.thread_*. No polling needed.
https://api.githubcopilot.com/mcp/ is alive and documented, not archived. (The archived one is a separate community implementation.)
gh CLI is not required. The github_repository resource mounts the repo with git auth pre-wired. Native flow: edit files in mounted repo → git push via bash → create PR via MCP create_pull_request tool.
ServerName:tool_name.
managed-agents-2026-04-01 on every endpointidle → running → terminated. Events persist until session deletion. Container checkpoints expire after 30 days inactivity.mcp_server_url per vault, max 20 credentials per vault. OAuth refresh handled by Anthropic.github_repository — mounts a GitHub repo into the container with git auth pre-wired.PATCH /v1/sessions/{id}/resources/{rid}.Python 3.12+Node 20+Go 1.22+Rust 1.77+Java 21+Ruby 3.3+PHP 8.3+C/C++ GCC 13+
gitcurlwgetjqtarzipunzipsshscptmuxscreenmakecmakedockerripgrep (rg)treehtopsedawkgrepvimnanodiffpatch
Notable absence: gh CLI is not pre-installed. Add via packages.apt: ["gh"] on the environment — Anthropic pre-installs once and caches across sessions.
bashreadwriteeditglobgrepweb_fetchweb_search
configs[], or default-disable everything with default_config.enabled: false + per-tool allowlist.mcp_servers on the agent + expose via {type: "mcp_toolset", mcp_server_name: "..."}. Streamable HTTP only — no stdio. Auth flows through vault credentials at session creation. Invalid creds → session.error event, retries on next idle→running.agent.custom_tool_use → session.status_idle with stop_reason.type === "requires_action". Backend executes, then posts user.custom_tool_result. Permission policies do not apply — your application decides.
{ "type": "custom", "name": "get_weather", "description": "Get current weather for a location", "input_schema": { "type": "object", "properties": {...}, "required": [...] } }
name (≤64 chars, [a-z0-9-]+, no "anthropic"/"claude"); description (≤1024 chars)head -100s nested referencesprocessing-pdfs, not pdf-helperServerName:tool_nameCustom skills do not sync between surfaces:
/v1/skills).claude/skills/Local kendo Claude Code skills (vue-vitest-testing, php-unit-test) cannot be copy-pasted. They use first/second person and reference Claude Code-only slash commands.
xlsxdocxpptxpdf
Use as {type: "anthropic", skill_id: "xlsx"}. Custom org-uploaded skills support version pinning ("latest" or specific version).
Skill metadata (~100 tokens per skill from YAML frontmatter) is always loaded into the system prompt. The SKILL.md body (≤5K tokens) loads only when the agent decides the skill is relevant. Bundled scripts execute via bash; their source code never enters context, only the script's stdout/stderr.
Max 20 skills per session (across all sub-agents in multiagent).
A single agent declares a roster of sub-agents. All sub-agents share the same container and filesystem. Each runs in its own session thread with isolated context — own conversation history, model, system prompt, tools, MCP servers, skills.
Threads are persistent — coordinator can send a follow-up to an earlier sub-agent and that sub-agent retains its full prior turns. Maps cleanly to multi-stage workflows: Planner → Implementer → Reviewer.
"multiagent": { "type": "coordinator", "agents": [ { "type": "agent", "id": "agent_xxx" }, { "type": "agent", "id": "agent_yyy", "version": 3 }, { "type": "self" } ] }
The harness iterates the agent until the artifact passes a separate grader. Rubric is markdown with explicit per-criterion checks. Grader runs in a separate context window from the main agent — isn't biased by the agent's implementation choices.
{ "type": "user.define_outcome", "description": "Build a DCF model for Costco in .xlsx", "rubric": { "type": "text", "content": "# DCF Model Rubric\n..." }, "max_iterations": 5 }
Default max_iterations: 3, max 20. Rubric inline or via Files API. Span events span.outcome_evaluation_* make iteration loop observable. One outcome at a time per session, but outcomes can be chained sequentially.
| satisfied | Session transitions to idle |
| needs_revision | New iteration with grader's per-criterion feedback |
| max_iterations_reached | No further evaluation; one final revision |
| failed | Rubric ↔ description fundamentally contradict |
| interrupted | Only if eval already started before user.interrupt |
user.messageuser.interruptuser.custom_tool_resultuser.tool_confirmationuser.define_outcome
Race condition: open SSE stream before sending kickoff user.message.
Reconnect: open new stream → list past events for dedup IDs → tail.
input_tokensoutput_tokenscache_creation_input_tokenscache_read_input_tokens5-minute prompt cache TTL.
| Domain | Event types |
|---|---|
| Agent |
agent.message · agent.thinking · agent.tool_use · agent.tool_result · agent.mcp_tool_use · agent.mcp_tool_result · agent.custom_tool_use · agent.thread_context_compacted · agent.thread_message_received · agent.thread_message_sent
|
| Session |
session.status_running · session.status_idle · session.status_rescheduled · session.status_terminated · session.error · session.thread_created · session.thread_status_running · session.thread_status_idle · session.thread_status_terminated
|
| Span |
span.model_request_start · span.model_request_end · span.outcome_evaluation_start · span.outcome_evaluation_ongoing · span.outcome_evaluation_end
|
The Anthropic-blessed GitHub MCP is at https://api.githubcopilot.com/mcp/. Token declared on the session's resources, not the agent — agent stays repo-agnostic and reusable.
Multiple repos: add entries to resources array. Repos are cached across sessions sharing them. Token rotates mid-session via PATCH /v1/sessions/{id}/resources/{rid} — useful for short-lived GH App installation tokens. Fine-grained PATs are explicitly recommended over broad-access tokens.
session.status_run_startedsession.status_idledsession.status_rescheduledsession.status_terminatedsession.thread_createdsession.thread_idledsession.thread_terminatedsession.outcome_evaluation_ended
vault.createdvault.archivedvault.deletedvault_credential.createdvault_credential.archivedvault_credential.deletedvault_credential.refresh_failed
unwrap() checks X-Webhook-Signature, rejects payloads >5 min oldevent.id — dedupe accordingly| Concern | What it means for kendo |
|---|---|
| laravel/ai coverage | Covers Messages API only — not Managed Agents. Need direct HTTP via Http:: or official PHP SDK if it ships CMA support. New App\Services\ManagedAgentsService in Services deptrac layer. |
| kendo-script MCP transport | ✓ Verified streamable HTTP — no blocker (2026-05-09). Mcp::web('/mcp/kendo', KendoServer::class) in backend/routes/ai.php:15 uses laravel/mcp 0.6.6's HttpTransport, which cites MCP spec 2025-06-18/basic/transports verbatim. Live probe: POST → 401 + WWW-Authenticate: Bearer realm="mcp", resource_metadata="..."; GET/DELETE → 405 + Allow: POST; .well-known/oauth-protected-resource (RFC 9728) and .well-known/oauth-authorization-server (RFC 8414, S256 PKCE, refresh tokens, dynamic client registration) both serve correct metadata. Exactly the shape Anthropic's mcp_oauth vault credential type consumes. |
| GitHub repo-mount auth | ✓ Reuse existing GitHub App installation — no per-repo key (2026-05-09). Kendo connects via per-tenant GitHub App: GithubInstallation (central DB) maps installation_id ↔ tenant_id; ProjectGithubRepo.repo_full_name holds linked repos. GithubAppService::getInstallationToken(int $installationId): string (backend/app/Services/GithubAppService.php:30) mints a fresh 1-hour installation access token via JWT-signed POST /app/installations/{id}/access_tokens — covers every repo the installation has access to, no per-repo keying. Feed it directly to Anthropic's resources: [{type: "github_repository", authorization_token: $token, ...}]. App permissions (pull_requests: write + contents: write, proven by existing createCheckRun/createPrComment) cover Anthropic's required repo scope (§9). Caveats: 1-hour TTL → PATCH /v1/sessions/{id}/resources/{resource_id} rotation for long sessions; repo must be in the installation's selected list (already enforced by ProjectGithubRepo UX). |
| Reusable AI infra | AiOutboundLogger (hash-chained per ADR-0003) already covers tokens + status + errors. AgentProgressEvent + private Echo channel Tenant.{tenantId}.App.Models.User.{userId} handles client-side streaming. |
| Pennant pattern | Established: feature classes in app/Features/<Pascal>.php, #[Name('kebab-name')], resolve(): bool (default-off), per-tenant scoped. Frontend bridge: useFeatureActive('flagName'). |
| Webhook intake mirror | Mirror existing GithubWebhookController + VerifyGithubWebhook + ProcessPullRequestWebhookJob. New handler uses X-Webhook-Signature + 5-minute freshness check. |
| Audit logging | ADR-0001 mandates append-only hash-chained logs. Session lifecycle events would write through new ClaudeSessionAuditLogger mirroring IssueAuditLogger. |
| Strategic context (KD-0390) | kendo previously had in-house "AI bot assignment" — app/Actions/AiRun/*, App\Models\AiRun, AiRunWorkflowJob. Removed 2026-04-25 ("not on roadmap, if we ever want it we'll rebuild it"). Any new autonomous-agent feature is a deliberate revisit of that decision on a different shape. |
laravel/mcp 0.6.6 source (cites MCP spec 2025-06-18 + 2025-11-25) and against the live endpoint. Streamable HTTP confirmed; full RFC 9728 + RFC 8414 OAuth discovery in place. Anthropic-compatible — see §13.failed outcome — what does kendo do?failed result fires (rubric ↔ description contradicts), auto-comment + unassign Claude is the obvious answer but worth confirming.managed-agents-2026-04-01 beta header — no access form needed. New research-preview feature is "dreaming" (sessions that learn from past sessions); not a dependency for any of the three sketched apps.Superseded — migration cancelled 2026-05-09. PR #1101 closed without merging after the post-implementation production-telemetry check showed CMA fails the latency gate against real production p95 (28 s) by 4–7×. The 8 decisions below are preserved as the historical record of what was decided at the time. Decision #8's premise (Application B parked until ResearchAction soaks) cannot be met because ResearchAction never shipped — Application B is therefore re-opened to first-class consideration. Future CMA work must bake in the methodology rules from the production-telemetry postmortem.
Settled in a planning round between CEO and parent agent. These commit the project to a specific path; future /plan-feature rounds (for ResearchAction migration first, then Application B) start from these.
| Decision | Detail |
|---|---|
| 1. Next concrete CMA work | Migrate ResearchAction only — the multi-turn codebase-exploration phase of StoryGenerationHarnessAction.php. Other 4 phases stay on Messages API (structured-output calls don't benefit from CMA). |
| 2. No Pennant flag on migration | ResearchAction is internal infra. Pennant pattern is reserved for HandOffToClaude (Application B), the user-visible feature. |
| 3. Pre-migration spike ✓ ran 2026-05-09 | Ran with actual ResearchAction system prompt + 4 story-gen inputs. Prompt 1 produced a clean Sonnet apples-to-apples comparison; prompts 2-4 surfaced harness reliability issues. Single clean datapoint shows 20.7× cheaper, 2.7× faster, 3.3× fewer tool calls — both gates pass with order-of-magnitude margin. Migration green-lit. Full writeup in research/2026-05-08-managed-agents-spike.md. |
| 4. Spike scaffold location | Extend the existing ~/Code/cma-spike/ Python scaffolding with a new prompts/ file. Standalone, throwaway. Numbers stay comparable to the first spike via the same compare.py harness. |
| 5. Pass/fail thresholds | Migration ships only if both: cost ≥ 2× cheaper and p95 latency ≤ 1.5× current ResearchAction. Conservative on cost (lab-side 19× gives margin); forgiving on latency (story-gen is async-job, not interactive). |
| 6. Rollback strategy | Replace outright. Trust the benchmark as the gate. If CMA misbehaves in production, rollback via git revert + redeploy. No shadow mode, no in-code A/B. |
| 7. Outcomes + multiagent gating | Not a decision — both already public beta as of 2026-05-06. No access form. Available now on the standard beta header. |
| 8. Application B timing | Park until ResearchAction has shipped and run in production for ≥ 2 weeks. Then /plan-feature with battle-tested CMA infra (ManagedAgentsService + GitHub installation-token rotation + audit logging). |
Production ResearchAction system prompt against a story-gen input ("issue pinning" feature). Both paths ran to end_turn with no interruption.
| Messages API | CMA | Ratio | |
|---|---|---|---|
| Cost | $11.51 | $0.56 | 20.7× cheaper |
| Wall time | 483s · 8 min | 180s · 3 min | 2.7× faster |
| Tool calls | 118 | 36 | 3.3× fewer |
| Cumulative input | 3.79M tokens · uncached | 38in + 1.54M cache_read | — |
Messages-API per-prompt cost was 3.5× higher than the first spike's average ($3.24) — production-shape ResearchAction prompts are far more open-ended than synthetic codebase questions, so the no-cache agent loop balloons accordingly. CMA's automatic prompt caching absorbs this — same prompt cost $0.56.
Verdict at the time: migration green-lit. Cost gate (≥ 2×) passes by 10× margin. Latency gate (CMA p95 ≤ 1.5× current) was scored against the spike's 483 s headline — trivially passable.
Post-implementation correction (2026-05-09): a query against prod-issue-tracker.ai_outbound_logs (n=36, last 30 days, Script tenant) showed real production p95 is 28 s, not 483 s. The spike's "production-shape" prompt was 17× the real production p95. Today's CMA smokes (Sonnet 191 s, Haiku 125 s) are 4–7× slower than production p95 — the latency gate fails against the real baseline. Cost case still holds. Full postmortem with methodology lessons: research/2026-05-09-production-telemetry-correction.md.
Prompts 2-4 surfaced limitations of both the spike scaffold and Anthropic's session-control primitives. None block the migration decision; all flow into the ManagedAgentsService design when /plan-feature runs.
user.interrupt doesn't reliably stop sessions mid-bash-tool-execution. One stuck session ignored 3 interrupts over 15s and kept running for ~13 min until its bash command naturally returned. Interrupts seem to take effect only at model-loop boundaries, not mid-tool-call. Production constraint: bound work via outcome rubrics, not client-side interrupts.session.usage too early returns 0/0/0/0. Patched to poll for idle status before reading. Without this fix, ~$1 of orphan-session spend went unattributed initially.agent_toolset_20260401 is too broad to faithfully simulate ResearchAction. Production has 3 read-only tools; the spike agent had bash + read + write + edit + grep + glob + originally web_fetch + web_search. On prompt 2, the agent ignored "your ONLY job is to explore" and made 35 edit calls patching 16 files (a complete ValidationException-passthrough fix across all create/update MCP tools). No git push attempted — verified via bash log; the fix lived only inside the ephemeral container and was reaped on archive. Migration plan must use default_config.enabled: false + read-only allowlist.| Run | Cost | Notes |
|---|---|---|
| Messages API · Sonnet · prompt 1 | $11.51 | Full completion · the headline datapoint · exactly the cost the migration eliminates |
| CMA · Sonnet · prompt 1 | $0.56 | Full completion · the headline datapoint |
| CMA · Sonnet · prompt 2 (off-script) | ~$2.96 | Agent ignored "research only" prompt and patched 16 files |
| CMA · Sonnet · orphan-session cleanup | ~$1.06 | Recovered post-hoc via session-usage queries |
| Total | ~$16 | ~$11 of which was the Messages-API run that the migration removes |