Appearance
Managed Agents Spike — Codebase Research, CMA vs Messages API (3 prompts)
2026-05-09 correction: the latency headline ("2.7× faster") in this doc and the "Production-prompt benchmark" section below was based on a single open-ended prompt that ran 483 s on Messages-API. Real production telemetry is 19 s median / 28 s p95 (n=36 over 30 days). Today's CMA smokes are 4–7× slower than the production p95, not faster. The cost case still holds; the latency case is reversed. See
2026-05-09-production-telemetry-correction.mdfor the full postmortem and methodology lessons. The body of this doc is preserved unchanged for historical reference — apply the postmortem's findings before citing any latency number from below.
Summary
Empirical spike: ran 3 representative kendo "research a codebase" prompts through two implementations on the same model (Claude Sonnet 4.6), with equivalent capabilities. Anthropic Managed Agents (CMA) was 19× cheaper, ~18% faster on average, and used 58% fewer tool calls than a custom Messages-API implementation.
| Path | Total wall time | Total cost | Avg tool calls / prompt |
|---|---|---|---|
| Messages API (custom tools, ripgrep over a local kendo clone) | 517s | $9.72 | 64 |
CMA (agent_toolset_20260401 on github_repository mount) | 422s | $0.51 | 27 |
The cost gap is dominated by automatic prompt caching in CMA sessions. My Messages-API implementation didn't set cache_control markers — a fair-fight version would narrow the gap to maybe 3-5×, not 19×. But the gap stays meaningful in any version of the comparison, and it explains a structural property of CMA: long-conversation workloads with many tool calls (the shape of any agent loop) get prompt caching for free, where the equivalent Messages-API code has to opt in deliberately.
Output quality was indistinguishable between the two paths. Both produced clean structured markdown reports with file:line references. Variance in which exact line they cited (e.g. AppServiceProvider.php:165 vs :167) suggests minor hallucination on either side, with no consistent winner.
The spike validates a single specific claim: for codebase research as a workload, CMA is the right tool. It doesn't validate CMA for the parts of an autonomous-PR workflow that involve writing code, running tests, or opening PRs — those are still untested.
Spike setup
Standalone scaffolding at ~/Code/cma-spike/ — Python 3.12, anthropic SDK 0.100.0, python-dotenv. Both paths take the same prompt files; both use Sonnet 4.6; both have MAX_TOKENS_PER_RUN=4096 output cap.
Messages API path (messages_api/run.py): direct client.messages.create with three custom tools — search_code (shells to grep/ripgrep), read_file, list_files — pointed at a local kendo clone. Loop continues until the model emits stop_reason != "tool_use".
CMA path (cma/run.py): creates a Session against a pre-created Agent + Environment, attaches kendo via github_repository resource (read-only fine-grained PAT). Uses the built-in agent_toolset_20260401 (bash, read, write, edit, glob, grep). Streams events; stops on session.status_idle with stop_reason.type === "end_turn".
System prompts are aligned between the two paths so the comparison is on the runtime, not on the framing.
Verbatim numbers
| Prompt | MA wall | CMA wall | MA tokens | CMA tokens | MA $ | CMA $ | MA tools | CMA tools |
|---|---|---|---|---|---|---|---|---|
01-find-branch-linker-usages | 135.3s | 100.2s | 618,968 in / 5,968 out | 10 in / 2,498 out / 105,007 cache reads | $1.9464 | $0.0690 | 42 | 11 |
03-webhook-intake-pattern | 154.4s | 182.3s | 880,797 in / 8,225 out | 19 in / 5,535 out / 512,946 cache reads | $2.7658 | $0.2370 | 57 | 32 |
06-feature-flag-pattern | 227.2s | 139.6s | 1,618,413 in / 9,989 out | 22 in / 5,685 out / 385,398 cache reads | $5.0051 | $0.2010 | 95 | 37 |
| avg | 172.3s | 140.7s | 1,039,392 in / 8,060 out | 17 in / 4,572 out | $3.24 | $0.17 | 64 | 27 |
The CMA "input tokens" column reads as ~10-20 per run because almost everything (system prompt, tool definitions, accumulated history) hits the prompt cache. The cache-read column captures the real conversational scale on the CMA side.
The MA "input tokens" column shows the cumulative input over all turns of the agent loop — at 95 tool calls (prompt 06), the conversation history alone is 1.6M tokens of cumulative input. This is normal for an uncached agent loop; cache markers would dramatically reduce it.
Per-prompt observations
Prompt 01 — find IssueBranchLinker usages. Narrow component-tracing question. CMA: 100s, 11 tool calls. MA: 135s, 42 tool calls. CMA's tool-call efficiency comes from one-shot bash compositions (find ... | xargs grep ...) replacing what MA does as 5-7 sequential round-trips.
Prompt 03 — webhook intake pattern. Multi-file pattern documentation across routes, middleware, controllers, jobs, audit. The only prompt where CMA was slower (182s vs 154s). The CMA agent went deeper — read 32 tool calls' worth of files including the audit logger and the queued job retry config. The MA agent stopped earlier with a thinner answer. So CMA's "loss" on wall time was paired with a richer report. Both produced usable output.
Prompt 06 — Pennant feature flag pattern. Small enumeration with auto-discovery wiring. CMA: 140s, 37 tool calls. MA: 227s, 95 tool calls. The MA agent looped extensively — five separate searches for the #[Name] attribute, repeated reads of AppServiceProvider. The cumulative input hit 1.6M tokens (the full session conversation history flowing through every turn) and the cost ballooned to $5. CMA capped the same exploration with prompt caching. This is the prompt where the cost gap got most extreme (25×).
Findings
1. Cost: CMA wins decisively, ~19× cheaper across the sample. Caveat: this includes the unfair-caching effect. A Messages-API implementation that wires up cache_control: {type: "ephemeral"} on system prompts and tool definitions would narrow the gap to maybe 3-5×. But that gap stays meaningful, and the rule generalises: CMA gets prompt caching for free; custom Messages-API code gets it only when you remember to opt in. For long-running agent loops, that's a real ergonomic win.
2. Speed: CMA wins on average (~18%), but variance is high. Prompt 03 was 28s slower on CMA. The mean tells the story; individual prompts swing both ways depending on how exploratory the agent gets. Repeated runs of the same prompt would also vary — we didn't measure that.
3. Tool-call efficiency: CMA uses ~58% fewer calls. Structural, not caching-related. In-container bash composes (find | xargs grep, cat | head, etc.) replace what tool-calling Messages API needs multiple round-trips for. This effect is independent of caching and would persist in a "fair fight" comparison.
4. Output quality: indistinguishable. Both paths produced clean structured reports with file:line refs. Minor differences in cited line numbers (off-by-2 here and there) on both sides. Neither wins on quality.
Honest limitations
- Sample size of 3. Directional signal, not a tight measurement. Variance per prompt is real (prompt 03 reversed the speed result). Wider sample would tighten the numbers.
- Codebase research only. This validates the read side of an agent. Writing code, running tests, opening PRs — all untested.
- Sonnet 4.6 only. Opus would shift absolute numbers but probably keep the ratio.
- Cold-start tax hidden. Every CMA run started cold (new session per prompt). Production might reuse sessions for warm calls; we didn't measure that.
- The Messages-API path didn't use prompt caching. This was deliberate (matches how
app/Actions/Agent/ResearchAction.phpis wired today, which doesn't usecache_controlmarkers either) but it's the dominant cause of the cost gap. The "real" gap is smaller. - Both paths were single-turn from the user's POV. Multi-turn iterative refinement was not tested.
Strategic read
The signal is strong enough to act on a specific, contained migration:
Migrate ResearchAction of the existing story-generation harness to CMA. That's the multi-turn codebase-exploration phase of app/Actions/Agent/StoryGenerationHarnessAction.php. The other 4 phases (ValidateInputAction, DuplicateCheckAction, ClassifyAction, WriteAction) are essentially structured-output calls that don't benefit from CMA — they should stay on Messages API. Expected outcomes:
- Cost on the research phase drops 5-15× (conservative, after accounting for cache markers being available in either runtime)
- Latency on the research phase ties or improves
- Tool-call count drops by half — fewer round-trips, less load on the kendo backend's MCP tools
- We get production CMA infrastructure stood up (Agent + Environment +
github_repositoryresource + webhook handler + audit logging) on a real workload before betting the autonomous-execution feature on it
What this spike doesn't justify:
- Building the full "Hand to Claude" autonomous-PR feature next. We've only validated the read side. The write side is still entirely untested in our context.
- Migrating phases of the harness that are not tool-heavy. Those are the right shape for Messages API; CMA would just add session-creation overhead.
Reproduce
Spike folder: ~/Code/cma-spike/
bash
cd ~/Code/cma-spike
source .venv/bin/activate
# .env needs ANTHROPIC_API_KEY + GITHUB_TOKEN (read-only PAT scoped to script-development/kendo)
python3 cma/setup.py # creates Agent + Environment, caches IDs
python3 messages_api/run.py prompts/<file>.md # one MA run
python3 cma/run.py prompts/<file>.md # one CMA run
python3 compare.py # produces results/comparison.mdTo extend the sample, drop more prompt files into prompts/ and re-run. Cost ceiling: each MA run ≈ $1-5; each CMA run ≈ $0.05-0.50 on Sonnet at the prompt sizes we tested.
To stop the meter when done: python3 cma/teardown.py (archives the Agent + Environment).
Prerequisite verification — kendo-script MCP transport (2026-05-09)
The capability survey flagged one prerequisite for any CMA migration that wants to attach the existing kendo-script MCP server (the laravel/mcp server backing https://script.kendo.dev/mcp/kendo) to a Managed Agents session: does it speak streamable HTTP MCP? Anthropic's mcp_toolset only accepts streamable-HTTP MCP servers — stdio is not supported, and pre-2025-03-26 separate-endpoint SSE transport is not supported. Verified now so the migration plan doesn't trip on a transport mismatch.
Verdict: ✅ streamable HTTP, OAuth-discoverable, Anthropic-compatible. No transport blocker.
Source verification
backend/routes/ai.php:15 registers the server via Mcp::web('/mcp/kendo', KendoServer::class). Tracing into laravel/mcp 0.6.6 (backend/composer.json pinned at ^0.6.6):
vendor/laravel/mcp/src/Server/Registrar.php:32-56—web()registers a single route URI accepting:POST→ wraps the request inHttpTransport, runs the server, returns eitherapplication/jsonortext/event-stream. Cites the MCP 2025-11-25 transport spec inline.GET→405 Method Not AllowedwithAllow: POST(server-initiated GET stream not supported — spec-permitted)DELETE→405 Method Not AllowedwithAllow: POST(session termination via DELETE not supported — also spec-permitted)
vendor/laravel/mcp/src/Server/Transport/HttpTransport.php— implements both the immediate-response branch and the SSE-upgrade branch on the same POST endpoint. Reads/writes theMCP-Session-Idheader. SetsX-Accel-Buffering: noon streamed responses. Returns202for notifications-only POSTs,200otherwise — explicit reference tohttps://modelcontextprotocol.io/specification/2025-06-18/basic/transports#sending-messages-to-the-serverat line 70.
This is the streamable HTTP transport introduced in MCP spec 2025-03-26 and refined in 2025-06-18 / 2025-11-25 — single endpoint, POST-driven, optional SSE upgrade per request. Not the legacy two-endpoint SSE transport from 2024-11-05.
Live probe
Live endpoint behaviour matches the code (probed 2026-05-09 from this machine):
| Request | Response | Notes |
|---|---|---|
GET https://script.kendo.dev/mcp/kendo | 405 + Allow: POST | Spec-compliant. |
DELETE https://script.kendo.dev/mcp/kendo | 405 + Allow: POST | Spec-compliant. |
POST with no token, Accept: application/json, text/event-stream | 401 + WWW-Authenticate: Bearer realm="mcp", resource_metadata="https://script.kendo.dev/.well-known/oauth-protected-resource/mcp/kendo" | OAuth resource-indicator challenge (RFC 9728). |
GET /.well-known/oauth-protected-resource/mcp/kendo | {"resource":"https://script.kendo.dev/mcp/kendo","authorization_servers":["https://script.kendo.dev"],"scopes_supported":["mcp:use"]} | RFC 9728 metadata. |
GET /.well-known/oauth-authorization-server | {issuer, authorization_endpoint, token_endpoint, registration_endpoint, response_types_supported:["code"], code_challenge_methods_supported:["S256"], scopes_supported:["mcp:use"], grant_types_supported:["authorization_code","refresh_token"]} | RFC 8414 metadata. PKCE S256 + refresh tokens supported. Dynamic client registration available at /oauth/register. |
This is precisely the shape Anthropic's mcp_oauth vault credential type consumes. Per the capability survey (section 8), POST /v1/vaults/{vault_id}/credentials with auth: { type: "mcp_oauth", mcp_server_url, access_token, expires_at, refresh: { token_endpoint, client_id, refresh_token, token_endpoint_auth } } would be filled directly from the metadata above — Anthropic handles refresh.
What this means for the recommended migration
The spike's recommended next step — migrate ResearchAction of StoryGenerationHarnessAction to CMA — does not need kendo-script MCP at all. Codebase research uses the agent_toolset_20260401 over the github_repository resource mount (which is what we benchmarked in this spike, and what gave us the 19× cost win). kendo-script MCP only enters the picture if/when we want CMA agents to read or mutate kendo issues, branches, time entries, etc. — i.e., for the bigger Application B ("Hand to Claude" autonomous PR) flow.
So this verification doesn't unblock the immediate next step, but it does eliminate the largest infrastructure unknown for Application B: we will not have to fork or rewrite the MCP server to make it Anthropic-compatible. We can declare it on the Agent and authenticate per-user via a vault.
Open follow-up
- End-to-end auth flight. Discovery + 401 challenge are verified. The actual
mcp_oauthround-trip — Anthropic exchanging an access token, callingtools/list, calling a tool, refreshing on expiry — has not been exercised. That's a small additional spike once we're scoping Application B; not required to unblock theResearchActionmigration. mcp:usescope sufficiency. The auth server only advertisesmcp:use. If we later want to differentiate read-only vs read-write CMA-driven access (e.g. a "Hand to Claude" feature that should only read but never mutate), we'll need a finer-grained scope set on the kendo-script side. Today, a token withmcp:usecan call every tool the user has permission for.
Production-prompt benchmark (2026-05-09)
Per decisions § 3-5 below, ran the actual production ResearchAction system prompt against realistic story-gen inputs (prompts/research-action/* in ~/Code/cma-spike/). The benchmark hit harness reliability limits before completing the full 4-prompt × 2-path matrix, but prompt 1 produced a clean apples-to-apples Sonnet comparison that's already an order of magnitude past the pass/fail gate. Calling it.
Headline numbers (single clean Sonnet datapoint)
| Messages API | CMA | Ratio | |
|---|---|---|---|
| Cost | $11.51 | $0.56 | 20.7× cheaper |
| Wall time | 483s (8 min) | 180s (3 min) | 2.7× faster |
| Tool calls | 118 | 36 | 3.3× fewer |
| Cumulative input tokens | 3.79M (no caching) | 38in + 1.54M cache_read | — |
The Messages-API per-prompt cost was 3.5× higher than the first spike's average ($3.24) — production-shape ResearchAction prompts are far more open-ended than synthetic codebase questions, and the no-cache agent loop balloons accordingly. That's exactly the workload shape where CMA's automatic prompt caching wins biggest. Same prompt on CMA cost $0.56 — the cache layer absorbed virtually all of the input.
Verdict
Migration green-lit. The cost gate (≥ 2× cheaper) passes by ~10× margin. The latency gate (CMA p95 ≤ 1.5× current) trivially passes — CMA is faster, not slower. Async-job nature of ResearchAction (queued, not interactive) makes latency mostly cosmetic anyway.
One datapoint isn't a tight measurement, but the gap is large enough that statistical noise can't close it.
Harness reliability findings (block on these before re-benchmarking)
The remaining 3 prompts surfaced multiple issues with the spike scaffold and Anthropic's session-control primitives. None block the migration decision, but they'd block any future benchmark that needs reliable per-prompt numbers:
Production-shape prompts run 5–10× longer than synthetic. First-spike prompts averaged 100–180s on CMA; production-shape ran 5–13 minutes. The session timeout in
cma/run.pywas 300s — patched to 600s mid-run. Future runs need ≥ 600s with tolerance for outliers.user.interruptdoesn't reliably stop sessions mid-bash-tool-execution. One stuck session ignored 3 interrupts over 15s and stayedrunninguntil its agent's long bash command naturally returned ~13 minutes later. The interrupt seems to take effect only at the next model-loop boundary, not mid-tool-call. Patched the harness with drain-then-poll-for-idle but it still wasn't sufficient.Reading
session.usagetoo early returns0/0/0/0. The harness fetched usage immediately after the stream closed; this returns zeros for sessions still finalising. Patched to poll foridlestatus (up to 40s) before reading. Without this fix, ~$1 of orphan-session spend went unattributed.The full
agent_toolset_20260401is too broad to faithfully simulate ResearchAction. Production ResearchAction has 3 read-only tools (get_repo_tree,search_code,get_file_content); the spike's agent had bash + read + write + edit + grep + glob + (originally)web_fetch+web_search. On prompt 2, the agent ignored "your ONLY job is to explore" and made 35editcalls patching 16 files (a full ValidationException-passthrough fix across all create/update MCP tools). Cost ($2.96) and behaviour (write-side) both diverged from what production would do. Nogit pushwas attempted — verified via the bash call log; the 16-file fix lived only inside the ephemeral container and was reaped on archive. For an honest production-prompt benchmark, the agent must be created withdefault_config.enabled: false+ an explicit allowlist (read,grep,globonly).
Spend tally
| Run | Cost | Notes |
|---|---|---|
| Messages API · Sonnet · prompt 1 | $11.51 | Full completion, end_turn — the headline datapoint |
| CMA · Sonnet · prompt 1 | $0.56 | Full completion, end_turn — the headline datapoint |
| CMA · Sonnet · prompt 2 (off-script edit campaign) | ~$2.96 | Agent ignored research-only system prompt, did the full fix |
| CMA · Sonnet · orphan-session cleanup | ~$1.06 | Recovered via post-hoc usage queries on archived sessions |
| Total | ~$16 | ~$11 of which was the Messages-API run that this whole exercise plans to eliminate |
What this changes for the migration plan
When /plan-feature runs for the ResearchAction migration:
- Don't re-run the production-prompt benchmark unless the harness gets the fixes above. The migration decision doesn't need more data — prompt 1 is sufficient.
- The migration's
ManagedAgentsServiceshould mirror production ResearchAction's read-only tool surface — wrapget_repo_tree,search_code,get_file_contentas MCP tools or custom tools, and create the CMA Agent withdefault_config.enabled: false+ explicit allowlist. Easier to reason about, cheaper per-run, predictable. - The
user.interruptunreliability becomes a production constraint: long-running CMA sessions need server-side deadlines + spend budgets inManagedAgentsService, not just client-side timeouts. Plan for "the session may run for 15 minutes after we tell it to stop." For ResearchAction specifically: bound the work via Anthropic's outcome rubric (max_iterations) rather than relying on interrupts.
Decisions made (2026-05-09)
After the prerequisite verifications below, a planning round between CEO and parent agent settled the path forward:
- Migrate
ResearchActiononly (slice 2 of capability survey § 12 Application C). Other 4 phases stay on Messages API. - No Pennant flag. ResearchAction is internal infra. Pennant is reserved for HandOffToClaude (Application B, the user-visible feature).
- Pre-migration benchmark spike: extend
~/Code/cma-spike/with a real-prompt file (actual ResearchAction system prompt + 3-5 sampled story-gen inputs), re-runcompare.py. Closes cost+latency uncertainty on the real prompt shape, not the 3 synthetic prompts this spike used. - Pass/fail gate: ship only if cost ≥ 2× cheaper and p95 latency ≤ 1.5× current ResearchAction. Conservative on cost (we saw 19× lab-side; 2× in prod gives margin); forgiving on latency (story-gen is async-job, not interactive UX).
- Rollback strategy: replace outright; trust the benchmark as the gate; rollback via
git revert+ redeploy if needed. No shadow mode, no A/B compare in code. - Outcomes + multiagent: confirmed public beta as of 2026-05-06 (Code with Claude 2026) — no access form needed; remove that prerequisite from the open list.
- Application B (Hand to Claude): park until ResearchAction has shipped and run in production for ≥ 2 weeks. Then
/plan-featureit with battle-tested CMA infra.
Full table with rationale lives in the capability survey at ./managed-agents-kendo-evaluation.md § 16.
Prerequisite verification — github_repository mount auth (2026-05-09)
The spike used a hand-rolled fine-grained PAT on the github_repository resource. For production, the mount needs to authenticate as kendo, not as a developer's PAT. Question we wanted to settle: can we reuse the existing GitHub App installation that kendo already has wired up, or do we need to provision a separate token per linked repo?
Verdict: ✅ reuse the existing installation. No per-repo key.
Source verification
Kendo's GitHub integration uses a per-tenant GitHub App installation:
app/Models/Central/GithubInstallation.php— central-DB row mappinginstallation_id ↔ tenant_id. One row per tenant; the App is installed account-wide for that tenant.app/Models/ProjectGithubRepo.php— per-tenant row holdingrepo_full_name(e.g."owner/repo") linked to a project. Anything in this table is reachable via the tenant's installation token.app/Services/GithubAppService.php:30—getInstallationToken(int $installationId): stringmints a fresh 1-hour installation access token by JWT-signing a call toPOST /app/installations/{id}/access_tokens. The token authorises every repo the installation has access to — not per-repo.
The App's permission set is provable from existing usage in GithubAppService — it issues check runs (createCheckRun), PR comments (createPrComment), and dispatches workflows. That implies at minimum pull_requests: write and contents: write, both covered by repo scope on Anthropic's CMA token-permission table (capability survey § 9: clone private repos = repo, create PRs = repo).
CMA wiring
php
// In ManagedAgentsService::createSession() or equivalent
$installation = GithubInstallation::where('tenant_id', $tenant->id)->firstOrFail();
$token = $githubAppService->getInstallationToken($installation->installation_id);
$payload = [
'agent' => $agentId,
'environment_id' => $envId,
'resources' => [
[
'type' => 'github_repository',
'url' => "https://github.com/{$projectRepo->repo_full_name}",
'mount_path' => '/workspace/repo',
'authorization_token' => $token, // 1-hour TTL
],
// multi-repo: same $token, different url / mount_path per entry
],
];Two real caveats (neither is a per-repo-key problem)
- Token TTL is 1 hour. Anthropic supports
PATCH /v1/sessions/{session_id}/resources/{resource_id}to rotate mid-session — capability survey § 9 calls this out. Long-running sessions (story-genResearchActionis short-lived; Application B "Hand to Claude" is potentially multi-hour) need a refresh job that re-mints viagetInstallationToken()before expiry. - App must have the repo selected. Already enforced upstream by
ProjectGithubRepo(the user picked which repos to grant at install/configure time) — UX precondition, no new infra.
What this means for the migrations
- ResearchAction migration (recommended next step): each session is short (single research phase, well under 1 hour). Mint once at session-create, no rotation needed.
- Application B (Hand to Claude): sessions can run hours. Token rotation is required infrastructure — tied to the long-session lifecycle, alongside outcome evaluation and webhook handling.
References
- The capability survey written before this spike:
./managed-agents-kendo-evaluation.md— covers all 14 Anthropic doc pages, three plausible kendo applications, decision framework - Anthropic Managed Agents docs: https://platform.claude.com/docs/en/managed-agents/overview
- Spike scaffold (local):
~/Code/cma-spike/ - Spike Agent/Environment IDs (cached locally, archive when done):
~/Code/cma-spike/.cache.json - Closest existing kendo precedent:
backend/app/Actions/Agent/StoryGenerationHarnessAction.php(5-phase harness;ResearchActionis the migration target)