Skip to content

Managed Agents Spike — Codebase Research, CMA vs Messages API (3 prompts)

2026-05-09 correction: the latency headline ("2.7× faster") in this doc and the "Production-prompt benchmark" section below was based on a single open-ended prompt that ran 483 s on Messages-API. Real production telemetry is 19 s median / 28 s p95 (n=36 over 30 days). Today's CMA smokes are 4–7× slower than the production p95, not faster. The cost case still holds; the latency case is reversed. See 2026-05-09-production-telemetry-correction.md for the full postmortem and methodology lessons. The body of this doc is preserved unchanged for historical reference — apply the postmortem's findings before citing any latency number from below.

Summary

Empirical spike: ran 3 representative kendo "research a codebase" prompts through two implementations on the same model (Claude Sonnet 4.6), with equivalent capabilities. Anthropic Managed Agents (CMA) was 19× cheaper, ~18% faster on average, and used 58% fewer tool calls than a custom Messages-API implementation.

PathTotal wall timeTotal costAvg tool calls / prompt
Messages API (custom tools, ripgrep over a local kendo clone)517s$9.7264
CMA (agent_toolset_20260401 on github_repository mount)422s$0.5127

The cost gap is dominated by automatic prompt caching in CMA sessions. My Messages-API implementation didn't set cache_control markers — a fair-fight version would narrow the gap to maybe 3-5×, not 19×. But the gap stays meaningful in any version of the comparison, and it explains a structural property of CMA: long-conversation workloads with many tool calls (the shape of any agent loop) get prompt caching for free, where the equivalent Messages-API code has to opt in deliberately.

Output quality was indistinguishable between the two paths. Both produced clean structured markdown reports with file:line references. Variance in which exact line they cited (e.g. AppServiceProvider.php:165 vs :167) suggests minor hallucination on either side, with no consistent winner.

The spike validates a single specific claim: for codebase research as a workload, CMA is the right tool. It doesn't validate CMA for the parts of an autonomous-PR workflow that involve writing code, running tests, or opening PRs — those are still untested.

Spike setup

Standalone scaffolding at ~/Code/cma-spike/ — Python 3.12, anthropic SDK 0.100.0, python-dotenv. Both paths take the same prompt files; both use Sonnet 4.6; both have MAX_TOKENS_PER_RUN=4096 output cap.

Messages API path (messages_api/run.py): direct client.messages.create with three custom tools — search_code (shells to grep/ripgrep), read_file, list_files — pointed at a local kendo clone. Loop continues until the model emits stop_reason != "tool_use".

CMA path (cma/run.py): creates a Session against a pre-created Agent + Environment, attaches kendo via github_repository resource (read-only fine-grained PAT). Uses the built-in agent_toolset_20260401 (bash, read, write, edit, glob, grep). Streams events; stops on session.status_idle with stop_reason.type === "end_turn".

System prompts are aligned between the two paths so the comparison is on the runtime, not on the framing.

Verbatim numbers

PromptMA wallCMA wallMA tokensCMA tokensMA $CMA $MA toolsCMA tools
01-find-branch-linker-usages135.3s100.2s618,968 in / 5,968 out10 in / 2,498 out / 105,007 cache reads$1.9464$0.06904211
03-webhook-intake-pattern154.4s182.3s880,797 in / 8,225 out19 in / 5,535 out / 512,946 cache reads$2.7658$0.23705732
06-feature-flag-pattern227.2s139.6s1,618,413 in / 9,989 out22 in / 5,685 out / 385,398 cache reads$5.0051$0.20109537
avg172.3s140.7s1,039,392 in / 8,060 out17 in / 4,572 out$3.24$0.176427

The CMA "input tokens" column reads as ~10-20 per run because almost everything (system prompt, tool definitions, accumulated history) hits the prompt cache. The cache-read column captures the real conversational scale on the CMA side.

The MA "input tokens" column shows the cumulative input over all turns of the agent loop — at 95 tool calls (prompt 06), the conversation history alone is 1.6M tokens of cumulative input. This is normal for an uncached agent loop; cache markers would dramatically reduce it.

Per-prompt observations

Prompt 01 — find IssueBranchLinker usages. Narrow component-tracing question. CMA: 100s, 11 tool calls. MA: 135s, 42 tool calls. CMA's tool-call efficiency comes from one-shot bash compositions (find ... | xargs grep ...) replacing what MA does as 5-7 sequential round-trips.

Prompt 03 — webhook intake pattern. Multi-file pattern documentation across routes, middleware, controllers, jobs, audit. The only prompt where CMA was slower (182s vs 154s). The CMA agent went deeper — read 32 tool calls' worth of files including the audit logger and the queued job retry config. The MA agent stopped earlier with a thinner answer. So CMA's "loss" on wall time was paired with a richer report. Both produced usable output.

Prompt 06 — Pennant feature flag pattern. Small enumeration with auto-discovery wiring. CMA: 140s, 37 tool calls. MA: 227s, 95 tool calls. The MA agent looped extensively — five separate searches for the #[Name] attribute, repeated reads of AppServiceProvider. The cumulative input hit 1.6M tokens (the full session conversation history flowing through every turn) and the cost ballooned to $5. CMA capped the same exploration with prompt caching. This is the prompt where the cost gap got most extreme (25×).

Findings

1. Cost: CMA wins decisively, ~19× cheaper across the sample. Caveat: this includes the unfair-caching effect. A Messages-API implementation that wires up cache_control: {type: "ephemeral"} on system prompts and tool definitions would narrow the gap to maybe 3-5×. But that gap stays meaningful, and the rule generalises: CMA gets prompt caching for free; custom Messages-API code gets it only when you remember to opt in. For long-running agent loops, that's a real ergonomic win.

2. Speed: CMA wins on average (~18%), but variance is high. Prompt 03 was 28s slower on CMA. The mean tells the story; individual prompts swing both ways depending on how exploratory the agent gets. Repeated runs of the same prompt would also vary — we didn't measure that.

3. Tool-call efficiency: CMA uses ~58% fewer calls. Structural, not caching-related. In-container bash composes (find | xargs grep, cat | head, etc.) replace what tool-calling Messages API needs multiple round-trips for. This effect is independent of caching and would persist in a "fair fight" comparison.

4. Output quality: indistinguishable. Both paths produced clean structured reports with file:line refs. Minor differences in cited line numbers (off-by-2 here and there) on both sides. Neither wins on quality.

Honest limitations

  • Sample size of 3. Directional signal, not a tight measurement. Variance per prompt is real (prompt 03 reversed the speed result). Wider sample would tighten the numbers.
  • Codebase research only. This validates the read side of an agent. Writing code, running tests, opening PRs — all untested.
  • Sonnet 4.6 only. Opus would shift absolute numbers but probably keep the ratio.
  • Cold-start tax hidden. Every CMA run started cold (new session per prompt). Production might reuse sessions for warm calls; we didn't measure that.
  • The Messages-API path didn't use prompt caching. This was deliberate (matches how app/Actions/Agent/ResearchAction.php is wired today, which doesn't use cache_control markers either) but it's the dominant cause of the cost gap. The "real" gap is smaller.
  • Both paths were single-turn from the user's POV. Multi-turn iterative refinement was not tested.

Strategic read

The signal is strong enough to act on a specific, contained migration:

Migrate ResearchAction of the existing story-generation harness to CMA. That's the multi-turn codebase-exploration phase of app/Actions/Agent/StoryGenerationHarnessAction.php. The other 4 phases (ValidateInputAction, DuplicateCheckAction, ClassifyAction, WriteAction) are essentially structured-output calls that don't benefit from CMA — they should stay on Messages API. Expected outcomes:

  • Cost on the research phase drops 5-15× (conservative, after accounting for cache markers being available in either runtime)
  • Latency on the research phase ties or improves
  • Tool-call count drops by half — fewer round-trips, less load on the kendo backend's MCP tools
  • We get production CMA infrastructure stood up (Agent + Environment + github_repository resource + webhook handler + audit logging) on a real workload before betting the autonomous-execution feature on it

What this spike doesn't justify:

  • Building the full "Hand to Claude" autonomous-PR feature next. We've only validated the read side. The write side is still entirely untested in our context.
  • Migrating phases of the harness that are not tool-heavy. Those are the right shape for Messages API; CMA would just add session-creation overhead.

Reproduce

Spike folder: ~/Code/cma-spike/

bash
cd ~/Code/cma-spike
source .venv/bin/activate
# .env needs ANTHROPIC_API_KEY + GITHUB_TOKEN (read-only PAT scoped to script-development/kendo)
python3 cma/setup.py                           # creates Agent + Environment, caches IDs
python3 messages_api/run.py prompts/<file>.md  # one MA run
python3 cma/run.py prompts/<file>.md           # one CMA run
python3 compare.py                             # produces results/comparison.md

To extend the sample, drop more prompt files into prompts/ and re-run. Cost ceiling: each MA run ≈ $1-5; each CMA run ≈ $0.05-0.50 on Sonnet at the prompt sizes we tested.

To stop the meter when done: python3 cma/teardown.py (archives the Agent + Environment).

Prerequisite verification — kendo-script MCP transport (2026-05-09)

The capability survey flagged one prerequisite for any CMA migration that wants to attach the existing kendo-script MCP server (the laravel/mcp server backing https://script.kendo.dev/mcp/kendo) to a Managed Agents session: does it speak streamable HTTP MCP? Anthropic's mcp_toolset only accepts streamable-HTTP MCP servers — stdio is not supported, and pre-2025-03-26 separate-endpoint SSE transport is not supported. Verified now so the migration plan doesn't trip on a transport mismatch.

Verdict: ✅ streamable HTTP, OAuth-discoverable, Anthropic-compatible. No transport blocker.

Source verification

backend/routes/ai.php:15 registers the server via Mcp::web('/mcp/kendo', KendoServer::class). Tracing into laravel/mcp 0.6.6 (backend/composer.json pinned at ^0.6.6):

  • vendor/laravel/mcp/src/Server/Registrar.php:32-56web() registers a single route URI accepting:
    • POST → wraps the request in HttpTransport, runs the server, returns either application/json or text/event-stream. Cites the MCP 2025-11-25 transport spec inline.
    • GET405 Method Not Allowed with Allow: POST (server-initiated GET stream not supported — spec-permitted)
    • DELETE405 Method Not Allowed with Allow: POST (session termination via DELETE not supported — also spec-permitted)
  • vendor/laravel/mcp/src/Server/Transport/HttpTransport.php — implements both the immediate-response branch and the SSE-upgrade branch on the same POST endpoint. Reads/writes the MCP-Session-Id header. Sets X-Accel-Buffering: no on streamed responses. Returns 202 for notifications-only POSTs, 200 otherwise — explicit reference to https://modelcontextprotocol.io/specification/2025-06-18/basic/transports#sending-messages-to-the-server at line 70.

This is the streamable HTTP transport introduced in MCP spec 2025-03-26 and refined in 2025-06-18 / 2025-11-25 — single endpoint, POST-driven, optional SSE upgrade per request. Not the legacy two-endpoint SSE transport from 2024-11-05.

Live probe

Live endpoint behaviour matches the code (probed 2026-05-09 from this machine):

RequestResponseNotes
GET https://script.kendo.dev/mcp/kendo405 + Allow: POSTSpec-compliant.
DELETE https://script.kendo.dev/mcp/kendo405 + Allow: POSTSpec-compliant.
POST with no token, Accept: application/json, text/event-stream401 + WWW-Authenticate: Bearer realm="mcp", resource_metadata="https://script.kendo.dev/.well-known/oauth-protected-resource/mcp/kendo"OAuth resource-indicator challenge (RFC 9728).
GET /.well-known/oauth-protected-resource/mcp/kendo{"resource":"https://script.kendo.dev/mcp/kendo","authorization_servers":["https://script.kendo.dev"],"scopes_supported":["mcp:use"]}RFC 9728 metadata.
GET /.well-known/oauth-authorization-server{issuer, authorization_endpoint, token_endpoint, registration_endpoint, response_types_supported:["code"], code_challenge_methods_supported:["S256"], scopes_supported:["mcp:use"], grant_types_supported:["authorization_code","refresh_token"]}RFC 8414 metadata. PKCE S256 + refresh tokens supported. Dynamic client registration available at /oauth/register.

This is precisely the shape Anthropic's mcp_oauth vault credential type consumes. Per the capability survey (section 8), POST /v1/vaults/{vault_id}/credentials with auth: { type: "mcp_oauth", mcp_server_url, access_token, expires_at, refresh: { token_endpoint, client_id, refresh_token, token_endpoint_auth } } would be filled directly from the metadata above — Anthropic handles refresh.

The spike's recommended next step — migrate ResearchAction of StoryGenerationHarnessAction to CMA — does not need kendo-script MCP at all. Codebase research uses the agent_toolset_20260401 over the github_repository resource mount (which is what we benchmarked in this spike, and what gave us the 19× cost win). kendo-script MCP only enters the picture if/when we want CMA agents to read or mutate kendo issues, branches, time entries, etc. — i.e., for the bigger Application B ("Hand to Claude" autonomous PR) flow.

So this verification doesn't unblock the immediate next step, but it does eliminate the largest infrastructure unknown for Application B: we will not have to fork or rewrite the MCP server to make it Anthropic-compatible. We can declare it on the Agent and authenticate per-user via a vault.

Open follow-up

  • End-to-end auth flight. Discovery + 401 challenge are verified. The actual mcp_oauth round-trip — Anthropic exchanging an access token, calling tools/list, calling a tool, refreshing on expiry — has not been exercised. That's a small additional spike once we're scoping Application B; not required to unblock the ResearchAction migration.
  • mcp:use scope sufficiency. The auth server only advertises mcp:use. If we later want to differentiate read-only vs read-write CMA-driven access (e.g. a "Hand to Claude" feature that should only read but never mutate), we'll need a finer-grained scope set on the kendo-script side. Today, a token with mcp:use can call every tool the user has permission for.

Production-prompt benchmark (2026-05-09)

Per decisions § 3-5 below, ran the actual production ResearchAction system prompt against realistic story-gen inputs (prompts/research-action/* in ~/Code/cma-spike/). The benchmark hit harness reliability limits before completing the full 4-prompt × 2-path matrix, but prompt 1 produced a clean apples-to-apples Sonnet comparison that's already an order of magnitude past the pass/fail gate. Calling it.

Headline numbers (single clean Sonnet datapoint)

Messages APICMARatio
Cost$11.51$0.5620.7× cheaper
Wall time483s (8 min)180s (3 min)2.7× faster
Tool calls118363.3× fewer
Cumulative input tokens3.79M (no caching)38in + 1.54M cache_read

The Messages-API per-prompt cost was 3.5× higher than the first spike's average ($3.24) — production-shape ResearchAction prompts are far more open-ended than synthetic codebase questions, and the no-cache agent loop balloons accordingly. That's exactly the workload shape where CMA's automatic prompt caching wins biggest. Same prompt on CMA cost $0.56 — the cache layer absorbed virtually all of the input.

Verdict

Migration green-lit. The cost gate (≥ 2× cheaper) passes by ~10× margin. The latency gate (CMA p95 ≤ 1.5× current) trivially passes — CMA is faster, not slower. Async-job nature of ResearchAction (queued, not interactive) makes latency mostly cosmetic anyway.

One datapoint isn't a tight measurement, but the gap is large enough that statistical noise can't close it.

Harness reliability findings (block on these before re-benchmarking)

The remaining 3 prompts surfaced multiple issues with the spike scaffold and Anthropic's session-control primitives. None block the migration decision, but they'd block any future benchmark that needs reliable per-prompt numbers:

  1. Production-shape prompts run 5–10× longer than synthetic. First-spike prompts averaged 100–180s on CMA; production-shape ran 5–13 minutes. The session timeout in cma/run.py was 300s — patched to 600s mid-run. Future runs need ≥ 600s with tolerance for outliers.

  2. user.interrupt doesn't reliably stop sessions mid-bash-tool-execution. One stuck session ignored 3 interrupts over 15s and stayed running until its agent's long bash command naturally returned ~13 minutes later. The interrupt seems to take effect only at the next model-loop boundary, not mid-tool-call. Patched the harness with drain-then-poll-for-idle but it still wasn't sufficient.

  3. Reading session.usage too early returns 0/0/0/0. The harness fetched usage immediately after the stream closed; this returns zeros for sessions still finalising. Patched to poll for idle status (up to 40s) before reading. Without this fix, ~$1 of orphan-session spend went unattributed.

  4. The full agent_toolset_20260401 is too broad to faithfully simulate ResearchAction. Production ResearchAction has 3 read-only tools (get_repo_tree, search_code, get_file_content); the spike's agent had bash + read + write + edit + grep + glob + (originally) web_fetch + web_search. On prompt 2, the agent ignored "your ONLY job is to explore" and made 35 edit calls patching 16 files (a full ValidationException-passthrough fix across all create/update MCP tools). Cost ($2.96) and behaviour (write-side) both diverged from what production would do. No git push was attempted — verified via the bash call log; the 16-file fix lived only inside the ephemeral container and was reaped on archive. For an honest production-prompt benchmark, the agent must be created with default_config.enabled: false + an explicit allowlist (read, grep, glob only).

Spend tally

RunCostNotes
Messages API · Sonnet · prompt 1$11.51Full completion, end_turn — the headline datapoint
CMA · Sonnet · prompt 1$0.56Full completion, end_turn — the headline datapoint
CMA · Sonnet · prompt 2 (off-script edit campaign)~$2.96Agent ignored research-only system prompt, did the full fix
CMA · Sonnet · orphan-session cleanup~$1.06Recovered via post-hoc usage queries on archived sessions
Total~$16~$11 of which was the Messages-API run that this whole exercise plans to eliminate

What this changes for the migration plan

When /plan-feature runs for the ResearchAction migration:

  • Don't re-run the production-prompt benchmark unless the harness gets the fixes above. The migration decision doesn't need more data — prompt 1 is sufficient.
  • The migration's ManagedAgentsService should mirror production ResearchAction's read-only tool surface — wrap get_repo_tree, search_code, get_file_content as MCP tools or custom tools, and create the CMA Agent with default_config.enabled: false + explicit allowlist. Easier to reason about, cheaper per-run, predictable.
  • The user.interrupt unreliability becomes a production constraint: long-running CMA sessions need server-side deadlines + spend budgets in ManagedAgentsService, not just client-side timeouts. Plan for "the session may run for 15 minutes after we tell it to stop." For ResearchAction specifically: bound the work via Anthropic's outcome rubric (max_iterations) rather than relying on interrupts.

Decisions made (2026-05-09)

After the prerequisite verifications below, a planning round between CEO and parent agent settled the path forward:

  1. Migrate ResearchAction only (slice 2 of capability survey § 12 Application C). Other 4 phases stay on Messages API.
  2. No Pennant flag. ResearchAction is internal infra. Pennant is reserved for HandOffToClaude (Application B, the user-visible feature).
  3. Pre-migration benchmark spike: extend ~/Code/cma-spike/ with a real-prompt file (actual ResearchAction system prompt + 3-5 sampled story-gen inputs), re-run compare.py. Closes cost+latency uncertainty on the real prompt shape, not the 3 synthetic prompts this spike used.
  4. Pass/fail gate: ship only if cost ≥ 2× cheaper and p95 latency ≤ 1.5× current ResearchAction. Conservative on cost (we saw 19× lab-side; 2× in prod gives margin); forgiving on latency (story-gen is async-job, not interactive UX).
  5. Rollback strategy: replace outright; trust the benchmark as the gate; rollback via git revert + redeploy if needed. No shadow mode, no A/B compare in code.
  6. Outcomes + multiagent: confirmed public beta as of 2026-05-06 (Code with Claude 2026) — no access form needed; remove that prerequisite from the open list.
  7. Application B (Hand to Claude): park until ResearchAction has shipped and run in production for ≥ 2 weeks. Then /plan-feature it with battle-tested CMA infra.

Full table with rationale lives in the capability survey at ./managed-agents-kendo-evaluation.md § 16.

Prerequisite verification — github_repository mount auth (2026-05-09)

The spike used a hand-rolled fine-grained PAT on the github_repository resource. For production, the mount needs to authenticate as kendo, not as a developer's PAT. Question we wanted to settle: can we reuse the existing GitHub App installation that kendo already has wired up, or do we need to provision a separate token per linked repo?

Verdict: ✅ reuse the existing installation. No per-repo key.

Source verification

Kendo's GitHub integration uses a per-tenant GitHub App installation:

  • app/Models/Central/GithubInstallation.php — central-DB row mapping installation_id ↔ tenant_id. One row per tenant; the App is installed account-wide for that tenant.
  • app/Models/ProjectGithubRepo.php — per-tenant row holding repo_full_name (e.g. "owner/repo") linked to a project. Anything in this table is reachable via the tenant's installation token.
  • app/Services/GithubAppService.php:30getInstallationToken(int $installationId): string mints a fresh 1-hour installation access token by JWT-signing a call to POST /app/installations/{id}/access_tokens. The token authorises every repo the installation has access to — not per-repo.

The App's permission set is provable from existing usage in GithubAppService — it issues check runs (createCheckRun), PR comments (createPrComment), and dispatches workflows. That implies at minimum pull_requests: write and contents: write, both covered by repo scope on Anthropic's CMA token-permission table (capability survey § 9: clone private repos = repo, create PRs = repo).

CMA wiring

php
// In ManagedAgentsService::createSession() or equivalent
$installation = GithubInstallation::where('tenant_id', $tenant->id)->firstOrFail();
$token = $githubAppService->getInstallationToken($installation->installation_id);

$payload = [
    'agent' => $agentId,
    'environment_id' => $envId,
    'resources' => [
        [
            'type' => 'github_repository',
            'url' => "https://github.com/{$projectRepo->repo_full_name}",
            'mount_path' => '/workspace/repo',
            'authorization_token' => $token,  // 1-hour TTL
        ],
        // multi-repo: same $token, different url / mount_path per entry
    ],
];

Two real caveats (neither is a per-repo-key problem)

  1. Token TTL is 1 hour. Anthropic supports PATCH /v1/sessions/{session_id}/resources/{resource_id} to rotate mid-session — capability survey § 9 calls this out. Long-running sessions (story-gen ResearchAction is short-lived; Application B "Hand to Claude" is potentially multi-hour) need a refresh job that re-mints via getInstallationToken() before expiry.
  2. App must have the repo selected. Already enforced upstream by ProjectGithubRepo (the user picked which repos to grant at install/configure time) — UX precondition, no new infra.

What this means for the migrations

  • ResearchAction migration (recommended next step): each session is short (single research phase, well under 1 hour). Mint once at session-create, no rotation needed.
  • Application B (Hand to Claude): sessions can run hours. Token rotation is required infrastructure — tied to the long-session lifecycle, alongside outcome evaluation and webhook handling.

References

  • The capability survey written before this spike: ./managed-agents-kendo-evaluation.md — covers all 14 Anthropic doc pages, three plausible kendo applications, decision framework
  • Anthropic Managed Agents docs: https://platform.claude.com/docs/en/managed-agents/overview
  • Spike scaffold (local): ~/Code/cma-spike/
  • Spike Agent/Environment IDs (cached locally, archive when done): ~/Code/cma-spike/.cache.json
  • Closest existing kendo precedent: backend/app/Actions/Agent/StoryGenerationHarnessAction.php (5-phase harness; ResearchAction is the migration target)