CMA vs Messages API

Spike Codebase research workload — Sonnet 4.6 — 3 prompts 2026-05-08

Cost reduction

19×

$9.72 → $0.51 across 3 prompts

Wall time

−18%

avg 172.3s → 140.7s per prompt

Tool calls

−58%

avg 64 → 27 per prompt

Output quality

≈

indistinguishable on rubric

    Validated claim: for codebase research as a workload, CMA is the right tool. The cost gap is dominated
    by automatic prompt caching in CMA sessions — a Messages-API path that opted into cache_control would narrow
    the gap to roughly 3–5×, but CMA gets caching for free where Messages API has to opt in deliberately. Spike does not
    validate CMA for writing code, running tests, or opening PRs. Those remain untested.
  

Aggregate comparison 3 prompts, totals

Total cost USD across 3 prompts

Messages API

$9.72 baseline

CMA

$0.51 −95% (19× cheaper)

Wall time total seconds

Messages API

517s baseline

CMA

422s −18% faster

Tool calls avg per prompt

Messages API

64 baseline

CMA

27 −58% fewer

Spike setup

Messages API path

Runtimedirect client.messages.create loop
Toolssearch_code (ripgrep), read_file, list_files
Repo accesslocal kendo clone on disk
Cachingnone — no cache_control markers
Stopstop_reason != "tool_use"

CMA path

RuntimeSession on a pre-created Agent + Environment
Toolsagent_toolset_20260401 (bash, read, write, edit, glob, grep)
Repo accessgithub_repository resource (read-only PAT)
Cachingautomatic, server-side
Stopsession.status_idle + end_turn

Both paths use Sonnet 4.6, share an output cap of MAX_TOKENS_PER_RUN=4096, and use aligned system prompts. The comparison is on the runtime, not the framing.

Per-prompt verbatim numbers 3 prompts

Prompt	MA wall	CMA wall	MA tool calls	CMA tool calls	MA cost	CMA cost	Cost ratio
01-find-branch-linker-usages	135.3s	100.2s	42	11	$1.95	$0.07	28×
03-webhook-intake-pattern	154.4s	182.3s	57	32	$2.77	$0.24	12×
06-feature-flag-pattern	227.2s	139.6s	95	37	$5.01	$0.20	25×
Average	172.3s	140.7s	64	27	$3.24	$0.17	19×

On token columns: CMA reports ~10–20 input tokens per run because system prompt, tool definitions, and accumulated history all hit the prompt cache (cache reads ranged 105k–513k per prompt). Messages-API input tokens are cumulative — at 95 tool calls on prompt 06, conversation history alone reached 1.6M tokens of replayed input. That cumulative replay is the dominant cost driver, and it's exactly what cache markers would shrink.

Prompt 01 · branch linker

Narrow component-tracing question. CMA: 100s, 11 tool calls. MA: 135s, 42 tool calls.

CMA's tool-call efficiency comes from one-shot bash compositions (find ... | xargs grep ...) replacing what MA does as 5–7 sequential round-trips.

Prompt 03 · webhook intake

Multi-file pattern across routes, middleware, controllers, jobs, audit. The only prompt where CMA was slower (182s vs 154s).

CMA went deeper — 32 tool calls covering audit logger and queued job retry config. MA stopped earlier with a thinner answer. Wall-time loss paired with a richer report.

Prompt 06 · feature flag

Pennant flag enumeration. CMA: 140s, 37 tool calls. MA: 227s, 95 tool calls.

MA looped extensively — five separate searches for the #[Name] attribute, repeated reads of AppServiceProvider. Cumulative input hit 1.6M tokens. Cost gap most extreme here: 25×.

Findings

Cost — CMA wins decisively (~19×)

Caveat: includes the unfair-caching effect. A fair-fight version with cache_control on system prompts and tool defs would narrow to 3–5×. The rule generalises: CMA gets prompt caching for free; Messages API gets it only when you remember to opt in.

Speed — CMA wins on average (~18%), high variance

Prompt 03 was 28s slower on CMA. Mean tells the story; individual prompts swing both ways depending on how exploratory the agent gets. We didn't measure repeat-run variance.

Tool calls — CMA uses ~58% fewer

Structural, not caching-related. In-container bash composes (find | xargs grep, cat | head) replace what tool-calling Messages API needs multiple round-trips for. Effect persists in a fair-fight comparison.

Output quality — indistinguishable

Both paths produced clean structured reports with file:line refs. Minor differences in cited line numbers (off-by-2 here and there) on both sides. Neither wins on quality.

Honest limitations

Sample size of 3. Directional signal, not a tight measurement. Variance per prompt is real (prompt 03 reversed the speed result).
Codebase research only. Validates the read side. Writing code, running tests, opening PRs — all untested.
Sonnet 4.6 only. Opus would shift absolute numbers but probably keep the ratio.
Cold-start tax hidden. Every CMA run started cold (new session per prompt). Production might reuse sessions for warm calls.
MA didn't use prompt caching. Deliberate (matches how ResearchAction is wired today) but the dominant cause of the cost gap. The "real" gap is smaller.
Single-turn from the user's POV. Multi-turn iterative refinement was not tested.

Strategic read · what this spike justifies

Migrate

ResearchAction → CMA

The multi-turn codebase-exploration phase of app/Actions/Agent/StoryGenerationHarnessAction.php. It is the closest existing kendo precedent for the workload this spike validated.

Cost on the research phase drops 5–15× (conservative, after accounting for cache markers being available either runtime)
Latency on the research phase ties or improves
Tool-call count drops by half — fewer round-trips, less load on kendo backend's MCP tools
Production CMA infra (Agent + Environment + github_repository resource + webhook handler + audit logging) gets stood up on a real workload before betting the autonomous-execution feature on it

Keep

ValidateInput · DuplicateCheck · Classify · Write — stay on Messages API

The other 4 phases of the harness are essentially structured-output calls. They don't benefit from CMA — session-creation overhead would only add cost without speedup. CMA is the wrong shape for non-tool-heavy phases.

Don't ship yet

"Hand to Claude" autonomous-PR feature

We've only validated the read side. The write side — editing code, running tests, opening PRs from CMA — is still entirely untested in our context. Building the full autonomous-PR feature next would skip a layer of validation we haven't earned.

Reproduce

Spike folder: ~/Code/cma-spike/

# .env needs ANTHROPIC_API_KEY + GITHUB_TOKEN
# (read-only PAT scoped to script-development/kendo)
cd ~/Code/cma-spike
source .venv/bin/activate

python3 cma/setup.py                          # create Agent + Env, cache IDs
python3 messages_api/run.py prompts/<file>.md  # one MA run
python3 cma/run.py prompts/<file>.md           # one CMA run
python3 compare.py                            # results/comparison.md

# When done, stop the meter:
python3 cma/teardown.py                        # archive Agent + Env

Cost ceiling per run on Sonnet at the prompt sizes tested: $1–5 for MA, $0.05–0.50 for CMA. Drop more prompt files in prompts/ to extend the sample.

References

Capability survey written before this spike: research/managed-agents-kendo-evaluation.md — covers all 14 Anthropic doc pages, three plausible kendo applications, decision framework
Full spike write-up: research/2026-05-08-managed-agents-spike.md
Anthropic Managed Agents docs: platform.claude.com/docs/en/managed-agents/overview
Spike scaffold (local): ~/Code/cma-spike/
Spike Agent/Env IDs (cached, archive when done): ~/Code/cma-spike/.cache.json
Closest existing kendo precedent: backend/app/Actions/Agent/StoryGenerationHarnessAction.php