Production telemetry contradicts the spike's wall-clock claim

TL;DR

The spike's 2.7× wall-clock advantage for CMA over Messages API was driven by a single open-ended benchmark prompt that ran 483 s on the Messages API path. Real production traffic on the same feature='agent_research' Action runs in 19 s median / 28 s p95 — 17× faster than the spike's "production-shape" datapoint and 6–10× faster than CMA in the smokes we ran today. The cost case for CMA still holds (caching is real). The latency case for CMA is the opposite of what the spike implied for typical workloads, and the migration as shipped fails its own AC #7 (p95 ≤ 1.5 × current) by ~3–4×.

This was caught after the migration code was already merge-ready, by querying prod-issue-tracker's ai_outbound_logs directly. Captured here for future reference and for the eventual Application B (Hand-to-Claude) planning, where the same methodology mistake would be far more expensive.

Numbers

Production Messages-API ResearchAction (real traffic)

Source: prod-issue-tracker Script tenant DB, feature='agent_research' AND status='success' AND created_at >= NOW() - INTERVAL 30 DAY, n=36.

	Value
min	3 451 ms (~3.5 s)
p50 (median)	18 703 ms (~19 s)
p90	26 340 ms (~26 s)
p95	28 036 ms (~28 s)
max	32 214 ms (~32 s)
avg	18 413 ms (~18 s)

Distribution is tight (p50 to p95 spans ~10 s, no long tail) — this is a stable workload, not a noisy one.

Spike's "production-shape" Messages-API datapoint

Source: docs/plans/KD-0650-migrate-research-action-to-cma/research/2026-05-08-managed-agents-spike.md § Production-prompt benchmark.

	Value
Wall time	483 000 ms (8 min)
Tool calls	118
Cumulative input tokens	3.79 M (no caching)

This is 17× the production p95 and 25× the production median. The spike author flagged it as a single datapoint and noted production-shape prompts ran 5–10× longer than their synthetic baseline. What they didn't have was a query against actual production telemetry, which would have shown that real production prompts behave more like the synthetic spike (~100–180 s) than the open-ended one they chose for the headline.

Today's CMA smokes (this branch)

Same StoryGenerationHarness research prompt against the kendo repo:

Run	Wall time	Iterations	Output chars
Sonnet 4.6	191 s	25	10 063
Haiku 4.5	125 s	19	15 752

Both are 4–7× slower than production p95 of the path we're replacing.

Why the spike's number was misleading

Single open-ended datapoint. The 483 s came from one prompt, deliberately chosen to stress the workload. Production users do not write prompts that open-ended.
Spike compared against a no-cache Messages-API harness. The cumulative 3.79 M input-tokens-with-no-cache figure was real but is also fixable in the Messages-API impl by setting cache_control on the system prompt and tool definitions. kendo's current production ResearchAction (laravel/ai) doesn't enable caching either, but its workload is bounded enough that this barely shows up — 19 s median doesn't have room for the runaway re-send pattern the spike's outlier prompt triggered.
CMA's structural advantage (composing find | xargs grep in-container) is real but only matters when the agent has enough work to need it. On a 19 s workload, CMA's session-creation + container-boot overhead dominates — exactly the regime where Messages API wins.
The spike author noted "one datapoint isn't a tight measurement, but the gap is large enough that statistical noise can't close it." Statistical noise wasn't the problem — workload selection was. The chosen prompt was 25× heavier than the production median.

Methodology lessons

Always query production telemetry before trusting a spike

ai_outbound_logs exists. It has been writing feature='agent_research' rows for months. A 5-minute query against it would have shown the spike's headline was off by an order of magnitude. The correction took less than 10 minutes once the query was actually run.

Rule of thumb for future AI-workload spikes: whenever a feature already has Channel-1 audit logs (AiOutboundLogger), the planning step must include a SELECT response_time_ms, input_tokens, output_tokens FROM ai_outbound_logs WHERE feature=? AND status='success' over a representative window. Replace the spike's headline with that distribution; use the spike only to characterise behaviour the production data can't (e.g., a new model, a new tool surface, a new caching strategy).

Single-datapoint benchmarks are fine — for cost, not for latency

Cost scales roughly with token volume, which is dominated by the prompt + tool surface, which is what a single carefully-chosen prompt can characterise. Latency depends on workload distribution, which can't be inferred from one prompt regardless of how representative it feels. The spike's cost claim (20.7× cheaper) survives the production-telemetry cross-check because token volume on the chosen prompt is plausibly representative. The latency claim doesn't.

"AC ≤ 1.5× current" is only meaningful if "current" was measured from production

PLAN.md AC #7 says "p95 wall time stays ≤ 1.5× current ResearchAction." That AC was written against the spike's 483 s figure, which made it trivially passable by anything. Against the actual production p95 (28 s), the same AC is failed by every CMA smoke we've run. The lesson: when an AC contains a multiplier of "current," the AC writer must specify which measurement of current it's gating against — production telemetry, spike, or theoretical lower bound — and bake that source into the AC line.

Implications for this PR (KD-0650)

Not changed tonight — leaving the branch as-is. Decision pending tomorrow. The honest options are:

Ship as-is, accept the latency regression for the cost win. Defensible if the cost win is large enough on real production traffic that it justifies users waiting ~125 s instead of ~19 s. Cost gate AC #6 is still likely to pass; latency gate AC #7 is not.
Ship side-by-side with an env-var dispatcher (the proposal in the side-by-side discussion thread). Default to legacy, flip CMA on for specific tenants/projects, gather real-world cost + latency telemetry, decide later.
Don't ship the migration. The spike's headline argument was wrong; the cost win on real workloads may not be 20× because real workloads aren't 3.79 M-token monsters.
Re-design the migration around a different latency target. E.g., add a max_iterations: 8 cap to the Agent definition, see if it can land under 60 s while still producing usable JSON. Combined with Haiku, this might be reachable.

Implications for Application B (Hand-to-Claude)

This is the bigger learning. Application B will involve much heavier autonomous workloads (planner + implementer + reviewer + opening PRs), and the comparable Messages-API baseline will not have Channel-1 telemetry yet because Application B is greenfield. Three takeaways:

Build the audit-log + telemetry query path first, before the spike, not after. If we can't measure production behaviour from logs, the spike must produce its own distribution (multiple prompts, multiple runs) — not a single headline number.
Be sceptical of "X× faster" headlines on workloads where the comparison baseline is a system that nobody has actually measured at production scale. The whole spike → migration plan → implementation → review chain operated on a number nobody had cross-checked.
Pre-register the latency target against a specific dataset. "≤ 1.5× current" only works if "current" is a documented number from a documented query over a documented window.

Open questions for tomorrow

Run the same query for the other tenants in production (with appropriate authorisation) to confirm 19 s / 28 s isn't Script-specific.
Run the same query for status='error' rows — error paths may have very different timing (timeouts, retries) and could shift the picture.
Estimate real production cost-per-call by joining response_time_ms with input_tokens + output_tokens + cache_* fields (cache columns are now populated post-D8 for new rows). Compare with the smoke's CMA cost figures on the same workload shape.
Decide on the side-by-side question with this data in hand.

Production telemetry contradicts the spike's wall-clock claim ​

TL;DR ​

Numbers ​

Production Messages-API ResearchAction (real traffic) ​

Spike's "production-shape" Messages-API datapoint ​

Today's CMA smokes (this branch) ​

Why the spike's number was misleading ​

Methodology lessons ​

Always query production telemetry before trusting a spike ​

Single-datapoint benchmarks are fine — for cost, not for latency ​

"AC ≤ 1.5× current" is only meaningful if "current" was measured from production ​

Implications for this PR (KD-0650) ​

Implications for Application B (Hand-to-Claude) ​

Open questions for tomorrow ​