Skip to content

Postmortems

Production-affecting bugs and edge cases that taught us something. Newest first.

The fix lives in code; what's preserved here is the root cause and the generalizable lesson — the part that disappears if you only read commits.


KD-0924 — Subdomain availability check ignores the domains table

  • Severity: medium
  • Symptom: CheckSubdomainAvailabilityAction reported a subdomain as available when it existed in domains.domain but not in tenants.database. The subsequent signup insert then 500'd on the domains_domain_unique constraint. Reproduced in prod.
  • Root cause: The action only queried tenants.database, but Domain is the canonical owner of the subdomain (the unique index lives on domains.domain). When tenants.database diverged from domains.domain — possible after a rolled-back/diverged provisioning run or operator edit — the check passed falsely while the insert collided.
  • Fix: Injected Domain alongside Tenant; execute() short-circuits to available: false on a tenants.database hit, then also checks domains.domain.
  • Lesson: An availability check must consult every table that owns the uniqueness invariant it's predicting — checking one of two tables that can diverge guarantees a false "available" the moment they drift. The DB unique index is the real contract; the pre-check has to query the same column(s) the index covers.

KD-0920 — x-fs-cache-hashes header not CORS-exposed, killing cross-origin cache invalidation

  • Severity: medium
  • Symptom: On any cross-origin setup (every local dev install: script.localhost:3000:8000), the browser never handed the x-fs-cache-hashes response header to JS, so the cached-store wrapper saw no invalidation signal and lanes/labels/sprints stayed stale until a full refresh. Invisible failure — the wrapper degrades to null silently.
  • Root cause: config/cors.php set 'exposed_headers' => []. Browsers expose only the seven CORS-safelisted response headers to cross-origin JS; a custom header is readable only if the server lists it in Access-Control-Expose-Headers, which Laravel's HandleCors emits only when exposed_headers is non-empty. Backend stamping and the SPA wrapper were each correct in isolation — the bug was the missing exposure entry between them.
  • Fix: Added 'x-fs-cache-hashes' to exposed_headers (a config constant, not an env knob — the header is a fixed non-sensitive protocol value).
  • Lesson: Stamping a custom response header does nothing for cross-origin JS unless the server also CORS-exposes it — the two are separate steps and a header present on the wire is still invisible to headers.get() without the expose list. Same-origin prod hid it; the first witness was every dev install. When a protocol depends on a custom header, the CORS expose entry is load-bearing, not optional.

KD-0919 — Cache-hash header stamped on too few routes to ever reach the SPA

  • Severity: medium
  • Symptom: The cached-store protocol's steady-state invalidation never fired during normal navigation. Client A mutated a sprint/epic/lane/label, the backend bumped the project's *_hash, but an open tab on client B kept serving the stale list — the refetch signal never arrived.
  • Root cause: Registration coverage, not logic. StampCacheHashesMiddleware was mounted on only five narrow routes (project show + the four cached-resource groups). The requests an SPA actually fires while navigating (board, backlog, issue show, comments) were in none of those groups, so the response carried no header. The signal was circular: the only responses announcing "sprints changed" were the sprint requests the wrapper had already decided to suppress.
  • Fix: Hoisted the middleware to the Route::prefix('projects') group (one registration covers every project-scoped route, current and future) and removed the five redundant inline mounts. The middleware already self-guards index/store to header-free.
  • Lesson: A change-notification header is only useful on the responses the client actually requests in steady state — stamping it solely on the resource's own endpoints is circular, because those are exactly the requests the cache suppresses. Mount the signal across the whole navigation surface (group-level), and lean on the middleware's self-guards rather than narrow per-route registration that drifts as routes are added.

KD-0918 — Memoized cached stores go deaf to broadcasts after the first page unmount

  • Severity: high
  • Symptom: Sprints/epics/lanes/labels created or changed by another client stopped appearing live after the user's first in-app navigation — they surfaced only after a full refresh. Per-page live data (board, comments, time entries) kept updating fine; only the four project-scoped cached stores went deaf.
  • Root cause: subscribeWithAutoCleanup unconditionally called onScopeDispose(stop) (correct for per-page subscriptions). But the four cached stores are memoized per project and subscribe exactly once, at store-creation time — which runs inside the setup() of whichever component first calls the make…Store factory. When that component unmounted, the scope disposed, stop() fired, and the listener was removed; the memoized store stayed cached but never re-subscribed (fs-adapter-store subscribes once, at construction). Introduced by KD-0680's onScopeDispose auto-cleanup, which optimised per-page teardown without accounting for the memoized-singleton lifetime.
  • Fix: Persistent subscribe + evict-on-leave: a scope-free subscribeProjectChannelPersistent plus a per-project onLeaveProject registry. The four stores subscribe persistently and register an eviction callback that drops their memoized instance on leaveAllProjectChannels/resetEcho, so a revisit rebuilds and re-subscribes. Per-page subscriptions keep scope-bound teardown.
  • Lesson: A subscription's lifetime must match the lifetime of the thing it feeds — scope-bound (onScopeDispose) cleanup is correct for per-component data but wrong for a memoized singleton whose listener should live as long as the cache entry. When you add an auto-cleanup optimisation, audit every consumer whose lifetime is not the mounting component's, or the optimisation silently kills long-lived subscriptions on the first unmount.

KD-0889 — FilterBar search input has no inline clear (✕) button

  • Severity: low
  • Symptom: The issues-tab filter-bar search input had no inline ✕ to clear the term — users had to select-and-delete or use the separate global "Clear all." The reports page's search already had one, so the two bars felt inconsistent.
  • Root cause: FilterBar.vue's search <input> was bound to the model with no per-input clear control. The inline-clear pattern existed only in the sibling SearchFilter.vue and was never carried into FilterBar.
  • Fix: Added a v-if="searchTerm" ✕ button mirroring SearchFilter's clear control; clicking empties the model, which both hides the button and clears the filter (the search term is the filter). Lands across all 9 pages that mount the shared bar.
  • Lesson: When two sibling components present the same affordance, a pattern added to one but not the other reads as a regression — shared UI affordances should be lifted to the shared component or mirrored deliberately, not implemented per-page.

KD-0882 — ProfileSidebar spec leaks a post-teardown dynamic import, flaking CI

  • Severity: low
  • Symptom: The Test tenant-core frontend CI job intermittently exited code 1 even though all assertions passed, blocking PRs until a manual rerun. The failure was an unhandled EnvironmentTeardownError.
  • Root cause: The spec called the defineAsyncComponent factory (() => import('ProfilePictureForm.vue'), which transitively pulls the browser-only browser-image-compression module) without awaiting the returned promise. The dynamic import raced Vitest's environment teardown; when teardown won, the late-resolving import was recorded as an unhandled rejection.
  • Fix: await the factory call so the import resolves within the test's lifetime.
  • Lesson: An un-awaited promise in a test — especially a dynamic import() — races the runner's environment teardown and surfaces as an intermittent, assertion-passing CI failure. Any async work a test triggers must complete before the test ends, or it leaks into teardown as a flake.

KD-0878 — Epic Board badge stops updating on remote drags (stale positions payload shape)

  • Severity: medium
  • Symptom: The Epic Board's per-epic open/closed badge stopped updating live when other users moved issues; it stayed stale until a full reload. The three issue tabs handled the same broadcast correctly.
  • Root cause: Since KD-0789 the backend IssuePositionEvent broadcasts a single {position} payload on the positions event. The Epic Board's handler still destructured the pre-KD-0789 array shape {positions} and looped it, so positions was undefined and for (const position of positions) threw before any state applied. The issue tabs had been migrated to the single-{position} shape (centralised in useIssueLiveSync); this call site was missed because the Epic Board hand-rolled the same five-event protocol inline instead of consuming the composable. The spec hid it by pinning the old array shape.
  • Fix: Adopted useIssueLiveSync on the Epic Board (widened generic over item type), replacing the five inline Echo registrations with one composable call so the broadcast contract lives in exactly one place.
  • Lesson: A wire-shape change ripples into every consumer that re-implements the protocol by hand — the call site that hand-rolls what a shared composable already does is the one that gets missed when the shape changes. Centralising the contract in one composable is the fix and the prevention. And a test that pins the old shape keeps a broken consumer green while prod throws.

KD-0872 — Bearer-token auth failures return a 302 redirect instead of 401 JSON

  • Severity: medium
  • Symptom: API/MCP token clients (CLI, VS Code extension, feedback button) carrying a revoked/expired/unknown Bearer token got a 302 redirect to the login page instead of a machine-readable 401. The client had no actionable signal to re-authenticate — failures were silent. This was the delivery path for the 2026-05-29 feedback-loss incident (the balloon followed the redirect into a fake 201).
  • Root cause: Passport's TokenGuard catches the OAuthServerException internally and returns null, so auth:sanctum throws AuthenticationException. Laravel's default renderer checks $request->expectsJson(); token clients don't send Accept: application/json, so the check fails and the request is redirected — correct for browser navigation, wrong for machine clients on /api/* and /mcp/*. (The issue's premise that OAuthServerException reaches the handler was wrong; the guard swallows it.)
  • Fix: Added a render() for AuthenticationException returning {code: 'TOKEN_INVALID', ...} 401 JSON for api/* / mcp/* paths that don't already expect JSON; passes through (returns null) otherwise so SPA/JSON-client behaviour is unchanged. Mirrors KD-0739's AccessDeniedHttpException expectsJson() scoping.
  • Lesson: The framework default of "redirect non-JSON auth failures to login" is correct only for browsers — machine clients on API/MCP paths need a structured 401, and they don't send Accept: application/json. Path-scope the override (api/*/mcp/* + !expectsJson()) so browser flows keep redirecting. Diagnose where the exception is actually thrown, not where the issue claims (the guard caught the OAuth exception two layers up).

KD-0870 — project_tokens.active not synced when the backing PAT is revoked

  • Severity: medium
  • Symptom: Bulk-revoking a user's personal access tokens (e.g. on 2FA enablement) flipped oauth_access_tokens.revoked = 1 but left the matching project_tokens.active = true. The Project Tokens UI reads active, so it kept showing dead tokens as live; operators couldn't tell a working token from a revoked one without querying the DB. Confirmed in prod after the 2026-05-29 sweep.
  • Root cause: RevokeUserTokensAction had no dependency on ProjectToken — it knew only oauth_access_tokens. The other revocation path, DeleteProjectTokenAction, kept both tables in sync; only this path desynced, because the two paths were written independently and only the delete path was built for cross-table consistency.
  • Fix: Injected ProjectToken; after the revocation loop, set active = false on any project_tokens rows whose token_id is in the revoked set, inside the same transaction.
  • Lesson: When two code paths both invalidate the same entity, both must maintain every derived/mirrored column the entity owns — one path keeping a denormalised flag in sync while the sibling forgets it guarantees the UI shows a state that contradicts the source of truth. New invalidation paths must be checked against the full set of side-effects the canonical path performs.

KD-0858 — Board card can't be moved: deterministic rank collision dead-ends MoveIssueAction

  • Severity: high
  • Symptom: Certain board cards refused to move — every drag snapped back, no matter how many retries. The backend returned 409 (RankCollisionException) and the FE reverted. Surfaced in prod (Nightwatch #168). Two coupled defects: the 409 toast showed an empty body ({"message":""}), and the user-facing copy was hardcoded per-HTTP-status in the FE rather than coming from the backend exception.
  • Root cause: Rank::between is fully deterministic (base-26 midpoint, no randomness). When a project's rank space is degraded (zero-width gap, stale neighbour ids) so the midpoint lands on an existing (project_id, rank) UNIQUE value, the 3-attempt retry loop recomputes the same value every time and collides identically — the loop only resolves transient concurrent collisions, never deterministic ones. KD-0808's respread recovery was wired in execute() for RankTooLongException only, so overflow self-healed but a stuck gap dead-ended. The empty body: CustomException subclasses declare copy via a protected $message default, but PHP's Exception::__construct overwrites that default with '' the instant it runs with any argument — including previous: only — and the leak-safe new X(previous: $cause) throw (mandated to avoid leaking the MySQL duplicate-key string) was therefore silently incompatible with the property-default-message pattern.
  • Fix: (1) Widened the reactive-recovery catch in MoveIssueAction and BulkMoveAction to RankTooLongException | RankCollisionException → respread + one bounded retry. (2) Gave CustomException a constructor falling back to $this->message when no explicit message is passed. (3) dontReportWhen filter dropping handled <500 CustomExceptions from the monitor. (4) FE renders the backend exception's data.message instead of a hardcoded status→string map.
  • Lesson: A retry loop only helps when the inputs change between attempts — retrying a deterministic computation reproduces the same collision forever; deterministic exhaustion needs a state-changing recovery (respread), not a retry. Separately: PHP's Exception::__construct clobbers a subclass's protected $message default to '' whenever called with any named/positional argument, so a previous:-only throw silently ships an empty message — override the constructor to restore the declared default. And the backend that owns what failed should own the words the user sees; duplicating copy in the SPA per status code drifts.

KD-0852 — My Issues badge shows a stale lane after an agent (MCP) lane change

  • Severity: medium
  • Symptom: When an agent moved an issue to a different lane via the MCP path (UpdateIssueTool, or start-work-on-issue), the My Issues page kept showing the old lane in its Status badge until a manual refresh. The broadcast fired and the row otherwise updated — only lane_title/lane_color were stale. The same move through the web UI updated correctly.
  • Root cause: UpdateIssueAction assigns the new lane_id, saves, then broadcasts via resources that read lane_title/lane_color off the in-memory lane relation, hydrating with loadMissing. UpdateIssueTool resolves the issue with lane already eager-loaded. Changing the lane_id FK does not refresh an already-loaded belongsTo relation, and loadMissing is a no-op when the (now stale) relation is present — so the payload carried the new lane_id but the old lane_title/lane_color. The web path route-model-binds without preloading lane, so loadMissing fetched it fresh — hence agent-specific.
  • Fix: Added a single $issue->refresh() after the audit-log write and before the two broadcast calls, dropping stale relation caches so the resources re-read the new lane. Rejected per-relation unsetRelation() (mock churn, only covers listed relations) and tool-level eager-load removal (leaves the root cause for other callers).
  • Lesson: Mutating a foreign key does not refresh an already-loaded belongsTo relation, and loadMissing won't re-fetch what's already (stalely) present — so a serializer that reads through the relation emits stale nested data whenever a caller preloaded it. A broadcasting Action that mutates FKs must refresh() (or unset the affected relations) before serializing. The bug is caller-dependent: it only appears for the path that preloads, which is why the web UI looked fine while the agent path broke.

KD-0848 — Filter bar hijacks Cmd/Ctrl+F, blocking the browser's native find

  • Severity: low
  • Symptom: On any page rendering the shared FilterBar (9 call sites), pressing Cmd/Ctrl+F opened the Kendo filter popover instead of the browser's native find-in-page. The handler called preventDefault(), so users lost the universal browser find shortcut everywhere the bar appeared.
  • Root cause: FilterBar.vue's isFilterShortcut matched (metaKey || ctrlKey) && key === 'f' and the keydown handler preventDefault()'d before opening the popover. Cmd/Ctrl+F is the browser's universal find accelerator.
  • Fix: Rebound the shortcut to a bare / (the web-standard search key), freeing Cmd/Ctrl+F to fall through untouched. The existing form-control guard already suppresses it while typing.
  • Lesson: Don't bind app shortcuts to the browser's reserved accelerators (Cmd/Ctrl+F/N/T/W…) — preventDefault'ing them strips a universal capability on every page the component mounts. The bare / is the conventional in-app find/search key; reserved-modifier combos belong to the browser.

  • Severity: low
  • Symptom: In the RichTextArea comment/description editor, links rendered with no underline and clicking one navigated away instead of letting the user select/edit it. Read-only rendered prose was correct.
  • Root cause: Two config/styling gaps. TipTap v3 StarterKit bundles extension-link, whose Link mark defaults to openOnClick: trueRichTextArea registered StarterKit with no config. Separately, no .tiptap a CSS rule existed (editor content styles live in markdown.css), while read-only prose was underlined via .description-prose a.
  • Fix: StarterKit.configure({link: {openOnClick: false}}) so editor clicks edit rather than navigate; added a .tiptap a rule mirroring .description-prose a. CSS-only, not a link HTMLAttributes class, so the rendered <a> markup is unchanged and the existing real-render assertion stayed green.
  • Lesson: A bundled extension's defaults are inherited silently when you register the kit with no config — TipTap StarterKit's Link defaults openOnClick: true. And the editor surface (.tiptap) and read-only surface (.description-prose) are separate style scopes; a prose rule does not cover the editor. Check both the behaviour defaults and the per-surface CSS coverage.

KD-0841 — Logging time for a past date takes too many interactions

  • Severity: low
  • Symptom: The "Log Time" modal's "Started At" field opened empty (a bare native <input type="datetime-local">, default startedAt: null), so logging against a recent past day forced the user to hand-type every segment or click backward through the native calendar — described as "absurdly long."
  • Root cause: The field had no shortcut affordance and opened empty because new entries default startedAt: null. The friction was purely the date-selection affordance — nothing about calculation or storage.
  • Fix: Prefill "Started At" with now() when the create modal opens (logging against today is now zero interactions; a past date is reached by editing a populated field). The independent "auto-calculate start time" checkbox (back-dates start to now() − duration) was briefly removed as redundant, then restored once that misread its distinct purpose, and renamed for clarity. Day-preset buttons from an earlier iteration were dropped.
  • Lesson: An empty input is the worst default for an "edit a near value" task — prefilling the common case (now) removes the build-from-empty friction more simply than adding preset buttons. Watch for two controls with overlapping-but-distinct purpose (prefill vs back-date-on-duration): removing one as "redundant" can quietly delete a different behaviour.

KD-0839 — PDF attachment preview renders blank (iframe sandbox blocks JS)

  • Severity: medium
  • Symptom: Clicking a PDF attachment opened the preview modal but the PDF never rendered — the iframe loaded the blob URL yet stayed blank. Images previewed fine (they use <img>, not an iframe).
  • Root cause: The PDF <iframe> had sandbox="allow-same-origin". A sandbox attribute without allow-scripts blocks all JavaScript inside the iframe, and both Chrome's native PDF viewer and Firefox's PDF.js need JS to initialise — so the blob loaded but rendered nothing.
  • Fix: Removed the sandbox attribute. Blob URLs from URL.createObjectURL() are ephemeral and tab-local and the content is our own server's — no cross-origin surface to sandbox.
  • Lesson: sandbox without allow-scripts silently disables the JS that built-in PDF viewers depend on — a sandbox tight enough to block scripts blocks the very feature you're embedding. Don't sandbox an iframe whose source is a same-origin/tab-local blob you produced; there's nothing to isolate.

KD-0838 — AsyncErrorBoundary shows a fatal "Could not load page" for transient/user-action failures

  • Severity: medium
  • Symptom: The shared AsyncErrorBoundary rendered a full-page fatal "Could not load page." for any captured error except EntryNotFoundError. Two everyday events tripped it: a failed comment submission, and clicking an issue then hitting Back before it loaded (the fatal screen then persisted on the /board route that had loaded fine).
  • Root cause: Two defects. (1) Sticky boundary state — onErrorCaptured set hasError = true and never cleared it, and ProjectLayout keeps one boundary instance alive across in-project tab switches (the RouterView child swaps, the boundary doesn't), so an error raised loading one route latched and poisoned the next. (2) submitComment awaited create() with no try/catch, so a rejected POST propagated through onErrorCaptured into the boundary. There is no request-cancellation infra in the FE, so the "aborted load" was the reporter's mental model, not a literal CanceledError.
  • Fix: Initially: reset hasError on navigation via a resetKey prop (the boundary can't read the route — the app skips app.use(router)) plus a local try/catch on submitComment. Per PR review, replaced the per-handler guard with a systemic discrimination: onErrorCaptured now inspects Vue's info arg — 'native event handler'/'component event handler' errors propagate untouched (toast middleware surfaces them, user stays on the page), while async-setup/render/lifecycle/watcher errors still latch the fatal screen. resetKey made required so a future consumer can't forget it.
  • Lesson: An error boundary must distinguish a page-load failure (fatal screen is right) from a user-action failure (stay on the page, toast it) — onErrorCaptured fires for every descendant error including event handlers, so a boundary that treats all errors as fatal turns any failed submit into a full-page crash. Discriminate by Vue's info source. And latched boundary state must clear on navigation, or one route's error poisons its siblings. Prefer a required prop over an optional one for a guard that prevents a known bug (a future consumer can't silently forget it).

KD-0837 — Bar Color picker dims unselected swatches, distorting their hue

  • Severity: low
  • Symptom: In the epic "Bar Color" picker, selecting a colour appeared to change the colour of the other swatches (olive→yellow, orange→brown), as if the picker applied the wrong colour. Both report screenshots were of the picker itself, differing only in which swatch was at full opacity.
  • Root cause: Unselected swatches were dimmed with op-50 (whole-element opacity) over a saturated fill. CSS opacity composites the fill against whatever is behind it — the page background, which differs per theme — so a saturated colour at 50% over a dark page mixes toward black and shifts perceived hue (not just brightness). The stored value was always correct; the defect was purely the picker's render.
  • Fix: Replaced the bespoke opacity-dimmed swatch grid with the SingleSelect colour dropdown already used for lanes/labels, which renders colour names (text) and never a dimmed swatch — sidestepping the root cause entirely.
  • Lesson: Whole-element opacity on a coloured fill is theme-dependent by construction — it blends with the page background, so a "dimmed" selection indicator shifts the perceived hue differently in light vs dark mode. Indicate selection without touching the fill's alpha (a border, a check, or text-based selection), or the colour the user sees is a lie.

KD-0836 — Time-log summary cards bucket by logging date, not work date

  • Severity: medium
  • Symptom: On the Time Entries page the Today/Yesterday/Avg-per-day summary cards didn't reconcile with the filtered table — the cards counted hours on days that had no visible rows. Same dataset, different date axis.
  • Root cause: The summary helpers (filterByPeriod, getUniqueDaysCount) bucketed each entry on createdAt (when the entry was recorded) while the table presented each entry under startedAt ?? createdAt (when the work happened). When time is logged after the fact the two axes diverge, so the per-day cards counted hours on days the table never showed. The reporter's "summary uses a different dataset" hypothesis was wrong — same filtered dataset, wrong field.
  • Fix: Bucket the summary helpers on startedAt ?? createdAt, matching the table's date column. Frontend-only; the backend range filter still uses created_at (a broader product question parked).
  • Lesson: Two views over the same dataset must bucket on the same field, or they'll disagree without either being "wrong" — a summary and its table reconciling depends on a shared date axis (work date vs logging date), not just a shared filter. When numbers don't add up, suspect the axis before the dataset.

KD-0817 — Issue deletion fails on FK constraint for Hand-to-Claude tables

  • Severity: high
  • Symptom: Deleting an issue threw a 1451 FK-constraint violation whenever any Hand-to-Claude row referenced it (claude_issue_eligibilities, its criteria children, or claude_sessions). The same gap hit issue_label, and DeleteProjectAction was additionally missing issue_watchers cleanup and the reports.promoted_issue_id nullify the single/bulk paths already did.
  • Root cause: KD-0658 shipped three tenant-DB tables with restrictOnDelete() FKs to issues but didn't update DeleteIssueAction, BulkDeleteIssuesAction, or DeleteProjectAction. The arch gate that would catch this (CascadeRelationsTest) only walks HasMany/HasOne/MorphMany relations declared on the model — and Issue had no claudeSessions()/eligibility() relation, so the gate had nothing to enforce against. KD-0803 added issue_label 16 days later with the same omission; KD-0709 seeding the tables made it reproduce in dev.
  • Fix: Added Issue::claudeSessions() HasMany + Issue::eligibility() HasOne and listed both in cascadeRelations() so the existing arch gate now enforces future regressions. The three delete Actions guard on in-flight sessions (409), archive terminal-uncleaned sessions via a queued job capturing scalar IDs, delegate eligibility cleanup, and detach labels. DeleteProjectAction gained the missing issue_watchers + report-nullify steps.
  • Lesson: An arch gate that enforces "every relation is cleaned on delete" is blind to relations the model never declares — a new table with a restrictOnDelete FK is invisible to the audit until someone adds the corresponding relation. The structural fix is to declare the relation and list it in cascadeRelations() so the gate has teeth. The gate still doesn't cover BelongsToMany pivots (labels), so pivot omissions remain a known blind spot. (Same hand-maintained-cascade-list drift class as KD-0738.)

KD-0757 — Mention menu opens at page far-left on the first @-keystroke

  • Severity: medium
  • Symptom: Two positioning defects in the @-mention menu. (1) Typing @ flashed the menu at the far-left (~0,0) for one frame, then it snapped to the caret on the next character. (2) Once open, the menu didn't follow the caret when the page/editor scrolled.
  • Root cause: (1) mountMentionList appended the element before the async updatePosition applied coordinates — for one frame it had position: static and flowed to its container's top-left; on later keystrokes it already carried position: absolute, hence first-keystroke-only. (2) floating-ui's computePosition is one-shot; position was computed on open/update only, so the body-appended menu drifted from the caret inside its scroll container.
  • Fix: (1) Mount hidden+positioned: set position: absolute + visibility: hidden before appendChild, reveal once the first computed coords are applied (visibility: hidden, not display: none, keeps it measurable). (2) Position via floating-ui autoUpdate (runs immediately and on every scroll/resize) and return its cleanup, run on close/Escape so no listeners leak.
  • Lesson: An element positioned by an async callback flashes at its static-flow origin for the frames before coordinates land — mount it hidden-but-measurable and reveal only after the first compute. And one-shot computePosition doesn't track scroll; use autoUpdate (with a cleanup wired to close) when a floating element must stay glued to a moving anchor.

KD-0752 — Markdown (.md) attachments have no in-app preview

  • Severity: low
  • Symptom: Clicking a .md attachment thumbnail did nothing — it rendered as a generic file icon with only a download button. Upload and MCP fetch already worked; only the web preview failed.
  • Root cause: Markdown was absent from the frontend previewability path. isPreviewableMimeType returned true only for images + PDFs, so the thumbnail click never emitted preview for a .md, and the preview modal had no markdown branch (it would fall through to a raw <iframe>). Stored MIME for .md is unreliable (text/plain from finfo), so detection has to key off the .md/.markdown filename extension, not MIME.
  • Fix: Added isMarkdownFilename (extension-based), made the thumbnail previewable for markdown, and added a preview-modal branch that fetches the bytes through the auth'd download endpoint, reads the blob as text, and renders via the app's existing renderMarkdownDescriptionProse stack. Frontend-only.
  • Lesson: A previewability check keyed on MIME alone misses file types whose stored MIME is generic (.mdtext/plain) — detect by extension when the MIME is unreliable. And when an app already owns a renderer for a content type, wire the preview path into it rather than falling through to a raw iframe.

KD-0644 — Empty toast container stays rendered on every page after toasts dismiss

  • Severity: low
  • Symptom: After the fs-toast 0.2.0 migration, the <div popover="manual"> toast container never disappeared once all toasts dismissed — it lingered as an empty fixed box on every route (pointer-events-none, so it didn't block clicks, but always present).
  • Root cause: fs-toast hides the closed container by calling el.hidePopover() and relying on the UA rule [popover]:not(:popover-open){display:none}. But App.vue applied a bare flex utility (→ display: flex) directly to that element. Author-origin display: flex always beats the UA-origin display: none in the cascade regardless of specificity, so the container never collapsed when closed. The flex/fixed/z-1050 attrs predated the migration and were harmless until the popover-based hide began depending on display.
  • Fix: Gated the display to the open state: replaced the unconditional flex with class="popover-open:flex", so display: flex applies only while :popover-open, and the UA rule hides the empty container.
  • Lesson: An author-origin display declaration unconditionally beats a user-agent display: none rule — so any styling that hard-sets display on a [popover] element defeats the Popover API's own hide. Scope display utilities to :popover-open when the hide is delegated to the UA rule. A previously-inert utility can become load-bearing the moment a dependency starts relying on the property it sets.

KD-0624 — CascadeRelationsTest skips Tenant and misses trait-provided relations

  • Severity: medium
  • Symptom: Audit-coverage defect, not a runtime crash: CascadeRelationsTest (the ADR-0002 gate ensuring every tenant model enumerates its cascade relations) was green precisely because two blind spots let it skip the cases it should catch. Un-blinding it surfaced four previously-invisible relations (Tenant::githubInstallations + three Passport relations on User).
  • Root cause: Two intentional skips. (1) Tenant was hardcoded into a $centralModels exclusion list justified as "never deleted via application logic" — false, since DeleteTenantAction cascades real relations. (2) The "all relations listed" test filtered out any relation contributed by a trait; but PHP flattens trait methods onto the using class (reflection reports getDeclaringClass() === Tenant/User), so those relations are reachable and were being thrown away — for every model, not just Tenant.
  • Fix: Removed the $centralModels exclusion and the trait-method filter, and replaced them with an explicit $nonCascadeRelations allowlist (each entry justified inline) so every discovered relation must be either in cascadeRelations() or acknowledged here. Verified by transiently dropping an entry and confirming the test then fails.
  • Lesson: A test exclusion justified by a comment ("never deleted", "trait-provided, skip") is a place bugs hide — the green suite was an artifact of the audit skipping its hardest cases. An audit should never silently drop a category; require every discovered item to be explicitly handled or explicitly acknowledged, so the skip list itself is reviewable. (Same exclusion-comment-rot theme as KD-0786.)

KD-0479 — Tooltip layout regressions and silent failures

  • Severity: medium
  • Symptom: Four regressions from the "bundle tooltips into components" pivot, caught in PR review. Tooltip.vue's wrapper <div> participated in layout, breaking call sites relying on flex/absolute positioning (ml-auto watch button, report-card copy button). IconButton/ReportListItem sniffed $attrs['aria-label'], producing empty tooltips and inaccessible buttons when callers used title= or omitted the label. DragElement re-introduced a trailing-space assignee label for single-name users.
  • Root cause: Tooltip.vue wrapped its slot in an inline-block <div> that the parent's flex/grid algorithm laid out as a real item, so layout-affecting attrs (ml-auto, absolute) landed on the inner <button>, not the layout participant. The $attrs['aria-label'] sniff was brittle — any caller using title= or omitting aria-label silently got an empty tooltip and a button with no accessible name.
  • Fix: Tooltip.vue wrapper switched to display: contents (anchoring floating-ui to firstElementChild) so it no longer participates in layout; IconButton got an explicit required label: string prop (sweeping ~25 call sites off the $attrs sniff); DragElement label changed to [firstName, lastName].filter(Boolean).join(' ').
  • Lesson: A wrapper element silently changes its children's layout context — a tooltip/HOC wrapper that isn't display: contents becomes a real flex/grid item and misplaces positioning utilities meant for the wrapped element. And sniffing a value off $attrs is a brittle implicit contract: make it an explicit, required prop so a missing label is a type error at the call site, not an empty tooltip + inaccessible button at runtime.

KD-0807 — Multi-select drag in Backlog persists only one issue's move

  • Severity: medium
  • Symptom: Multi-selecting N issues in the Backlog and dragging one across sprint sections moved all N cards visually, but only ONE issue's change persisted. The other N−1 silently reverted on the next sync/refresh. The BulkActionBar "Move to" dropdown hit the same bug.
  • Root cause: Regression from KD-0789's fractional-rank drag rewrite. The legacy bulk-update shape posted N updates in one request; the new single-issue moveIssueForProject(issueId, payload) posts one move at a time, and the drag store's diff helper (findLaneChangedItem) returned on the first lane-changed item. So exactly one move request fired regardless of how many cards the user moved. The single-issue endpoint cannot carry N issues atomically.
  • Fix: Dedicated BulkMoveAction + POST /api/projects/{project}/issues/bulk-move (mirroring the precedent BulkAssignEpicAction), with a sprint-only {issue_ids, target_sprint_id, position} payload — each issue keeps its lane + epic, only sprint_id + rank change. N ranks spread logarithmically via Rank::spread. FE drag store gained a bulkUpdate() path that collects every lane-changed item.
  • Lesson: When you replace a bulk endpoint with a single-item one, audit every multi-select code path that fed the old shape — a diff helper that "returns the first changed item" silently drops the rest. And bulk operations need a dedicated atomic endpoint; you can't synthesize N-item atomicity by looping a single-item call. Sequentially Rank::between-ing N cards also degrades rank length ~1 char per 4 cards, so spread balanced midpoints instead of chaining.

KD-0798 — VS Code extension shows no issues after API shape change

  • Severity: high
  • Symptom: Opening a project in the VS Code extension showed no issues and fired a "API error" notification. Assignee avatars rendered [object Object] as the image src.
  • Root cause: KD-0774 changed IssueResourceData to return branch_links as full nested objects ({id, branch_name, branch_url, status}) instead of a flat status array. The extension was only partially updated — it kept the field name but typed each link as {status}, so it still fired N secondary GET .../branch-links calls to fetch data already present in the initial response. Separately, profile_picture was typed string | null but the API now returns {avif, webp} | null, so the raw object was passed through as an image URL.
  • Fix: Widened the extension's branch_links type to the full shape and derived branch names/URLs directly from the initial response (dropping the N secondary calls). Fixed profile_picture type and extracted avif ?? webp ?? null.
  • Lesson: A backend resource shape change ripples into every client that wasn't updated in lockstep — the extension is a separate consumer with no shared type contract, so a partial update left it making redundant calls AND mis-rendering. When a response field changes from scalar to object, every client's type and every place it's interpolated (especially as a URL src) must be revisited.

KD-0788 — Central-binding arch test over-detects: only 5 of 14 flagged Actions were true gaps

  • Severity: medium
  • Symptom: The KD-0783 broader arch test listed 14 Actions in centralBindingKnownGaps(), framed in the issue as "13 quick-wins, just add the binding." Latent/low-volume — the binding gaps would surface as audit-transaction crashes only when the affected central paths ran in prod.
  • Root cause: The arch test is a structural detector ("Action injects a central model AND has ->transaction(") but the canonical bug is behavioural ("the outer transaction opens on the wrong connection"). Of the 14: 5 were true gaps ($this->db->transaction wrapping central writes); 5 were already correct (transaction opened via $model->getConnection()->transaction(), which the test couldn't distinguish); and 4 were tenant-primary Actions where binding $this->db to central would invert the bug — opening a central transaction while tenant writes committed unsynchronised.
  • Fix: Bound the 5 true gaps in AppServiceProvider. Refined the arch test to require both ConnectionInterface injection AND ->transaction( (dropping the 5 model-getConnection Actions naturally). Added inline @central-binding-exempt: markers to the 4 tenant-primary Actions and taught the test to honour them. Emptied centralBindingKnownGaps() to [].
  • Lesson: A structural arch heuristic and the behavioural bug it targets coincide for the obvious cases and diverge for the rest — a "known gaps" list taken at face value will misclassify. Reading each flagged site beats trusting the count: blindly "adding the binding" to all 14 would have broken 4 working Actions. When a heuristic over-detects, the fix is to tighten the heuristic AND provide an auditable escape-hatch marker, not to suppress with a grandfather list.

KD-0787 — RollbackProvisioningAction unbound + inline DROP DATABASE on the bound connection

  • Severity: medium
  • Symptom: Same latent shape as KD-0783. Dormant in prod (DOMAIN_PROVISIONING_ENABLED=false). The moment the rollback path ran with provisioning enabled, the central audit-write would throw AuditLogWriter must be called within a database transaction because the outer $this->db->transaction(...) opened on the default connection, not central.
  • Root cause: Two coupled defects. (1) The Action injected a generic ConnectionInterface and wasn't contextually bound to central, so the audit-log model's hardcoded central connection saw transactionLevel() === 0. (2) An inline $this->db->statement('DROP DATABASE...') would route through whatever connection got injected — once bound to central, central's MySQL user may lack DROP privileges. (1) couldn't land without (2): binding to central while the inline DDL remained would make rollback DDL fail in any locked-down environment.
  • Fix: Injected the already-tenant-bound DropTenantDatabaseAction for the DDL, added RollbackProvisioningAction to the central ConnectionInterface binding, and promoted its arch-test entry from centralBindingKnownGaps() to centralActionsRequiringBinding() (turning the gate into a regression test).
  • Lesson: Binding an Action's connection and extracting its cross-connection DDL are coupled changes — you can't safely flip the binding while inline statements still inherit it. DDL that needs different privileges (DROP DATABASE on tenant) must be delegated to a connection-explicit sibling Action before the parent's connection is rebound.

KD-0786 — ProvisionDomainAction unbound from central despite being central-only

  • Severity: medium
  • Symptom: Same crash shape as KD-0783, dormant behind DOMAIN_PROVISIONING_ENABLED=false. Would fire LogicException on every provisioning state transition that emits an audit row the moment the flag flipped on.
  • Root cause: The Action's outer transaction resolved ConnectionInterface to the default connection because it wasn't contextually bound to central. It was excluded from the binding with a stale comment claiming it "also does tenant-DB work" — but a post-KD-0580 state-machine refactor had made the Action central-only (Domain model is central, audit is central, the advance() branches call external providers not the DB, no TenantSwitcher). The exclusion rationale had outlived its truth.
  • Fix: Added ProvisionDomainAction to the central ConnectionInterface binding and moved its arch-test entry from gaps to required. No change to the Action — only container wiring was missing.
  • Lesson: Exclusion comments rot. A "we can't bind this because it does X" note must be re-validated against the current source before it's trusted — a later refactor can remove X and leave the stale exclusion silently masking a bug-in-waiting. The arch test's sentinel ("an audit-writing Action must be in exactly one of bound-list or gaps-list") is what forced the decision instead of letting it sit undecided.

KD-0785 — CreateTenantAction crashes on first prod central-admin invite

  • Severity: high
  • Symptom: Same latent KD-0783 shape. The central-admin "invite a tenant" flow (POST /api/central/tenants) worked in dev/test (where DB_CONNECTION falls through to central) but the first invitation in prod would throw the audit-transaction assertion because the outer transaction opened on the default mysql connection while writes hit central models.
  • Root cause: The Action wasn't in the central ConnectionInterface binding because its private createAdminUser opened an inner transaction on the same $this->db after a TenantSwitcher::switchTo() — and those inner writes target the tenant DB (admin User row + role pivot). Binding the whole Action to central would route the inner tenant transaction to central too. Same shape SignupAction had pre-KD-0783.
  • Fix: Mirrored KD-0783's extraction exactly — pulled the tenant-DB work into a new sibling CreateInvitedTenantAdminUserAction (injecting the default tenant ConnectionInterface and owning the switchTo/reset lifecycle), then bound CreateTenantAction to central and promoted its arch-test entry to required.
  • Lesson: An Action that opens transactions on two different connections cannot be contextually bound to either — the only clean fix is to split the second-connection work into its own Action with its own binding. The "extract the tenant-DB half into a sibling Action" pattern is now the canonical resolution for this entire class (KD-0783 → KD-0785 → KD-0787).

KD-0783 — Public signup 500s in prod on audit-log transaction assertion

  • Severity: high
  • Symptom: Every public signup at central.kendo.dev/signup returned HTTP 500 with RuntimeException: AuditLogWriter must be called within a database transaction for hash chain integrity. Production-only — the exact path passed every CI test and worked locally. A partial central row (Tenant insert) could commit before the audit assertion fired.
  • Root cause: SignupAction and CreateDomainAction injected ConnectionInterface with no connection name, so the container resolved the default connection. Prod's .fly/config/prod.toml sets DB_CONNECTION=mysql, so $this->db->transaction(...) opened on mysql. But the audit-log models hardcode $connection = 'central', and assertWithinTransaction checks central's transaction level — found 0 (the open transaction was on mysql) and threw. Tests didn't catch it because DB_CONNECTION is unset in tests, so both connections resolved to the same instance. The same defect was latent in every central audit-writing Action.
  • Fix: Contextually bound ConnectionInterface to central in AppServiceProvider for all 10 central audit-writing Actions. Extracted SignupAction's tenant-DB createAdminUser into CreateTenantAdminUserAction (default connection) and its inline DROP DATABASE into DropTenantDatabaseAction (tenant connection). Added an arch test forcing every newly-discovered central audit-writing Action into either the bound list or a known-gaps list.
  • Lesson: Injecting an unqualified ConnectionInterface silently binds to whatever DB_CONNECTION resolves to — fine until a model hardcodes a different connection, at which point transaction-scoped invariants check the wrong connection. Tests that leave DB_CONNECTION unset collapse distinct connections into one instance and hide the entire class; an arch test that asserts the binding decision is the only reliable gate. (Also surfaced two side findings: DB_CONNECTION=mysql references a connection that doesn't exist in config — it only resolves via Laravel's framework-default merge — and DB_PASSWORD was visible in plaintext via fly ssh console env.)

KD-0761 — Avatar initials low-contrast, undersized, off-center

  • Severity: low
  • Symptom: Fallback avatar initials were hard to read — in dark mode, white text on bright tint backgrounds (e.g. #4ade80) hit contrast ratios as low as 1.3:1, far below WCAG AA's 4.5:1. Initials also rendered too small and sat slightly high.
  • Root cause: ProfilePicture.vue hardcoded c-white for initials regardless of theme; the dark-mode tint palette uses high-luminance backgrounds where white text fails contrast. Font sizes were ~35-40% of the container, weight was too light, and no optical baseline compensation was applied for uppercase glyphs.
  • Fix: Added a theme-aware --avatar-initials CSS variable (near-black in dark mode, white in light), bumped font sizes and weight to 700, added items-center + a 0.5px translateY optical nudge.
  • Lesson: Hardcoding c-white for text-on-color assumes dark backgrounds — a bright tint palette inverts that assumption. Contrast-sensitive colors must be theme-aware tokens, not literals. (The fix also caught a recurring trap: stale test assertions pinning the pre-fix font sizes blocked the first verification.)

KD-0760 — Delete confirmation dialog shows "Submit" instead of "Delete"

  • Severity: low
  • Symptom: Destructive confirmation modals (delete attachment, delete issue, delete tenant, etc.) showed a generic "Submit" confirm button instead of a destructive verb — ambiguous, and visually identical to a benign save, raising accidental-confirmation risk on irreversible actions.
  • Root cause: confirmModal defaults confirmButtonText to 'Submit'. Nine destructive call sites omitted the third positional argument and inherited the generic label. (~23 other call sites already passed an explicit verb, so the misbehaviour was purely the default kicking in on incomplete calls.)
  • Fix: Passed an explicit 'Delete' (or equivalent verb) at all 9 destructive call sites. Left the helper's 'Submit' default in place but now unreachable from any destructive flow.
  • Lesson: A permissive default on a shared destructive helper is a latent footgun — every incomplete call site silently inherits the wrong label. For destructive actions the safer design is no default (force the caller to name the verb), but at minimum every call site must be audited when a default is too generic to be safe.

KD-0759 — File-upload drop zones too short to hit reliably

  • Severity: low
  • Symptom: Drop zones (attachment uploaders, profile picture modal) rendered short — standard ~100px, compact ~40px, profile ~80px. Files released slightly above/below the dashed border missed the drop and landed on the page.
  • Root cause: None of the three dropzone surfaces set a min-h-*, so the affordance collapsed to its icon + text height. The @drop handler fires on the same <div> that draws the border, so the visible affordance IS the hit area — a short visual yields a short hit target. The compact variant fell below the WCAG 2.5.5 44px floor.
  • Fix: Added min-h-32 (128px) to standard + profile dropzones, min-h-14 (56px) to compact, plus flex centering. No padding/border/copy changes.
  • Lesson: When the visual affordance and the event target are the same element, the visual size directly determines the hit area — sizing it for "looks fine" isn't the same as sizing it for "easy to hit." Set an explicit minimum height against a real interaction target (WCAG 2.5.5's 44px floor as the baseline).

KD-0756 — Invite form not reset after a successful invite

  • Severity: low
  • Symptom: After inviting a user, re-opening the invite modal showed the previous person's data still populated in every field. Users had to refresh to get a clean form, risking re-submitting the same details under a different email.
  • Root cause: newInvite is a module-scoped ref passed by reference to the modal each time it opens; the form mutates that object in place via v-model. The onSubmit success path posted, refreshed, toasted, and closed the modal — but never reset newInvite.value, so stale data persisted across openings.
  • Fix: One line — reassign newInvite.value to empty defaults after closeModal() in onSubmit. The toast captures the invitee's name before the reset; a failed invite throws before the reset, keeping the form populated for retry.
  • Lesson: A reused, mutated-in-place form model needs an explicit reset on the success path — close-and-reopen does not clear it because the same object reference is handed back. Reset after success, but only after success, so failures keep the user's input for retry.

KD-0754 — Tab walks through every WYSIWYG toolbar button

  • Severity: low
  • Symptom: Tabbing from a form's title field into the RichTextArea description forced keyboard users through ~8 formatting toolbar buttons first (H1/H2/H3/Bold/Italic/UL/OL/Raw). Reproduced everywhere RichTextArea is consumed (comments, issue templates, AI story prompt, epic form).
  • Root cause: FormatButton.vue was a plain <button> with default tabindex="0", and the toolbar <div> precedes the editor content in DOM order — so every button became a tab stop before the editor. No role="toolbar" or roving-tabindex consolidated them into one stop.
  • Fix: Applied the WAI-ARIA toolbar pattern (roving tabindex): role="toolbar" + aria-label, exactly one button at tabindex="0" and the rest at -1, arrow keys for internal navigation. Format actions remain reachable via Tiptap shortcuts; the raw-mode toggle stays keyboard-reachable (which a flat tabindex="-1" strategy would have lost).
  • Lesson: A group of related controls should be a single tab stop with internal arrow-key navigation (the WAI-ARIA toolbar pattern), not N sequential tab stops. A shared component is the right fix point — wiring roving tabindex once protects every consumer.

KD-0753 — LinkBranchTool rethrows raw exceptions as JSON-RPC -32603

  • Severity: medium
  • Symptom: Any failure in the MCP LinkBranchTool rethrew the raw Throwable, which the MCP framework mapped to a generic -32603 internal error. Callers couldn't distinguish a duplicate branch link from a cross-project mismatch, a deadlock, or a broadcast failure. Under parallel agent fan-out the error was consistent and unactionable.
  • Root cause: The top-level catch captured the exception for the audit log but then throw $throwable'd the raw exception — violating the documented Exception-Leak Discipline ("MCP tools must never rethrow raw Throwables from their top-level catch"). Concurrent calls could throw deadlock/unique-constraint exceptions from InnoDB gap locks, all flattened to -32603.
  • Fix: Replaced the rethrow with three ordered catch blocks: BranchAlreadyLinkedException and CrossProjectException return specific structured messages; remaining Throwable is logged with scoped context and returned as a generic structured error instead of rethrown.
  • Lesson: At a protocol boundary (MCP/JSON-RPC), a raw rethrow collapses every distinct failure into one opaque code — the caller (often another agent) loses all ability to act. Top-level catches at such boundaries must map known exceptions to structured errors and log-then-wrap the unknown, never rethrow raw.

KD-0738 — Project deletion 500s on unhandled RESTRICT foreign keys

  • Severity: high
  • Symptom: DELETE /api/projects/{project} returned 500 for any project that had been used (triggered a Claude session, been watched, linked to a tenant AI key, etc.) with SQLSTATE[23000]: Integrity constraint violation.
  • Root cause: DeleteProjectAction walks a hand-maintained list of descendant tables to delete before the parents. That list was last extended at a 2026-02-18 cascade-to-restrict audit. Six tables with restrictOnDelete() FKs into the project subtree have landed since (claude_sessions, issue_watchers, attachment_extracted_contexts, claude_issue_eligibilities, claude_issue_eligibility_criteria, tenant_ai_key_project) and none were cleaned up, so MySQL blocked the parent delete.
  • Fix: Added six raw db->table(...)->whereIn(...)->delete() calls inside the existing transaction, in FK dependency order (criteria before eligibility, contexts before attachments, pivot before project).
  • Lesson: A hand-maintained cascade-delete list is guaranteed to drift — every new table with a RESTRICT FK into the subtree must be manually added, and nothing fails until a populated project is deleted in prod. This bug class has no automated check today; an arch test that diffs schema FKs into project/issues against the Actions that delete them would close it. (Related drift: BulkDeleteIssuesAction already deleted issue_watchers but DeleteProjectAction didn't — the two tear-down paths had diverged.)

KD-0734 — Complete Sprint modal prompts for incomplete issues when none exist

  • Severity: medium
  • Symptom: Two defects. (1) The Complete Sprint modal always showed "What should we do with incomplete issues?" even when every issue was already Done. (2) After completing a sprint, the board kept showing the completed sprint until a manual reload.
  • Root cause: (1) hasNoIssues checked issuesCount === 0 (total issues) instead of incomplete-issues count — and the backend never exposed an incomplete count, so there was nothing else to check. (2) CompleteSprintAction was the only mutating sprint Action that never called SprintBroadcaster->updated(), so the reactive store was never notified. A follow-up surfaced a third issue: makeSprintStoreForProject wasn't memoized, so the modal's retrieveAll() refreshed a different store instance than Backlog used — and since the broadcast ships with ->toOthers(), the originator's UI had no refresh path at all.
  • Fix: Added lazily-computed incomplete_issues_count to SprintResourceData; switched the modal to check it. Injected SprintBroadcaster into CompleteSprintAction. Memoized the sprint store by projectId and derived hasNoIssues from the freshly-retrieved store value rather than the stale prop snapshot.
  • Lesson: "No issues to move" is a count of incomplete issues, not total — modeling a domain question with the nearest-available field produces noise. And a mutating Action that skips the broadcast its siblings all fire is a silent realtime gap. The deeper trap: ->toOthers() excludes the originator, so the person who triggered the action depends entirely on the HTTP response refreshing the same store instance — an unmemoized store factory quietly breaks that for the one user who most expects to see the result.

KD-0733 — Markdown tables render as unstyled plain text

  • Severity: low
  • Symptom: GFM tables in issue descriptions showed header and cell text with no borders, no row separation, no padding — indistinguishable from two lines of plain text.
  • Root cause: marked correctly emitted <table>/<thead>/<tr>/<th> and DOMPurify preserved them, but markdown.css defined .description-prose styles for every other prose element and had no rules for table elements. The browser's default table rendering has zero borders.
  • Fix: Added .description-prose table styles (border-collapse, borders via var(--border), header background, alternating row background, padding) matching the file's existing visual language.
  • Lesson: A prose stylesheet is only complete for the elements it explicitly targets — when a markdown renderer can emit an element type (tables) that the prose CSS never styled, it falls back to unstyled UA defaults. Cross-check the renderer's full output tag set against the prose stylesheet's coverage.

KD-0725 — Modal dialogs overflow viewport on narrow screens

  • Severity: medium
  • Symptom: large shared modals (1360px design width) overflowed the viewport across the entire 1024–1359px range — the half-screen-of-a-1920px-monitor band up through small desktops. Right edge clipped, close button hidden, horizontal scroll appeared.
  • Root cause: BaseFormModal/BaseShowModal switched to fixed pixel widths (lg:w-100/220/340) at the lg breakpoint with no viewport guard. Below lg the w-90vw fallback was already viewport-capped, so the bug only triggered in the lg+ band where the fixed width exceeded the viewport.
  • Fix: Added max-w-95vw to every entry in both size maps, so resolved width became min(design-width, 95vw).
  • Lesson: A fixed pixel width above a breakpoint assumes the viewport is always wider than the design width — false for half-screen and mid-desktop widths. Any fixed-width element needs a viewport-relative max-width cap. An arch test scanning for lg:w-<n> on a <dialog> child without a matching max-w-<n>vw would prevent the regression class.

KD-0700 — Hand-to-Claude grader verdict read from a key Anthropic never sends

  • Severity: high
  • Symptom: Every graded Hand-to-Claude session was recorded as Failed regardless of the actual grader verdict — the UI said "Claude could not finish the issue" even when the grader explicitly satisfied the rubric. Confirmed in prod on a session whose Anthropic events API showed result: "satisfied" but kendo stored status = Failed.
  • Root cause: handleOutcome read rawPayload['outcome']['result'] from the webhook, but Anthropic's outcome_evaluation_ended webhook is a notification only — it carries no outcome key at any depth. The real verdict lives on the session's outcomeEvaluations[] list, already returned by the getSession retrieve call — but aggregateEvents walked the event stream for tokens/iterations only and never surfaced it. A secondary defect: triggerCleanup archived the session unconditionally on the first status_idled webhook (fired between implementer end_turn and grader start, while Anthropic had flipped back to running), 400ing on every run.
  • Fix: Added outcomeResult to SessionResultData, populated from the latest outcomeEvaluations entry in getSession, and read the verdict from there instead of the webhook payload. Gated triggerCleanup on the session reporting idle/terminated rather than running.
  • Lesson: Reading state from a webhook payload that the provider documents as a notification-only event guarantees a wrong answer — the authoritative state must come from the retrieve call. The test suite hid it because the test helper fabricated the outcome key the provider never sends: a fixture builder that synthesizes a shape no real API produces will keep a bug green forever.

KD-0699 — PR-evidence parser rejects every verbatim MCP response

  • Severity: high
  • Symptom: Every Hand-to-Claude implementer session that successfully opened a PR was terminated as Failed/missing_pr_evidence before the grader ran — burning the full token spend with no verdict and permanently marking the issue Failed. The implementer pasted the MCP tool response verbatim as instructed.
  • Root cause: extractPullRequestUrlFromMessage read $decoded['html_url'], but the GitHub MCP create_pull_request tool returns {id, url} with no html_url. Both the kendo parser AND the implementer system prompt's "good output looks like this" example encoded the GitHub REST API shape rather than the MCP tool's actual shape — so the prompt-mandated verbatim quoting was structurally guaranteed to fail the parse. The test suite passed because the test helper generated the same wrong shape the prompt documented.
  • Fix: Made the parser accept url ?? html_url (MCP shape first, REST shape as forward-compatible fallback), both still validated through the github.com PR-URL regex. Updated the prompt example and test helper to the real MCP shape. Downstream verifyPullRequestOpen still confirms the PR exists, so the looser key set didn't weaken fabrication defence.
  • Lesson: When a prompt instructs the model to quote a tool's output verbatim, the parser must match what the tool actually emits, not what an API doc says — and the prompt example, the parser, and the test fixture must all agree on the real shape. Three places encoded the same imagined shape; prod was the first witness. Audit fixture builders for synthesized-vs-real payloads. (Side note: the diagnosed session cost ~$31 on Opus for well-trodden work — flagged a model-tier question.)

KD-0693 — Anthropic session cleanup 400s archiving the primary thread

  • Severity: high
  • Symptom: Every terminal Hand-to-Claude session hit 400 invalid_request_error: "The primary thread cannot be archived; archive the session instead." Because the exception threw before cleaned_up_at was stamped, the webhook job retried forever and the hourly prune re-hit the same failure every tick (9 occurrences in two hours post-deploy).
  • Root cause: ArchiveAnthropicSessionResourcesAction walked every thread via streamThreadsForSession and called archiveThread on each — including the primary thread, which Anthropic rejects. The streamer yielded the primary thread (parentThreadID === null) despite its contract implying only archivable threads. Compounding it, the Action never called sessions->archive($sessionId) at all, so sessions were never archived on the Anthropic side even before the 400 surfaced.
  • Fix: Filtered the primary thread (parentThreadID === null) out of streamThreadsForSession, and added an archiveSession primitive called once after the child-thread and vault archives, before stamping cleaned_up_at.
  • Lesson: When a cleanup step throws before its idempotency marker is set, it retries forever and turns a single failure into a recurring incident — cleanup loops must either tolerate the rejecting case or stamp progress before the fragile call. And a streamer whose contract says "things you can archive" must actually filter to that set, or every caller inherits the exception.

KD-0691 — session.status_idled webhook events silently dropped

  • Severity: high
  • Symptom: When a Hand-to-Claude session completed naturally, Anthropic emitted session.status_idled — the dedup row was written (so processed_at looked healthy) but no status update, no completion comment, no audit row, no broadcast, and no cleanup ran. From kendo's POV the session was permanently in flight; Anthropic-side resources sat until the 30-day TTL.
  • Root cause: All three match ($data->eventType) blocks in HandleSessionWebhookAction only enumerated outcome_evaluation_ended and status_terminatedsession.status_idled fell through to default => null. The dedup row was written before the inner match, which is exactly what made the failure silent: the webhook-events table looked processed while the session stayed Pending.
  • Fix: Added session.status_idled to all three match blocks and a handleIdled method branching on stopReason (end_turn → Completed, anything else → Failed), mirroring handleTerminated. Defensive early-return-with-warning if the aggregated result is null.
  • Lesson: A match with a silent default => null arm is a trap for event-type handling — a new (or unhandled) event type produces no error, just missing work. And writing a dedup/processed marker before the work means the marker lies when the work is skipped: record "processed" only after the handler actually runs, or unhandled events masquerade as healthy.

KD-0634 — Filter state leaks across projects, blanking Backlog/Board

  • Severity: medium
  • Symptom: Project-scoped issue filters (selected lanes/epics/creators/sprints) persisted across navigation between projects and across reloads. Because lane/epic/sprint IDs are per-project auto-increment keys, a filter from Project A matched nothing in Project B — the middle pane rendered empty with a stale filter chip showing. Affected Backlog, Board, and Overview.
  • Root cause: filters.ts declared module-level singleton refs persisted under global localStorage keys. Four held project-scoped IDs; the matchers did strict ID equality with no project-membership check, so a previous project's IDs filtered out every issue in the current one.
  • Fix: Per-slot storage keys — each project gets its own slot (issue-filters.{projectId}.selectedLanes), hydrated on setFilterProject(projectId). Cross-project MyIssues routes through a fixed myissues slot. Rejected the simpler "single global key + reset on project change" because localStorage is shared across tabs, so a Ctrl+Click into Project B would silently wipe Project A's filter in another tab.
  • Lesson: State persisted under a global key but holding scope-specific identifiers will leak across scopes — and the simpler "reset on change" fix breaks under multi-tab because localStorage is origin-shared. Per-scope storage slots sidestep both the leak and the cross-tab race. (Also: selectedSprints was dormant — declared and cleared but wired into no page; clearing it for hygiene future-proofs whoever wires it up.)

KD-0631 — Blank page when async component setup fails

  • Severity: medium
  • Symptom: When an HTTP request failed during a page's async <script setup> (observed with 429s in prod during rapid navigation), the page content rendered blank — no error state, no retry. Error toasts ("Too Many Attempts.") did appear, but the content never rendered.
  • Root cause: Three layers combined. (1) Pages fired unguarded await Promise.all([...]) at the top of async setup — one rejection killed the whole batch. (2) Layouts wrapped <RouterView> in <Suspense>, which has no native error slot — an async child rejection puts Suspense into an unrecoverable blank state. (3) App-level onErrorCaptured only handled EntryNotFoundError and re-threw everything else.
  • Fix: A shared AsyncErrorBoundary component (using onErrorCaptured) placed at the two Suspense boundaries (ProjectLayout, SharedDomainLayout), rendering "Could not load page" + a "Go back" button instead of blanking. Explicitly passes EntryNotFoundError through to the existing App.vue handler. No per-page changes.
  • Lesson: Vue's <Suspense> has no error slot — an async setup rejection blanks the subtree unrecoverably unless an error boundary wraps it. Toasts surfacing the error don't help; the failure is a separate code path. One boundary at the Suspense seam covers every page beneath it. (Investigation also spun off KD-0679/0680/0635 on the underlying rate-limit pressure from unconditional refetches and orphaned broadcast subscriptions.)

KD-0512 — Reports detail pane cramped at tablet / mid-desktop widths

  • Severity: low
  • Symptom: The Reports page right-hand detail pane became unusable between ~768px and ~1200px — at 960px the report title wrapped character-by-character and the AI stepper labels overlapped into mush; at 768px the pane collapsed to a ~50px sliver.
  • Root cause: The two-pane layout had only one breakpoint guard (lt-md:flex-col at 768px). A fixed 400px left pane + persistent sidebar + padding left the detail pane only ~300-500px across the entire 768-1200px band — below the AI stepper's ~480px minimum. The epic documented a "narrow" breakpoint at <1100px that the Reports page never honoured.
  • Fix: Added a shared isNarrow ref (NARROW_BREAKPOINT = 1100) to the breakpoint service and extended the existing master-detail pattern to fire below 1100px — list OR detail, not both, with a back button.
  • Lesson: A layout built two-pane-first for wide monitors needs a breakpoint between "wide desktop" and "mobile stack" — the half-screen / mid-desktop band gets the worst of both otherwise. When an epic already defines a "narrow" threshold, page-level layouts must honour it via a shared signal rather than inventing per-page breakpoints.

KD-0687 — Implementer agent silently reverts to always_ask on every re-run

  • Severity: high
  • Symptom: The Implementer agent's GitHub MCP calls (create_branch, push_files, create_pull_request, …) parked the session waiting on a human after any re-run of the provisioner. Production currently worked only because the policy was patched out-of-band via manual curl.
  • Root cause: backend/scripts/provision-hand-to-claude.php built the Implementer's mcp_toolset without a permission_policy field. Anthropic's Managed Agents API defaults an absent permission_policy to always_ask, so the script — the supposed source of truth — disagreed with the live agent state, and any re-run silently overrode the manual fix.
  • Fix: Extracted the toolset block into a require-returns-array companion (backend/scripts/lib/hand-to-claude-implementer-tools.php) that declares $alwaysAllow = ['type' => 'always_allow'] once and applies it to both the toolset's default_config and every entry in configs. Added a unit test asserting the shape.
  • Lesson: Provisioning scripts that PATCH external systems must declare every policy field explicitly — relying on API defaults means the script and the live state can diverge silently, and any "fix" applied out-of-band is one re-run away from being clobbered.

KD-0663 — Issue show page does not update from broadcasts

  • Severity: medium
  • Symptom: Editing an issue's title in tab B didn't re-render tab A's <h1> until manual reload. The bell-watch toggle had its own GET endpoint, optimistic-rollback try/catch, and a manual race counter — none of which updated when other tabs watched/unwatched.
  • Root cause: Show.vue had no project-channel broadcast subscription at all. A mid-fix attempt introduced a per-issue channel (Tenant.{t}.Project.{p}.Issue.{id}) + page-scoped useLiveIssueDetail composable + applyResource/setById leaks on the issue store — re-implementing what lanes / sprints / comments already did via the project-wide ProjectDomainUpdateEvent channel. The fs-adapter-store package docs explicitly call out exposing setById as an anti-pattern.
  • Fix: Reverted the per-issue channel; broadcast full IssueResourceData on the existing project-wide channel via ProjectDomainUpdateEvent. Added watcher_ids to IssueResourceData, made ToggleIssueWatchAction return the issue and fan out via IssueBroadcaster::updated(), and replaced the watch GET/optimistic plumbing with a one-line computed + an issue.watch() adapter method that uses the package's sanctioned storeModule.setById.
  • Lesson: Before inventing a new realtime channel or store mutator, check whether sister relations (lanes/sprints/comments) already solve it on the project-wide channel — payload-size arguments rarely justify the architectural cost once you measure (typical issue ~2.5 KB compact vs Reverb's 10 KB ceiling). And when an adapter package documents setById as an anti-pattern, exposing it on the store wrapper is a code smell, not a workaround.

KD-0654 — IssueForm submit button not disabled during in-flight save

  • Severity: medium
  • Symptom: Rapid double-click on Update/Create/Promote fired the handler twice in parallel before the first round-trip resolved. Edit popped history twice; Create produced a duplicate issue with orphan attachments associated to only one; ReportDetail promoted the report twice.
  • Root cause: IssueForm.vue's submit button had no :disabled binding, and none of the three call sites (Edit.vue, Create.vue, ReportDetail.vue) wrapped their await in an isSubmitting guard or try/finally. The browser fired duplicate submit events freely; the async operations were independent network requests, so both succeeded.
  • Fix: Two-layer guard. Added optional isSubmitting?: boolean prop to IssueForm bound to the button's :disabled. Each call site got a local isSubmitting ref, an early-return guard at the top of the handler (covers synthetic requestSubmit() paths), and a try/finally around the await so the flag resets on throw.
  • Lesson: A shared form component is the leverage point for double-submit prevention — every consumer is one optional prop away from being protected. And the guard needs both layers: :disabled blocks the click, the early-return covers synthetic submits, and try/finally guarantees the flag resets even when the network call throws (otherwise a failed submit locks the form forever).

KD-0653 — UpdateIssueAction silently drops attachmentIds from PUT requests

  • Severity: medium
  • Symptom: PUT /api/projects/{id}/issues/{slug} accepted attachment_ids in the body, validated it, populated the DTO, and returned 200 OK — but UpdateIssueAction never read the field. The API advertised behaviour it didn't implement.
  • Root cause: SaveIssueRequest and SaveIssueData were shared across Create and Update because both controller actions used the same FormRequest. The Create path needs attachmentIds for the orphan-claim pattern (uploads happen before the issue has an ID); on Update, attachment edits go through dedicated makeAttachmentStore() endpoints, so UpdateIssueAction correctly didn't act on the field — but the shared DTO kept advertising it.
  • Fix: Per ADR-0020, split SaveIssueData into CreateIssueData (with attachmentIds) and UpdateIssueData (without), and SaveIssueRequest into CreateIssueRequest / UpdateIssueRequest. Update path's validation rule removed entirely. Frontend IssueBase lost attachmentIds; a NewIssueMutable type carries it as a Create-only payload.
  • Lesson: Sharing a FormRequest/DTO across Create and Update sounds DRY but encodes a lie when the two paths have different field surfaces — the silent-drop is the symptom, the type signature is the bug. Direction-specific DTOs make the contract honest and let the type system reject the misuse.

KD-0583 — Dead unsaved-content warning on Create Issue page

  • Severity: low
  • Symptom: The Create Issue page had a "you have unsaved files, leave anyway?" warning that had never fired in this codebase. No user had reported the missing dialog.
  • Root cause: onBeforeRouteLeave from vue-router requires the router to be installed via app.use(router). This app uses a custom createRouterView() shell (shared/services/router/components.ts) and never calls app.use() with the router, so the guard registered against an absent router and silently did nothing. The companion beforeunload listener only fired on tab close, not the in-app navigation case the warning was meant to cover. The author shipped without verifying the guard fired.
  • Fix: Deleted the dead block — onBeforeRouteLeave, the beforeunload listener, the onUnmounted cleanup, and the clearOrphanAttachments helper (no other callers). Orphan attachments are pruned server-side by PruneOrphanedAttachmentsAction after 24h, so no hygiene gap.
  • Lesson: Vue Router composition-API guards (onBeforeRouteLeave, onBeforeRouteUpdate) silently no-op when the router isn't installed as a plugin — apps using a custom router-view shell must verify any router-guard hook actually fires before shipping it, because the failure mode is invisible.

KD-0626 — lint-staged glob never matches, ESLint skipped on commit

  • Severity: low
  • Symptom: Pre-commit hook printed "No staged files match any configured task" and ESLint never ran locally — errors only surfaced on CI.
  • Root cause: lint-staged config in frontend/package.json used globs anchored at repo root (frontend/src/**/*) but the hook ran lint-staged with cwd frontend/, where staged paths resolve to src/.... Off-by-one prefix.
  • Fix: Stripped the frontend/ prefix from both glob keys.
  • Lesson: Glob patterns must be relative to the cwd of the tool that evaluates them — when a tool is launched from a subdirectory, every config path inside it is anchored there.

KD-0606 — AI generate-story keys collide with snake_case wire format

  • Severity: high
  • Symptom: "Generate" button on Reports/Issues AI panel returned 422 "The source description field is required" even though the report had a description.
  • Root cause: The frontend HTTP middleware runs deepSnakeKeys() on every outbound payload, but AgentGenerateStoryRequest::rules() keys were camelCase (sourceDescription). Wire body shipped source_description; rule never matched. The earlier KD-0511 rename intended snake_case but wrote camelCase. Feature tests posted camelCase directly, bypassing the middleware, so CI stayed green while production was broken.
  • Fix: Renamed rule keys to snake_case; added arch test rejecting any camelCase top-level rule key; updated tests to post the real wire format.
  • Lesson: Feature tests that bypass the global request middleware can hide wire-format mismatches indefinitely — when middleware mutates payload shape, tests must post the post-middleware shape, not the pre-middleware shape.

KD-0605 — Stale checkedReportIds selection promotes the wrong report

  • Severity: high
  • Symptom: After dismissing a checked report and clicking a different one, pressing Promote generated an issue from the previously checked report. UI showed report B; API received report A's id.
  • Root cause: selectedReportId (detail pane) and checkedReportIds (multi-select for promote) were two independent pieces of state. The set was never pruned when a report transitioned out of pending — dismissed reports stayed checked and ReportDetail.promoteReports preferred the non-empty stale set over the visible report.
  • Fix: Self-heal in the checkedReports computed by filtering on getReportStatus(report) === pending, so non-pending reports drop out of the multi-select reactively.
  • Lesson: Two pieces of selection state that model the same intent will drift — prefer derived/filtered state over manually synchronised mirrors, or self-heal in the computed by gating on the source-of-truth status.

KD-0604 — parseDuration silently drops decimals

  • Severity: medium
  • Symptom: Users entering "2.5h" saw it round-trip to "5h" (300 minutes) instead of 150 minutes. No error — silent corruption.
  • Root cause: DURATION_PATTERN = /(\d+)\s*(w|d|h|m)/gi was non-anchored and integer-only, used with matchAll. It silently skipped any character that didn't fit the pattern — decimals, commas, junk after a fragment. "2.5h" matched only 5h.
  • Fix: Split into VALIDATION_PATTERN (anchored, validates whole input) and EXTRACTION_PATTERN (extracts each chunk). Reject inputs that don't fully match instead of partial-summing.
  • Lesson: Non-anchored matchAll over user input is a silent-corruption pattern — when parsing structured input, validate the whole string against an anchored pattern before extracting parts.

KD-0601 — Dragging issue to sprint shows "unauthorized"

  • Severity: medium
  • Symptom: Users with "Own" update scope could change an issue's sprint via the edit modal but got 403 when dragging the same issue on the backlog.
  • Root cause: IssuePolicy::updateBoard() called CheckPermission::check() with no $ownerId, so the "Own" scope check evaluated null !== null → always false. The sibling update() method correctly passed $issue->user_id.
  • Fix: Pass $user->id as $ownerId in updateBoard(); additionally add per-issue Gate::authorize in UpdateIssueBoardAction for issues that actually moved (sprint/lane/epic changed).
  • Lesson: Policies that share a permission scope must share a calling convention — when scope semantics depend on a parameter (like $ownerId), every policy method that checks that scope must pass it the same way, or "Own" silently means "deny everyone".

KD-0600 — GitHub App install fails on webhook/redirect race

  • Severity: high
  • Symptom: Users completing GitHub App install were told "Installation failed — please close this tab and try again" while the recovery URL silently still worked. Reproduced on production for the emmie tenant.
  • Root cause: GitHub fires the installation webhook and the browser redirect concurrently with no ordering guarantee. The webhook controller queued ProcessInstallationWebhookJob and returned 200 immediately; the redirect's one-shot DB lookup hit before the job ran. Worse, the error copy told users to close the tab — abandoning the working recovery URL.
  • Fix: Process installation events inline in the webhook controller (200 only after row committed); add bounded retry ([200, 500, 1000, 1000]ms) in the lookup; rewrite Blade view to auto-reload on installation_missing instead of telling user to close the tab.
  • Lesson: When two external systems fire concurrent events about the same state, "process inline" + "bounded retry on read" beats "queue async + hope" — and error copy must direct users toward recovery, never away from it.

KD-0596 — Reserved subdomain blocklist not enforced on admin CRUD

  • Severity: high
  • Symptom: A central operator could create a tenant with a reserved subdomain (e.g. central.kendo.dev — the central app's own host) via admin paths, bypassing the public signup blocklist.
  • Root cause: Three admin FormRequests (StoreTenantRequest, StoreDomainRequest, UpdateDomainRequest) used a weaker regex than StoreSignupRequest and lacked Rule::notIn(Tenant::RESERVED_SUBDOMAINS). Validation logic was duplicated across requests with no shared source of truth, so the drift was invisible.
  • Fix: Extracted shared SubdomainRule ValidationRule class; applied to all four FormRequests including signup.
  • Lesson: Validation logic that exists in more than one place will drift — extract shared rules into reusable ValidationRule classes the moment a second copy is needed.

KD-0591 — Validation errors silently fail on 11 forms

  • Severity: medium
  • Symptom: Users submitted forms; backend returned 422 with field-level errors; nothing rendered. Forms sat there with no feedback.
  • Root cause: 11 templates lacked <FormError name="…" /> next to inputs whose backend rules validated those fields. The response middleware populated the global errorBag correctly — there was just no live FormError instance subscribed to render it.
  • Fix: Inserted 27 missing <FormError> bindings. Two server-determined fields (laneId, order) intentionally skipped.
  • Lesson: Hand-authored form-error placement guarantees drift — forms should structurally couple inputs with their error display (a FormField wrapper, or an arch test that cross-references templates against backend rules).

KD-0589 — Duplicate inserts return 500 instead of 422

  • Severity: medium
  • Symptom: Three Store endpoints (tenant AI key, project AI key, project GitHub repo) returned HTTP 500 when a user submitted a duplicate — typically a double-clicked Save button.
  • Root cause: Migrations declared unique indexes but the corresponding FormRequests had no Rule::unique(...) and the Actions had no pre-check. Duplicates reached save(), surfaced SQLSTATE 23000, and Laravel rendered that as 500.
  • Fix: Added Rule::unique with the migration-matching where() scope to each FormRequest.
  • Lesson: Every DB-level unique index needs a matching Rule::unique (or Action-level guard) — the DB invariant is correct, but without a validation surface the user gets 500 instead of a polite 422. An arch test that cross-references migration unique indexes against FormRequest rules would catch this class.

KD-0588 — PasswordConfirmModal silent failure

  • Severity: high
  • Symptom: Wrong password (or any error) in the password-confirm modal closed the modal silently. User believed the destructive action they were confirming had succeeded.
  • Root cause: handleSubmit ordered emit('close') before await onConfirm. Parent unmounted the modal during the synchronous emit, destroying the inline <FormError name="password" /> before the response middleware could populate errorBag.password. The catch block intentionally swallowed the error, relying on a FormError that no longer existed.
  • Fix: Reorder so emit('close') runs only after a successful await onConfirm. Distinguish 422 (FormError surfaces inline) from non-422 (dangerToast) in the catch.
  • Lesson: Never close a modal until the awaited action it gates has resolved — and never rely on a global error bag if the component subscribed to it might already be unmounted.

KD-0587 — Cross-project attachment leak via unscoped attachment_ids

  • Severity: medium
  • Symptom: Reported as a cross-project leak: a user could attach attachments from another project. Investigation showed the leak didn't actually happen at runtime (Action layer scoped the query) but the FormRequest validation gap was real defense-in-depth.
  • Root cause: SaveIssueRequest validated attachment_ids.* as ['integer'] only — no Rule::exists('attachments', 'id')->where('project_id', $projectId). Inconsistent with lane_id, sprint_id, epic_id etc. on the same request. The arch test that should have caught this matched only the wrong-pattern ('exists:attachments,id') and missed omission entirely.
  • Fix: Added scoped Rule::exists. Added 'attachments' to the arch-test whitelist.
  • Lesson: Defense-in-depth scoping must live at the FormRequest layer, not just the Action — and arch tests that detect misuse must also detect omission, otherwise they're a false signal.

KD-0586 — Validation errors don't reach users on 14 forms (camelCase mismatch)

  • Severity: medium
  • Symptom: Server-side 422s arrived but never displayed on 14 forms. Wrong-password failures, missing-team errors, etc. all silently failed.
  • Root cause: The Axios response-error middleware ran camelCase(key) on every error key before populating errorBag. 25 <FormError name="snake_case"> bindings looked up keys the middleware never populated. Existing camelCase bindings worked; new authors didn't realise the middleware was transforming.
  • Fix: Renamed all 25 dead bindings to camelCase. Added arch test rejecting any <FormError name> containing _ or -.
  • Lesson: When a middleware silently transforms data shape, the convention has to be enforced by tooling — naming conventions across template/middleware boundaries are guaranteed to drift without an arch test.

KD-0585 — projects.description column too short for validation rule

  • Severity: medium
  • Symptom: POST /api/projects returned HTTP 500 for any description longer than 255 chars (4 occurrences in 24h on prod).
  • Root cause: Migration created description as $table->string() (VARCHAR(255)). FormRequest later capped at max:5000. Validation passed, INSERT crashed with SQLSTATE 22001 Data too long. No arch test asserts that string|max:N rules don't exceed the underlying column length.
  • Fix: Changed column to TEXT.
  • Lesson: Validation rule length and column length must be cross-checked by tooling — drift between FormRequest max:N and schema length silently turns 422-able input into 500s.

KD-0581 — Billing seat-count mismatch between Kendo UI and Stripe

  • Severity: high
  • Symptom: Pro tenant displayed "Seats: 2 (€4/seat/month)" but Stripe's invoice was €4 — quantity stuck at 1. Customer silently under-billed.
  • Root cause: Two sources of truth with no reconciliation. BillingController::status computed seat count live from User::query()->count(). Stripe's quantity changed only when SyncSeatQuantityJob fired from three Actions (invite/delete/restore). Any membership change before the sync infrastructure existed left Stripe permanently stale. Both paths also failed silently on edge cases (no tenant context, no Cashier subscription, queue failures). Fix (proposed): Make Stripe the single source of truth — read seat count from $subscription->quantity for active subscriptions, fall back to User::query()->count() only without a subscription. Lesson: External systems holding billable state must be the single source of truth — recomputing the same metric in two places (DB count + external API) without a reconciliation pass guarantees drift, and silent no-ops in sync code make the drift invisible.

KD-0574 — TenantAwareQueue captures stale scoped instances, every queued broadcast dropped

  • Severity: high
  • Symptom: Every realtime broadcast dispatched through queue workers silently dropped in production. Users saw no live updates anywhere until manual reload.
  • Root cause: TenantAwareQueue was constructed once at boot with TenantSwitcher and TenantContext injected as readonly properties. Both bindings were scoped. Laravel's queue worker calls forgetScopedInstances() before every job — removing the cached instance from the container's instances[] map but leaving TenantAwareQueue holding an orphan reference. The JobProcessing listener wrote tenant onto the orphan; broadcastOn() resolved a fresh TenantContext with no tenant set. KD-0556's lazy resolution surfaced the bug; eager constructor injection was the underlying defect. Mocked unit tests passed because Mockery has no notion of container scoping.
  • Fix: Replaced eager constructor injection with resolve(...) calls inside every closure registered by register(). Replaced unit tests (which gave false-green for ~7 weeks) with feature tests using real container bindings.
  • Lesson: When a service's dependencies are scoped, that service must NOT cache them in constructor properties — and tests for scoped-binding consumers must use real container bindings, because mocks bypass container scoping and ship false-green.

KD-0556 — PR-merge webhook does not broadcast issue lane change

  • Severity: medium
  • Symptom: Merging a PR moved the linked issue's lane server-side but no updates broadcast reached connected boards in real time. Manual drag-drop broadcast correctly, so websocket pipeline was healthy.
  • Root cause: 10 broadcast events derived their channel from TenantContext snapshotted in the constructor, and broadcastOn() returned [] silently when the snapshot was null — no log, no exception. Either tenant context wasn't bound at construction time (background job, console command) or the snapshot drifted across processes. The silent-drop was invisible to monitoring.
  • Fix: Shared ResolvesTenantBroadcastChannel trait that resolves TenantContext lazily at broadcast time and emits a structured error-level log when the resolved id is null.
  • Lesson: Code paths that drop work silently are unobservable bugs — every "return empty / skip / no-op" branch on infrastructure code must log something structured, otherwise the failure mode never surfaces.

KD-0553 — GitHub App self-serve install fails on cache-prefix mismatch

  • Severity: high
  • Symptom: GitHub App tenant install fails with "Invalid or expired OAuth state" 404 on the first try. Complete feature outage.
  • Root cause: Install state lived in cache. The install request ran under tenant context (IdentifyTenant mutated cache.prefix to tenant_{id}_); the setup-callback ran on the base domain without IdentifyTenant. Laravel's CacheManager caches resolved stores by name — each store reads cache.prefix at construction. When PHP-FPM resolved a fresh cache.store between requests, it read the default prefix and missed the tenant-prefixed key.
  • Fix: Moved install state to a dedicated github_app_install_states table on the central connection. Prefix-immune by construction.
  • Lesson: State that crosses a tenant-context boundary cannot live in a tenant-prefixed cache — when reads and writes happen under different prefix configs, use a connection-scoped table instead.

KD-0545 — My Issues badge does not update via realtime broadcast

  • Severity: medium
  • Symptom: The Navbar's My Issues badge and page didn't react to assignment changes, lane crossings into Done, or self-assignment via MCP. Users had to refresh.
  • Root cause: Two-part defect. (1) Backend never fired user-scoped issue broadcasts — only project-channel events, which the Navbar can't subscribe to. (2) Frontend updates/deleted listeners on the user channel were domain-blind — every payload routed unconditionally to notificationStore, even though UserDomainUpdateEvent already carried a domain field.
  • Fix: Added IssueBroadcaster::myIssuesChanged() for user-scoped fan-out (computing wasOnList / isOnList per affected user). Made Navbar's user-channel listeners domain-aware.
  • Lesson: Realtime channels must be cut along the same axis as the data they keep in sync — a "My X" view cannot rely on per-project channels, and a multiplexed user channel needs explicit domain dispatch on the frontend or stores can't share it.

KD-0537 — Activity-timeline backfill migration deadlocks production release

  • Severity: high
  • Symptom: Fly's release_command for prod v180 failed with MySQL deadlock during a tenant-migration backfill. Production stuck on v179; every dev→main merge would keep failing the release.
  • Root cause: Two interacting problems. (1) Fly's release_command runs in an ephemeral machine while v179 app machines keep serving live traffic — the migration's per-row SELECT ... FOR UPDATE + INSERT fought the live IssueAuditLogger for the next-key lock; InnoDB picked the migration as the deadlock victim. (2) The backfill itself was wrong-shaped — it would have written synthetic "Created today" audit-log entries with now() timestamps, polluting the append-only hash chain with fabricated history. Staging never hit it because traffic was lower at deploy time.
  • Fix: Deleted the migration. The activity timeline correctly returns [] for legacy issues with no audit history; the frontend already had an empty state for that case.
  • Lesson: release_command migrations run concurrently with the previous version's live writes — any backfill that contends with hot-path writes on the same index will deadlock. And: if the truthful response to "we have no data" is an empty array, don't fabricate data to make it look populated.

KD-0519 — Lane reorder chevrons silently no-op on Project Settings

  • Severity: medium
  • Symptom: Clicking up/down chevron next to a lane appeared to do nothing — the order didn't change visually. (DB was actually updated; the frontend just didn't refresh.)
  • Root cause: Earlier KD-0464 split into two commits. Frontend assumed broadcasts would keep the lane store in sync and removed laneStore.retrieveAll() from updateLaneOrder(). Backend explicitly excluded bulk/cascade lane mutations from broadcasting — and lane reorder happens inside UpdateProjectAction's loop, which is exactly that "bulk/cascade" bucket. Net: write happened, no broadcast, no refetch, store kept stale order values.
  • Fix: Restored the one-line laneStore.retrieveAll() after project.update().
  • Lesson: When two commits together replace a refetch with a broadcast, both halves must cover the same code paths — broadcaster scope decisions on the backend must be cross-checked against every refetch the frontend removed.

KD-0518 — Sprint title update wrongly requires status

  • Severity: low
  • Symptom: Reporter claimed sprint edit modal returned 422 "status field required". Investigation showed real UI usage round-trips status correctly via mutable.value; only hand-crafted partial-payload clients (curl, MCP, external API) hit the 422.
  • Root cause: No defect in the real UI flow. The contract is "send the full sprint shape on update" — the adapter-store does that; clients that craft partial payloads correctly fail validation.
  • Fix: Kept status as required. Removed regression tests that had been added during investigation that encoded behaviour contradicting the chosen contract.
  • Lesson: Reproduce the bug from the actual UI flow before changing the contract — a report describing partial-payload behaviour might be from a hand-crafted client and reflect the contract working as designed.

KD-0514 — "Added you to project" notification fires for existing members

  • Severity: medium
  • Symptom: Users already on a project got an "added you to project X" notification when a new team containing them was linked.
  • Root cause: UpdateProjectAction built recipients by array_unique(array_merge($newlyAddedDirectMemberIds, $newTeamMemberIds)). The team-side list was every member of every newly-attached team with no diff against existing project membership. array_unique only deduplicated between the two lists — it didn't subtract users who already had access.
  • Fix: Subtract $currentDirectMemberIds ∪ members-of-currently-attached-teams from $allNewMemberIds before notifying.
  • Lesson: Notifications about "newly added" must diff against the prior state — set-union dedup is not the same as diff. When access can be granted via multiple paths (direct + team), every path must be considered when computing "what changed".

KD-0511 — AI validation error leaks into form fields

  • Severity: medium
  • Symptom: Clicking Generate on the AI story prompt with a short report description showed a 422 error attached to the IssueForm's description textarea — a field the user never edited.
  • Root cause: Wire-level field-name collision. The AI endpoint and the IssueForm shared two field names (description, title). The global error bag was keyed only by Laravel field name with no per-form scoping. A 422 keyed description from the AI endpoint rendered under any <FormError name="description">.
  • Fix: Renamed AI endpoint payload keys to sourceDescription/sourceTitle so no <FormError> watches them.
  • Lesson: A globally-scoped error bag means wire field names are a global namespace — two forms sharing a field name will leak errors across each other. Either scope error bags per form, or use distinct wire-field names for distinct forms.

KD-0510 — Newly created report not auto-selected in detail pane

  • Severity: low
  • Symptom: Submitting "+ Report" appended the new report to the list but the right-hand detail pane stayed on the placeholder. User had to find and click the new entry.
  • Root cause: handleCreate called await newReport.create() but discarded the return value. The adapter resolved to the persisted Report with its server-assigned id; nothing assigned that id to selectedReportId.
  • Fix: Capture the returned Report and assign its id to selectedReportId.
  • Lesson: Async create flows must thread the persisted entity's id back through the UI — otherwise the verify-and-edit loop is broken into two disconnected steps.

KD-0508 — Modal close button not accessible

  • Severity: low
  • Symptom: The X close button in modals was invisible to keyboard tools (Vimium) and screen readers.
  • Root cause: The X was a bare <svg> with a @click handler. SVG is not in the default tab order, has no implicit ARIA role, and is invisible to browser-level focus management.
  • Fix: Wrapped the icon in <button type="button" aria-label="Close"> with focus-visible ring.
  • Lesson: Click handlers belong on semantic elements (<button>, <a>) — never on bare icons. Accessibility regressions of this class are best caught by a structural arch rule, not visual review.

KD-0500 — Epic name overflow on board cards

  • Severity: low
  • Symptom: Long epic titles overflowed the colored badge horizontally past the card's right edge into adjacent space.
  • Root cause: Three compounding causes. (1) Project doesn't import @unocss/reset, so elements default to content-boxmax-width: 100% constrained only content, padding+border were added on top. (2) min-width: auto on inline-block elements with nowrap resolves to full unwrapped text width, beating max-width. (3) text-nowrap prevented wrapping but didn't add overflow:hidden or ellipsis.
  • Fix: Combined box-border + min-w-0 + max-w-full + truncate on SimpleBadge. Added min-w-0 on the parent flex container.
  • Lesson: Without a global box-sizing: border-box reset, every component that uses padding/border with percentage max-width is overflow-prone — and min-width: auto on flex/inline-block items is the silent killer that makes max-width constraints useless.

KD-0496 — Manual reports show 'Unknown' as author

  • Severity: low
  • Symptom: Manual reports created through the UI showed "Unknown" as author. API-created reports were fine.
  • Root cause: ReportForm.vue didn't send author_name; CreateReportAction stored null; frontend templates fell back to literal 'Unknown'. The single write site ($report->author_name = $data->authorName) was the bug — every read path correctly read the column, but the column was never populated for manual reports.
  • Fix: Write-time fallback in the Action: $data->authorName ?? "{first_name} {last_name}" when a creator is present.
  • Lesson: When many read paths share a single column, fix at the write site — anything else means hunting through every consumer (HTTP Resource, MCP tools, frontend templates) to patch the same fallback.

KD-0484 — PaginationBar overflows narrow containers

  • Severity: low
  • Symptom: Pagination bar visibly overflowed the Reports Overview's 400px left column when there were 8+ pages.
  • Root cause: <nav> had flex with no flex-wrap and used 11 fixed-width w-10 buttons (~440px intrinsic). Flex items don't shrink below their content's intrinsic width without explicit min-w-0. Parent had flex-wrap so siblings could wrap relative to each other, but neither child could wrap internally.
  • Fix: Added flex-wrap as last-resort fallback plus a CSS container query that hides the redundant «/» shortcut buttons below 440px (the edge pages remain reachable via the existing first/last page-number buttons).
  • Lesson: Container queries are the right tool for "compact this UI when its container is narrow" — viewport media queries can't see how big the actual parent column is.

KD-0445 — Ticket-updated toast cluttered, auto-hides

  • Severity: low
  • Symptom: When user A updated an issue, user B got an infoToast reading "Alert: <title> updated by <actor> at <ISO timestamp>" that auto-hid after 5s.
  • Root cause: Three coupled gaps. (1) Show.vue (issue detail page) didn't subscribe to project-channel issue-update broadcasts like Board/Backlog/Overview did. (2) Backend papered over (1) with a global PrivateAnnouncement user-channel toast. (3) The toast variant had no persistence escape-hatch — auto-hid before the user could act.
  • Fix: Removed the entire PrivateAnnouncement → 'alerts' → infoToast plumbing. Realtime "your view is stale" UX should live on the page that goes stale, not as a global cross-page toast.
  • Lesson: When a global notification papers over a missing local realtime subscription, the right fix is to wire the local subscription — not to refine the global notification.

KD-0443 — Board layout glitches on phone in landscape

  • Severity: low
  • Symptom: On phone in landscape, the sidebar took ~30% of the width and avatars overflowed issue cards.
  • Root cause: Two issues. (1) Breakpoint service used window.innerWidth < 768 for mobile detection — phone in landscape often has 800-900px width, crossing the threshold and rendering the desktop sidebar. Viewport width is not a reliable proxy for device type. (2) DragElement.vue had no overflow-hidden constraint, so children escaped at narrow column widths.
  • Fix: Added isTouchDevice via CSS media query (pointer: coarse) and (hover: none) (correctly identifies phones/tablets without external input devices). Forced collapsed sidebar on touch devices. Added overflow-hidden + truncation to card.
  • Lesson: For "is this a phone" questions, ask CSS about input modality (pointer: coarse, hover: none) — never use viewport width as a proxy. Phones in landscape break that proxy.

KD-0406 — Toast notifications hidden behind modal overlay

  • Severity: low
  • Symptom: Toasts fired while a modal was open rendered beneath the modal's backdrop and were invisible to the user.
  • Root cause: Modals use native <dialog>.showModal(), which adds the dialog to the browser's top layer — a separate rendering stack that paints above every regular z-index context. The toast container was a fixed <div> at z-index: 1050 in the regular stacking context. Top layer always wins.
  • Fix: Upstream in @script-development/fs-toast@0.2.0 — added popover="manual" to the container <div> and showPopover() on mount. Re-enters the top layer on every new toast (last-in-wins ordering).
  • Lesson: The browser's top-layer is not a higher z-index — it's a parallel rendering stack. Anything that must paint above a native <dialog> must also live in the top layer (via Popover API), not just have a high z-index.

Recurring themes

  • Silent failures are the real bug. KD-0556, KD-0588, KD-0511, KD-0586, KD-0591, KD-0581, KD-0604 — every "no error, no toast, nothing rendered" symptom traces to a code path that swallows or drops without logging. Every "return empty / skip / no-op" branch on infrastructure code needs a structured log entry, or the bug is unobservable.

  • Form-error binding edge cases. KD-0586 (camelCase mismatch), KD-0591 (missing bindings), KD-0511 (cross-form key collision), KD-0588 (modal unmounted before error rendered), KD-0606 (snake/camel rule mismatch). The <FormError> + global error bag pattern keeps producing the same shape: any drift between the wire shape, the middleware transform, the rule key, or the template binding silently breaks the user feedback loop. Arch tests catch the structural class; component-level patterns (FormField wrappers, scoped error bags) prevent it.

  • Tenant-scoping leaks via the wrong infrastructure. KD-0553 (cache prefix crossing tenant boundary), KD-0574 (scoped binding captured at boot), KD-0556/KD-0537 (broadcast/migration assumes tenant context). State that crosses tenant boundaries must live somewhere prefix-immune (central connection table, lazy resolution at consumption time) — caching it under a tenant prefix or capturing scoped instances at boot is a guaranteed silent drop.

  • Validation drift between layers. KD-0585 (max:5000 vs VARCHAR(255)), KD-0589 (unique index vs no Rule::unique), KD-0596 (signup blocklist vs admin paths), KD-0587 (scoped exists missing on attachments). Whenever a constraint exists at one layer (DB, migration, central rule) but isn't mirrored at the layer the user hits first (FormRequest), the error surfaces as a 500 or a silent leak. Arch tests that cross-reference layers (column length vs rule max, migration unique vs FormRequest unique, project-owned tables vs scoped exists) close the entire class.

  • Single source of truth, or guaranteed drift. KD-0581 (UI count vs Stripe quantity), KD-0596 (4 copies of subdomain rule), KD-0586 (manual <FormError> placement vs middleware naming), KD-0510 (server response not threaded to UI state). Anywhere the same value is computed/stored/displayed in two places without a reconciliation pass, drift is a question of when, not if.

  • Broadcast/refetch coverage gaps. KD-0519 (frontend dropped refetch assuming broadcast covered it), KD-0545 (project channel doesn't cover My-X views), KD-0556 (silent broadcast drop), KD-0445 (page didn't subscribe at all). Realtime channels must be cut along the same axis as the views consuming them. A user-list view cannot rely on per-project channels; a per-page view cannot rely on a global toast as compensation for a missing subscription.

  • Misleading error UX directs users away from recovery. KD-0600 ("close this tab" while the URL silently still worked), KD-0588 (modal closed before error rendered), KD-0606 ("source description required" with no input by that name). Error copy must reference the user's actual recovery path — never tell users to abandon a working state, never reference field names the UI doesn't show.