# Architecture (/docs/architecture)


Heartbeat is a **TV cockpit** that aggregates data from every source
system Satchel runs on and surfaces what's broken before someone walks
past the screen and notices. This page is the architectural ground
truth for the project — pipeline, data contract, refresh path,
principles, repo layout. Sibling docs cover [sources](/docs/sources),
[metric lifecycle](/docs/metrics/lifecycle), [decisions](/docs/decisions),
and the [roadmap](/docs/roadmap).

## Business goals [#business-goals]

Heartbeat exists to give Satchel's CEO **a single cockpit that shows
the state of the business in real-ish time** — and, just as
importantly, raises an alert when something silently breaks. Today
the team runs on data that is "a month, if not months, late", and
serious problems surface by accident in corridor conversations. The
canonical example raised at the 24-04-2026 offsite: a client did not
pay for MasterCard collateral for 4 months → Satchel covered the
debt out of its own funds → CEO learned about it by chance, walking
past someone's desk. Heartbeat must catch events of that shape
*before* they cost money.

Four concrete jobs the product must do:

1. **TV cockpit / red-zones-first view.** A single screen — physically
   on a TV in the office, always on, glanceable from across the
   room — that the CEO and team see the state of the business on.
   Direct quote from the offsite: «хочу видеть говно, а не то, что
   хорошо». Green metrics fade; red ones surface.
2. **Silent-event alerting.** When a metric crosses a threshold that
   matters, Heartbeat raises a signal — it does not wait for someone
   to walk past a screen. This is the layer that prevents the
   MasterCard-collateral class of incident.
3. **OKR / KPI substrate (later).** Once Heartbeat reliably produces
   metrics, those metrics become the input for departmental OKRs.
   Heartbeat isn't an OKR system, but every other initiative that
   needs trustworthy metrics depends on it landing first.
4. **Conversational interface via MCP (later).** Expose the registry
   and history as an MCP server so an agent can answer "what's going
   on today?" / "what should I look at first?" Same data, different
   surface.

What Heartbeat is **not**: a general-purpose BI tool, a multi-tenant
analytics platform, or a replacement for Zoho / YouTrack / Slack.
The principle is **aggregate, not replace** — channels stay where they
are; the *metric* is what gets unified.

## Where we are now (2026-05-08) [#where-we-are-now-2026-05-08]

* **6 source DBs landed** in `marts-db` (\~15 GB across 9 schemas):
  `webbank`, `aml`, `pdvs`, `vss`, `card_management`, `currency_rate`.
  See [sources](/docs/sources) for landed/pending detail.
* **31 metrics** in `heartbeat.metric_registry`; cockpit reads
  `metric_history` with honest `computed_at` and `source_as_of` per
  row.
* **17 audit cuts** in `heartbeat.audit_cut_registry` driving 3 pages
  (`/audit`, `/audit/clients`, `/audit/journeys`) via the
  parameterised `AuditPage` + cut-renderer registry.
* All 4 surfaces share `AppShell` with primary nav and Docs/Feedback
  icons.
* Both registries carry `verification_state` + email `owner`;
  everything ships `unverified` until cross-checked
  ([why](/docs/metrics/lifecycle)).
* Ingest+compute behind one endpoint (`POST /api/metrics/recompute`)
  with UI progress and hourly cron.
* Compute phase parallelised (`ThreadPoolExecutor(max_workers=4)`);
  full refresh of 31 metrics now completes in \~2 min. Post-ingest
  analytics indexes on the 4 largest tables (see
  [Data quality and performance](#data-quality-and-performance)).

Next steps live in the [roadmap](/docs/roadmap): (d) metric ↔ cut
deep-links, (e) auto-generated `/docs` content from the registries,
(f) alerting, (g) automated metrics discovery, (h) starred metrics
and watchlists.

## Pipeline shape [#pipeline-shape]

Five stages, in order:

```
sources → dlt ingest → marts-db → python compute → registry + history → surfaces
```

```mermaid
flowchart LR
  S[Sources<br/>Postgres / REST / webhooks] --> I[Ingest<br/>api/ingest_lib.py — dlt]
  I --> M[(marts-db<br/>Postgres on the VM)]
  M --> C[Compute<br/>api/metrics_compute.py]
  C --> R[(metric_registry<br/>+ metric_history)]
  R --> Cock[Cockpit /]
  R --> Aud[Audit /audit/*]
  R -.future.-> Alerts[Alerts]
  R -.future.-> MCP[MCP server]
```

1. **Sources** — heterogeneous (Postgres replicas, REST APIs,
   webhooks, files).
2. **Ingest** — `dlt`, **one engine** in `api/ingest_lib.py` driven by
   a `SOURCES` table of `SourceSpec(db, schemas, pipeline_name)`. The
   same engine is invoked from CLI (`scripts/ingest.py`) and from the
   recompute endpoint. There is **no** per-source `dlt_<name>.py`
   script — adding a source is a registry edit, not a new file.
3. **marts-db** — Postgres on the VM, owned by us. Landed copies live
   here, with whatever indexes we add (see [§ Indexes](#indexes)).
4. **Compute** — python module `api/metrics_compute.py`. For each row
   in `heartbeat.metric_registry` it executes `value_sql` against
   marts-db, captures `source_as_of` from the relevant
   `<schema>._dlt_loads`, resolves status, and UPSERTs a row into
   `heartbeat.metric_history`. Per-metric python modules are reserved
   for the day a metric needs source-API access or logic that cannot
   be expressed as `value_sql` — none exist today.
5. **Registry + history** — `heartbeat.metric_registry` (config) and
   `heartbeat.metric_history` (cached values, with `computed_at` and
   `source_as_of` per row). &#x2A;*Single read model.**

Three surfaces consume registry + history: the **cockpit** (TV,
today), the **alert dispatcher** (Slack/Telegram, later), and the
**MCP server** (agent, later). All three read the same two tables;
none of them touches sources or the compute layer directly.

The contract this shape commits us to:

* **Sources are heterogeneous** (Postgres / REST / webhook / file).
* **Land before transform.** We always land raw rows in marts-db
  before computing. No live-querying source systems on a cockpit
  refresh.
* **Registry + history is the single read model** for every consumer
  surface. New surfaces (alerts, MCP) plug into the same two tables;
  they don't grow their own data models.
* **Cockpit reads only from history** — never from source, never from
  raw marts, never blocking on compute. This is what makes the TV
  immune to a stalled sync or a slow API.

## Metric data contract [#metric-data-contract]

Every metric, regardless of source or compute path, lands in the
registry / history with the same six-component shape. This is the
**contract between the data layer and every consumer surface**
(cockpit, alerts, MCP, future OKR overlay). Surfaces never invent
their own threshold model — they read these values and decide what
to do.

| Field             | What it is                                                                                                                                                                                           | Where it comes from                                                |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------ |
| **value**         | The current scalar for the metric.                                                                                                                                                                   | Compute layer; written into `metric_history` per refresh.          |
| **period**        | The time window the value covers (`snapshot`, `24h`, `7d`, `30d`, `funnel`). Without it the number is meaningless.                                                                                   | Config in `db/metrics/<id>.sql`.                                   |
| **typical**       | μ ± σ (rolling benchmark over the last N days). What "normal" looks like for this metric. Optional — only `rolling_stat` metrics carry it.                                                           | Recomputed daily by `refresh_benchmarks.py` from `metric_history`. |
| **norm\_value**   | Soft acceptable boundary — the **amber** line. Source: owner commitment.                                                                                                                             | Owner-set in registry; tunable live without redeploy.              |
| **alert\_value**  | Hard ceiling — the **red** line. Source: SLA, regulatory cap, capacity ceiling. &#x2A;*Never auto-derived.**                                                                                         | Owner-set in registry; tunable live without redeploy.              |
| **target\_value** | OKR stretch goal — better than `norm_value`, optional. **Display-only**: beating it does not change status colour, it lights a "target hit" accent on the green tile. Paired with `target_due_date`. | OKR / owner; tunable live.                                         |

Status (green/amber/red) is server-computed in `api/cockpit.py` with
a fixed priority:

1. `alert_value` crossed → **red**
2. else `norm_value` crossed → **amber**
3. else if μ ± σ set → z-score (|z| ≥ 2 → red, |z| ≥ 1 → amber)
4. else **green**

`target_value` does not enter the status calculation — it's a forward
commitment surfaced as visual encouragement, not a threshold.

Why each consumer needs all six:

* **Cockpit** colours the tile per the priority above and lights a
  "target hit" accent when `value` beats `target_value` on a green
  tile.
* **Alerts** fire on the same priority. Reusing the contract means an
  alert can never disagree with what the screen shows.
* **OKR overlay** reads `target_value` + `target_due_date` directly
  as the KPI commitment, `norm_value` as the guard rail, `typical`
  as the baseline you're trying to move from.
* **MCP** answering "what's wrong today" returns metrics where
  `value` crossed `norm_value` or `alert_value`, ranked by severity
  / staleness.

Anything a surface needs beyond these six (per-team filter,
escalation chain, snooze state, OKR period boundary) is **surface
state**, not metric state, and lives in tables next to the surface —
never in the registry.

## Refresh path [#refresh-path]

There is **one path** for data to move forward: ingest **and** compute
behind a single endpoint, `POST /api/metrics/recompute`. A request
resolves the requested metrics → distinct dlt schemas → runs
`run_one()` per (db, schema) serially → recomputes every metric whose
schema is in that set. Cron uses the same path through
`scripts/refresh_all.py`. UI clicks, cron ticks, and any future
surface all speak this one language.

**Single-flight, globally coalesced.** Only one ingest+compute job
runs at a time. A POST during an active job returns the existing
`job_id` (`coalesced=true`); a cron tick that arrives mid-job is
silently absorbed. Job state is held in-process
(`api/recompute.py:_JOBS`); on api restart, in-flight jobs are
dropped — the next cron tick re-runs.

Three reasons single-flight is load-bearing:

* The VM tips over with parallel dlt extracts (it ran t3.small/2 GB
  before the t3.large upgrade; the ceiling is higher now but
  parallelism would still compete with `api`/`web`/`marts-db` for
  the same RAM).
* Multiple concurrent jobs racing to UPSERT the same
  `(metric_id, current_date)` history row would produce a value whose
  `source_as_of` and `computed_at` came from different jobs — a
  confusing audit trail.
* A persistent job ledger is overkill for an operator-facing console;
  we'll add it the day we need cross-restart resumption.

`GET /api/metrics/active-job` exposes the in-flight job to the UI;
the cockpit shows progress whether the trigger was a click or cron.

### Cron schedule [#cron-schedule]

| Cron         | What it does                                                               | Entry                                      |
| ------------ | -------------------------------------------------------------------------- | ------------------------------------------ |
| `17 * * * *` | Hourly ingest + compute (every metric).                                    | `scripts/refresh_all.py`                   |
| `23 4 * * *` | Daily μ/σ refresh for `rolling_stat` metrics.                              | `python -m api.scripts.refresh_benchmarks` |
| `30 4 * * *` | Daily client-segmentation snapshot.                                        | `python -m api.scripts.refresh_segments`   |
| `45 4 * * *` | Daily audit-cut registry sync + snapshot refresh for every (cut × window). | `python -m api.scripts.refresh_audit`      |

## Freshness signal [#freshness-signal]

Each `metric_history` row carries `source_as_of` — the maximum
`inserted_at` over completed loads in the metric's source schema
(`<schema>._dlt_loads.inserted_at WHERE status='completed'`). This
gives an automatic lower bound "data is at least this fresh" for
every metric without per-metric configuration.

**Staleness is a separate visual signal from status, not folded into
red/amber/green.** The status dot reflects `value` vs
`target`/`limit`/benchmark. The freshness timestamp colours
independently:

| period     | amber | red   |
| ---------- | ----- | ----- |
| `24h`      | > 36h | > 72h |
| `30d`      | > 7d  | > 14d |
| `snapshot` | > 24h | > 48h |

Conflating "the value is bad" with "the pipeline is broken" hides
the actual problem. An operator looking at red needs to know which
one it is — wrong number, or stale source.

## Audit cuts [#audit-cuts]

`/audit/<page>` is built out of **cuts** — small charts with prose
between them. Each cut is a Python module under
`api/audit/cuts/<cut_id>.py` exporting two things:

```python
META = {
    "page": "clients",
    "title": "...",
    "window_default": "30d",
}

def compute(conn, *, window: str) -> dict:
    ...
```

The module **is** the source of truth.
`api.scripts.refresh_audit` auto-syncs `heartbeat.audit_cut_registry`
from disk — no per-file SQL, no DB-level write guard.
`heartbeat.audit_cut_snapshot` caches the payload per
`(cut_id, window_key)`. Each leaf page (`/audit/<page>/page.tsx`) is
\~30 lines and just calls `<AuditPage page="clients" />`; the
`cut_id → React component` registry in
`dashboard/app/audit/_shared/cuts/index.ts` maps each cut to its
renderer. See [audit cuts](/docs/audit-cuts) for the add-a-chart
flow.

## Verification + ownership [#verification--ownership]

Both `metric_registry` and `audit_cut_registry` carry:

* `verification_state ∈ {unverified, verified}`, default `unverified`,
* `owner` — email validated against a loose RFC-5321 regex,
  default `NULL`.

Operators PATCH these via `/api/metrics/{id}/verification` and
`/api/audit/cuts/{id}/verification`. Both endpoints accept either
field independently (auto-save on owner blur). These two columns are
the only registry fields operators can change without a redeploy.

The metric lifecycle — when a tile is unverified, when an owner gets
attached, when verification flips — is its own page:
[metric lifecycle](/docs/metrics/lifecycle).

## Principles [#principles]

These hold across every choice in this codebase and shouldn't be
re-litigated when we pick a tool or pattern.

* **Aggregate, not replace.** Heartbeat reads from existing systems;
  it does not ask the team to migrate off Zoho or to consolidate
  tools. Each source connector is a one-way ingest.
* **Smallest stack that does today's job.** Build what the cockpit
  needs *now*, but don't take decisions that paint us into a corner
  when the second source / the alert layer / the OKR overlay lands.
  No abstractions invented for hypothetical use cases.
* **Config-truth in the file, runtime-truth in the table.**
  `db/metrics/<id>.sql` is the **config source of truth** — name,
  SQL fragment, period, owner, description, initial target/limit.
  Versioned and reviewed in PRs. `heartbeat.metric_registry` is the
  **runtime table** that the API joins against and that holds values
  which legitimately change outside the file: operator-tuned
  `norm_value` / `alert_value`, daily-recomputed `benchmark_mean` /
  `benchmark_stddev`, `verification_state`, `owner`. A DB-level write
  guard (`heartbeat.metric_registry_guard`) makes the file the only
  path for config columns, while leaving the operator-tunable columns
  writable.
* **Status logic stays server-side** (`api/cockpit.py`). The frontend
  renders what the API computed; it never re-derives green/amber/red.
* **Cockpit always renders from a fast read.** A slow source or a
  failing sync must not slow down or break the screen the CEO opens.
* **TV-first surface.** The primary client is an always-on TV in the
  office, not a laptop. Large legible layout, auto-refresh on a fixed
  interval, no auth modals, no toasts that require dismissal,
  graceful degradation on transient API failures (last good value
  with staleness indicator, not a blank tile).
* **One read model, many surfaces.** Cockpit, alerts, and (later)
  MCP all consume the same `metric_registry` + `metric_history`.
* **AI is the primary developer; recurring workflows are codified as
  skills.** Heartbeat is built mostly by AI agents. Every recurring
  workflow ("add a metric", "onboard a new source", "investigate a
  stalled sync", "draft an alert rule") is written as a skill under
  `.claude/skills/`. The skill *is* the runbook — executable,
  tested-by-use, and the only sanctioned path for that workflow.
  Live skills today: `metrics-discovery`, `metric-validate`,
  `heartbeat-today`.
* **`/docs` is the single source of truth.** This site (fumadocs) owns
  product, lifecycle, sources, decisions, and operations. Any new
  feature, product decision, user scenario, or behavioural change
  updates the relevant page in `/docs` **in the same PR** as the
  code. CLAUDE.md and README are thin pointers into `/docs`; they do
  not duplicate architecture. See the
  [decision](/docs/decisions#docs-as-sot) for context.
* **English for all code, comments, commits, and project docs.**
  Russian only in user-facing prose where the audience is the
  Satchel team itself.

## Repo map [#repo-map]

| Path                                                | Role                                                                                                                                                                                                          |
| --------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `api/ingest_lib.py`                                 | dlt engine. Owns `SOURCES`, `SKIP_TABLES`, `COLUMN_DROP`, `INCREMENTAL_PRIORITY`, and `run_one(spec, schema)`. Both the CLI and the recompute endpoint import from here.                                      |
| `scripts/ingest.py`                                 | Thin CLI wrapper over `api.ingest_lib`.                                                                                                                                                                       |
| `api/metrics_compute.py`                            | Compute engine. `fetch_registry()`, `compute_one()` — execute `value_sql`, look up `source_as_of`, resolve status, UPSERT into `metric_history`.                                                              |
| `api/recompute.py`                                  | The on-demand endpoint. `POST /api/metrics/recompute`, `GET /api/metrics/recompute/{id}`, `GET /api/metrics/active-job`. Single-flight global job, two phases (ingest → compute), threaded background runner. |
| `api/cockpit.py`                                    | The metrics read endpoint. Reads registry + last `metric_history` row per metric, returns the row including `computed_at` / `source_as_of` for the freshness chip.                                            |
| `scripts/refresh_all.py`                            | Cron entrypoint. POSTs `{"metric_ids":"all","trigger":"cron"}` to the local recompute endpoint, polls until done.                                                                                             |
| `scripts/compute_metrics.py`                        | CLI for compute-only refresh (ad-hoc, no ingest).                                                                                                                                                             |
| `db/metrics/<metric_id>.sql`                        | One file per metric — comment header + INSERT + UPDATE for `code` + UPDATE for `how_to_read` / `methodology` / `sources`. **Config source of truth** for `metric_registry`.                                   |
| `db/metrics/_schema.sql` / `_pre.sql` / `_post.sql` | Shared scaffolding around the per-metric files — DDL, write guard, helpers, transaction frame, final NOT NULL.                                                                                                |
| `db/build_metrics.sh`                               | Concatenates `_schema.sql + _pre.sql + metrics/*.sql + auto-GC + _post.sql` into the assembled SQL stream for psql.                                                                                           |
| `api/audit/routes.py`                               | Generic audit endpoints. `GET /api/audit/cuts/{id}`, `GET /api/audit/pages/{page}`, `POST /api/audit/cuts/{id}/refresh`, `GET /api/audit/registry`, `PATCH /api/audit/cuts/{id}/verification`.                |
| `api/audit/cuts/<cut_id>.py`                        | One file per audit cut. Exports `META` + `compute(conn, *, window) -> dict`.                                                                                                                                  |
| `api/audit/cuts/__init__.py`                        | Discovers modules at import, exposes `CUTS`, `cuts_for_page`, `sync_registry(conn)`, `upsert_snapshot`.                                                                                                       |
| `api/scripts/refresh_audit.py`                      | Sync-from-disk + recompute every (cut, window).                                                                                                                                                               |
| `db/audit_schema.sql`                               | DDL for `audit_cut_registry` + `audit_cut_snapshot`. Applied by `bin/deploy.sh`.                                                                                                                              |
| `dashboard/app/page.tsx`                            | The cockpit screen.                                                                                                                                                                                           |
| `dashboard/app/_shell/AppShell.tsx`                 | Shared header for all 4 main pages — primary nav (`Cockpit \| Audit`), right-side icon links (`Docs`, `Feedback`), freshness chip, ThemeToggle.                                                               |
| `dashboard/app/audit/_shared/AuditPage.tsx`         | Parameterised audit page driven by `cut_id → React component` registry.                                                                                                                                       |
| `dashboard/app/audit/_shared/cuts/<Cut>.tsx`        | One renderer per cut. Registered in `dashboard/app/audit/_shared/cuts/index.ts`.                                                                                                                              |
| `dashboard/app/docs/**`                             | Fumadocs route — this site. `[[...slug]]/page.tsx` renders MDX. `/docs/api` is the Scalar OpenAPI viewer.                                                                                                     |
| `dashboard/content/docs/**`                         | The MDX content of `/docs`.                                                                                                                                                                                   |
| `dashboard/lib/hooks.ts`                            | React-Query hooks: `useCockpitMetrics`, `useActiveJob`, `useStartRecompute`, `useSetMetricVerification`, `useAuditPage`, `useRefreshAuditCut`, `useSetCutVerification`.                                       |
| `bin/deploy.sh`                                     | Manual VM deploy entrypoint.                                                                                                                                                                                  |
| `docker-compose.prod.yml`                           | VM stack: `marts-db`, `api`, `web`.                                                                                                                                                                           |
| `docker-compose.dev.yml`                            | Local dev: schema-exploration Postgres on port 5433.                                                                                                                                                          |
| `.claude/skills/metrics-discovery/SKILL.md`         | The discipline gate for adding metrics.                                                                                                                                                                       |
| `.claude/skills/metric-validate/SKILL.md`           | Deep audit of a single metric.                                                                                                                                                                                |
| `.claude/skills/heartbeat-today/SKILL.md`           | Daily triage of cockpit state.                                                                                                                                                                                |

## Data quality and performance [#data-quality-and-performance]

This section records the rules that were learned the hard way during
the 2026-05-08 refresh-pipeline hardening session. Apply them to
every new metric and every new source landing.

### Post-ingest indexes [#post-ingest-indexes]

dlt lands rows without metric-aware indexes — only `_dlt_id` unique
indexes exist by default. Every WHERE and JOIN column on a multi-GB
table will do a full seq scan until you add an index explicitly.

**The rule:** for any table over 200 MB referenced in a metric's
`value_sql`, verify in `pg_indexes` that the WHERE and JOIN columns
have an index. If not, add a `CREATE INDEX IF NOT EXISTS` entry to
`POST_INGEST_INDEXES` in `api/ingest_lib.py`. The dict is applied
idempotently after every `pipeline.run()`.

Current indexes (as of 2026-05-08):

| Schema         | Index                                                                   | Reason                                                   |
| -------------- | ----------------------------------------------------------------------- | -------------------------------------------------------- |
| `aml_aml`      | `aml_retrospective(create_at)`                                          | AML throughput — 2.7 GB table                            |
| `webbank_core` | `entry(entry_state_id, create_time)`                                    | Revenue, active-client, top-tier metrics                 |
| `webbank_core` | `payment(id)`                                                           | Join from `entry.payment_id` — no PK index in dlt tables |
| `webbank_core` | `payment(type_id, create_time)`                                         | Outgoing approval-rate metric                            |
| `webbank_core` | `client_profile(id)`                                                    | Join from `payment.sender_profile_id`                    |
| `webbank_core` | `payment_state_log(state_id, timestamp)`                                | Processing-time metrics                                  |
| `webbank_core` | `payment_state_log(payment_id)`                                         | Join from filtered payment set                           |
| `webbank_core` | `client_profile_state_log(client_profile_id, profile_state, timestamp)` | Approval-rate cohort join                                |

Before/after: slowest metrics dropped from 4–10s per query to under 0.4s.

### Replace-strategy tables — hidden ingest cost [#replace-strategy-tables--hidden-ingest-cost]

dlt falls back to `replace` (truncate + full reload) for any table
with no PK and no recognised incremental cursor column. On every
ingest tick, these tables are wiped and fully re-ingested from the
source — even if nothing changed.

**The problem:** replace tables burn ingest time proportional to row
count every hour. The cost is invisible in normal operation but
becomes the bottleneck when a replace table grows or a new metric
references one.

**What to do when landing a new source:**

1. After the first ingest, check `_dlt_loads` or run `EXPLAIN` on a
   metric query to see which tables went through `replace`.
2. For each replace table: does a PK or cursor column exist in the
   source that dlt didn't detect (different naming, composite key)?
   If yes, add an override to `INCREMENTAL_PRIORITY` in
   `api/ingest_lib.py`.
3. If no cursor exists but the table is large and changes slowly,
   evaluate a daily schedule for that source (rather than hourly).
4. Document the outcome in [sources](/docs/sources): which tables are
   intentionally replace (small lookup tables — acceptable), which
   were fixed, which are blocked on a source-side schema change.

### Metric SQL rules [#metric-sql-rules]

These apply before any `value_sql` or `history_sql` is committed.
The `metrics-discovery` skill enforces them as a mandatory gate.

**No correlated subqueries.** Any `(SELECT ... WHERE t.id = outer.id)`
pattern must be rewritten as a `JOIN` + conditional aggregation
(`MIN(CASE WHEN ...)` etc.) before merging. Correlated subqueries
execute once per outer row — on a 2M-row table that's 2M separate
lookups. The query will time out.

**Timing gate: value\_sql must complete in under 5s on live data.**
Over 5s is a structural problem — missing index or correlated
subquery — that must be fixed before the metric ships. Run the SQL
directly on marts-db before opening a PR.

**EXPLAIN check for slow queries.** If the query is slow, run
`EXPLAIN (ANALYZE, BUFFERS)`. Look for `Seq Scan` on large tables
(add an index) or `Nested Loop` over a large outer set (rewrite as
`Hash Join` by restructuring the query).

**Statement timeout.** `api/metrics_compute.py` sets
`SET LOCAL statement_timeout = '60s'` before every `value_sql`
execution (configurable via `METRIC_STATEMENT_TIMEOUT`). A metric
that hits the timeout has a structural problem — raise the timeout
only after you've confirmed the SQL is correct and the table simply
needs more time, never as a first resort.

### Alerting, deep-links, auto-generated docs content [#alerting-deep-links-auto-generated-docs-content]

All three are tracked in the [roadmap](/docs/roadmap) — (d) metric ↔
cut deep-links, (e) auto-generation of `/docs` content from the
registries, (f) alerts, (g) automated discovery, (h) watchlists.
None block landing new metrics today.

## Decision log [#decision-log]

Every settled architectural decision is recorded in
[decisions](/docs/decisions). When you hit a "why is it like this"
question, look there before re-litigating.