# Metric lifecycle (/docs/metrics/lifecycle)


A Heartbeat metric travels through ten stages. The first six are
mostly mechanical (the `metrics-discovery` skill drives them, the
deploy script wires the rest). The last four are **business** —
a real human owns the number, validates it against reality, and
either trusts it or asks for it to change.

```
1. discovery       (metrics-discovery skill — JTBD + data check)
2. registration    (db/metrics/<id>.sql committed and deployed)
3. backfill        (compute_metrics + hourly cron fill metric_history)
4. benchmark       (≈30 days of history → μ/σ via refresh_benchmarks)
5. provisional     (norm_value / alert_value set from skill author's read)
6. unverified      (tile on cockpit, dashed border, "unverified" pill)
   ──────────────────────  business handover  ──────────────────────
7. owner assigned  (alert fires OR operator clicks tile → fills email)
8. owner review    (today: owner chats with Ivan)
9. verified        (owner agrees number is real; pill removed; eligible for alerts)
10. retirement     (file deleted from db/metrics/ — GC drops the row)
```

Stages 1–5 are the build path. Stage 6 is the **handover state** —
the metric is computing, the dashboard is showing it, but no one has
yet stood behind the number. Stages 7–9 are the path from "computed"
to "trusted". Stage 10 is the only sanctioned way to remove a metric.

## 1. Discovery [#1-discovery]

Driven by the [`metrics-discovery`](/docs/metrics/discovery) skill.

The skill enforces "money story first, SQL last":

1. Inventory the live registry (avoid duplicates).
2. Probe the live source / marts-db schema (does the data we'd build on
   actually exist and look populated?).
3. Write the JTBD + business rationale (what red on this tile would
   *mean*, in losses-avoided or revenue-gained terms).
4. Get alignment with Ivan / PM **before** any SQL is written.
5. Draft `value_sql` against marts-db, run it, sanity-check the number.

The output of this stage is a candidate file under
`docs/metric-candidates-YYYY-MM-DD.md` plus a draft of the metric SQL
file. Nothing is in the registry yet.

## 2. Registration [#2-registration]

A new file appears at `db/metrics/<metric_id>.sql`. It contains:

* a comment header (one-line purpose, methodology notes),
* an `INSERT` into `heartbeat.metric_registry` with `name`, `period`,
  `value_sql`, `description`, `sources`,
* an `UPDATE` setting the 4-char `code`,
* an `UPDATE` setting `how_to_read` / `methodology` / `sources`,
* **provisional** `norm_value` and `alert_value` from the skill author's
  best read (these *will* be tuned later — see stage 5).

Deploy lands the row in the registry:

```bash
ssh ubuntu@13.62.60.156 'cd ~/heartbeat-dashboard && ./bin/deploy.sh'
```

`bin/deploy.sh` runs `db/build_metrics.sh` → psql, then `compute_metrics`
(writes the first `metric_history` row), then `refresh_benchmarks`
(populates μ/σ if there's already enough history — usually not on day 1).

## 3. Backfill / history accumulation [#3-backfill--history-accumulation]

The metric is now computed every hour by the cron job at `:17`
(`scripts/refresh_all.py`). Each tick UPSERTs a row into
`heartbeat.metric_history` with:

* `value` — today's number,
* `computed_at` — when compute ran,
* `source_as_of` — `MAX(_dlt_loads.inserted_at)` for the source schema
  (how fresh the underlying data is),
* `status` — server-side green/amber/red against current thresholds.

The cockpit reads only from `metric_history`; the tile starts rendering
within the hour of registration. Sparklines start filling in.

## 4. Benchmark — typical band emerges [#4-benchmark--typical-band-emerges]

`refresh_benchmarks.py` runs daily at `04:23` and recomputes μ ± σ over
the rolling history window for every `rolling_stat` metric. After
roughly 30 days the typical band stabilises and the cockpit can show
"normal looks like X ± Y" alongside the raw value.

The benchmark is **never** used to derive thresholds — it's an
informational band. See stage 5.

## 5. Provisional thresholds [#5-provisional-thresholds]

`norm_value` (amber) and `alert_value` (red) at this point are
the *skill author's* best-read commitments. They are good enough to
make the tile colourable but they have not yet been agreed with anyone
who would be paged on the red.

Operators (today: Ivan) can tune these on the cockpit UI without a
redeploy — the inline editor PATCHes through
`/api/metrics/{id}/verification` (the same endpoint that handles owner
and verification state). The DB-level write guard
(`heartbeat.metric_registry_guard`) blocks every other write path.

> **Thresholds never auto-derive from data.** A threshold that drifts
> with the number it's supposed to police hides the very signal it
> exists to surface. Always external: SLA, regulatory cap, owner
> commitment.

## 6. Unverified — the handover state [#6-unverified--the-handover-state]

The tile is now live on the cockpit. Visually:

* an `"unverified"` pill near the metric's 4-char code chip,
* the number is read normally; status colour applies normally.

`verification_state = 'unverified'` is the **default**. Adding a
number to the cockpit doesn't mean someone has cross-checked it
against the source-of-truth count yet. The dashed border makes that
gap legible at a glance and explicitly invites the next step.

`owner` is `NULL` at this stage.

> **Unverified metrics are first-class citizens.** Staying unverified
> is the *normal* state for a metric until something forces a closer
> look — typically the first alert firing. Until then the tile is
> doing real work: it's on the screen, the number is computing, the
> sparkline is filling in, and the operator can already feel whether
> the trend is right. Verification is not a quality gate the metric
> has to pass to be useful — it's a contract that gets signed when
> someone is about to be paged on it. We expect most tiles on the
> cockpit to live in `unverified` for weeks or months, and that is
> fine. The push to `verified` happens *naturally* the first time
> the number crosses red and an owner needs to act on it; only then
> does the cost of a wrong number actually bite, and only then is
> verification worth the owner's time.

## 7. Owner assigned [#7-owner-assigned]

This stage is **event-driven, not scheduled**. The expected trigger is
the first alert firing on the metric (post §8.4(f), see
[roadmap](/docs/roadmap)) — the number crosses red, somebody has to
act on it, and that act of taking responsibility is what puts an
email on the tile. Until that happens it's fine for the tile to sit
unverified and ownerless; we don't chase owners for metrics nothing
has gone wrong with yet.

The `owner` is an email validated against a loose RFC-5321 regex. Two
ways the field gets populated:

* **Alert fires** (the natural path). When a threshold is crossed and
  someone needs to take responsibility for the number, an owner email
  is filled in by whoever steps up.
* **Manual UI assignment** (the proactive path, used when we already
  know who'll be paged). An operator clicks the metric tile and fills
  the inline owner field. Auto-saves on blur or Enter when the email
  validates. No redeploy.

Either way: the email column is now populated, the tile remains
**unverified** (assigning ownership is not the same as verifying the
number — it's the start of stage 8).

## 8. Owner review [#8-owner-review]

The owner cross-checks the number against their own source of
truth — typically a hand count out of webbank tables, an internal
report they trust, or domain knowledge ("this can't be 22%").

**Today this is informal: a chat request to Ivan.** The owner pings
Ivan, Ivan walks the SQL with them, they agree on whether the number
matches reality and whether the thresholds make sense.

The same review can be driven mechanically by the
[`/metric-validate`](/docs/metrics/validate) skill, which produces a
dated audit file under `docs/metric-audits/<date>-<metric_id>.md`
covering live source freshness, an independent cross-check rebuild
of the number, and a methodology check on what the red zone actually
means in business terms.

Outcomes of stage 8:

* **Number agrees with reality, thresholds match commitment.** Proceed
  to stage 9.
* **Number is wrong** (bad SQL, wrong source, wrong column). Owner
  files an issue / asks for the metric to be fixed. The fix is a new
  PR against `db/metrics/<id>.sql`. Lifecycle resets to stage 2 for
  this metric, but with the owner already in place.
* **Thresholds are wrong.** Owner tunes `norm_value` / `alert_value`
  inline on the cockpit. Number stays. Proceed to stage 9.
* **Metric isn't worth keeping.** Skip to stage 10 (retirement).

## 9. Verified [#9-verified]

The owner toggles `verification_state` to `verified` on the cockpit
(PATCH to `/api/metrics/{id}/verification`). Visually:

* the dashed border becomes solid,
* the `"unverified"` pill is removed,
* the tile reads as a normal cockpit metric.

Once verified the metric is **eligible for alert routing** (post
§8.4(f)) — only verified metrics with a non-null `owner` will route
to a notification channel. Unverified metrics stay silent regardless
of how red they go; we do not page humans on numbers nobody has
stood behind.

After verification, operators (owner included) can keep tuning
`norm_value` / `alert_value` / `target_due_date` / `owner`
indefinitely without redeploy. Verification state persists unless
explicitly flipped back — re-flipping to `unverified` is the right
action when an underlying source contract changes (a new dlt schema,
a renamed column) and the number needs a fresh cross-check.

## 10. Retirement [#10-retirement]

The metric is removed by deleting `db/metrics/<metric_id>.sql` and
running `./bin/deploy.sh`. The build script's auto-generated GC
(`DELETE … WHERE metric_id NOT IN (<files on disk>)`) drops the row
from `metric_registry`. History rows in `metric_history` are kept
indefinitely as audit trail.

If the metric is being replaced rather than just retired, the
replacement enters at stage 1 (new id, new file) — never edited in
place under the old id.

## Tooling summary [#tooling-summary]

| Stage                    | Driven by                                                                |
| ------------------------ | ------------------------------------------------------------------------ |
| 1 Discovery              | [`metrics-discovery`](/docs/metrics/discovery) skill                     |
| 2 Registration           | `db/metrics/<id>.sql` + `bin/deploy.sh`                                  |
| 3 Backfill               | `scripts/refresh_all.py` (cron `:17`)                                    |
| 4 Benchmark              | `api.scripts.refresh_benchmarks` (cron `04:23`)                          |
| 5 Provisional thresholds | Skill author + operator UI                                               |
| 6 Unverified surface     | `verification_state='unverified'` (default)                              |
| 7 Owner assigned         | Cockpit UI inline owner field                                            |
| 8 Owner review           | Chat with Ivan today; [`/metric-validate`](/docs/metrics/validate) skill |
| 9 Verified               | Cockpit UI verification toggle                                           |
| 10 Retirement            | `rm db/metrics/<id>.sql` + `bin/deploy.sh`                               |