Metric lifecycle

From "this should be on the cockpit" to a verified, owner-trusted tile — and eventual retirement.

A Heartbeat metric travels through ten stages. The first six are mostly mechanical (the metrics-discovery skill drives them, the deploy script wires the rest). The last four are business — a real human owns the number, validates it against reality, and either trusts it or asks for it to change.

1. discovery       (metrics-discovery skill — JTBD + data check)
2. registration    (db/metrics/<id>.sql committed and deployed)
3. backfill        (compute_metrics + hourly cron fill metric_history)
4. benchmark       (≈30 days of history → μ/σ via refresh_benchmarks)
5. provisional     (norm_value / alert_value set from skill author's read)
6. unverified      (tile on cockpit, dashed border, "unverified" pill)
   ──────────────────────  business handover  ──────────────────────
7. owner assigned  (alert fires OR operator clicks tile → fills email)
8. owner review    (today: owner chats with Ivan)
9. verified        (owner agrees number is real; pill removed; eligible for alerts)
10. retirement     (file deleted from db/metrics/ — GC drops the row)

Stages 1–5 are the build path. Stage 6 is the handover state — the metric is computing, the dashboard is showing it, but no one has yet stood behind the number. Stages 7–9 are the path from "computed" to "trusted". Stage 10 is the only sanctioned way to remove a metric.

1. Discovery

Driven by the metrics-discovery skill.

The skill enforces "money story first, SQL last":

  1. Inventory the live registry (avoid duplicates).
  2. Probe the live source / marts-db schema (does the data we'd build on actually exist and look populated?).
  3. Write the JTBD + business rationale (what red on this tile would mean, in losses-avoided or revenue-gained terms).
  4. Get alignment with Ivan / PM before any SQL is written.
  5. Draft value_sql against marts-db, run it, sanity-check the number.

The output of this stage is a candidate file under docs/metric-candidates-YYYY-MM-DD.md plus a draft of the metric SQL file. Nothing is in the registry yet.

2. Registration

A new file appears at db/metrics/<metric_id>.sql. It contains:

  • a comment header (one-line purpose, methodology notes),
  • an INSERT into heartbeat.metric_registry with name, period, value_sql, description, sources,
  • an UPDATE setting the 4-char code,
  • an UPDATE setting how_to_read / methodology / sources,
  • provisional norm_value and alert_value from the skill author's best read (these will be tuned later — see stage 5).

Deploy lands the row in the registry:

ssh ubuntu@13.62.60.156 'cd ~/heartbeat-dashboard && ./bin/deploy.sh'

bin/deploy.sh runs db/build_metrics.sh → psql, then compute_metrics (writes the first metric_history row), then refresh_benchmarks (populates μ/σ if there's already enough history — usually not on day 1).

3. Backfill / history accumulation

The metric is now computed every hour by the cron job at :17 (scripts/refresh_all.py). Each tick UPSERTs a row into heartbeat.metric_history with:

  • value — today's number,
  • computed_at — when compute ran,
  • source_as_ofMAX(_dlt_loads.inserted_at) for the source schema (how fresh the underlying data is),
  • status — server-side green/amber/red against current thresholds.

The cockpit reads only from metric_history; the tile starts rendering within the hour of registration. Sparklines start filling in.

4. Benchmark — typical band emerges

refresh_benchmarks.py runs daily at 04:23 and recomputes μ ± σ over the rolling history window for every rolling_stat metric. After roughly 30 days the typical band stabilises and the cockpit can show "normal looks like X ± Y" alongside the raw value.

The benchmark is never used to derive thresholds — it's an informational band. See stage 5.

5. Provisional thresholds

norm_value (amber) and alert_value (red) at this point are the skill author's best-read commitments. They are good enough to make the tile colourable but they have not yet been agreed with anyone who would be paged on the red.

Operators (today: Ivan) can tune these on the cockpit UI without a redeploy — the inline editor PATCHes through /api/metrics/{id}/verification (the same endpoint that handles owner and verification state). The DB-level write guard (heartbeat.metric_registry_guard) blocks every other write path.

Thresholds never auto-derive from data. A threshold that drifts with the number it's supposed to police hides the very signal it exists to surface. Always external: SLA, regulatory cap, owner commitment.

6. Unverified — the handover state

The tile is now live on the cockpit. Visually:

  • an "unverified" pill near the metric's 4-char code chip,
  • the number is read normally; status colour applies normally.

verification_state = 'unverified' is the default. Adding a number to the cockpit doesn't mean someone has cross-checked it against the source-of-truth count yet. The dashed border makes that gap legible at a glance and explicitly invites the next step.

owner is NULL at this stage.

Unverified metrics are first-class citizens. Staying unverified is the normal state for a metric until something forces a closer look — typically the first alert firing. Until then the tile is doing real work: it's on the screen, the number is computing, the sparkline is filling in, and the operator can already feel whether the trend is right. Verification is not a quality gate the metric has to pass to be useful — it's a contract that gets signed when someone is about to be paged on it. We expect most tiles on the cockpit to live in unverified for weeks or months, and that is fine. The push to verified happens naturally the first time the number crosses red and an owner needs to act on it; only then does the cost of a wrong number actually bite, and only then is verification worth the owner's time.

7. Owner assigned

This stage is event-driven, not scheduled. The expected trigger is the first alert firing on the metric (post §8.4(f), see roadmap) — the number crosses red, somebody has to act on it, and that act of taking responsibility is what puts an email on the tile. Until that happens it's fine for the tile to sit unverified and ownerless; we don't chase owners for metrics nothing has gone wrong with yet.

The owner is an email validated against a loose RFC-5321 regex. Two ways the field gets populated:

  • Alert fires (the natural path). When a threshold is crossed and someone needs to take responsibility for the number, an owner email is filled in by whoever steps up.
  • Manual UI assignment (the proactive path, used when we already know who'll be paged). An operator clicks the metric tile and fills the inline owner field. Auto-saves on blur or Enter when the email validates. No redeploy.

Either way: the email column is now populated, the tile remains unverified (assigning ownership is not the same as verifying the number — it's the start of stage 8).

8. Owner review

The owner cross-checks the number against their own source of truth — typically a hand count out of webbank tables, an internal report they trust, or domain knowledge ("this can't be 22%").

Today this is informal: a chat request to Ivan. The owner pings Ivan, Ivan walks the SQL with them, they agree on whether the number matches reality and whether the thresholds make sense.

The same review can be driven mechanically by the /metric-validate skill, which produces a dated audit file under docs/metric-audits/<date>-<metric_id>.md covering live source freshness, an independent cross-check rebuild of the number, and a methodology check on what the red zone actually means in business terms.

Outcomes of stage 8:

  • Number agrees with reality, thresholds match commitment. Proceed to stage 9.
  • Number is wrong (bad SQL, wrong source, wrong column). Owner files an issue / asks for the metric to be fixed. The fix is a new PR against db/metrics/<id>.sql. Lifecycle resets to stage 2 for this metric, but with the owner already in place.
  • Thresholds are wrong. Owner tunes norm_value / alert_value inline on the cockpit. Number stays. Proceed to stage 9.
  • Metric isn't worth keeping. Skip to stage 10 (retirement).

9. Verified

The owner toggles verification_state to verified on the cockpit (PATCH to /api/metrics/{id}/verification). Visually:

  • the dashed border becomes solid,
  • the "unverified" pill is removed,
  • the tile reads as a normal cockpit metric.

Once verified the metric is eligible for alert routing (post §8.4(f)) — only verified metrics with a non-null owner will route to a notification channel. Unverified metrics stay silent regardless of how red they go; we do not page humans on numbers nobody has stood behind.

After verification, operators (owner included) can keep tuning norm_value / alert_value / target_due_date / owner indefinitely without redeploy. Verification state persists unless explicitly flipped back — re-flipping to unverified is the right action when an underlying source contract changes (a new dlt schema, a renamed column) and the number needs a fresh cross-check.

10. Retirement

The metric is removed by deleting db/metrics/<metric_id>.sql and running ./bin/deploy.sh. The build script's auto-generated GC (DELETE … WHERE metric_id NOT IN (<files on disk>)) drops the row from metric_registry. History rows in metric_history are kept indefinitely as audit trail.

If the metric is being replaced rather than just retired, the replacement enters at stage 1 (new id, new file) — never edited in place under the old id.

Tooling summary

StageDriven by
1 Discoverymetrics-discovery skill
2 Registrationdb/metrics/<id>.sql + bin/deploy.sh
3 Backfillscripts/refresh_all.py (cron :17)
4 Benchmarkapi.scripts.refresh_benchmarks (cron 04:23)
5 Provisional thresholdsSkill author + operator UI
6 Unverified surfaceverification_state='unverified' (default)
7 Owner assignedCockpit UI inline owner field
8 Owner reviewChat with Ivan today; /metric-validate skill
9 VerifiedCockpit UI verification toggle
10 Retirementrm db/metrics/<id>.sql + bin/deploy.sh