Metric lifecycle
From "this should be on the cockpit" to a verified, owner-trusted tile — and eventual retirement.
A Heartbeat metric travels through ten stages. The first six are
mostly mechanical (the metrics-discovery skill drives them, the
deploy script wires the rest). The last four are business —
a real human owns the number, validates it against reality, and
either trusts it or asks for it to change.
1. discovery (metrics-discovery skill — JTBD + data check)
2. registration (db/metrics/<id>.sql committed and deployed)
3. backfill (compute_metrics + hourly cron fill metric_history)
4. benchmark (≈30 days of history → μ/σ via refresh_benchmarks)
5. provisional (norm_value / alert_value set from skill author's read)
6. unverified (tile on cockpit, dashed border, "unverified" pill)
────────────────────── business handover ──────────────────────
7. owner assigned (alert fires OR operator clicks tile → fills email)
8. owner review (today: owner chats with Ivan)
9. verified (owner agrees number is real; pill removed; eligible for alerts)
10. retirement (file deleted from db/metrics/ — GC drops the row)Stages 1–5 are the build path. Stage 6 is the handover state — the metric is computing, the dashboard is showing it, but no one has yet stood behind the number. Stages 7–9 are the path from "computed" to "trusted". Stage 10 is the only sanctioned way to remove a metric.
1. Discovery
Driven by the metrics-discovery skill.
The skill enforces "money story first, SQL last":
- Inventory the live registry (avoid duplicates).
- Probe the live source / marts-db schema (does the data we'd build on actually exist and look populated?).
- Write the JTBD + business rationale (what red on this tile would mean, in losses-avoided or revenue-gained terms).
- Get alignment with Ivan / PM before any SQL is written.
- Draft
value_sqlagainst marts-db, run it, sanity-check the number.
The output of this stage is a candidate file under
docs/metric-candidates-YYYY-MM-DD.md plus a draft of the metric SQL
file. Nothing is in the registry yet.
2. Registration
A new file appears at db/metrics/<metric_id>.sql. It contains:
- a comment header (one-line purpose, methodology notes),
- an
INSERTintoheartbeat.metric_registrywithname,period,value_sql,description,sources, - an
UPDATEsetting the 4-charcode, - an
UPDATEsettinghow_to_read/methodology/sources, - provisional
norm_valueandalert_valuefrom the skill author's best read (these will be tuned later — see stage 5).
Deploy lands the row in the registry:
ssh ubuntu@13.62.60.156 'cd ~/heartbeat-dashboard && ./bin/deploy.sh'bin/deploy.sh runs db/build_metrics.sh → psql, then compute_metrics
(writes the first metric_history row), then refresh_benchmarks
(populates μ/σ if there's already enough history — usually not on day 1).
3. Backfill / history accumulation
The metric is now computed every hour by the cron job at :17
(scripts/refresh_all.py). Each tick UPSERTs a row into
heartbeat.metric_history with:
value— today's number,computed_at— when compute ran,source_as_of—MAX(_dlt_loads.inserted_at)for the source schema (how fresh the underlying data is),status— server-side green/amber/red against current thresholds.
The cockpit reads only from metric_history; the tile starts rendering
within the hour of registration. Sparklines start filling in.
4. Benchmark — typical band emerges
refresh_benchmarks.py runs daily at 04:23 and recomputes μ ± σ over
the rolling history window for every rolling_stat metric. After
roughly 30 days the typical band stabilises and the cockpit can show
"normal looks like X ± Y" alongside the raw value.
The benchmark is never used to derive thresholds — it's an informational band. See stage 5.
5. Provisional thresholds
norm_value (amber) and alert_value (red) at this point are
the skill author's best-read commitments. They are good enough to
make the tile colourable but they have not yet been agreed with anyone
who would be paged on the red.
Operators (today: Ivan) can tune these on the cockpit UI without a
redeploy — the inline editor PATCHes through
/api/metrics/{id}/verification (the same endpoint that handles owner
and verification state). The DB-level write guard
(heartbeat.metric_registry_guard) blocks every other write path.
Thresholds never auto-derive from data. A threshold that drifts with the number it's supposed to police hides the very signal it exists to surface. Always external: SLA, regulatory cap, owner commitment.
6. Unverified — the handover state
The tile is now live on the cockpit. Visually:
- an
"unverified"pill near the metric's 4-char code chip, - the number is read normally; status colour applies normally.
verification_state = 'unverified' is the default. Adding a
number to the cockpit doesn't mean someone has cross-checked it
against the source-of-truth count yet. The dashed border makes that
gap legible at a glance and explicitly invites the next step.
owner is NULL at this stage.
Unverified metrics are first-class citizens. Staying unverified is the normal state for a metric until something forces a closer look — typically the first alert firing. Until then the tile is doing real work: it's on the screen, the number is computing, the sparkline is filling in, and the operator can already feel whether the trend is right. Verification is not a quality gate the metric has to pass to be useful — it's a contract that gets signed when someone is about to be paged on it. We expect most tiles on the cockpit to live in
unverifiedfor weeks or months, and that is fine. The push toverifiedhappens naturally the first time the number crosses red and an owner needs to act on it; only then does the cost of a wrong number actually bite, and only then is verification worth the owner's time.
7. Owner assigned
This stage is event-driven, not scheduled. The expected trigger is the first alert firing on the metric (post §8.4(f), see roadmap) — the number crosses red, somebody has to act on it, and that act of taking responsibility is what puts an email on the tile. Until that happens it's fine for the tile to sit unverified and ownerless; we don't chase owners for metrics nothing has gone wrong with yet.
The owner is an email validated against a loose RFC-5321 regex. Two
ways the field gets populated:
- Alert fires (the natural path). When a threshold is crossed and someone needs to take responsibility for the number, an owner email is filled in by whoever steps up.
- Manual UI assignment (the proactive path, used when we already know who'll be paged). An operator clicks the metric tile and fills the inline owner field. Auto-saves on blur or Enter when the email validates. No redeploy.
Either way: the email column is now populated, the tile remains unverified (assigning ownership is not the same as verifying the number — it's the start of stage 8).
8. Owner review
The owner cross-checks the number against their own source of truth — typically a hand count out of webbank tables, an internal report they trust, or domain knowledge ("this can't be 22%").
Today this is informal: a chat request to Ivan. The owner pings Ivan, Ivan walks the SQL with them, they agree on whether the number matches reality and whether the thresholds make sense.
The same review can be driven mechanically by the
/metric-validate skill, which produces a
dated audit file under docs/metric-audits/<date>-<metric_id>.md
covering live source freshness, an independent cross-check rebuild
of the number, and a methodology check on what the red zone actually
means in business terms.
Outcomes of stage 8:
- Number agrees with reality, thresholds match commitment. Proceed to stage 9.
- Number is wrong (bad SQL, wrong source, wrong column). Owner
files an issue / asks for the metric to be fixed. The fix is a new
PR against
db/metrics/<id>.sql. Lifecycle resets to stage 2 for this metric, but with the owner already in place. - Thresholds are wrong. Owner tunes
norm_value/alert_valueinline on the cockpit. Number stays. Proceed to stage 9. - Metric isn't worth keeping. Skip to stage 10 (retirement).
9. Verified
The owner toggles verification_state to verified on the cockpit
(PATCH to /api/metrics/{id}/verification). Visually:
- the dashed border becomes solid,
- the
"unverified"pill is removed, - the tile reads as a normal cockpit metric.
Once verified the metric is eligible for alert routing (post
§8.4(f)) — only verified metrics with a non-null owner will route
to a notification channel. Unverified metrics stay silent regardless
of how red they go; we do not page humans on numbers nobody has
stood behind.
After verification, operators (owner included) can keep tuning
norm_value / alert_value / target_due_date / owner
indefinitely without redeploy. Verification state persists unless
explicitly flipped back — re-flipping to unverified is the right
action when an underlying source contract changes (a new dlt schema,
a renamed column) and the number needs a fresh cross-check.
10. Retirement
The metric is removed by deleting db/metrics/<metric_id>.sql and
running ./bin/deploy.sh. The build script's auto-generated GC
(DELETE … WHERE metric_id NOT IN (<files on disk>)) drops the row
from metric_registry. History rows in metric_history are kept
indefinitely as audit trail.
If the metric is being replaced rather than just retired, the replacement enters at stage 1 (new id, new file) — never edited in place under the old id.
Tooling summary
| Stage | Driven by |
|---|---|
| 1 Discovery | metrics-discovery skill |
| 2 Registration | db/metrics/<id>.sql + bin/deploy.sh |
| 3 Backfill | scripts/refresh_all.py (cron :17) |
| 4 Benchmark | api.scripts.refresh_benchmarks (cron 04:23) |
| 5 Provisional thresholds | Skill author + operator UI |
| 6 Unverified surface | verification_state='unverified' (default) |
| 7 Owner assigned | Cockpit UI inline owner field |
| 8 Owner review | Chat with Ivan today; /metric-validate skill |
| 9 Verified | Cockpit UI verification toggle |
| 10 Retirement | rm db/metrics/<id>.sql + bin/deploy.sh |