# Discovery (/docs/metrics/discovery)



New metrics enter the registry through the
[`metrics-discovery`](#) skill, not via ad-hoc SQL. The skill is the
**discipline gate** between "this would be a useful number" and a
tile on the cockpit. It enforces a single ordering rule:

> **Money story first, SQL last.** A metric earns its place by being
> tied to revenue gained or losses avoided, not by being technically
> possible.

This page describes what the skill does, what its outputs look like,
and where ad-hoc shortcuts go wrong. The skill itself lives at
`.claude/skills/metrics-discovery/SKILL.md`; invoke it with
`/metrics-discovery` or any phrasing the skill description matches
("новая метрика", "что бы померить", "audit the cockpit").

## What the skill does [#what-the-skill-does]

A discovery session walks through five steps, in order. Skipping any
of them is what produces vanity counters.

### 1. Inventory the live registry [#1-inventory-the-live-registry]

Read `heartbeat.metric_registry` directly — what's already there,
what's `unverified`, what's been retired in the recent commit history.
The goal is to **avoid duplicates** before any new SQL is written.

### 2. Probe the live source [#2-probe-the-live-source]

For every candidate metric, the skill checks the dlt-landed schemas
on `marts-db` for the columns it would need:

* Is the relevant table populated?
* Does `<schema>._dlt_loads` show recent successful loads?
* Does the column the metric depends on actually exist with the
  semantics the candidate assumes?

If the data isn't landed — even if it's promised in
[pending sources](/docs/sources#pending) — the candidate is parked
there, not built off a wrong-but-landed surrogate. The
[webbank-first rule](/docs/sources#source-priority-—-webbank-first)
applies in full: fallback requires an explicit `notes` entry on the
metric.

### 3. Write the JTBD + business rationale [#3-write-the-jtbd--business-rationale]

For each candidate, write the *money story* explicitly:

* **Job to be done** — the question this metric answers (e.g. "is
  card-order approval rate falling below the rate at which we sustain
  client onboarding?").
* **Red zone meaning** — what loss in EUR or what unit of operational
  pain a red status corresponds to.
* **Action** — what the owner does when this goes red. If the answer
  is "nothing" or "log a ticket", the metric does not belong on the
  cockpit.

This step is what separates "we could measure this" from "we should
measure this". Without it, the cockpit fills up with vanity counters
that no one acts on.

### 4. Get alignment **before** the SQL [#4-get-alignment-before-the-sql]

The skill writes a dated candidates file under
`docs/metric-candidates-<YYYY-MM-DD>.md` with the JTBD, source
inventory, and proposed thresholds — and **stops**. Approval from
Ivan / PM is required before any compute SQL is drafted.

This is the single most often-skipped step, and it's the reason for
the artefact file: a candidate that nobody approved has a dated
record sitting next to candidates that did get approved, so we can
see the difference.

### 5. Draft `value_sql`, run it, sanity-check the number [#5-draft-value_sql-run-it-sanity-check-the-number]

Only after alignment does SQL get written. The skill drafts the
`value_sql` against `marts-db`, runs it, and reports the number
back — not just "the SQL parses", but "this is the actual current
value".

The number is sanity-checked against:

* the team's hand count (where one exists — the
  [webbank-first rule](/docs/sources#source-priority-—-webbank-first)
  exists because of a metric that diverged from one),
* a Jasper-rendered report covering the same concept (Jasper is a
  source of *definitions*, not of *data* — see
  [decisions](/docs/decisions)),
* the operator's domain expectation ("this can't be 22%").

Only after the number passes does the file land at
`db/metrics/<metric_id>.sql`.

## Anatomy of a metric file [#anatomy-of-a-metric-file]

A metric file has a fixed shape:

```sql
-- comment header: one-line purpose, methodology notes, owner
INSERT INTO heartbeat.metric_registry (metric_id, name, period, value_sql, ...) VALUES (...);
UPDATE heartbeat.metric_registry SET code = 'APRC' WHERE metric_id = '...';
UPDATE heartbeat.metric_registry SET how_to_read = '...', methodology = '...', sources = '...' WHERE metric_id = '...';
-- one-shot threshold reset (norm_value / alert_value / target_value)
```

Shared scaffolding (`db/metrics/_schema.sql`, `_pre.sql`, `_post.sql`)
wraps the per-metric files; `db/build_metrics.sh` concatenates
everything plus an auto-generated GC into the SQL stream piped into
psql. The build script's GC means: deleting the file removes the row;
no parallel manifest to maintain.

## Deploy after the file lands [#deploy-after-the-file-lands]

Once the file is in `db/metrics/`, run the standard VM deploy:

```bash
ssh ubuntu@13.62.60.156 'cd ~/heartbeat-dashboard && ./bin/deploy.sh'
```

`bin/deploy.sh` runs `db/build_metrics.sh` → psql, then
`compute_metrics` (writes the first `metric_history` row), then
`refresh_benchmarks` (recomputes μ/σ for `rolling_stat` metrics). The
new tile surfaces on the cockpit immediately — `unverified`,
ownerless, awaiting the lifecycle's business handover.

**Don't shortcut the deploy script** — applying the SQL by hand
without the compute + benchmark refresh leaves the new tile stuck on
"no data yet" until the next hourly cron tick.

## Common failure modes the skill prevents [#common-failure-modes-the-skill-prevents]

* **Metric without a money story** — the skill refuses to write SQL
  before step 3 lands.
* **Metric off a wrong-but-landed source** — the skill checks
  `pending-ingest-sources` and the webbank-first rule; surrogate
  metrics are flagged, not silently shipped.
* **Threshold derived from current data** — the skill won't write
  `target_value = (some percentile of metric_history)` even if asked.
  Thresholds come from outside the dashboard.
* **Duplicate of an existing metric** — caught at step 1 by
  inventorying the live registry.
* **Vanity counter** — the JTBD step requires a stated *action* on
  red. Counters with no action don't earn a tile.

## See also [#see-also]

* [Metric lifecycle](/docs/metrics/lifecycle) — what happens after a
  metric lands.
* [Validate](/docs/metrics/validate) — the deep-audit skill for a
  single metric whose number you don't trust.
* [Sources › pending](/docs/sources#pending) — the parking lot for
  metric concepts blocked on data that isn't landed.
