Architecture
Pipeline shape, the metric data contract, and the principles behind them.
Heartbeat is a TV cockpit that aggregates data from every source system Satchel runs on and surfaces what's broken before someone walks past the screen and notices. This page is the architectural ground truth for the project — pipeline, data contract, refresh path, principles, repo layout. Sibling docs cover sources, metric lifecycle, decisions, and the roadmap.
Business goals
Heartbeat exists to give Satchel's CEO a single cockpit that shows the state of the business in real-ish time — and, just as importantly, raises an alert when something silently breaks. Today the team runs on data that is "a month, if not months, late", and serious problems surface by accident in corridor conversations. The canonical example raised at the 24-04-2026 offsite: a client did not pay for MasterCard collateral for 4 months → Satchel covered the debt out of its own funds → CEO learned about it by chance, walking past someone's desk. Heartbeat must catch events of that shape before they cost money.
Four concrete jobs the product must do:
- TV cockpit / red-zones-first view. A single screen — physically on a TV in the office, always on, glanceable from across the room — that the CEO and team see the state of the business on. Direct quote from the offsite: «хочу видеть говно, а не то, что хорошо». Green metrics fade; red ones surface.
- Silent-event alerting. When a metric crosses a threshold that matters, Heartbeat raises a signal — it does not wait for someone to walk past a screen. This is the layer that prevents the MasterCard-collateral class of incident.
- OKR / KPI substrate (later). Once Heartbeat reliably produces metrics, those metrics become the input for departmental OKRs. Heartbeat isn't an OKR system, but every other initiative that needs trustworthy metrics depends on it landing first.
- Conversational interface via MCP (later). Expose the registry and history as an MCP server so an agent can answer "what's going on today?" / "what should I look at first?" Same data, different surface.
What Heartbeat is not: a general-purpose BI tool, a multi-tenant analytics platform, or a replacement for Zoho / YouTrack / Slack. The principle is aggregate, not replace — channels stay where they are; the metric is what gets unified.
Where we are now (2026-05-08)
- 6 source DBs landed in
marts-db(~15 GB across 9 schemas):webbank,aml,pdvs,vss,card_management,currency_rate. See sources for landed/pending detail. - 31 metrics in
heartbeat.metric_registry; cockpit readsmetric_historywith honestcomputed_atandsource_as_ofper row. - 17 audit cuts in
heartbeat.audit_cut_registrydriving 3 pages (/audit,/audit/clients,/audit/journeys) via the parameterisedAuditPage+ cut-renderer registry. - All 4 surfaces share
AppShellwith primary nav and Docs/Feedback icons. - Both registries carry
verification_state+ emailowner; everything shipsunverifieduntil cross-checked (why). - Ingest+compute behind one endpoint (
POST /api/metrics/recompute) with UI progress and hourly cron. - Compute phase parallelised (
ThreadPoolExecutor(max_workers=4)); full refresh of 31 metrics now completes in ~2 min. Post-ingest analytics indexes on the 4 largest tables (see Data quality and performance).
Next steps live in the roadmap: (d) metric ↔ cut
deep-links, (e) auto-generated /docs content from the registries,
(f) alerting, (g) automated metrics discovery, (h) starred metrics
and watchlists.
Pipeline shape
Five stages, in order:
sources → dlt ingest → marts-db → python compute → registry + history → surfacesflowchart LR
S[Sources<br/>Postgres / REST / webhooks] --> I[Ingest<br/>api/ingest_lib.py — dlt]
I --> M[(marts-db<br/>Postgres on the VM)]
M --> C[Compute<br/>api/metrics_compute.py]
C --> R[(metric_registry<br/>+ metric_history)]
R --> Cock[Cockpit /]
R --> Aud[Audit /audit/*]
R -.future.-> Alerts[Alerts]
R -.future.-> MCP[MCP server]- Sources — heterogeneous (Postgres replicas, REST APIs, webhooks, files).
- Ingest —
dlt, one engine inapi/ingest_lib.pydriven by aSOURCEStable ofSourceSpec(db, schemas, pipeline_name). The same engine is invoked from CLI (scripts/ingest.py) and from the recompute endpoint. There is no per-sourcedlt_<name>.pyscript — adding a source is a registry edit, not a new file. - marts-db — Postgres on the VM, owned by us. Landed copies live here, with whatever indexes we add (see § Indexes).
- Compute — python module
api/metrics_compute.py. For each row inheartbeat.metric_registryit executesvalue_sqlagainst marts-db, capturessource_as_offrom the relevant<schema>._dlt_loads, resolves status, and UPSERTs a row intoheartbeat.metric_history. Per-metric python modules are reserved for the day a metric needs source-API access or logic that cannot be expressed asvalue_sql— none exist today. - Registry + history —
heartbeat.metric_registry(config) andheartbeat.metric_history(cached values, withcomputed_atandsource_as_ofper row). Single read model.
Three surfaces consume registry + history: the cockpit (TV, today), the alert dispatcher (Slack/Telegram, later), and the MCP server (agent, later). All three read the same two tables; none of them touches sources or the compute layer directly.
The contract this shape commits us to:
- Sources are heterogeneous (Postgres / REST / webhook / file).
- Land before transform. We always land raw rows in marts-db before computing. No live-querying source systems on a cockpit refresh.
- Registry + history is the single read model for every consumer surface. New surfaces (alerts, MCP) plug into the same two tables; they don't grow their own data models.
- Cockpit reads only from history — never from source, never from raw marts, never blocking on compute. This is what makes the TV immune to a stalled sync or a slow API.
Metric data contract
Every metric, regardless of source or compute path, lands in the registry / history with the same six-component shape. This is the contract between the data layer and every consumer surface (cockpit, alerts, MCP, future OKR overlay). Surfaces never invent their own threshold model — they read these values and decide what to do.
| Field | What it is | Where it comes from |
|---|---|---|
| value | The current scalar for the metric. | Compute layer; written into metric_history per refresh. |
| period | The time window the value covers (snapshot, 24h, 7d, 30d, funnel). Without it the number is meaningless. | Config in db/metrics/<id>.sql. |
| typical | μ ± σ (rolling benchmark over the last N days). What "normal" looks like for this metric. Optional — only rolling_stat metrics carry it. | Recomputed daily by refresh_benchmarks.py from metric_history. |
| norm_value | Soft acceptable boundary — the amber line. Source: owner commitment. | Owner-set in registry; tunable live without redeploy. |
| alert_value | Hard ceiling — the red line. Source: SLA, regulatory cap, capacity ceiling. Never auto-derived. | Owner-set in registry; tunable live without redeploy. |
| target_value | OKR stretch goal — better than norm_value, optional. Display-only: beating it does not change status colour, it lights a "target hit" accent on the green tile. Paired with target_due_date. | OKR / owner; tunable live. |
Status (green/amber/red) is server-computed in api/cockpit.py with
a fixed priority:
alert_valuecrossed → red- else
norm_valuecrossed → amber - else if μ ± σ set → z-score (|z| ≥ 2 → red, |z| ≥ 1 → amber)
- else green
target_value does not enter the status calculation — it's a forward
commitment surfaced as visual encouragement, not a threshold.
Why each consumer needs all six:
- Cockpit colours the tile per the priority above and lights a
"target hit" accent when
valuebeatstarget_valueon a green tile. - Alerts fire on the same priority. Reusing the contract means an alert can never disagree with what the screen shows.
- OKR overlay reads
target_value+target_due_datedirectly as the KPI commitment,norm_valueas the guard rail,typicalas the baseline you're trying to move from. - MCP answering "what's wrong today" returns metrics where
valuecrossednorm_valueoralert_value, ranked by severity / staleness.
Anything a surface needs beyond these six (per-team filter, escalation chain, snooze state, OKR period boundary) is surface state, not metric state, and lives in tables next to the surface — never in the registry.
Refresh path
There is one path for data to move forward: ingest and compute
behind a single endpoint, POST /api/metrics/recompute. A request
resolves the requested metrics → distinct dlt schemas → runs
run_one() per (db, schema) serially → recomputes every metric whose
schema is in that set. Cron uses the same path through
scripts/refresh_all.py. UI clicks, cron ticks, and any future
surface all speak this one language.
Single-flight, globally coalesced. Only one ingest+compute job
runs at a time. A POST during an active job returns the existing
job_id (coalesced=true); a cron tick that arrives mid-job is
silently absorbed. Job state is held in-process
(api/recompute.py:_JOBS); on api restart, in-flight jobs are
dropped — the next cron tick re-runs.
Three reasons single-flight is load-bearing:
- The VM tips over with parallel dlt extracts (it ran t3.small/2 GB
before the t3.large upgrade; the ceiling is higher now but
parallelism would still compete with
api/web/marts-dbfor the same RAM). - Multiple concurrent jobs racing to UPSERT the same
(metric_id, current_date)history row would produce a value whosesource_as_ofandcomputed_atcame from different jobs — a confusing audit trail. - A persistent job ledger is overkill for an operator-facing console; we'll add it the day we need cross-restart resumption.
GET /api/metrics/active-job exposes the in-flight job to the UI;
the cockpit shows progress whether the trigger was a click or cron.
Cron schedule
| Cron | What it does | Entry |
|---|---|---|
17 * * * * | Hourly ingest + compute (every metric). | scripts/refresh_all.py |
23 4 * * * | Daily μ/σ refresh for rolling_stat metrics. | python -m api.scripts.refresh_benchmarks |
30 4 * * * | Daily client-segmentation snapshot. | python -m api.scripts.refresh_segments |
45 4 * * * | Daily audit-cut registry sync + snapshot refresh for every (cut × window). | python -m api.scripts.refresh_audit |
Freshness signal
Each metric_history row carries source_as_of — the maximum
inserted_at over completed loads in the metric's source schema
(<schema>._dlt_loads.inserted_at WHERE status='completed'). This
gives an automatic lower bound "data is at least this fresh" for
every metric without per-metric configuration.
Staleness is a separate visual signal from status, not folded into
red/amber/green. The status dot reflects value vs
target/limit/benchmark. The freshness timestamp colours
independently:
| period | amber | red |
|---|---|---|
24h | > 36h | > 72h |
30d | > 7d | > 14d |
snapshot | > 24h | > 48h |
Conflating "the value is bad" with "the pipeline is broken" hides the actual problem. An operator looking at red needs to know which one it is — wrong number, or stale source.
Audit cuts
/audit/<page> is built out of cuts — small charts with prose
between them. Each cut is a Python module under
api/audit/cuts/<cut_id>.py exporting two things:
META = {
"page": "clients",
"title": "...",
"window_default": "30d",
}
def compute(conn, *, window: str) -> dict:
...The module is the source of truth.
api.scripts.refresh_audit auto-syncs heartbeat.audit_cut_registry
from disk — no per-file SQL, no DB-level write guard.
heartbeat.audit_cut_snapshot caches the payload per
(cut_id, window_key). Each leaf page (/audit/<page>/page.tsx) is
~30 lines and just calls <AuditPage page="clients" />; the
cut_id → React component registry in
dashboard/app/audit/_shared/cuts/index.ts maps each cut to its
renderer. See audit cuts for the add-a-chart
flow.
Verification + ownership
Both metric_registry and audit_cut_registry carry:
verification_state ∈ {unverified, verified}, defaultunverified,owner— email validated against a loose RFC-5321 regex, defaultNULL.
Operators PATCH these via /api/metrics/{id}/verification and
/api/audit/cuts/{id}/verification. Both endpoints accept either
field independently (auto-save on owner blur). These two columns are
the only registry fields operators can change without a redeploy.
The metric lifecycle — when a tile is unverified, when an owner gets attached, when verification flips — is its own page: metric lifecycle.
Principles
These hold across every choice in this codebase and shouldn't be re-litigated when we pick a tool or pattern.
- Aggregate, not replace. Heartbeat reads from existing systems; it does not ask the team to migrate off Zoho or to consolidate tools. Each source connector is a one-way ingest.
- Smallest stack that does today's job. Build what the cockpit needs now, but don't take decisions that paint us into a corner when the second source / the alert layer / the OKR overlay lands. No abstractions invented for hypothetical use cases.
- Config-truth in the file, runtime-truth in the table.
db/metrics/<id>.sqlis the config source of truth — name, SQL fragment, period, owner, description, initial target/limit. Versioned and reviewed in PRs.heartbeat.metric_registryis the runtime table that the API joins against and that holds values which legitimately change outside the file: operator-tunednorm_value/alert_value, daily-recomputedbenchmark_mean/benchmark_stddev,verification_state,owner. A DB-level write guard (heartbeat.metric_registry_guard) makes the file the only path for config columns, while leaving the operator-tunable columns writable. - Status logic stays server-side (
api/cockpit.py). The frontend renders what the API computed; it never re-derives green/amber/red. - Cockpit always renders from a fast read. A slow source or a failing sync must not slow down or break the screen the CEO opens.
- TV-first surface. The primary client is an always-on TV in the office, not a laptop. Large legible layout, auto-refresh on a fixed interval, no auth modals, no toasts that require dismissal, graceful degradation on transient API failures (last good value with staleness indicator, not a blank tile).
- One read model, many surfaces. Cockpit, alerts, and (later)
MCP all consume the same
metric_registry+metric_history. - AI is the primary developer; recurring workflows are codified as
skills. Heartbeat is built mostly by AI agents. Every recurring
workflow ("add a metric", "onboard a new source", "investigate a
stalled sync", "draft an alert rule") is written as a skill under
.claude/skills/. The skill is the runbook — executable, tested-by-use, and the only sanctioned path for that workflow. Live skills today:metrics-discovery,metric-validate,heartbeat-today. /docsis the single source of truth. This site (fumadocs) owns product, lifecycle, sources, decisions, and operations. Any new feature, product decision, user scenario, or behavioural change updates the relevant page in/docsin the same PR as the code. CLAUDE.md and README are thin pointers into/docs; they do not duplicate architecture. See the decision for context.- English for all code, comments, commits, and project docs. Russian only in user-facing prose where the audience is the Satchel team itself.
Repo map
| Path | Role |
|---|---|
api/ingest_lib.py | dlt engine. Owns SOURCES, SKIP_TABLES, COLUMN_DROP, INCREMENTAL_PRIORITY, and run_one(spec, schema). Both the CLI and the recompute endpoint import from here. |
scripts/ingest.py | Thin CLI wrapper over api.ingest_lib. |
api/metrics_compute.py | Compute engine. fetch_registry(), compute_one() — execute value_sql, look up source_as_of, resolve status, UPSERT into metric_history. |
api/recompute.py | The on-demand endpoint. POST /api/metrics/recompute, GET /api/metrics/recompute/{id}, GET /api/metrics/active-job. Single-flight global job, two phases (ingest → compute), threaded background runner. |
api/cockpit.py | The metrics read endpoint. Reads registry + last metric_history row per metric, returns the row including computed_at / source_as_of for the freshness chip. |
scripts/refresh_all.py | Cron entrypoint. POSTs {"metric_ids":"all","trigger":"cron"} to the local recompute endpoint, polls until done. |
scripts/compute_metrics.py | CLI for compute-only refresh (ad-hoc, no ingest). |
db/metrics/<metric_id>.sql | One file per metric — comment header + INSERT + UPDATE for code + UPDATE for how_to_read / methodology / sources. Config source of truth for metric_registry. |
db/metrics/_schema.sql / _pre.sql / _post.sql | Shared scaffolding around the per-metric files — DDL, write guard, helpers, transaction frame, final NOT NULL. |
db/build_metrics.sh | Concatenates _schema.sql + _pre.sql + metrics/*.sql + auto-GC + _post.sql into the assembled SQL stream for psql. |
api/audit/routes.py | Generic audit endpoints. GET /api/audit/cuts/{id}, GET /api/audit/pages/{page}, POST /api/audit/cuts/{id}/refresh, GET /api/audit/registry, PATCH /api/audit/cuts/{id}/verification. |
api/audit/cuts/<cut_id>.py | One file per audit cut. Exports META + compute(conn, *, window) -> dict. |
api/audit/cuts/__init__.py | Discovers modules at import, exposes CUTS, cuts_for_page, sync_registry(conn), upsert_snapshot. |
api/scripts/refresh_audit.py | Sync-from-disk + recompute every (cut, window). |
db/audit_schema.sql | DDL for audit_cut_registry + audit_cut_snapshot. Applied by bin/deploy.sh. |
dashboard/app/page.tsx | The cockpit screen. |
dashboard/app/_shell/AppShell.tsx | Shared header for all 4 main pages — primary nav (Cockpit | Audit), right-side icon links (Docs, Feedback), freshness chip, ThemeToggle. |
dashboard/app/audit/_shared/AuditPage.tsx | Parameterised audit page driven by cut_id → React component registry. |
dashboard/app/audit/_shared/cuts/<Cut>.tsx | One renderer per cut. Registered in dashboard/app/audit/_shared/cuts/index.ts. |
dashboard/app/docs/** | Fumadocs route — this site. [[...slug]]/page.tsx renders MDX. /docs/api is the Scalar OpenAPI viewer. |
dashboard/content/docs/** | The MDX content of /docs. |
dashboard/lib/hooks.ts | React-Query hooks: useCockpitMetrics, useActiveJob, useStartRecompute, useSetMetricVerification, useAuditPage, useRefreshAuditCut, useSetCutVerification. |
bin/deploy.sh | Manual VM deploy entrypoint. |
docker-compose.prod.yml | VM stack: marts-db, api, web. |
docker-compose.dev.yml | Local dev: schema-exploration Postgres on port 5433. |
.claude/skills/metrics-discovery/SKILL.md | The discipline gate for adding metrics. |
.claude/skills/metric-validate/SKILL.md | Deep audit of a single metric. |
.claude/skills/heartbeat-today/SKILL.md | Daily triage of cockpit state. |
Data quality and performance
This section records the rules that were learned the hard way during the 2026-05-08 refresh-pipeline hardening session. Apply them to every new metric and every new source landing.
Post-ingest indexes
dlt lands rows without metric-aware indexes — only _dlt_id unique
indexes exist by default. Every WHERE and JOIN column on a multi-GB
table will do a full seq scan until you add an index explicitly.
The rule: for any table over 200 MB referenced in a metric's
value_sql, verify in pg_indexes that the WHERE and JOIN columns
have an index. If not, add a CREATE INDEX IF NOT EXISTS entry to
POST_INGEST_INDEXES in api/ingest_lib.py. The dict is applied
idempotently after every pipeline.run().
Current indexes (as of 2026-05-08):
| Schema | Index | Reason |
|---|---|---|
aml_aml | aml_retrospective(create_at) | AML throughput — 2.7 GB table |
webbank_core | entry(entry_state_id, create_time) | Revenue, active-client, top-tier metrics |
webbank_core | payment(id) | Join from entry.payment_id — no PK index in dlt tables |
webbank_core | payment(type_id, create_time) | Outgoing approval-rate metric |
webbank_core | client_profile(id) | Join from payment.sender_profile_id |
webbank_core | payment_state_log(state_id, timestamp) | Processing-time metrics |
webbank_core | payment_state_log(payment_id) | Join from filtered payment set |
webbank_core | client_profile_state_log(client_profile_id, profile_state, timestamp) | Approval-rate cohort join |
Before/after: slowest metrics dropped from 4–10s per query to under 0.4s.
Replace-strategy tables — hidden ingest cost
dlt falls back to replace (truncate + full reload) for any table
with no PK and no recognised incremental cursor column. On every
ingest tick, these tables are wiped and fully re-ingested from the
source — even if nothing changed.
The problem: replace tables burn ingest time proportional to row count every hour. The cost is invisible in normal operation but becomes the bottleneck when a replace table grows or a new metric references one.
What to do when landing a new source:
- After the first ingest, check
_dlt_loadsor runEXPLAINon a metric query to see which tables went throughreplace. - For each replace table: does a PK or cursor column exist in the
source that dlt didn't detect (different naming, composite key)?
If yes, add an override to
INCREMENTAL_PRIORITYinapi/ingest_lib.py. - If no cursor exists but the table is large and changes slowly, evaluate a daily schedule for that source (rather than hourly).
- Document the outcome in sources: which tables are intentionally replace (small lookup tables — acceptable), which were fixed, which are blocked on a source-side schema change.
Metric SQL rules
These apply before any value_sql or history_sql is committed.
The metrics-discovery skill enforces them as a mandatory gate.
No correlated subqueries. Any (SELECT ... WHERE t.id = outer.id)
pattern must be rewritten as a JOIN + conditional aggregation
(MIN(CASE WHEN ...) etc.) before merging. Correlated subqueries
execute once per outer row — on a 2M-row table that's 2M separate
lookups. The query will time out.
Timing gate: value_sql must complete in under 5s on live data. Over 5s is a structural problem — missing index or correlated subquery — that must be fixed before the metric ships. Run the SQL directly on marts-db before opening a PR.
EXPLAIN check for slow queries. If the query is slow, run
EXPLAIN (ANALYZE, BUFFERS). Look for Seq Scan on large tables
(add an index) or Nested Loop over a large outer set (rewrite as
Hash Join by restructuring the query).
Statement timeout. api/metrics_compute.py sets
SET LOCAL statement_timeout = '60s' before every value_sql
execution (configurable via METRIC_STATEMENT_TIMEOUT). A metric
that hits the timeout has a structural problem — raise the timeout
only after you've confirmed the SQL is correct and the table simply
needs more time, never as a first resort.
Alerting, deep-links, auto-generated docs content
All three are tracked in the roadmap — (d) metric ↔
cut deep-links, (e) auto-generation of /docs content from the
registries, (f) alerts, (g) automated discovery, (h) watchlists.
None block landing new metrics today.
Decision log
Every settled architectural decision is recorded in decisions. When you hit a "why is it like this" question, look there before re-litigating.