Architecture

Pipeline shape, the metric data contract, and the principles behind them.

Heartbeat is a TV cockpit that aggregates data from every source system Satchel runs on and surfaces what's broken before someone walks past the screen and notices. This page is the architectural ground truth for the project — pipeline, data contract, refresh path, principles, repo layout. Sibling docs cover sources, metric lifecycle, decisions, and the roadmap.

Business goals

Heartbeat exists to give Satchel's CEO a single cockpit that shows the state of the business in real-ish time — and, just as importantly, raises an alert when something silently breaks. Today the team runs on data that is "a month, if not months, late", and serious problems surface by accident in corridor conversations. The canonical example raised at the 24-04-2026 offsite: a client did not pay for MasterCard collateral for 4 months → Satchel covered the debt out of its own funds → CEO learned about it by chance, walking past someone's desk. Heartbeat must catch events of that shape before they cost money.

Four concrete jobs the product must do:

  1. TV cockpit / red-zones-first view. A single screen — physically on a TV in the office, always on, glanceable from across the room — that the CEO and team see the state of the business on. Direct quote from the offsite: «хочу видеть говно, а не то, что хорошо». Green metrics fade; red ones surface.
  2. Silent-event alerting. When a metric crosses a threshold that matters, Heartbeat raises a signal — it does not wait for someone to walk past a screen. This is the layer that prevents the MasterCard-collateral class of incident.
  3. OKR / KPI substrate (later). Once Heartbeat reliably produces metrics, those metrics become the input for departmental OKRs. Heartbeat isn't an OKR system, but every other initiative that needs trustworthy metrics depends on it landing first.
  4. Conversational interface via MCP (later). Expose the registry and history as an MCP server so an agent can answer "what's going on today?" / "what should I look at first?" Same data, different surface.

What Heartbeat is not: a general-purpose BI tool, a multi-tenant analytics platform, or a replacement for Zoho / YouTrack / Slack. The principle is aggregate, not replace — channels stay where they are; the metric is what gets unified.

Where we are now (2026-05-08)

  • 6 source DBs landed in marts-db (~15 GB across 9 schemas): webbank, aml, pdvs, vss, card_management, currency_rate. See sources for landed/pending detail.
  • 31 metrics in heartbeat.metric_registry; cockpit reads metric_history with honest computed_at and source_as_of per row.
  • 17 audit cuts in heartbeat.audit_cut_registry driving 3 pages (/audit, /audit/clients, /audit/journeys) via the parameterised AuditPage + cut-renderer registry.
  • All 4 surfaces share AppShell with primary nav and Docs/Feedback icons.
  • Both registries carry verification_state + email owner; everything ships unverified until cross-checked (why).
  • Ingest+compute behind one endpoint (POST /api/metrics/recompute) with UI progress and hourly cron.
  • Compute phase parallelised (ThreadPoolExecutor(max_workers=4)); full refresh of 31 metrics now completes in ~2 min. Post-ingest analytics indexes on the 4 largest tables (see Data quality and performance).

Next steps live in the roadmap: (d) metric ↔ cut deep-links, (e) auto-generated /docs content from the registries, (f) alerting, (g) automated metrics discovery, (h) starred metrics and watchlists.

Pipeline shape

Five stages, in order:

sources → dlt ingest → marts-db → python compute → registry + history → surfaces
flowchart LR
  S[Sources<br/>Postgres / REST / webhooks] --> I[Ingest<br/>api/ingest_lib.py — dlt]
  I --> M[(marts-db<br/>Postgres on the VM)]
  M --> C[Compute<br/>api/metrics_compute.py]
  C --> R[(metric_registry<br/>+ metric_history)]
  R --> Cock[Cockpit /]
  R --> Aud[Audit /audit/*]
  R -.future.-> Alerts[Alerts]
  R -.future.-> MCP[MCP server]
  1. Sources — heterogeneous (Postgres replicas, REST APIs, webhooks, files).
  2. Ingestdlt, one engine in api/ingest_lib.py driven by a SOURCES table of SourceSpec(db, schemas, pipeline_name). The same engine is invoked from CLI (scripts/ingest.py) and from the recompute endpoint. There is no per-source dlt_<name>.py script — adding a source is a registry edit, not a new file.
  3. marts-db — Postgres on the VM, owned by us. Landed copies live here, with whatever indexes we add (see § Indexes).
  4. Compute — python module api/metrics_compute.py. For each row in heartbeat.metric_registry it executes value_sql against marts-db, captures source_as_of from the relevant <schema>._dlt_loads, resolves status, and UPSERTs a row into heartbeat.metric_history. Per-metric python modules are reserved for the day a metric needs source-API access or logic that cannot be expressed as value_sql — none exist today.
  5. Registry + historyheartbeat.metric_registry (config) and heartbeat.metric_history (cached values, with computed_at and source_as_of per row). Single read model.

Three surfaces consume registry + history: the cockpit (TV, today), the alert dispatcher (Slack/Telegram, later), and the MCP server (agent, later). All three read the same two tables; none of them touches sources or the compute layer directly.

The contract this shape commits us to:

  • Sources are heterogeneous (Postgres / REST / webhook / file).
  • Land before transform. We always land raw rows in marts-db before computing. No live-querying source systems on a cockpit refresh.
  • Registry + history is the single read model for every consumer surface. New surfaces (alerts, MCP) plug into the same two tables; they don't grow their own data models.
  • Cockpit reads only from history — never from source, never from raw marts, never blocking on compute. This is what makes the TV immune to a stalled sync or a slow API.

Metric data contract

Every metric, regardless of source or compute path, lands in the registry / history with the same six-component shape. This is the contract between the data layer and every consumer surface (cockpit, alerts, MCP, future OKR overlay). Surfaces never invent their own threshold model — they read these values and decide what to do.

FieldWhat it isWhere it comes from
valueThe current scalar for the metric.Compute layer; written into metric_history per refresh.
periodThe time window the value covers (snapshot, 24h, 7d, 30d, funnel). Without it the number is meaningless.Config in db/metrics/<id>.sql.
typicalμ ± σ (rolling benchmark over the last N days). What "normal" looks like for this metric. Optional — only rolling_stat metrics carry it.Recomputed daily by refresh_benchmarks.py from metric_history.
norm_valueSoft acceptable boundary — the amber line. Source: owner commitment.Owner-set in registry; tunable live without redeploy.
alert_valueHard ceiling — the red line. Source: SLA, regulatory cap, capacity ceiling. Never auto-derived.Owner-set in registry; tunable live without redeploy.
target_valueOKR stretch goal — better than norm_value, optional. Display-only: beating it does not change status colour, it lights a "target hit" accent on the green tile. Paired with target_due_date.OKR / owner; tunable live.

Status (green/amber/red) is server-computed in api/cockpit.py with a fixed priority:

  1. alert_value crossed → red
  2. else norm_value crossed → amber
  3. else if μ ± σ set → z-score (|z| ≥ 2 → red, |z| ≥ 1 → amber)
  4. else green

target_value does not enter the status calculation — it's a forward commitment surfaced as visual encouragement, not a threshold.

Why each consumer needs all six:

  • Cockpit colours the tile per the priority above and lights a "target hit" accent when value beats target_value on a green tile.
  • Alerts fire on the same priority. Reusing the contract means an alert can never disagree with what the screen shows.
  • OKR overlay reads target_value + target_due_date directly as the KPI commitment, norm_value as the guard rail, typical as the baseline you're trying to move from.
  • MCP answering "what's wrong today" returns metrics where value crossed norm_value or alert_value, ranked by severity / staleness.

Anything a surface needs beyond these six (per-team filter, escalation chain, snooze state, OKR period boundary) is surface state, not metric state, and lives in tables next to the surface — never in the registry.

Refresh path

There is one path for data to move forward: ingest and compute behind a single endpoint, POST /api/metrics/recompute. A request resolves the requested metrics → distinct dlt schemas → runs run_one() per (db, schema) serially → recomputes every metric whose schema is in that set. Cron uses the same path through scripts/refresh_all.py. UI clicks, cron ticks, and any future surface all speak this one language.

Single-flight, globally coalesced. Only one ingest+compute job runs at a time. A POST during an active job returns the existing job_id (coalesced=true); a cron tick that arrives mid-job is silently absorbed. Job state is held in-process (api/recompute.py:_JOBS); on api restart, in-flight jobs are dropped — the next cron tick re-runs.

Three reasons single-flight is load-bearing:

  • The VM tips over with parallel dlt extracts (it ran t3.small/2 GB before the t3.large upgrade; the ceiling is higher now but parallelism would still compete with api/web/marts-db for the same RAM).
  • Multiple concurrent jobs racing to UPSERT the same (metric_id, current_date) history row would produce a value whose source_as_of and computed_at came from different jobs — a confusing audit trail.
  • A persistent job ledger is overkill for an operator-facing console; we'll add it the day we need cross-restart resumption.

GET /api/metrics/active-job exposes the in-flight job to the UI; the cockpit shows progress whether the trigger was a click or cron.

Cron schedule

CronWhat it doesEntry
17 * * * *Hourly ingest + compute (every metric).scripts/refresh_all.py
23 4 * * *Daily μ/σ refresh for rolling_stat metrics.python -m api.scripts.refresh_benchmarks
30 4 * * *Daily client-segmentation snapshot.python -m api.scripts.refresh_segments
45 4 * * *Daily audit-cut registry sync + snapshot refresh for every (cut × window).python -m api.scripts.refresh_audit

Freshness signal

Each metric_history row carries source_as_of — the maximum inserted_at over completed loads in the metric's source schema (<schema>._dlt_loads.inserted_at WHERE status='completed'). This gives an automatic lower bound "data is at least this fresh" for every metric without per-metric configuration.

Staleness is a separate visual signal from status, not folded into red/amber/green. The status dot reflects value vs target/limit/benchmark. The freshness timestamp colours independently:

periodamberred
24h> 36h> 72h
30d> 7d> 14d
snapshot> 24h> 48h

Conflating "the value is bad" with "the pipeline is broken" hides the actual problem. An operator looking at red needs to know which one it is — wrong number, or stale source.

Audit cuts

/audit/<page> is built out of cuts — small charts with prose between them. Each cut is a Python module under api/audit/cuts/<cut_id>.py exporting two things:

META = {
    "page": "clients",
    "title": "...",
    "window_default": "30d",
}

def compute(conn, *, window: str) -> dict:
    ...

The module is the source of truth. api.scripts.refresh_audit auto-syncs heartbeat.audit_cut_registry from disk — no per-file SQL, no DB-level write guard. heartbeat.audit_cut_snapshot caches the payload per (cut_id, window_key). Each leaf page (/audit/<page>/page.tsx) is ~30 lines and just calls <AuditPage page="clients" />; the cut_id → React component registry in dashboard/app/audit/_shared/cuts/index.ts maps each cut to its renderer. See audit cuts for the add-a-chart flow.

Verification + ownership

Both metric_registry and audit_cut_registry carry:

  • verification_state ∈ {unverified, verified}, default unverified,
  • owner — email validated against a loose RFC-5321 regex, default NULL.

Operators PATCH these via /api/metrics/{id}/verification and /api/audit/cuts/{id}/verification. Both endpoints accept either field independently (auto-save on owner blur). These two columns are the only registry fields operators can change without a redeploy.

The metric lifecycle — when a tile is unverified, when an owner gets attached, when verification flips — is its own page: metric lifecycle.

Principles

These hold across every choice in this codebase and shouldn't be re-litigated when we pick a tool or pattern.

  • Aggregate, not replace. Heartbeat reads from existing systems; it does not ask the team to migrate off Zoho or to consolidate tools. Each source connector is a one-way ingest.
  • Smallest stack that does today's job. Build what the cockpit needs now, but don't take decisions that paint us into a corner when the second source / the alert layer / the OKR overlay lands. No abstractions invented for hypothetical use cases.
  • Config-truth in the file, runtime-truth in the table. db/metrics/<id>.sql is the config source of truth — name, SQL fragment, period, owner, description, initial target/limit. Versioned and reviewed in PRs. heartbeat.metric_registry is the runtime table that the API joins against and that holds values which legitimately change outside the file: operator-tuned norm_value / alert_value, daily-recomputed benchmark_mean / benchmark_stddev, verification_state, owner. A DB-level write guard (heartbeat.metric_registry_guard) makes the file the only path for config columns, while leaving the operator-tunable columns writable.
  • Status logic stays server-side (api/cockpit.py). The frontend renders what the API computed; it never re-derives green/amber/red.
  • Cockpit always renders from a fast read. A slow source or a failing sync must not slow down or break the screen the CEO opens.
  • TV-first surface. The primary client is an always-on TV in the office, not a laptop. Large legible layout, auto-refresh on a fixed interval, no auth modals, no toasts that require dismissal, graceful degradation on transient API failures (last good value with staleness indicator, not a blank tile).
  • One read model, many surfaces. Cockpit, alerts, and (later) MCP all consume the same metric_registry + metric_history.
  • AI is the primary developer; recurring workflows are codified as skills. Heartbeat is built mostly by AI agents. Every recurring workflow ("add a metric", "onboard a new source", "investigate a stalled sync", "draft an alert rule") is written as a skill under .claude/skills/. The skill is the runbook — executable, tested-by-use, and the only sanctioned path for that workflow. Live skills today: metrics-discovery, metric-validate, heartbeat-today.
  • /docs is the single source of truth. This site (fumadocs) owns product, lifecycle, sources, decisions, and operations. Any new feature, product decision, user scenario, or behavioural change updates the relevant page in /docs in the same PR as the code. CLAUDE.md and README are thin pointers into /docs; they do not duplicate architecture. See the decision for context.
  • English for all code, comments, commits, and project docs. Russian only in user-facing prose where the audience is the Satchel team itself.

Repo map

PathRole
api/ingest_lib.pydlt engine. Owns SOURCES, SKIP_TABLES, COLUMN_DROP, INCREMENTAL_PRIORITY, and run_one(spec, schema). Both the CLI and the recompute endpoint import from here.
scripts/ingest.pyThin CLI wrapper over api.ingest_lib.
api/metrics_compute.pyCompute engine. fetch_registry(), compute_one() — execute value_sql, look up source_as_of, resolve status, UPSERT into metric_history.
api/recompute.pyThe on-demand endpoint. POST /api/metrics/recompute, GET /api/metrics/recompute/{id}, GET /api/metrics/active-job. Single-flight global job, two phases (ingest → compute), threaded background runner.
api/cockpit.pyThe metrics read endpoint. Reads registry + last metric_history row per metric, returns the row including computed_at / source_as_of for the freshness chip.
scripts/refresh_all.pyCron entrypoint. POSTs {"metric_ids":"all","trigger":"cron"} to the local recompute endpoint, polls until done.
scripts/compute_metrics.pyCLI for compute-only refresh (ad-hoc, no ingest).
db/metrics/<metric_id>.sqlOne file per metric — comment header + INSERT + UPDATE for code + UPDATE for how_to_read / methodology / sources. Config source of truth for metric_registry.
db/metrics/_schema.sql / _pre.sql / _post.sqlShared scaffolding around the per-metric files — DDL, write guard, helpers, transaction frame, final NOT NULL.
db/build_metrics.shConcatenates _schema.sql + _pre.sql + metrics/*.sql + auto-GC + _post.sql into the assembled SQL stream for psql.
api/audit/routes.pyGeneric audit endpoints. GET /api/audit/cuts/{id}, GET /api/audit/pages/{page}, POST /api/audit/cuts/{id}/refresh, GET /api/audit/registry, PATCH /api/audit/cuts/{id}/verification.
api/audit/cuts/<cut_id>.pyOne file per audit cut. Exports META + compute(conn, *, window) -> dict.
api/audit/cuts/__init__.pyDiscovers modules at import, exposes CUTS, cuts_for_page, sync_registry(conn), upsert_snapshot.
api/scripts/refresh_audit.pySync-from-disk + recompute every (cut, window).
db/audit_schema.sqlDDL for audit_cut_registry + audit_cut_snapshot. Applied by bin/deploy.sh.
dashboard/app/page.tsxThe cockpit screen.
dashboard/app/_shell/AppShell.tsxShared header for all 4 main pages — primary nav (Cockpit | Audit), right-side icon links (Docs, Feedback), freshness chip, ThemeToggle.
dashboard/app/audit/_shared/AuditPage.tsxParameterised audit page driven by cut_id → React component registry.
dashboard/app/audit/_shared/cuts/<Cut>.tsxOne renderer per cut. Registered in dashboard/app/audit/_shared/cuts/index.ts.
dashboard/app/docs/**Fumadocs route — this site. [[...slug]]/page.tsx renders MDX. /docs/api is the Scalar OpenAPI viewer.
dashboard/content/docs/**The MDX content of /docs.
dashboard/lib/hooks.tsReact-Query hooks: useCockpitMetrics, useActiveJob, useStartRecompute, useSetMetricVerification, useAuditPage, useRefreshAuditCut, useSetCutVerification.
bin/deploy.shManual VM deploy entrypoint.
docker-compose.prod.ymlVM stack: marts-db, api, web.
docker-compose.dev.ymlLocal dev: schema-exploration Postgres on port 5433.
.claude/skills/metrics-discovery/SKILL.mdThe discipline gate for adding metrics.
.claude/skills/metric-validate/SKILL.mdDeep audit of a single metric.
.claude/skills/heartbeat-today/SKILL.mdDaily triage of cockpit state.

Data quality and performance

This section records the rules that were learned the hard way during the 2026-05-08 refresh-pipeline hardening session. Apply them to every new metric and every new source landing.

Post-ingest indexes

dlt lands rows without metric-aware indexes — only _dlt_id unique indexes exist by default. Every WHERE and JOIN column on a multi-GB table will do a full seq scan until you add an index explicitly.

The rule: for any table over 200 MB referenced in a metric's value_sql, verify in pg_indexes that the WHERE and JOIN columns have an index. If not, add a CREATE INDEX IF NOT EXISTS entry to POST_INGEST_INDEXES in api/ingest_lib.py. The dict is applied idempotently after every pipeline.run().

Current indexes (as of 2026-05-08):

SchemaIndexReason
aml_amlaml_retrospective(create_at)AML throughput — 2.7 GB table
webbank_coreentry(entry_state_id, create_time)Revenue, active-client, top-tier metrics
webbank_corepayment(id)Join from entry.payment_id — no PK index in dlt tables
webbank_corepayment(type_id, create_time)Outgoing approval-rate metric
webbank_coreclient_profile(id)Join from payment.sender_profile_id
webbank_corepayment_state_log(state_id, timestamp)Processing-time metrics
webbank_corepayment_state_log(payment_id)Join from filtered payment set
webbank_coreclient_profile_state_log(client_profile_id, profile_state, timestamp)Approval-rate cohort join

Before/after: slowest metrics dropped from 4–10s per query to under 0.4s.

Replace-strategy tables — hidden ingest cost

dlt falls back to replace (truncate + full reload) for any table with no PK and no recognised incremental cursor column. On every ingest tick, these tables are wiped and fully re-ingested from the source — even if nothing changed.

The problem: replace tables burn ingest time proportional to row count every hour. The cost is invisible in normal operation but becomes the bottleneck when a replace table grows or a new metric references one.

What to do when landing a new source:

  1. After the first ingest, check _dlt_loads or run EXPLAIN on a metric query to see which tables went through replace.
  2. For each replace table: does a PK or cursor column exist in the source that dlt didn't detect (different naming, composite key)? If yes, add an override to INCREMENTAL_PRIORITY in api/ingest_lib.py.
  3. If no cursor exists but the table is large and changes slowly, evaluate a daily schedule for that source (rather than hourly).
  4. Document the outcome in sources: which tables are intentionally replace (small lookup tables — acceptable), which were fixed, which are blocked on a source-side schema change.

Metric SQL rules

These apply before any value_sql or history_sql is committed. The metrics-discovery skill enforces them as a mandatory gate.

No correlated subqueries. Any (SELECT ... WHERE t.id = outer.id) pattern must be rewritten as a JOIN + conditional aggregation (MIN(CASE WHEN ...) etc.) before merging. Correlated subqueries execute once per outer row — on a 2M-row table that's 2M separate lookups. The query will time out.

Timing gate: value_sql must complete in under 5s on live data. Over 5s is a structural problem — missing index or correlated subquery — that must be fixed before the metric ships. Run the SQL directly on marts-db before opening a PR.

EXPLAIN check for slow queries. If the query is slow, run EXPLAIN (ANALYZE, BUFFERS). Look for Seq Scan on large tables (add an index) or Nested Loop over a large outer set (rewrite as Hash Join by restructuring the query).

Statement timeout. api/metrics_compute.py sets SET LOCAL statement_timeout = '60s' before every value_sql execution (configurable via METRIC_STATEMENT_TIMEOUT). A metric that hits the timeout has a structural problem — raise the timeout only after you've confirmed the SQL is correct and the table simply needs more time, never as a first resort.

All three are tracked in the roadmap — (d) metric ↔ cut deep-links, (e) auto-generation of /docs content from the registries, (f) alerts, (g) automated discovery, (h) watchlists. None block landing new metrics today.

Decision log

Every settled architectural decision is recorded in decisions. When you hit a "why is it like this" question, look there before re-litigating.