Decision log

Decisions on this page are settled and not up for reopening unless the underlying constraint changes. Format: date, decision, reason. Append-only — when a decision is superseded, add a new entry that notes which prior entry it replaces; do not edit the historical row.

Date	Decision	Reason
2026-04-24 (offsite)	Heartbeat aggregates data from existing systems; we do not consolidate tools.	"Заморозить" decision from offsite — `aggregate, not replace`.
2026-04-24 (offsite)	OKR system depends on Heartbeat metrics; OKR work is deferred until Heartbeat produces stable data.	Without trustworthy metrics OKRs hit the same data-deficit; sequencing matters.
2026-04-28	No FDW for source access.	Couples us to the chosen Postgres cluster; doesn't survive a future BigQuery / SaaS source.
2026-04-28	Marts in a writable DB physically separate from source.	Source is read-only; ingest / refresh writes need somewhere to land.
2026-04-29	`db/metrics.sql` is the single source of truth for the registry, enforced by DB trigger.	Stops drift between dead seed files, prod INSERTs, and narrative docs.
2026-04-29	Old `docs/satchel_analytics_pipeline_spec.md` (MySQL/Airbyte/BigQuery training spec) is obsolete.	Nothing in it reflects what we built or where we're going. Deleted.
2026-04-29	Ingest stack: dlt (per-source python pipelines invoked through one engine).	Surveyed the field (per-source python, dlt, Airbyte/Fivetran, Dagster, Databricks/Snowflake, Cube, ClickHouse). dlt is the only tool whose single job is "land 15 heterogeneous sources into a Postgres with retry/state/schema-drift handled" — at our scale (one VM, daily-ish refresh, banking data that should not leave the perimeter) the alternatives are either too heavy (managed ETL + warehouse) or too thin (per-script).
2026-04-29	Compute stack: python-first with results cached in `heartbeat.metric_history`.	dbt was overengineering for ~50-metric scale and would have left half of MVP-4 (BO API, Zoho, GA) outside its reach anyway. Python compute pushes aggregation down via SQL or API client per-metric, treats Postgres and SaaS uniformly, and removes the cross-DB JOIN problem dbt couldn't solve. Replaces the previous `value_sql`-on-page-load pattern.
2026-04-29	Registry reset. All 18 prior metrics removed.	They were research-phase hypotheses written against synthetic data and a now-deleted dbt staging/marts layer. New metrics enter via `metrics-discovery` once dlt has landed real source data. The registry table, write guard, and methodology comments stay.
2026-04-29	Old dbt layer removed (`dbt_satchel/`, `scripts/sync_webbank.py`, `dbt` service in compose).	Direct consequence of the two decisions above.
2026-04-30	Heartbeat is not built on top of Jasper Reports Server (or any embedded-analytics platform). Jasper stays as the team's reporting tool; Heartbeat is a separate control plane. Pithy formulation: Jasper = reports for people who know what they want to see; Heartbeat = signals for cases where nobody knew to look.	Jasper renders pixel-perfect reports on pull (someone opens a report) — that's a different operational model than Heartbeat's push (silent-event alerting). Concrete blockers if we tried to host Heartbeat inside Jasper: (a) no native 5-tuple metric contract / rolling benchmark / red-amber-green status engine — we'd reinvent it on top; (b) Jasper is JDBC-first, the MVP source mix (Zoho OAuth, GA aggregate API, webhooks, Fireblocks, Ondato) is not its sweet spot; (c) iteration requires a tech-support ticket, which is exactly the latency problem Heartbeat exists to fix; (d) TV-cockpit constraints (no auth, auto-refresh, graceful degradation, 3m readable) fight an embed; (e) the later MCP surface needs the unified `metric_registry + metric_history` model, not parsed Jasper outputs; (f) commercial lock-in on a strategic CEO/OKR/alerting layer.
2026-04-30	*Jasper (and other report tools) are sources of metric definitions, not sources of data.* Heartbeat does not ingest Jasper outputs (PDF/CSV/rendered reports). Instead, we treat the SQL templates behind Jasper reports as vetted reference logic, and the rendered reports as a validation set against `metric_history`.	Jasper sits over the same Postgres clusters Heartbeat already reads; ingesting its rendered outputs would add a fragile layer (manual Excel post-processing) on top of data we already access at the source. What's actually valuable is the accumulated domain context in each SQL template (WHERE clauses, exclusions, joins) — that becomes input to `metrics-discovery` and a regression check against our own compute.
2026-04-30	Cockpit reads only from `metric_history`; a scheduled compute runner is the only writer. Live `value_sql` execution on API request is removed. `scripts/compute_metrics.py` runs on cron, writes `(metric_id, d, v, computed_at, source_as_of, status, error)` per metric. API does a `DISTINCT ON (metric_id) ... ORDER BY computed_at DESC` lookup; if no row exists for a metric, it surfaces as "no data yet" (amber).	Two requirements forced the move: (a) the UI must show an honest "last update" per metric — without a separate compute step the only timestamp available is "now()"; (b) `value_sql` cost was leaking into every 30-second poll.
2026-04-30	Source freshness is taken from `<schema>._dlt_loads.inserted_at` (max where status='completed'), where `<schema>` is the first dotted segment of the metric's `source_mart`. Stored on each `metric_history` row as `source_as_of`.	dlt already records every successful landing in a per-schema `_dlt_loads` table — we don't need a parallel freshness ledger. This gives an automatic lower-bound "data is at least this fresh" for every metric without per-metric configuration.
2026-04-30	Staleness is a separate visual signal from status, not folded into red/amber/green. The status dot reflects `value` vs `target`/`limit`/benchmark. The freshness timestamp colours independently — amber when `source_as_of` exceeds the period-derived expectation (24h: >36h; 30d: >7d; snapshot: >24h), red at 2× that.	Conflating "the value is bad" with "the pipeline is broken" hides the actual problem. An operator looking at red needs to know which one it is — wrong number, or stale source.
2026-04-30	Refresh is one path: ingest + compute behind a single endpoint (`POST /api/metrics/recompute`). A request resolves the requested `metric_ids` to distinct dlt schemas, runs the dlt pipeline serially per schema (incremental where possible), then recomputes every metric whose source schema is in that set. Cron uses the same endpoint via `scripts/refresh_all.py`.	Two earlier attempts split this. (a) If we kept compute as the only "refresh", `source_as_of` never moves — the operator's freshness signal becomes a lie. (b) If ingest were a separate cron with no UI hook, a click on a stale metric would give the user no way to actually pull fresh data.
2026-04-30	Single-flight refresh, globally coalesced. Only one ingest+compute job runs at a time. POST during an active job returns the existing `job_id` (`coalesced=true`); the cron tick that arrives mid-job is silently absorbed. Job state is held in-process; on api restart, in-flight jobs are dropped.	(a) The VM tips over with parallel dlt extracts; single-flight is the cheapest correctness guarantee. (b) Multiple concurrent jobs racing to UPSERT the same `(metric_id, current_date)` history row would produce a value whose `source_as_of` and `computed_at` came from different jobs — a confusing audit trail.
2026-04-30	`api/ingest_lib.py` is the dlt engine; `scripts/ingest.py` is a thin CLI wrapper. Same `run_one(spec, schema)` is called from cron, from the recompute endpoint, and from manual ad-hoc shell runs. dlt deps live in the api container; `SOURCE_PG_*` credentials are passed via `docker-compose.prod.yml` env.	A previous shape had separate `scripts/dlt_<source>.py` per source. With one engine driven by a `SOURCES` table of `SourceSpec(db, schemas, pipeline_name)`, adding a source is a one-line registry edit, not a new file with copy-pasted boilerplate.
2026-04-30	Source freshness threshold is per-`period`, not per-metric. `period=24h` ⇒ amber > 36h, red > 72h. `period=30d` ⇒ amber > 7d, red > 14d. `period=snapshot` ⇒ amber > 24h, red > 48h.	The freshness expectation is a property of how the metric is defined, not a property of the metric itself. Encoding it on the period keeps the registry small; if a specific metric ever needs an override, it earns its own column then.
2026-04-30	`webbank.core.lock_table` is in `SKIP_TABLES`.	Liquibase advisory lock; no business data.
2026-04-30	*Snapshot metrics get a `history_sql` that plots the worst moment* per day, not end-of-day or last-observed.** For staleness/age-style metrics each historical point is the maximum staleness reached during that day.	Empty sparklines on the cockpit erode trust faster than imperfect ones. End-of-day or instantaneous samples hide the actual operational pain (a 30-hour FX gap that closed at 16:01 is invisible at midnight); the worst-of-day rule keeps the alarming peaks on the chart.
2026-05-04	Long-running backfill ingest runs as a one-off container (`docker compose run --rm --no-deps --name heartbeat_ingest api ...`), not via `docker exec heartbeat_api`. dlt working dir lives on the named volume `dlt_data` mounted at `/var/dlt`.	First webbank backfill (12 GB, ~2.5 h) was killed twice — once by `docker compose up -d --build api` from a parallel deploy session (writable layer wiped along with mid-flight extract/normalize state), once by a VM reboot. Decoupling ingest from the api container's lifecycle and persisting dlt state on a volume makes long backfills resumable. The hourly cron keeps using `docker exec heartbeat_api` because incremental refreshes are seconds, not hours.
2026-05-04	VM upgraded from t3.small (2 GB) to t3.large (8 GB), eu-north-1.	t3.small OOM-killed webbank's `core.payment` / `core.entry` / `payment_state_log` extracts on the first attempt. t3.large gives ~6 GB headroom during normalize+load, comfortable for the current 6 sources and adequate for the planned 10+. Single-flight ingest constraint (decision 2026-04-30) was kept — even with 8 GB, parallel pipelines would compete with api/web/marts-db for the same RAM. Cost delta: +€24/mo over t3.small.
2026-05-05	Registry split — one file per metric. The monolithic `db/metrics.sql` (2940 lines, 28 metrics) is replaced by `db/metrics/<id>.sql` per metric, plus shared scaffolding (`_schema.sql` / `_pre.sql` / `_post.sql`) and `db/build_metrics.sh` which concatenates them + an auto-generated GC into the assembled SQL stream. Adding a metric = drop a file; removing one = delete a file. The DB-level write-guard semantics are unchanged.	The single file outgrew its review usefulness (one-line edit drowned in 3000-line diffs, frequent merge conflicts in `metrics-discovery` flow). Per-metric files let GitLab MR diffs surface only the metric in question. The build script's GC-from-filenames removes the dual-list maintenance the previous manifest required.
2026-05-07	Audit cuts are first-class entities — one Python module per cut under `api/audit/cuts/<cut_id>.py` with `META` + `compute(conn, *, window)`. New tables `heartbeat.audit_cut_registry` (auto-synced from disk by `refresh_audit`) and `heartbeat.audit_cut_snapshot` (per `(cut_id, window_key)`) replace the previous monolithic `audit_snapshot` payload. Generic routes: `GET /api/audit/cuts/{id}`, `GET /api/audit/pages/{page}`, `POST /api/audit/cuts/{id}/refresh`, `GET /api/audit/registry`. Frontend gets a `cut_id → React component` registry; the three audit pages drop from ~900 lines each to ~30.	The previous shape (one `<page>_anatomy.py` module per page emitting one big payload, three hardcoded GET endpoints, ~900-line hand-rolled pages) was non-extensible: adding a chart meant editing four files in lockstep, the cuts had no individual ownership / verification / methodology fields, and there was no `/api` surface to drive a future `/docs` page or alert routing. Mirroring `db/metrics/<id>.sql` makes "add an audit cut" feel like "add a metric". The `window` column needed quoting in PostgreSQL, so it ships as `window_key` everywhere; the Python kwarg stays `window`.
2026-05-07	Shared `AppShell.tsx` for all 4 pages. Header, primary nav (`Cockpit \| Audit`), right-side icon links (Docs, Feedback), freshness chip, ThemeToggle live in `dashboard/app/_shell/AppShell.tsx`. Page-specific toolbar passed via prop. The "← Heartbeat" back-link is removed — primary nav makes it redundant.	Each of the 4 pages had ~50 lines of hand-rolled header that drifted out of sync. With Docs and Feedback being net-new affordances, fixing this once was cheaper than cargo-culting the divergence into two more places. Docs/Feedback live as right-side icons (not segmented items) because they're cross-cutting affordances, not sibling sections.
2026-05-07	`verification_state` + email `owner` on `metric_registry` and `audit_cut_registry`. Two new columns each, default `verification_state='unverified'`, default `owner=NULL`. The `metric_registry_guard` trigger's allow-list is extended to include both, so operators can verify / re-assign without a redeploy. PATCH endpoints `/api/metrics/{id}/verification` and `/api/audit/cuts/{id}/verification` accept either field independently and validate `owner` against a loose RFC-5321 regex. UI: dashed border + small "unverified" pill on cockpit code chips and audit `Section` cards. Inline owner field auto-saves on blur/Enter when valid.	Heartbeat metrics and audit cuts both ship as "unverified by default" — adding a number to the cockpit doesn't mean someone has cross-checked it against the source-of-truth count yet. Without a visible state, drift between "the SQL runs" and "the team trusts the number" is invisible. The dashed-border treatment makes that gap legible at a glance. The email validator on `owner` reserves the right shape for §8.1 alert routing — a Telegram/Slack channel routes by `owner.email`, so it has to be an actual email.
2026-05-07	`/docs` (fumadocs) is the single source of truth for product, lifecycle, sources, decisions, and operations. ARCHITECTURE.md is retired and its content migrated into `/docs`. `cheatsheet.md` and `docs/pending-ingest-sources.md` are folded into operations/access and sources respectively. CLAUDE.md and README become thin pointers to `/docs`. Any new feature, product decision, user scenario, or behavioural change updates the relevant page in `/docs` in the same PR as the code. [#docs-as-sot]	Three sources of architectural truth (ARCHITECTURE.md, CLAUDE.md, scattered notes) drifted independently. With `/docs` rendered in the same dashboard the user already opens, written in MDX rather than a single 450-line file, indexed for LLMs via `llms.txt`/`llms.mdx`, and discoverable via `Docs` icon in `AppShell`, there is exactly one place a human or an agent goes to learn how Heartbeat works. The principle "config-truth in the file, runtime-truth in the table" already governs metrics; this decision applies the same shape to product knowledge — the file is the truth, every other surface mirrors it.
2026-05-07	Metric lifecycle has 10 named stages, with `unverified` treated as a first-class long-lived state rather than a quality gate. See metric lifecycle. Verification is event-driven (typically the first alert firing), not procedural.	We were treating `unverified` as embarrassing and `verified` as the goal, which would have pressured every new metric through a full owner cross-check before it could land — the opposite of what `metrics-discovery` was built for. Naming the lifecycle and making `unverified` the expected long-term state for most tiles aligns the social cost with the actual one: an unverified number that nobody is paged on costs nothing, an unverified number that fires a false alert costs an owner's time. Verification should happen exactly when that cost is about to be paid.
2026-05-08	Compute phase is parallelised with `ThreadPoolExecutor(max_workers=4)`, each worker on its own DB connection. Ingest remains serial (single-flight).	A full refresh of 31 metrics took 8–12 min before this change. Parallelising only compute (not ingest) avoids the RAM ceiling and UPSERT races that make parallel ingest unsafe, while still getting ~4× throughput on the compute phase.
2026-05-08	`POST_INGEST_INDEXES` dict in `api/ingest_lib.py` — B-tree indexes applied automatically (idempotent, `IF NOT EXISTS`) after every `pipeline.run()`. dlt only creates `_dlt_id` unique indexes by default; every date-range and join-column filter on multi-GB tables would otherwise do a full seq scan.	First discovered when `application_approval_rate` timed out at 60s. Root cause: `entry` (2.6 GB), `payment` (1.76 GB), `payment_state_log` (1.4 GB), `aml_retrospective` (2.7 GB) had no useful indexes. Adding them dropped per-query time from 4–10s to under 0.4s. Rule: when a new metric's `value_sql` touches a table over 200 MB, verify the WHERE/JOIN columns have indexes before shipping the metric.
2026-05-08	`replace`-strategy tables are a hidden ingest cost — they are truncated and fully reloaded on every ingest tick regardless of change volume. Audit all replace-strategy tables; switch to merge/append where a source PK or cursor column exists. Intentional replace tables (small lookup tables) are acceptable and must be documented.	dlt falls back to `replace` for any table with no PK and no recognised incremental cursor. The cost is invisible in normal operation but becomes the bottleneck when a replace table grows. Discovered during the 2026-05-08 hardening session; 4 tables were already fixed to merge.
2026-05-08	Per-metric `statement_timeout` of 60s (`SET LOCAL statement_timeout = '60s'` before each `value_sql`), configurable via `METRIC_STATEMENT_TIMEOUT` env var.	Prevents one hanging query from blocking the entire compute queue. Value can be raised per-metric by setting the env var; if a metric legitimately exceeds 60s it is a structural problem (missing index or correlated subquery) that must be fixed before the timeout is relaxed.
2026-05-08	Targeted ingest for per-metric refresh: `value_sql` is regex-parsed to extract `schema.table` references; only those tables are ingested. A full `webbank.core` refresh went from loading all 186 tables to loading 4.	Reduces per-metric UI refresh latency from minutes to seconds. Full-schema ingest is still used for the hourly cron; targeted ingest is only for the on-demand per-metric path.
2026-05-08	Cockpit background poll is suppressed while a job is in flight. `useCockpitMetrics` checks `active-job` from the query cache at refetch time and skips the refetch if a job is running. A single atomic invalidation fires when the job finishes.	The 30s background poll was firing during an active refresh, which showed a mixed old/new state mid-job on the TV screen. The fix makes the cockpit update atomically rather than streaming partial updates.

How to add an entry

Decisions land in this table when:

a non-obvious choice was made (tooling, architectural shape, scope boundary), and
reverting it later would cost work (someone would re-litigate the same trade-offs from scratch).

What does not belong here: bug fixes, naming choices that are already obvious from the code, or "we used library X" when the library is the only sane option for the job.

Format: append a row, with date in YYYY-MM-DD, a short imperative decision, and a reason that names the constraint or incident behind it. If the decision supersedes an earlier one, name the prior date in the reason — do not edit the historical row.

Decision log

How to add an entry

On this page