Decision log

Every settled architectural decision, with the reason it was settled.

Decisions on this page are settled and not up for reopening unless the underlying constraint changes. Format: date, decision, reason. Append-only — when a decision is superseded, add a new entry that notes which prior entry it replaces; do not edit the historical row.

DateDecisionReason
2026-04-24 (offsite)Heartbeat aggregates data from existing systems; we do not consolidate tools."Заморозить" decision from offsite — aggregate, not replace.
2026-04-24 (offsite)OKR system depends on Heartbeat metrics; OKR work is deferred until Heartbeat produces stable data.Without trustworthy metrics OKRs hit the same data-deficit; sequencing matters.
2026-04-28No FDW for source access.Couples us to the chosen Postgres cluster; doesn't survive a future BigQuery / SaaS source.
2026-04-28Marts in a writable DB physically separate from source.Source is read-only; ingest / refresh writes need somewhere to land.
2026-04-29db/metrics.sql is the single source of truth for the registry, enforced by DB trigger.Stops drift between dead seed files, prod INSERTs, and narrative docs.
2026-04-29Old docs/satchel_analytics_pipeline_spec.md (MySQL/Airbyte/BigQuery training spec) is obsolete.Nothing in it reflects what we built or where we're going. Deleted.
2026-04-29Ingest stack: dlt (per-source python pipelines invoked through one engine).Surveyed the field (per-source python, dlt, Airbyte/Fivetran, Dagster, Databricks/Snowflake, Cube, ClickHouse). dlt is the only tool whose single job is "land 15 heterogeneous sources into a Postgres with retry/state/schema-drift handled" — at our scale (one VM, daily-ish refresh, banking data that should not leave the perimeter) the alternatives are either too heavy (managed ETL + warehouse) or too thin (per-script).
2026-04-29Compute stack: python-first with results cached in heartbeat.metric_history.dbt was overengineering for ~50-metric scale and would have left half of MVP-4 (BO API, Zoho, GA) outside its reach anyway. Python compute pushes aggregation down via SQL or API client per-metric, treats Postgres and SaaS uniformly, and removes the cross-DB JOIN problem dbt couldn't solve. Replaces the previous value_sql-on-page-load pattern.
2026-04-29Registry reset. All 18 prior metrics removed.They were research-phase hypotheses written against synthetic data and a now-deleted dbt staging/marts layer. New metrics enter via metrics-discovery once dlt has landed real source data. The registry table, write guard, and methodology comments stay.
2026-04-29Old dbt layer removed (dbt_satchel/, scripts/sync_webbank.py, dbt service in compose).Direct consequence of the two decisions above.
2026-04-30Heartbeat is not built on top of Jasper Reports Server (or any embedded-analytics platform). Jasper stays as the team's reporting tool; Heartbeat is a separate control plane. Pithy formulation: Jasper = reports for people who know what they want to see; Heartbeat = signals for cases where nobody knew to look.Jasper renders pixel-perfect reports on pull (someone opens a report) — that's a different operational model than Heartbeat's push (silent-event alerting). Concrete blockers if we tried to host Heartbeat inside Jasper: (a) no native 5-tuple metric contract / rolling benchmark / red-amber-green status engine — we'd reinvent it on top; (b) Jasper is JDBC-first, the MVP source mix (Zoho OAuth, GA aggregate API, webhooks, Fireblocks, Ondato) is not its sweet spot; (c) iteration requires a tech-support ticket, which is exactly the latency problem Heartbeat exists to fix; (d) TV-cockpit constraints (no auth, auto-refresh, graceful degradation, 3m readable) fight an embed; (e) the later MCP surface needs the unified metric_registry + metric_history model, not parsed Jasper outputs; (f) commercial lock-in on a strategic CEO/OKR/alerting layer.
2026-04-30Jasper (and other report tools) are sources of metric definitions, not sources of data. Heartbeat does not ingest Jasper outputs (PDF/CSV/rendered reports). Instead, we treat the SQL templates behind Jasper reports as vetted reference logic, and the rendered reports as a validation set against metric_history.Jasper sits over the same Postgres clusters Heartbeat already reads; ingesting its rendered outputs would add a fragile layer (manual Excel post-processing) on top of data we already access at the source. What's actually valuable is the accumulated domain context in each SQL template (WHERE clauses, exclusions, joins) — that becomes input to metrics-discovery and a regression check against our own compute.
2026-04-30Cockpit reads only from metric_history; a scheduled compute runner is the only writer. Live value_sql execution on API request is removed. scripts/compute_metrics.py runs on cron, writes (metric_id, d, v, computed_at, source_as_of, status, error) per metric. API does a DISTINCT ON (metric_id) ... ORDER BY computed_at DESC lookup; if no row exists for a metric, it surfaces as "no data yet" (amber).Two requirements forced the move: (a) the UI must show an honest "last update" per metric — without a separate compute step the only timestamp available is "now()"; (b) value_sql cost was leaking into every 30-second poll.
2026-04-30Source freshness is taken from <schema>._dlt_loads.inserted_at (max where status='completed'), where <schema> is the first dotted segment of the metric's source_mart. Stored on each metric_history row as source_as_of.dlt already records every successful landing in a per-schema _dlt_loads table — we don't need a parallel freshness ledger. This gives an automatic lower-bound "data is at least this fresh" for every metric without per-metric configuration.
2026-04-30Staleness is a separate visual signal from status, not folded into red/amber/green. The status dot reflects value vs target/limit/benchmark. The freshness timestamp colours independently — amber when source_as_of exceeds the period-derived expectation (24h: >36h; 30d: >7d; snapshot: >24h), red at 2× that.Conflating "the value is bad" with "the pipeline is broken" hides the actual problem. An operator looking at red needs to know which one it is — wrong number, or stale source.
2026-04-30Refresh is one path: ingest + compute behind a single endpoint (POST /api/metrics/recompute). A request resolves the requested metric_ids to distinct dlt schemas, runs the dlt pipeline serially per schema (incremental where possible), then recomputes every metric whose source schema is in that set. Cron uses the same endpoint via scripts/refresh_all.py.Two earlier attempts split this. (a) If we kept compute as the only "refresh", source_as_of never moves — the operator's freshness signal becomes a lie. (b) If ingest were a separate cron with no UI hook, a click on a stale metric would give the user no way to actually pull fresh data.
2026-04-30Single-flight refresh, globally coalesced. Only one ingest+compute job runs at a time. POST during an active job returns the existing job_id (coalesced=true); the cron tick that arrives mid-job is silently absorbed. Job state is held in-process; on api restart, in-flight jobs are dropped.(a) The VM tips over with parallel dlt extracts; single-flight is the cheapest correctness guarantee. (b) Multiple concurrent jobs racing to UPSERT the same (metric_id, current_date) history row would produce a value whose source_as_of and computed_at came from different jobs — a confusing audit trail.
2026-04-30api/ingest_lib.py is the dlt engine; scripts/ingest.py is a thin CLI wrapper. Same run_one(spec, schema) is called from cron, from the recompute endpoint, and from manual ad-hoc shell runs. dlt deps live in the api container; SOURCE_PG_* credentials are passed via docker-compose.prod.yml env.A previous shape had separate scripts/dlt_<source>.py per source. With one engine driven by a SOURCES table of SourceSpec(db, schemas, pipeline_name), adding a source is a one-line registry edit, not a new file with copy-pasted boilerplate.
2026-04-30Source freshness threshold is per-period, not per-metric. period=24h ⇒ amber > 36h, red > 72h. period=30d ⇒ amber > 7d, red > 14d. period=snapshot ⇒ amber > 24h, red > 48h.The freshness expectation is a property of how the metric is defined, not a property of the metric itself. Encoding it on the period keeps the registry small; if a specific metric ever needs an override, it earns its own column then.
2026-04-30webbank.core.lock_table is in SKIP_TABLES.Liquibase advisory lock; no business data.
2026-04-30Snapshot metrics get a history_sql that plots the worst moment per day, not end-of-day or last-observed. For staleness/age-style metrics each historical point is the maximum staleness reached during that day.Empty sparklines on the cockpit erode trust faster than imperfect ones. End-of-day or instantaneous samples hide the actual operational pain (a 30-hour FX gap that closed at 16:01 is invisible at midnight); the worst-of-day rule keeps the alarming peaks on the chart.
2026-05-04Long-running backfill ingest runs as a one-off container (docker compose run --rm --no-deps --name heartbeat_ingest api ...), not via docker exec heartbeat_api. dlt working dir lives on the named volume dlt_data mounted at /var/dlt.First webbank backfill (12 GB, ~2.5 h) was killed twice — once by docker compose up -d --build api from a parallel deploy session (writable layer wiped along with mid-flight extract/normalize state), once by a VM reboot. Decoupling ingest from the api container's lifecycle and persisting dlt state on a volume makes long backfills resumable. The hourly cron keeps using docker exec heartbeat_api because incremental refreshes are seconds, not hours.
2026-05-04VM upgraded from t3.small (2 GB) to t3.large (8 GB), eu-north-1.t3.small OOM-killed webbank's core.payment / core.entry / payment_state_log extracts on the first attempt. t3.large gives ~6 GB headroom during normalize+load, comfortable for the current 6 sources and adequate for the planned 10+. Single-flight ingest constraint (decision 2026-04-30) was kept — even with 8 GB, parallel pipelines would compete with api/web/marts-db for the same RAM. Cost delta: +€24/mo over t3.small.
2026-05-05Registry split — one file per metric. The monolithic db/metrics.sql (2940 lines, 28 metrics) is replaced by db/metrics/<id>.sql per metric, plus shared scaffolding (_schema.sql / _pre.sql / _post.sql) and db/build_metrics.sh which concatenates them + an auto-generated GC into the assembled SQL stream. Adding a metric = drop a file; removing one = delete a file. The DB-level write-guard semantics are unchanged.The single file outgrew its review usefulness (one-line edit drowned in 3000-line diffs, frequent merge conflicts in metrics-discovery flow). Per-metric files let GitLab MR diffs surface only the metric in question. The build script's GC-from-filenames removes the dual-list maintenance the previous manifest required.
2026-05-07Audit cuts are first-class entities — one Python module per cut under api/audit/cuts/<cut_id>.py with META + compute(conn, *, window). New tables heartbeat.audit_cut_registry (auto-synced from disk by refresh_audit) and heartbeat.audit_cut_snapshot (per (cut_id, window_key)) replace the previous monolithic audit_snapshot payload. Generic routes: GET /api/audit/cuts/{id}, GET /api/audit/pages/{page}, POST /api/audit/cuts/{id}/refresh, GET /api/audit/registry. Frontend gets a cut_id → React component registry; the three audit pages drop from ~900 lines each to ~30.The previous shape (one <page>_anatomy.py module per page emitting one big payload, three hardcoded GET endpoints, ~900-line hand-rolled pages) was non-extensible: adding a chart meant editing four files in lockstep, the cuts had no individual ownership / verification / methodology fields, and there was no /api surface to drive a future /docs page or alert routing. Mirroring db/metrics/<id>.sql makes "add an audit cut" feel like "add a metric". The window column needed quoting in PostgreSQL, so it ships as window_key everywhere; the Python kwarg stays window.
2026-05-07Shared AppShell.tsx for all 4 pages. Header, primary nav (Cockpit | Audit), right-side icon links (Docs, Feedback), freshness chip, ThemeToggle live in dashboard/app/_shell/AppShell.tsx. Page-specific toolbar passed via prop. The "← Heartbeat" back-link is removed — primary nav makes it redundant.Each of the 4 pages had ~50 lines of hand-rolled header that drifted out of sync. With Docs and Feedback being net-new affordances, fixing this once was cheaper than cargo-culting the divergence into two more places. Docs/Feedback live as right-side icons (not segmented items) because they're cross-cutting affordances, not sibling sections.
2026-05-07verification_state + email owner on metric_registry and audit_cut_registry. Two new columns each, default verification_state='unverified', default owner=NULL. The metric_registry_guard trigger's allow-list is extended to include both, so operators can verify / re-assign without a redeploy. PATCH endpoints /api/metrics/{id}/verification and /api/audit/cuts/{id}/verification accept either field independently and validate owner against a loose RFC-5321 regex. UI: dashed border + small "unverified" pill on cockpit code chips and audit Section cards. Inline owner field auto-saves on blur/Enter when valid.Heartbeat metrics and audit cuts both ship as "unverified by default" — adding a number to the cockpit doesn't mean someone has cross-checked it against the source-of-truth count yet. Without a visible state, drift between "the SQL runs" and "the team trusts the number" is invisible. The dashed-border treatment makes that gap legible at a glance. The email validator on owner reserves the right shape for §8.1 alert routing — a Telegram/Slack channel routes by owner.email, so it has to be an actual email.
2026-05-07/docs (fumadocs) is the single source of truth for product, lifecycle, sources, decisions, and operations. ARCHITECTURE.md is retired and its content migrated into /docs. cheatsheet.md and docs/pending-ingest-sources.md are folded into operations/access and sources respectively. CLAUDE.md and README become thin pointers to /docs. Any new feature, product decision, user scenario, or behavioural change updates the relevant page in /docs in the same PR as the code. [#docs-as-sot]Three sources of architectural truth (ARCHITECTURE.md, CLAUDE.md, scattered notes) drifted independently. With /docs rendered in the same dashboard the user already opens, written in MDX rather than a single 450-line file, indexed for LLMs via llms.txt/llms.mdx, and discoverable via Docs icon in AppShell, there is exactly one place a human or an agent goes to learn how Heartbeat works. The principle "config-truth in the file, runtime-truth in the table" already governs metrics; this decision applies the same shape to product knowledge — the file is the truth, every other surface mirrors it.
2026-05-07Metric lifecycle has 10 named stages, with unverified treated as a first-class long-lived state rather than a quality gate. See metric lifecycle. Verification is event-driven (typically the first alert firing), not procedural.We were treating unverified as embarrassing and verified as the goal, which would have pressured every new metric through a full owner cross-check before it could land — the opposite of what metrics-discovery was built for. Naming the lifecycle and making unverified the expected long-term state for most tiles aligns the social cost with the actual one: an unverified number that nobody is paged on costs nothing, an unverified number that fires a false alert costs an owner's time. Verification should happen exactly when that cost is about to be paid.
2026-05-08Compute phase is parallelised with ThreadPoolExecutor(max_workers=4), each worker on its own DB connection. Ingest remains serial (single-flight).A full refresh of 31 metrics took 8–12 min before this change. Parallelising only compute (not ingest) avoids the RAM ceiling and UPSERT races that make parallel ingest unsafe, while still getting ~4× throughput on the compute phase.
2026-05-08POST_INGEST_INDEXES dict in api/ingest_lib.py — B-tree indexes applied automatically (idempotent, IF NOT EXISTS) after every pipeline.run(). dlt only creates _dlt_id unique indexes by default; every date-range and join-column filter on multi-GB tables would otherwise do a full seq scan.First discovered when application_approval_rate timed out at 60s. Root cause: entry (2.6 GB), payment (1.76 GB), payment_state_log (1.4 GB), aml_retrospective (2.7 GB) had no useful indexes. Adding them dropped per-query time from 4–10s to under 0.4s. Rule: when a new metric's value_sql touches a table over 200 MB, verify the WHERE/JOIN columns have indexes before shipping the metric.
2026-05-08replace-strategy tables are a hidden ingest cost — they are truncated and fully reloaded on every ingest tick regardless of change volume. Audit all replace-strategy tables; switch to merge/append where a source PK or cursor column exists. Intentional replace tables (small lookup tables) are acceptable and must be documented.dlt falls back to replace for any table with no PK and no recognised incremental cursor. The cost is invisible in normal operation but becomes the bottleneck when a replace table grows. Discovered during the 2026-05-08 hardening session; 4 tables were already fixed to merge.
2026-05-08Per-metric statement_timeout of 60s (SET LOCAL statement_timeout = '60s' before each value_sql), configurable via METRIC_STATEMENT_TIMEOUT env var.Prevents one hanging query from blocking the entire compute queue. Value can be raised per-metric by setting the env var; if a metric legitimately exceeds 60s it is a structural problem (missing index or correlated subquery) that must be fixed before the timeout is relaxed.
2026-05-08Targeted ingest for per-metric refresh: value_sql is regex-parsed to extract schema.table references; only those tables are ingested. A full webbank.core refresh went from loading all 186 tables to loading 4.Reduces per-metric UI refresh latency from minutes to seconds. Full-schema ingest is still used for the hourly cron; targeted ingest is only for the on-demand per-metric path.
2026-05-08Cockpit background poll is suppressed while a job is in flight. useCockpitMetrics checks active-job from the query cache at refetch time and skips the refetch if a job is running. A single atomic invalidation fires when the job finishes.The 30s background poll was firing during an active refresh, which showed a mixed old/new state mid-job on the TV screen. The fix makes the cockpit update atomically rather than streaming partial updates.

How to add an entry

Decisions land in this table when:

  • a non-obvious choice was made (tooling, architectural shape, scope boundary), and
  • reverting it later would cost work (someone would re-litigate the same trade-offs from scratch).

What does not belong here: bug fixes, naming choices that are already obvious from the code, or "we used library X" when the library is the only sane option for the job.

Format: append a row, with date in YYYY-MM-DD, a short imperative decision, and a reason that names the constraint or incident behind it. If the decision supersedes an earlier one, name the prior date in the reason — do not edit the historical row.