Sources

What's landed in marts-db, what's pending, and the rules for choosing a source.

Heartbeat reads from every system Satchel runs on. Every source arrives in marts-db via dlt, and every metric is built against landed data — never against a live source on a cockpit refresh. This page is the inventory: what's in, what's blocked, and the source priority rule that keeps the cockpit honest.

Source priority — webbank first

webbank is the source of truth for every metric. Use other landed sources (vss, aml, pdvs, card_management, currency_rate) only when the data webbank exposes is genuinely not there.

This was confirmed by the onboarding team on 2026-04-30 after a metric built on vss_vss.applicant_log diverged materially from what the team counts by hand against webbank.core.client_profile / client_profile_state_log / dict.profile_state. Whichever number we show on the cockpit must match the team's manual count, otherwise the metric loses trust the moment they look.

When a new metric is proposed via metrics-discovery:

  • webbank first — the skill checks webbank tables before any fallback.
  • fallback requires a notes entry on the metric explaining why webbank could not be used.
  • wrong source → no metric. If the right source is not landed, the metric is added to pending sources and the candidate is parked. We do not ship the metric off a wrong-but-landed source.

Landed sources

All 6 source DBs are landed in marts-db as of 2026-05-04. Total ~15 GB across 9 schemas.

SourceRoleNotable schemas
webbankSource of truth for clients, payments, accounts, KYC.core (193 t), dict (78 t), scb (20 t), integration (14 t), eapi (5 t)
amlSanctions / AML decisions (retrospective).aml
pdvsPayment dispatching / payee verification.core
vssKYC applicants / client verifications (legacy — being migrated to webbank).vss, dict
card_managementCard issuing / lifecycle.card, dict
currency_rateFX rates.rate

Operationally, as of 2026-05-04:

  • webbank dlt ingest is landed in fullwebbank_core 156 tables / 12 GB (entry, payment_state_log, account_balance_log all ~10M rows), webbank_dict 80 tables / 3 MB. Backfill ran 2026-04-30 .. 05-04 in a one-off container on the dlt_data volume.
  • Any onboarding / KYC / payment metric currently sourced from vss_vss.* is a migration target — re-derive against webbank.core.client_profile / client_profile_state_log / dict.profile_state (and the corresponding payment / state-log tables) and verify the number matches the team's hand count before flipping the metric over.

Postgres topology

We've made the same wrong assumption about Postgres twice already (treating "webbank" as a single DB; treating SEPA as part of the webbank cluster). The picture below is verified against pg_database on the live clusters and against the local schema-only mirror at seed/schemas/. Treat this as the source of truth — when in doubt, re-verify against the live cluster, not against memory.

Cluster A — webbank cluster

Host 16.170.155.54, postgres 14.2. 9 sibling DBs on one cluster. Same Postgres user reaches all of them with one credential once GRANT SELECT is in place per DB.

DBWhat's in itSELECT granted?
webbankCore banking — accounts, customers, payments, ledger, dictionariesyes — dashboard_user
amlAML alerts / retrospectivepending
pdvsPayee verificationpending
vssKYC applicants / client verificationspending
card-managementCard issuing / lifecyclepending
currency-rateFX ratespending
bpmWorkflow / BPMpending — see pending sources
notificationNotification logpending
vmi(TBD — pulled in dump but role unclear)pending

Cluster A is one credential, many DBs. Adding aml/vss/pdvs/etc. is a GRANT request, not a new connection setup.

Cluster B — SEPA cluster

Separate Postgres, host TBD. 3 DBs, fully independent of cluster A — different credentials, different network reachability.

DBWhat's in it
mmsSEPA outgoing payments
mms_instSEPA Instant payments
mss_kart(TBD)

Cluster B is its own infrastructure ask. Different host, different creds — does not piggyback on the cluster A request.

Local mirror for exploration

DDL only, no data:

  • seed/schemas/prod/ — cluster A dumps (gitignored).
  • seed/schemas/prod-sepa/ — cluster B dumps (gitignored).
  • docker compose -f docker-compose.dev.yml up -d postgres-schemas — container at localhost:5433, all 12 DBs loaded, user explorer / explorer.
  • seed/schemas/erd/webbank-core/index.html — SchemaSpy ERD for webbank.core (186 tables with relationships).

MVP source plan

The 24-04-2026 offsite produced a 15-source MVP shortlist. We deliberately narrow that to 4 sources for the first end-to-end build — chosen so that every ingest type that we'll later need is exercised at least once.

#SourceTypeWhy this one specificallyOwnerAccess
1webbank PostgresPostgres replicaCore banking truth. Already wired.Denconnected
2BO APIInternal RESTBack-office state not in webbank tables (operator actions, manual interventions, queues). Exercises the "internal HTTP" pattern without SaaS quirks.?unknown
3Zoho CRMSaaS REST (OAuth)Funnel truth: leads, deals, pipeline. Exercises the SaaS-OAuth pattern.?unknown
4Google AnalyticsSaaS REST (service account, query API)Top-of-funnel: traffic, sources, conversions. Exercises the SaaS-aggregation-API pattern.Дима (devs.ua)unknown

These 4 cover the 4 ingest archetypes we expect across the wider MVP-15 (own-Postgres / internal REST / SaaS OAuth REST / SaaS aggregate API). The remaining 11 (Tx monitoring, Bank providers, Collateral, Fireblocks, Ondato, Freshchat, Zendesk, YouTrack, registration/contact forms, accounting) come after the ingest pattern is proven on these 4.

Pending sources

Source databases that Heartbeat needs but has not yet landed. Each entry names the metrics blocked on it. Until landed, the affected metrics are either removed from the cockpit (file deleted from db/metrics/) or carry an explicit caveat in their notes.

BPM (Business Process Management)

Why we need it. BPM holds the AML compliance review state for outgoing payments. Webbank only sees the payment as authorized while BPM holds the case for manual compliance review (can be days). Without BPM we cannot:

  • subtract AML hold-time from outgoing-payment processing-time metrics (so the metric currently reads "bank + compliance + random hold" instead of "bank only");
  • show the full AML screening picture — BPM has the breakdown across incoming / monetary / ongoing screening types.

Blocked metrics / metric concepts.

  • outgoing_sepa_credit_processing_time_p50_min_24h — methodology is documented as bank+compliance until BPM lands. PM review 2026-05-05 wants a sibling that is bank-only (AML wait subtracted).
  • outgoing_other_processing_time_p50_min_7d — same.
  • ML Screening Failures (deleted 2026-05-06) — previously aml_failures_24h; only saw incoming AML retrospective screenings. BPM is needed for incoming + monetary + ongoing in one metric.

Status. Not landed as of 2026-05-06.

QAC

Why we need it. QAC is the source of truth for KYC pass/fail according to PM (Alena, 2026-05-05). Webbank's webbank_dict.kyc_verification_status exposes a coarser KYC outcome (CLEAR / AML_FOUND / IN_PROGRESS / ...) that may be useful as a stop-gap, but the team's hand-count and reporting are against QAC.

Blocked metrics / metric concepts.

  • KYC Pass Rate (deleted 2026-05-06) — previously kyc_pass_rate_30d; was reading the redundant Client Verification surface. Removed rather than left on a wrong source until QAC lands.

Status. Not landed as of 2026-05-06.

SEPA cluster (mms, mms_inst, mss_kart)

Why we need it. SEPA outgoing payments and SEPA Instant payments live on a completely separate Postgres cluster (Cluster B — mms, mms_inst, mss_kart). Any processing-time or volume metric for SEPA rails must read from here, not from webbank.

Blocked metrics / metric concepts.

  • SEPA outgoing payment volume (24h / 7d)
  • SEPA Instant processing time (p50, p95)
  • SEPA rejection rate

What's needed. Cluster B is a separate infrastructure ask — different host, different credentials, different network path from the VM. This is not a simple GRANT like the remaining Cluster A DBs; it requires a separate connection setup and .env entry.

Status. Not landed as of 2026-05-08. Host TBD; pending access credentials from ops.

BO API (internal REST)

Why we need it. The back-office API holds operator actions, manual interventions, and queue states that are not reflected in webbank tables. It exercises the internal-HTTP ingest pattern and gates the full MVP-4 ingest-archetype proof.

Blocked metrics / metric concepts.

  • Operator intervention queue length / age
  • Back-office case resolution time

What's needed. API base URL, auth scheme (token / mTLS), and SELECT-equivalent access on the endpoints covering queue and case state.

Status. Not landed as of 2026-05-08. Owner and access details unknown.

Zoho CRM (SaaS REST / OAuth)

Why we need it. Zoho CRM is the funnel truth — leads, deals, pipeline stage, conversion. Any sales-funnel or pipeline-health metric must read from here. Exercises the SaaS-OAuth ingest pattern.

Blocked metrics / metric concepts.

  • Lead → deal conversion rate (30d)
  • Pipeline value by stage
  • Time-in-stage per deal

What's needed. Zoho OAuth credentials (client ID / secret / refresh token), scope for CRM read access.

Status. Not landed as of 2026-05-08. Owner unknown.

Google Analytics (SaaS aggregate API)

Why we need it. GA is top-of-funnel truth — traffic, sources, conversions. Exercises the SaaS-aggregation-API ingest pattern (query API, not row-level data).

Blocked metrics / metric concepts.

  • Organic sessions (7d / 30d)
  • Conversion rate (visit → signup)
  • Traffic source breakdown

What's needed. Google Cloud service account with GA Data API access; GA4 property ID. Owner: Дима (devs.ua).

Status. Not landed as of 2026-05-08.

notification (Cluster A)

Why we need it. The notification DB on Cluster A holds the notification log — delivery status and channel breakdown for all system-generated messages. Useful for monitoring notification delivery health and debugging communication issues.

Blocked metrics / metric concepts.

  • Notification delivery rate (24h)
  • Failed / bounced notifications

What's needed. GRANT SELECT on notification DB for dashboard_user (same credential as webbank — Cluster A, one request covers it).

Status. Not landed as of 2026-05-08.

vmi (Cluster A)

Why we need it. Role unclear — present in the Cluster A dump but schema not yet inspected. Pending schema review to determine if it contains business-signal data relevant to a metric.

Status. Not landed as of 2026-05-08. Schema review needed before metrics can be proposed.

How to land a new source

Reach for the dlt engine, not a new script.

  1. Profile the upstream tables and pick the slice we actually need.

  2. Edit api/ingest_lib.py:SOURCES — add a SourceSpec(db=…, schemas=[…], pipeline_name=…).

  3. Add the source's connection env (SOURCE_PG_* for Postgres) to docker-compose.prod.yml and the VM .env.

  4. Backfill in a one-off container on the dlt_data volume:

    ssh ubuntu@13.62.60.156
    cd ~/heartbeat-dashboard
    docker compose -f docker-compose.prod.yml run --rm --no-deps \
      --name heartbeat_ingest api \
      python -m scripts.ingest --source <name>

    Long-running backfills must not run via docker exec heartbeat_api — the api container's writable layer gets wiped on rebuild and a docker compose up -d from a parallel deploy session will kill mid-flight extract/normalize state. The one-off container with dlt state on /var/dlt (env DLT_PIPELINES_DIR=/var/dlt/pipelines) makes the backfill resumable across rebuilds and reboots.

  5. Once the backfill is done, the hourly cron refresh_all.py incremental-loads it from then on (incremental refreshes are seconds, not hours, so they ride on docker exec heartbeat_api inside the cron path).

  6. Open the metrics-discovery skill against the new schema — candidate file → real number → owner alignment → metric file in db/metrics/.

Single-flight ingest still required. Even with the t3.large upgrade (2026-05-04), parallel pipelines compete with api/web/ marts-db for RAM. One backfill at a time.