# Sources (/docs/sources)



Heartbeat reads from every system Satchel runs on. Every source
arrives in `marts-db` via dlt, and every metric is built against
landed data — never against a live source on a cockpit refresh. This
page is the inventory: what's in, what's blocked, and the source
priority rule that keeps the cockpit honest.

## Source priority — webbank first [#source-priority--webbank-first]

`webbank` is the source of truth for every metric. Use other landed
sources (`vss`, `aml`, `pdvs`, `card_management`, `currency_rate`)
only when the data webbank exposes is genuinely not there.

This was confirmed by the onboarding team on 2026-04-30 after a
metric built on `vss_vss.applicant_log` diverged materially from what
the team counts by hand against `webbank.core.client_profile` /
`client_profile_state_log` / `dict.profile_state`. Whichever number
we show on the cockpit must match the team's manual count, otherwise
the metric loses trust the moment they look.

When a new metric is proposed via `metrics-discovery`:

* **webbank first** — the skill checks webbank tables before any
  fallback.
* **fallback requires a `notes` entry** on the metric explaining why
  webbank could not be used.
* **wrong source → no metric.** If the right source is not landed,
  the metric is added to [pending sources](#pending) and the candidate
  is parked. We do not ship the metric off a wrong-but-landed source.

## Landed sources [#landed-sources]

All 6 source DBs are landed in marts-db as of 2026-05-04. Total
\~15 GB across 9 schemas.

| Source               | Role                                                                        | Notable schemas                                                                 |
| -------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------- |
| **webbank**          | Source of truth for clients, payments, accounts, KYC.                       | `core` (193 t), `dict` (78 t), `scb` (20 t), `integration` (14 t), `eapi` (5 t) |
| **aml**              | Sanctions / AML decisions (retrospective).                                  | `aml`                                                                           |
| **pdvs**             | Payment dispatching / payee verification.                                   | `core`                                                                          |
| **vss**              | KYC applicants / client verifications (legacy — being migrated to webbank). | `vss`, `dict`                                                                   |
| **card\_management** | Card issuing / lifecycle.                                                   | `card`, `dict`                                                                  |
| **currency\_rate**   | FX rates.                                                                   | `rate`                                                                          |

Operationally, as of 2026-05-04:

* `webbank` dlt ingest is **landed in full** — `webbank_core` 156
  tables / 12 GB (entry, payment\_state\_log, account\_balance\_log all
  \~10M rows), `webbank_dict` 80 tables / 3 MB. Backfill ran
  2026-04-30 .. 05-04 in a one-off container on the `dlt_data`
  volume.
* Any onboarding / KYC / payment metric currently sourced from
  `vss_vss.*` is a **migration target** — re-derive against
  `webbank.core.client_profile` / `client_profile_state_log` /
  `dict.profile_state` (and the corresponding payment / state-log
  tables) and verify the number matches the team's hand count
  before flipping the metric over.

## Postgres topology [#postgres-topology]

We've made the same wrong assumption about Postgres twice already
(treating "webbank" as a single DB; treating SEPA as part of the
webbank cluster). The picture below is verified against
`pg_database` on the live clusters and against the local schema-only
mirror at `seed/schemas/`. &#x2A;*Treat this as the source of truth — when
in doubt, re-verify against the live cluster, not against memory.**

### Cluster A — webbank cluster [#cluster-a--webbank-cluster]

Host `16.170.155.54`, postgres 14.2. &#x2A;*9 sibling DBs on one
cluster.** Same Postgres user reaches all of them with one
credential once `GRANT SELECT` is in place per DB.

| DB                | What's in it                                                       | SELECT granted?                           |
| ----------------- | ------------------------------------------------------------------ | ----------------------------------------- |
| `webbank`         | Core banking — accounts, customers, payments, ledger, dictionaries | yes — `dashboard_user`                    |
| `aml`             | AML alerts / retrospective                                         | pending                                   |
| `pdvs`            | Payee verification                                                 | pending                                   |
| `vss`             | KYC applicants / client verifications                              | pending                                   |
| `card-management` | Card issuing / lifecycle                                           | pending                                   |
| `currency-rate`   | FX rates                                                           | pending                                   |
| `bpm`             | Workflow / BPM                                                     | pending — see [pending sources](#pending) |
| `notification`    | Notification log                                                   | pending                                   |
| `vmi`             | (TBD — pulled in dump but role unclear)                            | pending                                   |

> **Cluster A is one credential, many DBs.** Adding aml/vss/pdvs/etc.
> is a `GRANT` request, not a new connection setup.

### Cluster B — SEPA cluster [#cluster-b--sepa-cluster]

Separate Postgres, host TBD. **3 DBs, fully independent of
cluster A** — different credentials, different network reachability.

| DB         | What's in it           |
| ---------- | ---------------------- |
| `mms`      | SEPA outgoing payments |
| `mms_inst` | SEPA Instant payments  |
| `mss_kart` | (TBD)                  |

> **Cluster B is its own infrastructure ask.** Different host,
> different creds — does not piggyback on the cluster A request.

### Local mirror for exploration [#local-mirror-for-exploration]

DDL only, no data:

* `seed/schemas/prod/` — cluster A dumps (gitignored).
* `seed/schemas/prod-sepa/` — cluster B dumps (gitignored).
* `docker compose -f docker-compose.dev.yml up -d postgres-schemas` — container at
  `localhost:5433`, all 12 DBs loaded, user `explorer` / `explorer`.
* `seed/schemas/erd/webbank-core/index.html` — SchemaSpy ERD for
  `webbank.core` (186 tables with relationships).

## MVP source plan [#mvp-source-plan]

The 24-04-2026 offsite produced a 15-source MVP shortlist. We
deliberately narrow that to **4 sources for the first end-to-end
build** — chosen so that *every ingest type that we'll later need is
exercised at least once*.

| # | Source               | Type                                   | Why this one specifically                                                                                                                            | Owner          | Access        |
| - | -------------------- | -------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- | -------------- | ------------- |
| 1 | **webbank Postgres** | Postgres replica                       | Core banking truth. Already wired.                                                                                                                   | Den            | **connected** |
| 2 | **BO API**           | Internal REST                          | Back-office state not in webbank tables (operator actions, manual interventions, queues). Exercises the "internal HTTP" pattern without SaaS quirks. | ?              | unknown       |
| 3 | **Zoho CRM**         | SaaS REST (OAuth)                      | Funnel truth: leads, deals, pipeline. Exercises the SaaS-OAuth pattern.                                                                              | ?              | unknown       |
| 4 | **Google Analytics** | SaaS REST (service account, query API) | Top-of-funnel: traffic, sources, conversions. Exercises the SaaS-aggregation-API pattern.                                                            | Дима (devs.ua) | unknown       |

These 4 cover the 4 ingest archetypes we expect across the wider
MVP-15 (own-Postgres / internal REST / SaaS OAuth REST / SaaS
aggregate API). The remaining 11 (Tx monitoring, Bank providers,
Collateral, Fireblocks, Ondato, Freshchat, Zendesk, YouTrack,
registration/contact forms, accounting) come **after** the ingest
pattern is proven on these 4.

## Pending sources [#pending]

Source databases that Heartbeat needs but has **not yet landed**.
Each entry names the metrics blocked on it. Until landed, the
affected metrics are either removed from the cockpit (file deleted
from `db/metrics/`) or carry an explicit caveat in their `notes`.

### BPM (Business Process Management) [#bpm-business-process-management]

**Why we need it.** BPM holds the AML compliance review state for
outgoing payments. Webbank only sees the payment as `authorized`
while BPM holds the case for manual compliance review (can be
days). Without BPM we cannot:

* subtract AML hold-time from outgoing-payment processing-time
  metrics (so the metric currently reads "bank + compliance + random
  hold" instead of "bank only");
* show the full AML screening picture — BPM has the breakdown across
  incoming / monetary / ongoing screening types.

**Blocked metrics / metric concepts.**

* `outgoing_sepa_credit_processing_time_p50_min_24h` — methodology is
  documented as bank+compliance until BPM lands. PM review
  2026-05-05 wants a sibling that is bank-only (AML wait subtracted).
* `outgoing_other_processing_time_p50_min_7d` — same.
* ML Screening Failures (deleted 2026-05-06) — previously
  `aml_failures_24h`; only saw incoming AML retrospective screenings.
  BPM is needed for incoming + monetary + ongoing in one metric.

**Status.** Not landed as of 2026-05-06.

### QAC [#qac]

**Why we need it.** QAC is the source of truth for KYC pass/fail
according to PM (Alena, 2026-05-05). Webbank's
`webbank_dict.kyc_verification_status` exposes a coarser KYC outcome
(`CLEAR / AML_FOUND / IN_PROGRESS / ...`) that may be useful as a
stop-gap, but the team's hand-count and reporting are against QAC.

**Blocked metrics / metric concepts.**

* KYC Pass Rate (deleted 2026-05-06) — previously `kyc_pass_rate_30d`;
  was reading the redundant `Client Verification` surface. Removed
  rather than left on a wrong source until QAC lands.

**Status.** Not landed as of 2026-05-06.

### SEPA cluster (mms, mms\_inst, mss\_kart) [#sepa-cluster-mms-mms_inst-mss_kart]

**Why we need it.** SEPA outgoing payments and SEPA Instant payments
live on a completely separate Postgres cluster (Cluster B —
`mms`, `mms_inst`, `mss_kart`). Any processing-time or volume metric
for SEPA rails must read from here, not from webbank.

**Blocked metrics / metric concepts.**

* SEPA outgoing payment volume (24h / 7d)
* SEPA Instant processing time (p50, p95)
* SEPA rejection rate

**What's needed.** Cluster B is a separate infrastructure ask —
different host, different credentials, different network path from
the VM. This is not a simple `GRANT` like the remaining Cluster A
DBs; it requires a separate connection setup and `.env` entry.

**Status.** Not landed as of 2026-05-08. Host TBD; pending
access credentials from ops.

### BO API (internal REST) [#bo-api-internal-rest]

**Why we need it.** The back-office API holds operator actions, manual
interventions, and queue states that are not reflected in webbank
tables. It exercises the internal-HTTP ingest pattern and gates the
full MVP-4 ingest-archetype proof.

**Blocked metrics / metric concepts.**

* Operator intervention queue length / age
* Back-office case resolution time

**What's needed.** API base URL, auth scheme (token / mTLS), and
`SELECT`-equivalent access on the endpoints covering queue and case
state.

**Status.** Not landed as of 2026-05-08. Owner and access details
unknown.

### Zoho CRM (SaaS REST / OAuth) [#zoho-crm-saas-rest--oauth]

**Why we need it.** Zoho CRM is the funnel truth — leads, deals,
pipeline stage, conversion. Any sales-funnel or pipeline-health
metric must read from here. Exercises the SaaS-OAuth ingest pattern.

**Blocked metrics / metric concepts.**

* Lead → deal conversion rate (30d)
* Pipeline value by stage
* Time-in-stage per deal

**What's needed.** Zoho OAuth credentials (client ID / secret /
refresh token), scope for CRM read access.

**Status.** Not landed as of 2026-05-08. Owner unknown.

### Google Analytics (SaaS aggregate API) [#google-analytics-saas-aggregate-api]

**Why we need it.** GA is top-of-funnel truth — traffic, sources,
conversions. Exercises the SaaS-aggregation-API ingest pattern
(query API, not row-level data).

**Blocked metrics / metric concepts.**

* Organic sessions (7d / 30d)
* Conversion rate (visit → signup)
* Traffic source breakdown

**What's needed.** Google Cloud service account with GA Data API
access; GA4 property ID. Owner: Дима (devs.ua).

**Status.** Not landed as of 2026-05-08.

### notification (Cluster A) [#notification-cluster-a]

**Why we need it.** The `notification` DB on Cluster A holds the
notification log — delivery status and channel breakdown for all
system-generated messages. Useful for monitoring notification
delivery health and debugging communication issues.

**Blocked metrics / metric concepts.**

* Notification delivery rate (24h)
* Failed / bounced notifications

**What's needed.** `GRANT SELECT` on `notification` DB for
`dashboard_user` (same credential as `webbank` — Cluster A, one
request covers it).

**Status.** Not landed as of 2026-05-08.

### vmi (Cluster A) [#vmi-cluster-a]

**Why we need it.** Role unclear — present in the Cluster A dump but
schema not yet inspected. Pending schema review to determine if it
contains business-signal data relevant to a metric.

**Status.** Not landed as of 2026-05-08. Schema review needed before
metrics can be proposed.

## How to land a new source [#how-to-land-a-new-source]

Reach for the dlt engine, not a new script.

1. Profile the upstream tables and pick the slice we actually need.
2. Edit `api/ingest_lib.py:SOURCES` — add a `SourceSpec(db=…, schemas=[…], pipeline_name=…)`.
3. Add the source's connection env (`SOURCE_PG_*` for Postgres) to
   `docker-compose.prod.yml` and the VM `.env`.
4. **Backfill in a one-off container on the `dlt_data` volume:**

   ```bash
   ssh ubuntu@13.62.60.156
   cd ~/heartbeat-dashboard
   docker compose -f docker-compose.prod.yml run --rm --no-deps \
     --name heartbeat_ingest api \
     python -m scripts.ingest --source <name>
   ```

   Long-running backfills must not run via `docker exec heartbeat_api`
   — the api container's writable layer gets wiped on rebuild and a
   `docker compose up -d` from a parallel deploy session will kill
   mid-flight extract/normalize state. The one-off container with dlt
   state on `/var/dlt` (env `DLT_PIPELINES_DIR=/var/dlt/pipelines`)
   makes the backfill resumable across rebuilds and reboots.
5. Once the backfill is done, the hourly cron `refresh_all.py`
   incremental-loads it from then on (incremental refreshes are
   seconds, not hours, so they ride on `docker exec heartbeat_api`
   inside the cron path).
6. Open the `metrics-discovery` skill against the new schema —
   candidate file → real number → owner alignment → metric file in
   `db/metrics/`.

> **Single-flight ingest still required.** Even with the t3.large
> upgrade (2026-05-04), parallel pipelines compete with `api`/`web`/
> `marts-db` for RAM. One backfill at a time.
