Automating Data Vault 2.0 — Hubs, Links, and Satellites from a Description
Data Vault 2.0 is powerful but tedious to model by hand. Here's how to generate a standards-compliant raw vault — hubs, links, satellites, hash keys, and hash diffs — from a plain-English description.
Data Vault 2.0 is the methodology teams reach for when a warehouse has to absorb many source systems, track full history, and stay auditable. It's also the methodology that's most tedious to model by hand: every business key becomes a hub, every relationship a link, every set of descriptive attributes a satellite — and each one needs consistent hash keys and change-detection logic.
This post walks through generating a Data Vault 2.0 model from a description, so you get the methodology's benefits without authoring every structure by hand.
The three core structures
Data Vault separates what from how it relates from what we know about it:
- Hubs store the immutable business keys (a customer number, an order number).
- Links store the relationships between business keys (this customer placed this order).
- Satellites store the descriptive, time-variant context (the customer's email and plan as of a load date).
Keeping these separate is what lets a Data Vault grow without breaking: a new source adds a satellite, a new relationship adds a link, and nothing existing has to change.
Why it's painful by hand
The pattern is mechanical, which is exactly why it's error-prone at scale:
- Every hub and link needs a hash key (
_HK) derived from its business key(s). - Every satellite needs a
HASH_DIFFso loads can detect changes. - Every table needs a load timestamp and a record source.
- The hashing has to be identical across platforms, or loads stop being deterministic.
Hand-writing this for dozens of tables is where mistakes creep in.
Generating the vault from a description
Pick the Data Vault paradigm and describe your domain in plain English:
We ingest customers and orders from our billing system. Customers have an email and a plan. Orders belong to a customer and have a total and a status.
OffDataAI identifies the business keys (customer, order), the relationship (customer → order), and the descriptive attributes (email, plan, total, status), then generates the structures:
-- Hub: the customer business key
CREATE TABLE raw_vault.hub_customer (
customer_hk BINARY(20) NOT NULL, -- hash key
customer_bk STRING NOT NULL, -- business key
load_dts TIMESTAMP_NTZ NOT NULL,
record_source STRING NOT NULL,
CONSTRAINT pk_hub_customer PRIMARY KEY (customer_hk)
);
-- Link: customer placed order
CREATE TABLE raw_vault.link_customer_order (
customer_order_hk BINARY(20) NOT NULL,
customer_hk BINARY(20) NOT NULL,
order_hk BINARY(20) NOT NULL,
load_dts TIMESTAMP_NTZ NOT NULL,
record_source STRING NOT NULL
);
-- Satellite: descriptive context, time-variant
CREATE TABLE raw_vault.sat_customer_details (
customer_hk BINARY(20) NOT NULL,
load_dts TIMESTAMP_NTZ NOT NULL,
hash_diff BINARY(20) NOT NULL, -- change detection
email STRING,
plan_id STRING,
record_source STRING NOT NULL,
CONSTRAINT pk_sat_customer PRIMARY KEY (customer_hk, load_dts)
);
The hash keys and HASH_DIFF columns are emitted by the DDL layer — you never hand-add them — so hashing is consistent whether you target Snowflake, BigQuery, Databricks, Redshift, or Postgres.
Loading it with dbt
A raw vault is only useful once it's loaded. OffDataAI emits a dbt project alongside the DDL: staging models that compute the hash keys and hash diffs, and insert-only loads for hubs, links, and satellites. It runs with dbt build from the first commit.
Evolving without breaking
When the billing system adds a region field, you don't alter sat_customer_details — you add a new satellite (or a new column on a versioned satellite) and OffDataAI generates the migration. That insert-only, additive evolution is the whole point of Data Vault, and it's preserved automatically.
Try it
Describe your sources and get a standards-compliant raw vault in minutes. Start with the Data Vault 2.0 generator, or compare approaches with dimensional modeling and the full AI data modeling tool.
Try OffDataAI
Ship a warehouse from a single conversation.
Describe your domain. Get ERDs, DDL, dbt projects, and seed data for every major cloud warehouse.
Start building — free