In Part-I, I explained the place of Data Vault (DV) in Enterprise Data Warehouse Architecture. Now let’s look at different DV entities, rules for each entity and why Dan Lindstedt calls DV a “hybrid” approach. This minimal understanding is necessary before diving into the differences between the various modeling techniques.

The main entities of Data Vault are Hub, Link and Satellite.

HUB Entity (HUB_): This is a defining entity. It contains a unique list of business keys. These are the keys that businesses utilize in everyday operations. For example, employee number, SSN number, Product Code. So the attributes of HUB are:

  • Surrogate Key – This is a Primary Key of hub and holds 1-to-1 relationship with the Business Keys.
  • Business Key – This is a Primary Key of the source system. This can be a composite key. ETL checks this key’s existence in the hub table and inserts one if it doesn’t exist.
  • Load Date Time – The datetime of the key / record when it was first loaded into the table.
  • Record Source – The name of the source the record originated from. This is useful for data traceability.
  • Record Begin Date Time – The datetime when the record became active in the source (if available) or the datetime when ETL has been run.
  • Record End Date Time – The datetime when the record is closed. This can only be detected if the logical deletes are supplied or derived in some manner.

LINK Entity (LINK_): LINKS are constructed once all the HUBS are identified. Links are relationship entities.  These are the physical representation of m-to-m 3NF relationship. It represents the relationship or transaction between hubs. The link table contains the unique list of relationships between hub keys. When a relationship arrives, it simply gets loaded into the table if doesn’t exist. Typically, the link tables translate into fact tables in the datamart access layer. For example, the link between employee number and the project number. The other attributes of LINK are:

  • Surrogate Key – This is a Primary Key of the table and is useful when a link contains more than two hub keys as composite key might cause performance problems. This is also
    useful when the granularity of the link changes (a hub key is added) or history needs to be maintained on the relationships.
  • Hub Key 1 to Hub Key N – The surrogate keys from the hub tables that are involved in the relationship.
  • Load Date Time- The datetime when the record was loaded into the table.
  • Record Source – The source system name from where the record or relationship was loaded from.

SAT Entity (SAT_): SATS holds descriptive information about the hub keys or the relationships. The satellite is most closely resembles Type 2 Dimension. When the data changes, a delta record is inserted into the table and if the certain columns changes faster than others then these can be split into two different tables to avoid data replication. For example, employee details such as employee name, address, phone number, email address in the satellite off of hub  or time spent by an employee on a certain project in satellite off of LINK that stores the relationship between an employees and projects. The other attributes of SAT are:

  • Hub or Link Surrogate Key from HUB or LINK table. This is part of the primary key.
  • Load Date Time – The datetime when the record was inserted into the table. This is part of the primary key.
  • Surrogate Key – This is optional. It is useful when satellites have multiple values such as multiple home addresses.
  • Record Source – The name of the source.
  • Record Begin Date Time – The datetime when the record became active in the source (if known) or the datetime when ETL has been run.
  • Record End Date Time – The datetime when the record is closed.

And stand-alone tables such as calendars, time, code and description tables may be used.

Modeling Rules for Each Part of the Entity:

FOR HUBS:

  • Hubs keys cannot migrate into other hubs (no parent/child like HUBS).
  • Hubs must be connected through links.
  • More than two hubs can be connected through links.
  • Surrogate keys may be used.
  • Business keys are 1 to 1 relationship with surrogate keys.
  • Hubs primary keys always migrate outward.
  • Hub business keys and primary keys never change.
  • If a hub has two or more satellites, then a point-in-time table can be built for ease of joins.
  • An ‘UNKNOWN’ business key record can be inserted into Hub that can be used to tie other data in links and sats that has no business keys in source. This kind of data is usually a bad/incomplete source data.

FOR LINKS:

  • Links can be connected to other links.
  • Links must have atleast two hubs associated with them in order to instantiated.
  • Surrogate keys may be used.
  • The combination of surrogate business keys made a unique key.
  • Does not contain descriptive data.
  • Does not contain begin and end dates.

FOR SATS:

  • Satellites may be connected to hubs or links.
  • Have 1 and only 1 parent table.
  • Satellites always contain either a load date-time stamp, or a numeric reference to a stand-alone load date-time sequence table.
  • Primary key is a combination of ‘surrogate key’ from either hub or link and the load datetime stamp.
  • Surrogate keys may not be used.
  • Must have a Load End Date to indicate when the CHANGE to the data set has occurred.
  • Satellites are always delta driven. Duplicate rows should not appear.
  • Data is separated into satellite structures based on 1) type of information 2) rate of change.

DV model utilize bits of both 3rd Normal Form and Dimension Modeling concepts.  This approach has made the model simple, flexible, expandable, adaptable and consistent.

  • Adapted many-to-many physical relationship structure from 3NF that became a LINK table.
  • The LINK table is also similar to factless fact in Start Schema.
  • Adapted the notion of 1 to 1 (business key to surrogate key) tracking from dimensional modeling (type 1 dimension).
  • Adapted the notion of “data over time in a separate table/structure” from dimensional modeling (type 2 dimension). This resulted in a SAT table however it is fundamentally
    different, in that it is a child dependent table, whereas the dimension is a parent table to the facts.

This is it for now. In next post(s) we will look into some examples which will show how Data Vault technique overcomes the limitations of 3NF and Dimensional Model structures when applied as an Enterprise Data Warehouse.


0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *