So my business sponsors and senior architects have decided to build a data vault. We have already recognized and considered the benefits of changing course for our enterprise. We spent a lot of time considering the business benefits that a different approach to business intelligence would provide. Some of these business related benefits that we identified are:

    • Supports functional areas of business
    • Integrates business keys that cross functional areas
    • Deep historical tracking of information as it changes over time
    • Need to load 100% of the data 100% of the time
    • Conceptual and logical models of the business are natural representations in data vault (DV)  structure

We also took the time to evaluate potential technical benefits. Some of the most important benefits to us were the following:

    • Apply business rules on the way out to data marts
    • Run ETL processes in parallel
    • Flexible and adaptable to change in business requirements over time
    • Auditable back to the source system
    • Compliance
    • Supports agile development approach
    • Simple ETL load patterns allow for code generation

Now that my company has made the decision to move forward with this new Data Vault Methodology approach to Business Intelligence, where do I begin? Well, let’s start with the basic building blocks of a data vault. A data vault can be as simple as a hub and a satellite, but in practice, there is generally a lot of each type.

Remember: a Hub is a collection of business keys. A link tracks the relationship between hubs, or potentially with other relationships (links). A satellite is the time sensitive collection of attributes related to either an only one hub or link.

Here is a sample data model with the end in mind. Notice the Hubs, Links, and Satellites are all here and are appropriately related to each other.

data vault model

So lets dig a little deeper into the purpose of each and how to model and load them effectively.

Hubs

Hubs are the containers for business keys. They are the most important facets of the data vault methodology. The more successfully one is able to identify business keys the less refining of the model will follow. Business keys can be identified using a multitude of strategies. Sometimes it is from interviewing business users, sometimes it is from reviewing data models (primary keys or unique keys), sometimes it is from metadata systems that have identified key important information, as well as other areas.

The basic structure and treatment of the Hub table is as follows:

Mandatory Columns

    • Hub Sequence Identifier (generally a number generated from a database)
    • “Business Key” Value (generally a string to handle any data type)
    • Load Date (generally a date and time)
    • Record Source (generally a string)

Loading Pattern

    • Select Distinct list of business Keys
    • Add timestamp and record source
    • Insert into Hub if the value does not exist

Code Sample

SELECT DISTINCT

stg.LOAD_TIMESTAMP

, stg.RECORD_SOURCE

, stg.CUSTOMER_ID

 FROM stage.Customer stg

WHERE NOT EXISTS (SELECT 1

 FROM dv.H_Customer dv

 WHERE (stg.CUSTOMER_ID = dv.CUSTOMER_ID)

);

Links

Links stores the intersection of business keys (HUBS). Links can be considered the glue that holds the data vault model together. These tables allow for the data model to elegantly change over time because they can come and go as required by the business. Links also allow for the model to be created quickly without worry about whether the relationship is one to many or many to many. In addition, the flexible nature of link tables provides the option to add or drop link tables as requirements change throughout the maintenance lifecycle of the data warehouse or as part of a data mining exercise.

The basic structure and treatment of the Link table is as follows:

Mandatory Columns

    • Link Sequence Identifier (a database number)
    • Load Date and Time (generally a date field)
    • Record Source  (generally a string)
    • At least two Sequence Identifiers (either from Hubs or other Links and are numbers)

Loading Pattern

    • Select Distinct list of business Key combinations from source
    • Add timestamp and record source
    • Lookup data vault identifier from either Hub or Link
    • Insert into Link if the value does not exist

Code Sample

SELECT

stg.LOAD_TIMESTAMP

, stg.RECORD_SOURCE

, stg.CUSTOMER_SQN

, stg.ORDER_SQN

  FROM ( SELECT LOAD_TIMESTAMP

, RECORD_SOURCE

, (SELECT CUSTOMER_SQN

 FROM dv.H_CUSTOMER dv

 WHERE (src.CUSTOMER_ID = dv.CUSTOMER_ID)

) as CUSTOMER_SQN

, (SELECT ORDER_SQN

 FROM dv.H_ORDER dv

 WHERE (src.ORDER_ID = dv.ORDER_ID)

) as ORDER_SQN

    FROM stage.”Order” src

) stg

WHERE NOT EXISTS (SELECT 1

 FROM DV.L_Customer_Order dv

 WHERE (stg.CUSTOMER_SQN = dv.CUSTOMER_SQN)

AND (stg.ORDER_SQN = dv.ORDER_SQN)

);

Satellites

Satellites add all the color and description to the business keys (hubs) and relationships (links) in the data vault environment.  Satellites contain all the descriptive information, tracking change by start and end dates over time, to let one know the information in effect at any point in time.  In the purest sense, satellites are time aware and therefore tracks change over time as its main function.  Satellites are always directly related and are subordinate to a hub or a link. They provide context and definition to business key(s).  A satellite record is added when a change is detected in the processing.  In some cases, there may be multiple satellites pointing to one hub or one link.   The reasons for doing this could be multiple sources, or rate of change, or by data type.

The basic structure and treatment of the Link table is as follows:

Mandatory Columns

    • Hub or Link Sequence Identifier
    • Load Date
    • Load Date End
    • Record Source

Optional Columns

    • Attributes (may be only one, but usually a lot more strings, numbers, or dates)

Loading Pattern

    • Select list of attributes from the source
    • Add timestamp and record source
    • Compare to the existing applicable set of satellite records and insert when a change has been detected
    • Lookup and use the applicable Hub identifier or the Link identifier

Note: a two-step process is generally employed when using a Load End Date to set the time effective properly for satellites

Code Sample

SELECT stg.PRODUCT_SQN

, stg.LOAD_TIMESTAMP

, stg.RECORD_SOURCE

, stg.PRODUCT_DESC

 FROM ( SELECT (SELECT dv.PRODUCT_SQN

 FROM DV.H_Product dv

  WHERE (src.PRODUCT_NAME = dv.PRODUCT_NAME)

) as PRODUCT_SQN

, src.LOAD_TIMESTAMP

, src.RECORD_SOURCE

, src.PRODUCT_DESC

 FROM stage.[Order] src

) stg

WHERE NOT EXISTS (SELECT 1

 FROM dv.S_PRODUCT dv

 WHERE (   stg.PRODUCT_SQN = dv.PRODUCT_SQN)

AND (   stg.PRODUCT_DESC = dv.PRODUCT_DESCRIPTION)

AND dv.LOAD_TIMESTAMP = (SELECT MAX(dv1.LOAD_TIMESTAMP)

FROM DV.S_PRODUCT dv1

WHERE dv1.PRODUCT_SQN = dv.PRODUCT_SQN)

);

With this quick overview of the basics of the data vault model, I hope you can see the simplicity in the design as well as the pattern based loading process.  As you can see, whether you have 1 on 10 hubs or links, they should all look structurally similar as well as load in a similar fashion.  This drives down overall development and support costs when the Enterprise Data Warehouse is supported by a data vault.  Also, designers and developers that are new to the concepts generally can be up and productive in short order.

So if you are…

  • Currently engaging a data warehouse environment that is becoming harder and harder to support and maintain over time
  • Needing to address performance problems
  • Hoping to get your data governance problems addressed
  • Wanting more of a rapid and agile development process
  • Concerned about the current ETL processes having become rigid and difficult to support
  • Suffering from the lack of Business Rules maintenance and management
  • Embarking on a new Business Intelligence endeavor and would like to increase likelihood of success

…then the data vault methodology may be the answer for you.

Along with the loading patterns and models outlined here, there are many other benefits to applying this architecture and process to your Business Intelligence needs.


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *