So my business sponsors and senior architects have decided to build a data vault. We have already recognized and considered the benefits of changing course for our enterprise. We spent a lot of time considering the business benefits that a different approach to business intelligence would provide. Some of these business related benefits that we identified are:
-
- Supports functional areas of business
- Integrates business keys that cross functional areas
- Deep historical tracking of information as it changes over time
- Need to load 100% of the data 100% of the time
- Conceptual and logical models of the business are natural representations in data vault (DV) structure
We also took the time to evaluate potential technical benefits. Some of the most important benefits to us were the following:
-
- Apply business rules on the way out to data marts
- Run ETL processes in parallel
- Flexible and adaptable to change in business requirements over time
- Auditable back to the source system
- Compliance
- Supports agile development approach
- Simple ETL load patterns allow for code generation
Now that my company has made the decision to move forward with this new Data Vault Methodology approach to Business Intelligence, where do I begin? Well, let’s start with the basic building blocks of a data vault. A data vault can be as simple as a hub and a satellite, but in practice, there is generally a lot of each type.
Remember: a Hub is a collection of business keys. A link tracks the relationship between hubs, or potentially with other relationships (links). A satellite is the time sensitive collection of attributes related to either an only one hub or link.
Here is a sample data model with the end in mind. Notice the Hubs, Links, and Satellites are all here and are appropriately related to each other.
So lets dig a little deeper into the purpose of each and how to model and load them effectively.
Hubs
Hubs are the containers for business keys. They are the most important facets of the data vault methodology. The more successfully one is able to identify business keys the less refining of the model will follow. Business keys can be identified using a multitude of strategies. Sometimes it is from interviewing business users, sometimes it is from reviewing data models (primary keys or unique keys), sometimes it is from metadata systems that have identified key important information, as well as other areas.
The basic structure and treatment of the Hub table is as follows:
Mandatory Columns
-
- Hub Sequence Identifier (generally a number generated from a database)
- “Business Key” Value (generally a string to handle any data type)
- Load Date (generally a date and time)
- Record Source (generally a string)
Loading Pattern
-
- Select Distinct list of business Keys
- Add timestamp and record source
- Insert into Hub if the value does not exist
Code Sample
SELECT DISTINCT
stg.LOAD_TIMESTAMP
, stg.RECORD_SOURCE
, stg.CUSTOMER_ID
FROM stage.Customer stg
WHERE NOT EXISTS (SELECT 1
FROM dv.H_Customer dv
WHERE (stg.CUSTOMER_ID = dv.CUSTOMER_ID)
);
Links
Links stores the intersection of business keys (HUBS). Links can be considered the glue that holds the data vault model together. These tables allow for the data model to elegantly change over time because they can come and go as required by the business. Links also allow for the model to be created quickly without worry about whether the relationship is one to many or many to many. In addition, the flexible nature of link tables provides the option to add or drop link tables as requirements change throughout the maintenance lifecycle of the data warehouse or as part of a data mining exercise.
The basic structure and treatment of the Link table is as follows:
Mandatory Columns
-
- Link Sequence Identifier (a database number)
- Load Date and Time (generally a date field)
- Record Source (generally a string)
- At least two Sequence Identifiers (either from Hubs or other Links and are numbers)
Loading Pattern
-
- Select Distinct list of business Key combinations from source
- Add timestamp and record source
- Lookup data vault identifier from either Hub or Link
- Insert into Link if the value does not exist
Code Sample
SELECT
stg.LOAD_TIMESTAMP
, stg.RECORD_SOURCE
, stg.CUSTOMER_SQN
, stg.ORDER_SQN
FROM ( SELECT LOAD_TIMESTAMP
, RECORD_SOURCE
, (SELECT CUSTOMER_SQN
FROM dv.H_CUSTOMER dv
WHERE (src.CUSTOMER_ID = dv.CUSTOMER_ID)
) as CUSTOMER_SQN
, (SELECT ORDER_SQN
FROM dv.H_ORDER dv
WHERE (src.ORDER_ID = dv.ORDER_ID)
) as ORDER_SQN
FROM stage.”Order” src
) stg
WHERE NOT EXISTS (SELECT 1
FROM DV.L_Customer_Order dv
WHERE (stg.CUSTOMER_SQN = dv.CUSTOMER_SQN)
AND (stg.ORDER_SQN = dv.ORDER_SQN)
);
Satellites
Satellites add all the color and description to the business keys (hubs) and relationships (links) in the data vault environment. Satellites contain all the descriptive information, tracking change by start and end dates over time, to let one know the information in effect at any point in time. In the purest sense, satellites are time aware and therefore tracks change over time as its main function. Satellites are always directly related and are subordinate to a hub or a link. They provide context and definition to business key(s). A satellite record is added when a change is detected in the processing. In some cases, there may be multiple satellites pointing to one hub or one link. The reasons for doing this could be multiple sources, or rate of change, or by data type.
The basic structure and treatment of the Link table is as follows:
Mandatory Columns
-
- Hub or Link Sequence Identifier
- Load Date
- Load Date End
- Record Source
Optional Columns
-
- Attributes (may be only one, but usually a lot more strings, numbers, or dates)
Loading Pattern
-
- Select list of attributes from the source
- Add timestamp and record source
- Compare to the existing applicable set of satellite records and insert when a change has been detected
- Lookup and use the applicable Hub identifier or the Link identifier
Note: a two-step process is generally employed when using a Load End Date to set the time effective properly for satellites
Code Sample
SELECT stg.PRODUCT_SQN
, stg.LOAD_TIMESTAMP
, stg.RECORD_SOURCE
, stg.PRODUCT_DESC
FROM ( SELECT (SELECT dv.PRODUCT_SQN
FROM DV.H_Product dv
WHERE (src.PRODUCT_NAME = dv.PRODUCT_NAME)
) as PRODUCT_SQN
, src.LOAD_TIMESTAMP
, src.RECORD_SOURCE
, src.PRODUCT_DESC
FROM stage.[Order] src
) stg
WHERE NOT EXISTS (SELECT 1
FROM dv.S_PRODUCT dv
WHERE ( stg.PRODUCT_SQN = dv.PRODUCT_SQN)
AND ( stg.PRODUCT_DESC = dv.PRODUCT_DESCRIPTION)
AND dv.LOAD_TIMESTAMP = (SELECT MAX(dv1.LOAD_TIMESTAMP)
FROM DV.S_PRODUCT dv1
WHERE dv1.PRODUCT_SQN = dv.PRODUCT_SQN)
);
With this quick overview of the basics of the data vault model, I hope you can see the simplicity in the design as well as the pattern based loading process. As you can see, whether you have 1 on 10 hubs or links, they should all look structurally similar as well as load in a similar fashion. This drives down overall development and support costs when the Enterprise Data Warehouse is supported by a data vault. Also, designers and developers that are new to the concepts generally can be up and productive in short order.
So if you are…
- Currently engaging a data warehouse environment that is becoming harder and harder to support and maintain over time
- Needing to address performance problems
- Hoping to get your data governance problems addressed
- Wanting more of a rapid and agile development process
- Concerned about the current ETL processes having become rigid and difficult to support
- Suffering from the lack of Business Rules maintenance and management
- Embarking on a new Business Intelligence endeavor and would like to increase likelihood of success
…then the data vault methodology may be the answer for you.
Along with the loading patterns and models outlined here, there are many other benefits to applying this architecture and process to your Business Intelligence needs.
0 Comments