Often in the business world, things come along that drive change. Sometimes, the change could be subtle. Sometimes the change is dramatic. In a lot of cases, there are those that are early adopters, as well as those that one can count on to wait until the change is deemed mature by the industry experts. [...]
Often in the business world, things come along that drive change. Sometimes, the change could be subtle. Sometimes the change is dramatic. In a lot of cases, there are those that are early adopters, as well as those that one can count on to wait until the change is deemed mature by the industry experts. It is important also to put forth a value proposition to adopt the change; to change for change’s sake generally doesn’t help the bottom line. In some cases, this change crosses industries and technology, but sometimes the change is among clusters of business that gives one a competitive advantage over another.
Consider some of the transformations that have happened over the past few decades:
- The computer
- The internet
- Mobile technology
- Laws (SOX, etc.)
- Hybrid vehicles
- Global economics
- Social networks
Companies that were able to see these changes coming were able to prepare. Companies that didn’t see it coming were forced to adapt. It is obvious to see that being prepared is arguably the best alternative, but in many cases and for many reasons, this may not be financially, socially, or logistically possible during the early adoption period. Eventually, over time, more and more companies adapt the change into the culture or technology as it becomes more and more pervasive and valuable. Those that take the lead and the initiative generally experience growing pains, but also gain advantages of working out the kinks of emerging technology or change. Those that wait may need to face consequences for the delay but prolong the disruption to critical process for as long as possible.
So what would we say are some changes that are on the way if not already here that will help shape the next decade of business activity? One would argue that “Big Data” is a great candidate for the next major change we are beginning to feel the effects of as its existence begins to transform the business landscape.
So what is “Big Data”? According to Gartner[1] Big Data is defined as:
Big data is the term adopted by the market to describe extreme information management and processing issues which exceed the capability of traditional information technology along one or multiple dimensions to support the use of the information assets. Throughout 2010 and into 2011, big data has focused primarily on the volume issues of extremely large datasets generated from technology practices such as social media, operational technology, Internet logging and streaming sources. A wide array of hardware and software solutions has emerged to address the partial issue of volume. At this point, big data, or extreme information processing and management, is essentially a practice that presents new business opportunities.
The definition leaves the reader with a lot questions. What is “extreme information management”? What are the “hardware and software solutions” that attempt to address the issues? What are the “new business opportunities”? Let’s take a look some of these questions in more detail.
Extreme Information Management
When we look at the sheer volume or magnitude of some of the datasets that would make up a Big Data solution, it is clear the traditional processes of data management will be challenged. The normal ways of loading, storing, processing, evaluating, and analyzing this information has to change in order to reap the benefits inside the datasets. This means the normal way a technologist would write a query needs to change. The usual backup and recovery process has to be re-considered. In addition, how this “extreme” information is combined with the “traditional” information is the new conundrum businesses that are going down this path are facing. In a practical sense, a company that has a traditional data warehouse in place will be faced with the challenge of aligning existing dimensions with the new “dimensions” from the big data arena. If this challenge can be solved for an enterprise, then the competitive advantage discussed earlier can be realized because the power of leveraging changes. If a company can at least begin evaluating the contents of big data as it relates to their enterprise, they will still gain valuable insights and correlations that are eluding the competitors that will wait out the change until it is mature.
Hardware and Software Solutions
When a company is considering a big data solution, immediately the existing infrastructure stack will be impacted. There are a lot of vendors that are stepping up into the technology gap to help customers move forward with initiatives. In a lot of cases, storage and processing power will become immediate needs. Deep in the internals of big data solutions is the idea of multiple worker servers that take on distributed work to break down the processing into smaller chunks. This can be achieved with potentially low cost servers, but a lot of them depending on the scale desired. The open source community has been a leader in this arena of solutions that help address the big data problem with low cost entry into public offerings. For the technologist, this means learning new ways of addressing data access using environments like Hadoop (http://hadoop.apache.org/) or HPCC (http://hpccsystems.com/). The traditional ETL tool vendors like Pentaho (http://www.pentaho.com/) are also attempting to get in the game by extending their products to work in a big data environment. In the long run, we expect the vendors to all address “Big Data” in some form or fashion, suggesting the change is here to stay. Therefore, it would be wise to do some research in the current vendor offering or begin to experiment with some of the solution options that are available, in order to begin to reap some of the benefits of a big data solution.
New Business Opportunities
When considering what new business opportunities will be presented as a result of engaging in a Big Data environment, the potential is alluring. When dealing with datasets in the terabyte and petabyte range, the scope and makeup of the results go from specifics to generalities in rapid order. It is currently feasible to look at things in a dimensional model like perspective. It is difficult and in some cases no longer possible to calculate results as you would normally do in traditional database environments because the sheer physics have changed. But there are opportunities for businesses to integrate this new landscape into their current Business Intelligence environments. Some companies are finding ways to use sources from public web sites to private applications to drive value and competitive advantage. The government is even seeing the opportunity and critical need to get involved[2]. With this much activity related to a single discipline, we would recommend actively seeking out the opportunities in your own business environment. More than likely, your competitors already are.
What are the challenges?
Some of the challenges will be related to identifying sources, because sometimes these are external to a business and require creative thought to see the big picture and the analytical opportunities. Additional challenges will be related to bringing a technical staff up to speed on new technology and new ways of thinking related to size and scale of Big Data environments. Furthermore, there will be changes to “normal” data acquisition and storage policies. In addition to all this, there still is the integration issue. This issue can be more complex for companies that already have a mature data warehouse environment, and wish to navigate from a specific dimensional model to a less specific Big Data analysis. The bridge between the two can be daunting, physically and philosophically.
What does this mean for you?
Although Big Data is still an emerging technology and concept, there is enough momentum in the marketplace that warrants serious consideration for customers and clients. There are still significant challenges in bringing it all together in a comprehensive Business Intelligence environment. The ability for systems to collect more and more data points going forward is inevitable, and systems that can evaluate the large data sets will become more and more ingrained in the normal technical environment. The traditional systems and big data systems will become more seamless over time. The more companies can leverage the information investment on all the data points as they come available or affects their business, the more effective they will become.
If you would like to learn more about Big Data and what impact it may have in your environment or are just curious about how to approach the challenges in this new arena, we would love to help work through it with you.
More information on BIG DATA can be found online at www.lucruminc.com. Or to see how LÛCRUM can help your business with its BIG DATA needs, contact us.
[1] (http://www.gartner.com/it-glossary/big-data/),
[2] (see article here: http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_release_final_2.pdf)
A great list of big data sources is found here: http://www.zdnet.com/top-10-categories-for-big-data-sources-and-mining-technologies-7000000926/
So my business sponsors and senior architects have decided to build a data vault. We have already recognized and considered the benefits of changing course for our enterprise. We spent a lot of time considering the business benefits that a different approach to business intelligence would provide. Some of these business related benefits that we [...]
So my business sponsors and senior architects have decided to build a data vault. We have already recognized and considered the benefits of changing course for our enterprise. We spent a lot of time considering the business benefits that a different approach to business intelligence would provide. Some of these business related benefits that we identified are:
- Supports functional areas of business
- Integrates business keys that cross functional areas
- Deep historical tracking of information as it changes over time
- Need to load 100% of the data 100% of the time
- Conceptual and logical models of the business are natural representations in data vault (DV) structure
We also took the time to evaluate potential technical benefits. Some of the most important benefits to us were the following:
- Apply business rules on the way out to data marts
- Run ETL processes in parallel
- Flexible and adaptable to change in business requirements over time
- Auditable back to the source system
- Compliance
- Supports agile development approach
- Simple ETL load patterns allow for code generation
Now that my company has made the decision to move forward with this new Data Vault Methodology approach to Business Intelligence, where do I begin? Well, let’s start with the basic building blocks of a data vault. A data vault can be as simple as a hub and a satellite, but in practice, there is generally a lot of each type.
Remember: a Hub is a collection of business keys. A link tracks the relationship between hubs, or potentially with other relationships (links). A satellite is the time sensitive collection of attributes related to either an only one hub or link.
Here is a sample data model with the end in mind. Notice the Hubs, Links, and Satellites are all here and are appropriately related to each other.
So lets dig a little deeper into the purpose of each and how to model and load them effectively.
Hubs
Hubs are the containers for business keys. They are the most important facets of the data vault methodology. The more successfully one is able to identify business keys the less refining of the model will follow. Business keys can be identified using a multitude of strategies. Sometimes it is from interviewing business users, sometimes it is from reviewing data models (primary keys or unique keys), sometimes it is from metadata systems that have identified key important information, as well as other areas.
The basic structure and treatment of the Hub table is as follows:
Mandatory Columns
- Hub Sequence Identifier (generally a number generated from a database)
- “Business Key” Value (generally a string to handle any data type)
- Load Date (generally a date and time)
- Record Source (generally a string)
Loading Pattern
- Select Distinct list of business Keys
- Add timestamp and record source
- Insert into Hub if the value does not exist
Code Sample
SELECT DISTINCT
stg.LOAD_TIMESTAMP
, stg.RECORD_SOURCE
, stg.CUSTOMER_ID
FROM stage.Customer stg
WHERE NOT EXISTS (SELECT 1
FROM dv.H_Customer dv
WHERE (stg.CUSTOMER_ID = dv.CUSTOMER_ID)
);
Links
Links stores the intersection of business keys (HUBS). Links can be considered the glue that holds the data vault model together. These tables allow for the data model to elegantly change over time because they can come and go as required by the business. Links also allow for the model to be created quickly without worry about whether the relationship is one to many or many to many. In addition, the flexible nature of link tables provides the option to add or drop link tables as requirements change throughout the maintenance lifecycle of the data warehouse or as part of a data mining exercise.
The basic structure and treatment of the Link table is as follows:
Mandatory Columns
- Link Sequence Identifier (a database number)
- Load Date and Time (generally a date field)
- Record Source (generally a string)
- At least two Sequence Identifiers (either from Hubs or other Links and are numbers)
Loading Pattern
- Select Distinct list of business Key combinations from source
- Add timestamp and record source
- Lookup data vault identifier from either Hub or Link
- Insert into Link if the value does not exist
Code Sample
SELECT
stg.LOAD_TIMESTAMP
, stg.RECORD_SOURCE
, stg.CUSTOMER_SQN
, stg.ORDER_SQN
FROM ( SELECT LOAD_TIMESTAMP
, RECORD_SOURCE
, (SELECT CUSTOMER_SQN
FROM dv.H_CUSTOMER dv
WHERE (src.CUSTOMER_ID = dv.CUSTOMER_ID)
) as CUSTOMER_SQN
, (SELECT ORDER_SQN
FROM dv.H_ORDER dv
WHERE (src.ORDER_ID = dv.ORDER_ID)
) as ORDER_SQN
FROM stage.”Order” src
) stg
WHERE NOT EXISTS (SELECT 1
FROM DV.L_Customer_Order dv
WHERE (stg.CUSTOMER_SQN = dv.CUSTOMER_SQN)
AND (stg.ORDER_SQN = dv.ORDER_SQN)
);
Satellites
Satellites add all the color and description to the business keys (hubs) and relationships (links) in the data vault environment. Satellites contain all the descriptive information, tracking change by start and end dates over time, to let one know the information in effect at any point in time. In the purest sense, satellites are time aware and therefore tracks change over time as its main function. Satellites are always directly related and are subordinate to a hub or a link. They provide context and definition to business key(s). A satellite record is added when a change is detected in the processing. In some cases, there may be multiple satellites pointing to one hub or one link. The reasons for doing this could be multiple sources, or rate of change, or by data type.
The basic structure and treatment of the Link table is as follows:
Mandatory Columns
- Hub or Link Sequence Identifier
- Load Date
- Load Date End
- Record Source
Optional Columns
- Attributes (may be only one, but usually a lot more strings, numbers, or dates)
Loading Pattern
- Select list of attributes from the source
- Add timestamp and record source
- Compare to the existing applicable set of satellite records and insert when a change has been detected
- Lookup and use the applicable Hub identifier or the Link identifier
Note: a two-step process is generally employed when using a Load End Date to set the time effective properly for satellites
Code Sample
SELECT stg.PRODUCT_SQN
, stg.LOAD_TIMESTAMP
, stg.RECORD_SOURCE
, stg.PRODUCT_DESC
FROM ( SELECT (SELECT dv.PRODUCT_SQN
FROM DV.H_Product dv
WHERE (src.PRODUCT_NAME = dv.PRODUCT_NAME)
) as PRODUCT_SQN
, src.LOAD_TIMESTAMP
, src.RECORD_SOURCE
, src.PRODUCT_DESC
FROM stage.[Order] src
) stg
WHERE NOT EXISTS (SELECT 1
FROM dv.S_PRODUCT dv
WHERE ( stg.PRODUCT_SQN = dv.PRODUCT_SQN)
AND ( stg.PRODUCT_DESC = dv.PRODUCT_DESCRIPTION)
AND dv.LOAD_TIMESTAMP = (SELECT MAX(dv1.LOAD_TIMESTAMP)
FROM DV.S_PRODUCT dv1
WHERE dv1.PRODUCT_SQN = dv.PRODUCT_SQN)
);
With this quick overview of the basics of the data vault model, I hope you can see the simplicity in the design as well as the pattern based loading process. As you can see, whether you have 1 on 10 hubs or links, they should all look structurally similar as well as load in a similar fashion. This drives down overall development and support costs when the Enterprise Data Warehouse is supported by a data vault. Also, designers and developers that are new to the concepts generally can be up and productive in short order.
So if you are…
- Currently engaging a data warehouse environment that is becoming harder and harder to support and maintain over time
- Needing to address performance problems
- Hoping to get your data governance problems addressed
- Wanting more of a rapid and agile development process
- Concerned about the current ETL processes having become rigid and difficult to support
- Suffering from the lack of Business Rules maintenance and management
- Embarking on a new Business Intelligence endeavor and would like to increase likelihood of success
…then the data vault methodology may be the answer for you.
Along with the loading patterns and models outlined here, there are many other benefits to applying this architecture and process to your Business Intelligence needs.
To find out more on Data Vaults or how LÛCRUM can help your business by making your data meaningful contact us.
MicroStrategy – Have you looked at THEM lately?
If you haven’t done so lately, it’s time to take another look at MicroStrategy. They have done a great job in updating their offerings to match what is currently hot in the marketplace. They couple eloquently form and function together to please customers and technicians. You are sure to find interesting and thought provoking content [...]
If you haven’t done so lately, it’s time to take another look at MicroStrategy. They have done a great job in updating their offerings to match what is currently hot in the marketplace. They couple eloquently form and function together to please customers and technicians. You are sure to find interesting and thought provoking content within their product offerings. And you might even find things that have personal application as well as business application.
Here are the latest focus points in their strategy:
MicroStrategy Cloud Intelligence
One of the pillars MicroStrategy is focusing on is Cloud Intelligence. The details about the structure and function of the cloud are easy to understand. It is a seamless fit into the Microstrategy BI environment. After an initial perusal of the offering, it is easy to see the power, flexibility, and security of cloud computing, and how one is able to utilize it to drive their business decisions and adoption within organizations. MicroStrategy has gone to great lengths to highlight the advantages, the steps necessary to setup and deploy, and as a result benefit from the MicroStrategy Cloud. I believe this is a special niche that those who can visualize how to ramp up business intelligence projects without a lot of the normal overhead of software/hardware procurement as part of projects. One is left to focus on what they do best, and leverage an optimized platform as part of the overall deliverable.
MicroStrategy Mobile Intelligence
Microstrategy has bet the business to emphasize the importance of mobile intelligence. They believe that it will overtake the traditional web browser based intelligence that is prevalent today. MicroStrategy focuses on educating the business and developer community about the value of the MicroStrategy Mobile platform. It is easy to gain access to learn how to use the MicroStrategy platform to design, build out, maintain, support, and customize visually enticing apps for multiple output devices (iPad, iPhone), while leveraging the enterprise-caliber features of the MicroStrategy BI platform. This is achieved by implementing the metadata layer that governs all content. Highlighting such functionality, it helps to show that MicroStrategy is clearly the market leader within this pillar and there were customer stories to back up this claim.
MicroStrategy Social Intelligence
This is the most unusually interesting pillar due to the cutting edge nature of it. MicroStrategy Social Intelligence solutions is designed for both commercial customers and the consumer in the marketplace. MicroStrategy has built a bridge between the two that is compelling and an opportunity for those that have the courage to leverage it. MicroStrategy latest offerings that enable in-depth analysis of the Facebook fan base. They also focused on how to apply the research in the wealth of information available at Facebook to deliver very effective marketing campaigns, which basically makes the older style CRM systems obsolete. MicroStrategy walked through the steps and their products that help make this happen, which turn the promise of social media content into real business opportunities. Once engaged, a loyal fan-base turns into revenue… and companies that understand this and who those customers are, they will achieve a competitive advantage.
MicroStrategy Big Data
Big Data is here now. MicroStrategy has methods and technology that help clients deal with the extreme data volumes. The point was made that companies need to have the ability to use very large databases and data sets to make intelligent business decisions to drive growth and gain competitive advantages. Often interesting information is lurking in the details and MicroStrategy provides a method to make sense of it. MicroStrategy also offer features such as improved self-service that reduces the reliance on IT, when it comes to navigating the business intelligence architecture. There is even the possibility to connect MicroStrategy to Hadoop and begin to analyze web logs in a very easy to consume fashion. In addition, MicroStrategy focused on high performance across the entire platform to eliminate latency issues and meet performance goals.
The time is now to take a fresh look at Microstrategy. They are a big time player in the tools space to enable Business Intelligence. You won’t regret it.
There were a few interesting posts this past week as it relates to the Data Vault Methodology and emerging trends in Business Intelligence (BI). These are must reads for BI practitioners who wish to stay current on where the industry is heading as well as how and why things should be done in a certain [...]
There were a few interesting posts this past week as it relates to the Data Vault Methodology and emerging trends in Business Intelligence (BI). These are must reads for BI practitioners who wish to stay current on where the industry is heading as well as how and why things should be done in a certain way or method. Again, as Dan Linstedt, the inventor of the methodology is known to say, the data vault was created to solve specifically the Enterprise Data Warehouse (EDW) problem.
The first entry is the slide deck from the recent Advanced Architecture Conference in Denver: http://danlinstedt.com/datavaultcat/datavault-advanced-architecture-conference-slides/
There is a wealth of information contained herein that outlines at a high level the benefits and reasons why the data vault method is enabling pervasive BI. It goes into details about comparing 3rd normal form and the star schema form to the value the data vault brings to the table. It outlines the components of a data vault, including the hubs, links, satellites, PIT tables, and bridge tables.
It was nice to see the connection with the methodology to emerging tools. There are now more than a handful of tools that are undertaking the task of automating the ETL and loading processes of an EDW. The method of the data vault allows for a lot of the technical innovation. Dan presents the case well that businesses today want answers faster and cheaper today as well as having the system flexible enough to evolve with the business over time. Most EDW systems today fall short of being pervasive, and often that is a methodology problem that appears to have an answer.
In addition to these topics, Dan pushed forth the proposition of an “Operational Data Vault” and began to lay out how this connects to Business Intelligence. This section was very interesting and helped to turn on a few light bulbs and progressive thoughts. I love the idea of pushing the intelligence as far as possible while still keeping the style and benefits. I agree with Dan that this is still so new that there are not yet any vendor applications or tools that can help with this…yet.
The presentation does leave a little bit of gap with the more advanced concepts, such as change management and how to leverage a data vault using agile principles. It is also a bit light on how to drive a project plan as well as what to watch out for while attempting to build a data vault style EDW. All in all, he presents a very compelling case as to why the data vault should be strongly considered when delivering Business Intelligence.
The second entry is a public rebuttal to a challenge about the validity of the data vault methodology:
http://danlinstedt.com/datavaultcat/datavault-new-response-to-frank-habers-part-1/
As Dan states, he felt the need to lay out the facts and correct misconceptions from the original author. After you get past the first few sections of disclaimers and background, the article does a great job of comparing and contrasting methodologies and technical styles for building the EDW.
I often see HDSA (persistent historical data storage area) implemented at client sites and they sometimes go by different names (ODS, staging, etc.). I have yet to see a better discussion of the merits of a data vault compared to the HDSA as he presents here. If you are in an environment that has an ODS or persistent history area, please take a moment to read this section, because the warnings and issues are real and one should not make decisions here lightly because of the long term ramifications.
Dan also goes into great detail to itemize the benefits and value proposition of why you take the steps to have a data vault at the core of the EDW.
I would say that the overall tone of the article still makes it a bit hard to read, but if you can overlook the defensive stance, you will find many gems that will help solidify and justify the purpose and value of the data vault compared to any other technique.
In order to deliver a high quality application, testing is a necessary component of the deliverable portfolio. Often this step is overlooked, underappreciated, or worse, rushed and hurried to meet a deadline. The best solution would be to integrate testing throughout the development process.
The way to approach testing a Business Intelligence (BI) system is [...]
In order to deliver a high quality application, testing is a necessary component of the deliverable portfolio. Often this step is overlooked, underappreciated, or worse, rushed and hurried to meet a deadline. The best solution would be to integrate testing throughout the development process.
The way to approach testing a Business Intelligence (BI) system is to get the business to have ownership and buy-in early and often. The business users should be writing test cases and be responsible for executing them from a business perspective as well as trains them on the content in the system. The technical people should be ready to assist with query development or whatever to help the testing to be completed.
There should be some validation that is part of the design of the Extract, Transform, and Load (ETL) process itself. Some of this is to make sure mechanically things happened as they should and that there are appropriate logs when they don’t. In addition, the ETL developers should perform some kind of UFI (Unit Function) testing prior to moving to a TEST environment as well as a code review or peer review. Depending on the complexity of the ETL process, one generally don’t test each component of the process due to the details involved, but focus more on the net result of the test (i.e., all rows were inserted with no errors and all columns contain values – what happened in between is not as important to test because the load was successful).
In addition, the technician should take the next step of developing quality controls that make sure what was in the final table structures is what was expected. For example, have a report from the Operational Data Store (ODS) area that groups and sums some business keys with some key metrics and compared them to the results from the new implementation area, while highlighting only variances. This should be sent to a data governance team every morning. As long as it was clean, the BI team is sure mechanically things were working pretty well.
Depending on some of the business rules implemented, one may need to have reports that highlight “Unknown” values and other things that need to be dealt with by the business. Some of these scenarios should become test cases. The business should be using the Ad-hoc environment to be writing reports and queries to test the results. Ultimately, these reports should be reviewed by the data stewards as part of the data governance initiative.
For the most thorough results as well as the highest quality BI environment, everywhere there was a business rule implemented, there should be a test case that verifies the rule was implemented correctly. Depending on the volume and complexity, one may need to prioritize them and tackle the most important ones first.
For the documentation, it can be as simple as keeping a spreadsheet with the following items:
• Test Case #
• Test Case Description
• Tester
• Date Tested
• Expected Outcome
• Actual Outcome
• Pass/Fail
It is critical, as originally stated, to get the business users involved in the testing of the deliverables. There have been cases where the business thought what they were using to compare balances with was correct, but were eventually convinced the BI application was correct and they had a broken business process instead. This is most difficult because it is on a case by case basis, but this usually becomes the biggest challenge and hurdle that needs to be overcome to be perceived as successful. Ultimately the business must provide you with the information to know whether “the values put into the Data Warehouse or BI dashboard are correct”. And you are completely dependent on the business rules they gave you are correct (a lot of times they aren’t in version 1)…and even more risky if there is no data governance process in place.
A word of caution, if you don’t get the business buy-in on testing, they will certainly blame you when things aren’t correct in production (especially if that happens for things that were overlooked in testing). It is wise to have a step where the business had to sign off on testing and that they were comfortable with what was moving into production and that was very helpful when issues arose. Because they were involved in the process and it was not mostly IT doing the testing, finger pointing was kept to a minimum. In addition, shared success and teamwork was fostered bridging the gap between business users and Information Technology (IT) groups that sadly exists in a lot of organizations.
There are lots of tips and tricks out there that help technicians develop successful Business Intelligence (BI) implementations. Some of these ideas come from real world experience, some from industry experts, and some from personal preference. I thought it would be a good idea to put together a list of some of the things to [...]
There are lots of tips and tricks out there that help technicians develop successful Business Intelligence (BI) implementations. Some of these ideas come from real world experience, some from industry experts, and some from personal preference. I thought it would be a good idea to put together a list of some of the things to consider when building out the architecture.
Therefore, if you want to jump start your BI implementation, I suggest considering the following steps to help you minimize rework and avoid common pitfalls that lead to higher costs and longer delivery time:
1. Change the name of the dimension TIME_DMN to DATE_DMN because there should be a date and a separate time dimension available in the EDW. It is not unusual to see this combined together, but there is much more flexibility with a separate physical date dimension and time dimension as well as much fewer rows.
2. Use a tool to load all the date value possibilities desired instead of coding this by hand. Perform a one time “seeding” of the dimension and be finished with it. (See http://www.kimballgroup.com/html/MDWT08Content/Ch07_Date_Dim_2000-2020.xlsx). Why reinvent the wheel when the experts have spelled it out for all to use?
3. Develop as much of the ETL code as close to the database as possible. The database will give you the best performance if you can get away with it. I generally use ETL tools to call a SQL statement (views) in the database and write into another object in the same database. The ETL tool is more of a job control mechanism rather than a comprehensive logic flowing mechanism. This does require a methodology that keeps the ETL pattern based and simple like a data-vault.
4. Use a “data vault” instead of the ODS as the core of the data warehouse to store all history. Loading should be from files to a staging area. Load from the staging to data-vault and from data-vault to data-mart. The ODS could be used as an online repository for the files or loads and should closely resemble the file layouts and no business logic applied. The ODS may serve other purposes as well and may need to be treated as another source system.
5. Use a tool or combination of tools to generate ETL code and database structures as well as document. These tools exist on the market today and obviously code generators will consistently outperform even the best coders (see http://thefuturevalueofbusiness.com/data-vault-applying-the-story-of-john-henry.htm).
6. Fact tables should all be numeric columns. In a perfect world and for performance reasons as well as logic, stick to numbers. Try to move all non-numeric columns to dimensions. The databases will thank you when the table has lots of rows but the width of the rows is as narrow as possible.
7. Constraints in a star schema should be “RELY” – the ETL process should guarantee the validity. This one relates to Oracle, but is useful to make a point that the downstream systems will still see the constraints but the database does not need to incur the overhead of maintaining the relationship. The ETL process will do a fine job of making sure this is correct.
8. Check constraints are unnecessary as well (check for NOT NULL) on tables. This is more overhead that can be avoided. This is more important in transaction systems and again the ETL process should be dealing with this if it is important.
9. Dimensions should have indexes on columns that are often used in GROUP BY clauses. Stars will perform better if there are indexes against the columns that are mostly used to group and filter on.
10. The attributes about the ETL job should be stored in a separate table with a FK in all the tables that point back to the run. This helps keep fact tables lean and aids in consistency of the ETL process itself. It may also help in the cleanup effort of errant ETL runs.
11. Data governance should be happening now to review, cleanse, establish business rules, and sign off on the master dimensional data. It’s never too early to start on this process. The sooner lists and value sets are presented to the business around key business elements the earlier the discussion can begin to resolve issues and assign ownership.
12. Information that is used for reporting should never be destroyed (Preliminary vs. Final) unless signed off by the business. Once the information is loaded into the BI environment, it is a wise idea to keep it around. At some time the question will be asked “Why did we make THAT decision?” Keeping this information off-line or on-line is a different discussion.
13. All character columns should be defined as VARCHAR(255) instead of variable smaller widths in order to handle most column size situations. I see this often and I am puzzled as to why it is still being done this way. Modern databases are pretty good at storing variable length fields and not wasting space in the process. You will have a problem once you come across a new data source that has a value that is wider than what you sized the existing column. This sometimes comes along after a new acquisition gives you a new data source for your business.
14. Development should not be locked down. It stifles productivity and creates a bottleneck. There still needs to be peer and code reviews but that should happen prior to promotion to the Test environment while things are being designed and developed. Source control helps here too. Prototyping requires quick response and the faster things can change while the design is being “vetted” the better.
15. Use database sequences or identity columns instead of using what the ETL tool provides. It is generally a bad idea to rely on the ETL tool to set the “surrogate” keys. Moving code between environments becomes a real challenge. Databases have built functionality for this purpose, so it’s a good idea to use it.
Even implementing one of these steps will be beneficial, you should see marked improvement if you can do the whole list.
I am sure there are many more steps. I would love to hear your feedback and experience. Have you tried the steps and had a different outcome? What steps would you add or remove?
John Henry is an American folk hero, notable for having raced against a steam powered hammer and won, only to die in victory with his hammer in his hand. He has been the subject of numerous songs, stories, plays, and novels. He had a 20 pound (9 kg) hammer that he thought was light.
Like [...]
John Henry is an American folk hero, notable for having raced against a steam powered hammer and won, only to die in victory with his hammer in his hand. He has been the subject of numerous songs, stories, plays, and novels. He had a 20 pound (9 kg) hammer that he thought was light.
Like other “Big Men” such as Paul Bunyan and Pecos Bill, John Henry also served as a mythical representation of a group within the melting pot of the 19th-century working class. He grows to become the greatest “steel-driver” in the mid-century push to erect the railroads across the mountains to the West. When the owner of the railroad buys a steam-powered hammer to do the work of his mostly black driving crew, to save his job and the jobs of his men, John Henry challenges the owner to a contest: himself alone versus the steam hammer. John Henry beats the machine, but exhausted, collapses and dies.
His story is usually seen as an archetypal illustration of the futility of fighting the technological progress that was evident in the 19th century upset of traditional physical labor roles. Some labor advocates interpret the legend as illustrating that even the most skilled workers of time-honored practices are marginalized when companies are more interested in efficiency and production. Although John Henry proved himself more powerful than the steam-drill, he worked himself to death and was replaced by the machine anyway.
Note: the above excerpt was taken from http://en.wikipedia.org/wiki/John_Henry_(folklore).
So how does this story apply to the Information Technology industry in the 21st century? I find in my travels a similar resistance to efficiency and production, although it would be due to different motives. In this day and age, the most opportune ways to drive down costs over the long term is to automate as much as possible. I still see and talk to technicians that are still trying to build and manage the entire Business Intelligence environment by hand, basically wearing themselves out trying to keep up with maintenance and the rate of change demanded by business requirements.
So what is the root cause of the resistance? I would suggest that there is a comfort level in the technologist in us that entire works hard to reach a level of competence and skills and then, once attaining that level, resists enabling technologies. This is akin to John Henry feeling that he was able or better than a machine in doing a particular job. We, myself included, are comfortable with what we know how to do. Fear holds us back as well, because if we are able to generate the ETL job process, we will be replacing ETL developers with machines. Change requires us to leave our comfort zone and see beyond our own fears, in order to deliver real value to our customers.
In the information technology industry, as it has evolved over time, components move more to a modular or commodity approach as efficiency and refinement processes are applied over time. You see this with software products on the market today that have “wizards” that do routine or repetitive tasks. The more repetitive tasks can be commoditized, the more efficient the design and development process becomes. This saves real time and real dollars on development projects.
So what is one component, among many, to target for this approach? So in the realm of Business Intelligence, there is a pattern based approach to data modeling called the Data Vault ( www.learndatavault.com ). This pattern is manifested in three basic types of structures: hubs, links, and satellites. The nature of these structures allow for a repetitive process to be involved in the loading of the Enterprise Data Warehouse (EDW). For the most part, information is inserted, not updated or deleted. It tracks all changes over time, is integrated, and is auditable back to the source system. From the EDW, the standard data marts can be generated (and in some cases, this can be done with views). Because the pattern is restricted to three basic types, it becomes simpler to generate the loading ETL process using a machine. It is still not a perfect science, and depends on the value of the source model – but the point is the more that can be generated will save time and money over the long term.
So how does the technologist address the “John Henry” in each of us in order to provide the best service to our customers? First, one has to admit that it is good and beneficial to have a machine generate code and processes that were once done by hand. Secondly, one has to be open-minded to realize the benefit. Thirdly, one has to have the courage to take on change, by realizing that inside the change component, there is opportunity for future growth. This is manifested in customers being pleased that high quality projects are produced within a predictable cost and timeframe. By products of this method also reduce support costs over time. It’s a win for the technologist as well as success for the customer.
It’s time to embrace the “machine”, and allow it to do what it does best. The technician needs to move on ahead and add the thought leadership to the projects that will harness the power of smart design with efficient development patterns. The knowledge and knowhow is here today. I know. I have seen it done. I will continue to lay out the design patterns over the next few entries of what it takes to move in this direction and leverage the power of the data vault to bring world class solutions to our clients.
In this day and age, it seems to be trendy to gravitate to the flash and splash of the latest and greatest user facing tools to address Business Intelligence issues. Some believe that if they just get a dashboard and a few nifty graphs, all of a sudden they will have “answers” flowing through their [...]
In this day and age, it seems to be trendy to gravitate to the flash and splash of the latest and greatest user facing tools to address Business Intelligence issues. Some believe that if they just get a dashboard and a few nifty graphs, all of a sudden they will have “answers” flowing through their systems and into the reports. …Almost like it was magic.
The true Business Intelligence practitioners know better. Most modern systems still suffer under the same age old issues because they are still doing things the same old way. Some of the issues that are still prevalent are ability to change over time as the business changes and the integration of the information problem.
So when you build architecture on the quick and easy solution in a “silo”, you will eventually hit the wall when it comes to adaptability and scalability. So where does one turn when there is a need for speed as well as the ability to support mission critical reporting and analysis needs that must be able to pass audits?
There is a methodology that attempts to bridge the gaps between the typical issues in the current Business Intelligence offerings. The inventor of the data vault is Dan Linstedt (www.danlinstedt.com) where the concepts and rules are specified for successful engagements.
The data vault is not a product. It is not a magic pill that makes all your IT ills go away. It is a comprehensive approach to addressing real world issues with existing implementations. It brings real flexibility and adaptability to the implementation and brings reliability and dependability to the business. And with a team that understands the power of the data vault, you are now able to take your Business Intelligence environment upon which the tools that do the flash and splash can be sourced from.
According to Dan Linstedt, the inventor of the data vault methodology, the challenges around data integration include some or all of the following:
- Definition, or understanding of the data
- Functions or transformations applied to translate the data
- Interpretation of the data
- “MASTER” determination of the data
- Best storage and architecture of the data
- Visualization of the data
- Accountability and auditability of the data
- Traceability of the data
- Overloading (multi-use of single columns and record types) of data
- Historical data with lost definitions
- Data too big
- No change data capture/no audit trail
- Bad indexes
- No control over source feeds, source timing
- Multi-system valuation dependencies
- Missing data
- Changed Passwords
- Mis-aligned access rights
- Overflowing data
- Out-of-range data
- Bad domain data (a date field contains a string…)
Unless there is a comprehensive plan in place to deal with data integration, then it will only be a matter of time before your implementation will begin to suffer under the weight of the problem. Short sighted solutions only mask this issue for a short time, where our customers need a quick solution that will also stand the test of time and change.
And because of the nature of a data vault, this can be done in rapid releases that bring value within a few weeks of embarking on the project. Because of the style, there are now tools on the market that can generate the table and transformation logic. Once you reach this level, then change is transparent and accessible to your user community as well as the IT staff can finally keep close to the change as it is happening in the business.
According to Dan Linstedt, one would expect the following results from pursuing an implementation that included a data vault:
- Manage and enforce compliance to Sarbanes-Oxley, HIPPA, and BASIL II in your enterprise data warehouse (EDW)
- Spot business problems that were never visible previously
- Rapidly reduce business cycle time for implementing changes
- Merge new business units into the organization rapidly
- Rapid ROI and delivery of information to new star schemas
- Consolidate disparate data stores (i.e., master data management)
- Implement and deploy SOA, fast
- Scale to hundreds of terabytes or petabytes
- SEI CMM Level 5 compliant (repeatable, consistent, redundant architecture)
- Trace all data back to the source systems
With the data vault at the core of your Business Intelligence implementation, you are enabling your enterprise to be as nimble as possible without ignoring the core critical issues around data integration and change over time. Your user community will have the chance to grow at the pace dictated by business opportunity unconstrained by the “normal” issues around the traditional approaches.
Over the next few months, I will be going deeper into the components of a data vault, where it fits into enterprise architecture, and the ways to take advantage of the “Power of the Data Vault”. Stay tuned…
Categories
- Analytics and Data Visualization (28)
- Big Data (22)
- Business Intelligence (216)
- BI Best Practices (39)
- BI Tools (73)
- Business & Leadership (57)
- Data Management (16)
- Data Architecture (13)
- Data Governance (1)
- Data Integration (2)
- Data Security (3)
- Information Alignment (3)
Archives
Search



