I am sure most IT professionals, especially the ones dealing with data, have heard of data quality. The idea of monitoring data to ensure the data fit into the intended use with a high level of accuracy. When creating a Business Intelligence (BI) solution how can data quality be continually monitored through the [...]
I am sure most IT professionals, especially the ones dealing with data, have heard of data quality. The idea of monitoring data to ensure the data fit into the intended use with a high level of accuracy. When creating a Business Intelligence (BI) solution how can data quality be continually monitored through the entire solution? One method is called Data Auditing. The concept might not be new, but the formalization of the process is.
Data Auditing is the process of ensuring data quality from the beginning of the BI process to the final destination in a repeatable and measured way. This includes validation of data that arrive from source to staging to star to cube (if exists). Where ever business logic can be implemented, a data audit can be used to make sure the quality of the data is consistent.
One example of data auditing I have used was for a ‘technical glitch’ with the use of SQL Server Analysis Services (SSAS). The decision was made to not change source data that was of bad quality and load it into the Kimball star schema as it was. (Now this decision can be debated, but one major reason this was done was to easily expose data quality issues to the users as they believed there would be an inconceivably small amount of data quality issues.) What was discovered with using SSAS is it would not handle dates with a year prior to around 1500. Some of the dates in the source system had the year 200 instead of 2000 and so on. A Data Audit routine was designed to look for these dates into the stage tables and change them to a pre-determined default date. This allowed the SSAS job to complete and the cube to process.
This process seems much like any other data quality processes. The true auditing came from how these instances (and the others) were reported. The code developed to catch the data quality issues also entered data into “performance” stars schemas designed to provide data on the nightly process. This data was then shown by a dashboard used by the internal IT, BI staff. Every morning we could see how many rows of data were caught by each audit. This allowed us to make quick decisions on how to handle the data to change in the source system and even change in the data warehouse (including staging tables to make sure there was not a type 2 slowly changing dimension row added by the change to the original source data).
There were other data audits that summed specific counts of rows from the source system and made sure that count was the same in the star and cube due to how the star was loaded. Another data audit allowed us to show the measures in the fact tables and the cube were the same based on logic used. These types of data audits are not used to catch data quality, but to explicitly show that on a day-to-day basis, the results are the same. This audit was extremely necessary to build confidence in the data. Confidence in “the numbers” was extremely important to our clients and this was the simplest way for us to convince our end users they were getting what they were supposed to get.
The best way to implement a data auditing solution is to use the existing BI tools to build the report, dashboard or any other means to expose the audit. Even the simple use of Excel against the cube and star can be used. Any way that can be easily maintained is preferred. Of course the data needs to be understood in order to make sure this is done correctly. If data is distributed in the star to form a lower level of granularity, then it needs to be summed back to the original level and compared to what is in the source. This could mean there are rounding errors present, but that should only provide around a penny difference.
A problem with Data Auditing is the trade-off of time developing the data auditing process takes away from time to develop the business needed parts of the BI solution. One way to incorporate this is to build it into each project. The first project to utilize the concept of data auditing will take longer due to the need to build the underlying data structure and processes. Once this is started and built, the next project to use the structure will take less time to ‘plug in’ just like the idea of re-using conformed dimensions.
Data Auditing’s benefits provide a repeatable way to show data throughout the entire BI process is correct. This concept is part of a good data quality/data governance solution. The ability to ‘watch’ the data as it goes through the entire BI process to make sure it means what it supposed to mean will provide a security blanket for the end users. The end users can say “This data is from the data warehouse and I am positively sure it is correct.” How much value to the business and reassurance to IT does that grant?