Currently viewing the tag: "Data Analytics"

Which is better: Faster or Slower?I must admit I do enjoy Beck Bennett’s series of commercials for AT&T where he poses the question, “Which is better: faster or slower?”  I find his deadpan approach to a variety of co-actors and situations very humorous. The question “Which is better: faster or slower?” has interesting application in today’s information and analytics environment. Faster has always been better, correct? The scenario holds true in every industry. If you can make better decisions at a faster pace than your competitor or adversary, then you will always hold an advantage over them. However, the key isn’t just faster, but better decisions faster!

An interesting event occurred last week that made the point that faster is not always better. A short-lived Twitter hoax briefly erased $200 billion of value from the US Stock Market. False reports of explosions in the White House triggered a set of algorithms monitoring news feeds into a two-minute selling spree. In this case, untethered analytics only increased the pace at which we can make mistakes and caused the DOW to drop 145 points. The error was quickly identified and the DOW bounced back, but who knows what losses were incurred by algorithms reacting to the news feed and potentially to other algorithms reacting to those algorithms.

I am fortunate to be in the information and analytics industry and am continuously astounded by the algorithms and analytics that I see people put together. However, this event continues to remind me that even the best algorithms need good data and solid IT development principles such as building in a failsafe. Perhaps we need to teach these algorithms to check their sources before taking action.

The Background

Love it or hate it, the madness is upon us.  Every March, the country gets a healthy serving (or three) of College Basketball.  Each year, approximately 40 million people fill out brackets for the NCAA Men’s Basketball Tournament and each year, every single one of those people swears that they picked everything perfectly.  If you were about to Google, “What are the odds of completing a perfect bracket?” I will save you the trouble; it is 1 in 9.2 Quintillion.  If you were about to Google, “What on Earth is a Quintillion?” the answer is a 1 with 18 zeros behind it.  To put this in perspective, the odds of winning the Powerball are 1 in 175 Million. You have a better chance of winning the Powerball multiple times than picking that bracket correctly.

These however are just numbers, I began to wonder how I can slice and dice tournament history data.  Sure, I can find what teams have won the most or lost the most.  But can I dive even further, and find out what states, cities, or teams have the most wins or championships?  Which teams constantly underperform and which teams exceed expectations?

The Research

Using a data dump of NCAA Tournament History from 1939 to 2012 I was able to dive in very quickly and start seeing results.  I first wanted to see which states produced the most tournament victories.  Using Tableau I was able to visualize what the Top Ten states were in terms of victories.

Visual of the Winningest States

Using a filled map, I was able to visualize the amount of wins for the top ten states.  North Carolina and California are the top two states, no doubt fueled by the powerhouse schools of North Carolina, Duke, and UCLA. I wanted to go even further and see which cities brought the championships home for their respective states.  To create this visualization I used a dual axis map combining my filled map with a symbol map.

Visual of Winningest Cities within the Winningest States

Using this visualization you can see which cities allowed the states to appear on my first map.  Los Angeles and Lexington are homes to schools that have brought home the most national championships.  Instead of using strictly numbers and labels, I was able to represent their success using a “Circle” Symbol.  The bigger the symbol the more championships achieved.

I have a clear picture of what teams succeed, but how can I find out which teams succeed… Or don’t, when they are supposed to.  To do this, I needed to find out how many upsets occurred over the years.  Using the teams designated seeds at the beginning of the tournament I was able to determine every upset in tournament history.  I took this data and created visualizations for teams that get upset, and teams that create the upset.

Underachieving Teams Visual

Overachieving teams visual

I was able to utilize a stacked bar chart to visualize when teams were a higher seed if they were upset more often than not, and vice versa, if they were a lesser seed were they prone to upset their competitor.  The stacked bar also helped to show that while teams like Duke and North Carolina were upset the most, it was because they had the most opportunities to become upset.  The data above shows that Kansas is an overachieving team. 34 times out of 49 possibilities they upset their opponent in the tournament.

The Analysis

History shows that our top performing states are North Carolina, California, and Kentucky.  The cities that make those states successful are Lexington, Los Angeles, Chapel Hill, and Durham.  We can also see that teams such as Brigham Young, Pennsylvania and Utah State have a habit of underperforming in the tournament.  While teams such as Florida, Duke, and North Carolina, tend to over perform when they are the underdog.

The Conclusion

March Madness is an event loved by many, and the benefits of visualization allow me to recognize these findings very quickly. Imagine this type of data at your fingertips when you are filling out your bracket.  I certainly wish I would have used it to my advantage. Now, imagine these types of visualizations fueled by your company’s data. Replace the “wins” data with company revenue data.  You would be able to identify where you are successful, and then go further down to see what cities are producing that success. This allows a quick look at your business.  Use sales leads data to fuel your stacked bar charts.  See which of your offices is receiving/submitting leads and see how well they are closing them.  Data is powerful, but using visualization tools makes data meaningful.

cloud securityDo you remember these recent stories?  On July 31, 2012 Dropbox admitted it had been hacked. (Information Week, 8/1/2012).  Hackers had gained access to an employee’s account and from there were able to access LIVE usernames and passwords which could allow them to gain access to huge amounts of personal and corporate data.  Just four days later, Wired® writer Mat Honan’s Twitter account was hacked via his Apple and Amazon accounts (story in Wired and also reported by CBS, CNN, NPR and others).

Did you notice the common theme behind these reports?  Hackers didn’t get through the defenses of the Cloud by brute force.  Instead, they searched out weak points and exploited other vulnerabilities led to by those entry points.  In these examples – as in countless others – the weak points were processes and people.

The Dropbox hack was made possible by an employee using the same password to access multiple corporate resources, one of which happened to be a project site which contained a “test” file of real unencrypted usernames and passwords.  Either one could be considered a lapse in judgment – I mean, who thinks it is a good idea to store unencrypted user access information on a project site??? – but added together, these lapses made a result much more dangerous than the sum of their parts.

Mat Honan’s hack was made possible in part by process flaws at large and popular companies.  Again, each chink taken individually would likely not have been as damaging as the series of flaws building on each other.  Apple or Amazon individually didn’t provide enough information for hackers to take over Mr. Honan’s account, but taken together their processes and individual snippets of data provided the opportunity.

My purpose in writing this isn’t to scare anyone away from the Cloud or its legitimate providers.  The Cloud is cost-effective, portable, scalable, stable, and here to stay.  And it is as secure as technology will allow.  But as these stories illustrate, technology isn’t the risk.  Information wasn’t compromised by brute-force hacking or breaking encryption algorithms.  Data was put at risk by people and processes.

Have you ever worked with someone who messed up something royally by not following a documented process?  Or do you know someone who clicked a link in a bogus email and infected their laptop – or even the whole company – with a virus?  They might be working for your Cloud provider now.  Don’t rely on those folks to protect your data in the Cloud.  Instead, protect it yourself with Backups, Password Safety and Data Encryption before entrusting your precious data to the Cloud.  If a hacker gets into your Cloud, at least you won’t be the easiest target.

BIG DATAIn April 2012, VisibleTechnologies.com (a social media monitoring company) published a 1,211% increase in use of the term “Big Data,” from March 2011 to March 2012 in a survey of English Social Media Channels. Big Data is certainly one of the key buzzwords of our time.

In a 2001 METAGroup article, Doug Laney presented the three “V”s of Big Data: Volume, Velocity and Variety. Others have added multiple fourth “V”s such as Vulnerability, Veracity and Value. None of these contributes to the fundamental definition. They are consequential.

When people think of Big Data they often focus on the first “V”, volume; after all, it is called Big Data; but, large data volumes are nothing new. Data has always been “big” relative to the technology to make use of it.

The original Big Data was the Library at Alexandria, which contained the combined experiences and learnings of ten centuries. In 1944, the concern was that American University libraries were doubling in size every 16 years and that the number of published volumes would outpace the ability to physically store them, let alone access and derive value from them.

Data has always been big, but never nearly as massive as it is today. For over a decade, we have heard about the early pioneers of this generation’s big data: Wal*Mart, Google, eBay, Amazon, the Human Genome Project, and the new trailblazers such as Internet giants Facebook, Twitter, eHarmony, and comScore. Additionally there are ubiquitous sensor based data generators in Hospital Intensive Care Units, Radio Frequency IDs tracking products and assets, GPS systems, smart meters, factory production lines, satellites and meteorology, the list continues to grow.

Market research firm IDC estimated that 1,800 exabytes of data would be generated in the year 2011. An exabyte is a unit of information equal to one quintillion (1018 bytes), or one billion gigabytes. Estimates report that the world produced 14.7 exabytes of new data in 2008, triple the amount generated in 2003. Cisco systems estimates that by 2016, annual Internet traffic will create 1.3 Zettabytes (1021 bytes), or one trillion gigabytes. To put that in perspective, all the internet traffic in the years 1984 to 2012 has generated a total 1.2 Zettabytes. We will soon be generating in one year what has taken 26 years to accumulate.

Data Volume Across the World

The focus on the size attribute of Big Data is understandable in the face of these statistics, and stems from limitations in the technology available at the time, to acquire, process, and deliver these large volumes of data in a reasonable amount of time to make that data meaningful to the decision makers in the business. Traditional Relational Database technologies and methods of loading, storing and retrieving data were incapable of keeping pace with the speed necessary to analyze and act on the data.

With the advent of new storage and query technologies such as Hadoop, MapR, Cloudera, Teradata Aster, IBM Neteeza, NoSql, NuoDb, MongoDB, CouchDB, HBase, etc., volume becomes the least important of the three “V”s.

Volume alone does not define Big Data. Big Data is more about the second and third “V”s, Velocity and Variety. Part two of this Big Data series will delve into the Velocity factor.

 

Want more information on how LÛCRUM can help provide you with Big Data solutions? Contact us today!

Tagged with:
 

So my business sponsors and senior architects have decided to build a data vault. We have already recognized and considered the benefits of changing course for our enterprise. We spent a lot of time considering the business benefits that a different approach to business intelligence would provide. Some of these business related benefits that we identified are:

    • Supports functional areas of business
    • Integrates business keys that cross functional areas
    • Deep historical tracking of information as it changes over time
    • Need to load 100% of the data 100% of the time
    • Conceptual and logical models of the business are natural representations in data vault (DV)  structure

We also took the time to evaluate potential technical benefits. Some of the most important benefits to us were the following:

    • Apply business rules on the way out to data marts
    • Run ETL processes in parallel
    • Flexible and adaptable to change in business requirements over time
    • Auditable back to the source system
    • Compliance
    • Supports agile development approach
    • Simple ETL load patterns allow for code generation

Now that my company has made the decision to move forward with this new Data Vault Methodology approach to Business Intelligence, where do I begin? Well, let’s start with the basic building blocks of a data vault. A data vault can be as simple as a hub and a satellite, but in practice, there is generally a lot of each type.

Remember: a Hub is a collection of business keys. A link tracks the relationship between hubs, or potentially with other relationships (links). A satellite is the time sensitive collection of attributes related to either an only one hub or link.

Here is a sample data model with the end in mind. Notice the Hubs, Links, and Satellites are all here and are appropriately related to each other.

Sample Data Model

So lets dig a little deeper into the purpose of each and how to model and load them effectively.

Hubs

Hubs are the containers for business keys. They are the most important facets of the data vault methodology. The more successfully one is able to identify business keys the less refining of the model will follow. Business keys can be identified using a multitude of strategies. Sometimes it is from interviewing business users, sometimes it is from reviewing data models (primary keys or unique keys), sometimes it is from metadata systems that have identified key important information, as well as other areas.

The basic structure and treatment of the Hub table is as follows:

Mandatory Columns

    • Hub Sequence Identifier (generally a number generated from a database)
    • “Business Key” Value (generally a string to handle any data type)
    • Load Date (generally a date and time)
    • Record Source (generally a string)

Loading Pattern

    • Select Distinct list of business Keys
    • Add timestamp and record source
    • Insert into Hub if the value does not exist

Code Sample

SELECT DISTINCT

  stg.LOAD_TIMESTAMP

, stg.RECORD_SOURCE

, stg.CUSTOMER_ID

  FROM stage.Customer stg

WHERE NOT EXISTS (SELECT 1

                    FROM dv.H_Customer dv

                   WHERE (stg.CUSTOMER_ID = dv.CUSTOMER_ID)

);

Links

Links stores the intersection of business keys (HUBS). Links can be considered the glue that holds the data vault model together. These tables allow for the data model to elegantly change over time because they can come and go as required by the business. Links also allow for the model to be created quickly without worry about whether the relationship is one to many or many to many. In addition, the flexible nature of link tables provides the option to add or drop link tables as requirements change throughout the maintenance lifecycle of the data warehouse or as part of a data mining exercise.

The basic structure and treatment of the Link table is as follows:

Mandatory Columns

    • Link Sequence Identifier (a database number)
    • Load Date and Time (generally a date field)
    • Record Source  (generally a string)
    • At least two Sequence Identifiers (either from Hubs or other Links and are numbers)

Loading Pattern

    • Select Distinct list of business Key combinations from source
    • Add timestamp and record source
    • Lookup data vault identifier from either Hub or Link
    • Insert into Link if the value does not exist

Code Sample

SELECT

       stg.LOAD_TIMESTAMP

     , stg.RECORD_SOURCE

     , stg.CUSTOMER_SQN

     , stg.ORDER_SQN

  FROM ( SELECT LOAD_TIMESTAMP

              , RECORD_SOURCE

              , (SELECT CUSTOMER_SQN

                   FROM dv.H_CUSTOMER dv

                  WHERE (src.CUSTOMER_ID = dv.CUSTOMER_ID)

                ) as CUSTOMER_SQN

              , (SELECT ORDER_SQN

                   FROM dv.H_ORDER dv

                  WHERE (src.ORDER_ID = dv.ORDER_ID)

                ) as ORDER_SQN

           FROM stage.”Order” src

       ) stg

WHERE NOT EXISTS (SELECT 1

                    FROM DV.L_Customer_Order dv

                   WHERE (stg.CUSTOMER_SQN = dv.CUSTOMER_SQN)

                     AND (stg.ORDER_SQN = dv.ORDER_SQN)

                  );

Satellites

Satellites add all the color and description to the business keys (hubs) and relationships (links) in the data vault environment.  Satellites contain all the descriptive information, tracking change by start and end dates over time, to let one know the information in effect at any point in time.  In the purest sense, satellites are time aware and therefore tracks change over time as its main function.  Satellites are always directly related and are subordinate to a hub or a link. They provide context and definition to business key(s).  A satellite record is added when a change is detected in the processing.  In some cases, there may be multiple satellites pointing to one hub or one link.   The reasons for doing this could be multiple sources, or rate of change, or by data type.

The basic structure and treatment of the Link table is as follows:

Mandatory Columns

    • Hub or Link Sequence Identifier
    • Load Date
    • Load Date End
    • Record Source

Optional Columns

    • Attributes (may be only one, but usually a lot more strings, numbers, or dates)

Loading Pattern

    • Select list of attributes from the source
    • Add timestamp and record source
    • Compare to the existing applicable set of satellite records and insert when a change has been detected
    • Lookup and use the applicable Hub identifier or the Link identifier

Note: a two-step process is generally employed when using a Load End Date to set the time effective properly for satellites

Code Sample

SELECT stg.PRODUCT_SQN

     , stg.LOAD_TIMESTAMP

     , stg.RECORD_SOURCE

     , stg.PRODUCT_DESC

  FROM ( SELECT (SELECT dv.PRODUCT_SQN

                   FROM DV.H_Product dv

                  WHERE (src.PRODUCT_NAME = dv.PRODUCT_NAME)

                ) as PRODUCT_SQN

              , src.LOAD_TIMESTAMP

              , src.RECORD_SOURCE

              , src.PRODUCT_DESC

           FROM stage.[Order] src

       ) stg

WHERE NOT EXISTS (SELECT 1

                    FROM dv.S_PRODUCT dv

                   WHERE (   stg.PRODUCT_SQN = dv.PRODUCT_SQN)

                     AND (   stg.PRODUCT_DESC = dv.PRODUCT_DESCRIPTION)

                     AND dv.LOAD_TIMESTAMP = (SELECT MAX(dv1.LOAD_TIMESTAMP)

                                                FROM DV.S_PRODUCT dv1

                                               WHERE dv1.PRODUCT_SQN = dv.PRODUCT_SQN)

                                             );

With this quick overview of the basics of the data vault model, I hope you can see the simplicity in the design as well as the pattern based loading process.  As you can see, whether you have 1 on 10 hubs or links, they should all look structurally similar as well as load in a similar fashion.  This drives down overall development and support costs when the Enterprise Data Warehouse is supported by a data vault.  Also, designers and developers that are new to the concepts generally can be up and productive in short order.

So if you are…

  • Currently engaging a data warehouse environment that is becoming harder and harder to support and maintain over time
  • Needing to address performance problems
  • Hoping to get your data governance problems addressed
  • Wanting more of a rapid and agile development process
  • Concerned about the current ETL processes having become rigid and difficult to support
  • Suffering from the lack of Business Rules maintenance and management
  • Embarking on a new Business Intelligence endeavor and would like to increase likelihood of success

…then the data vault methodology may be the answer for you.

Along with the loading patterns and models outlined here, there are many other benefits to applying this architecture and process to your Business Intelligence needs.

To find out more on Data Vaults or how LÛCRUM can help your business by making your data meaningful contact us.

As the evening winds down every night, I like to look back and reflect on the day’s events. Being an analytically minded person, I tend to look back at decisions that were made and how they had an effect on events that transpired throughout the day. As I delved into this nightly process, I started looking at the day in a more granular manner. Upon doing so, I came to the realization that we are all swimming in “Day to Day Data”.

Did I run out of coffee this morning? What time did I get in the car? Are these are completely unrelated items of information? If you answered no, then I would be inclined to say you are wrong. All these granular pieces of information are important to someone or some entity. Tell McDonalds the answer to these two questions, and they have the perfect time to run an ad on the radio for their “premium roast” coffee.

Take time to think about your day to day data. Think of all the pieces of information that drive your day; you can use this information to accomplish a great number of things. Identify inefficiencies that are a constant annoyance.

The day to day data that we create is astounding. As we all know now, the data is out there. The data is important but what is more important is identifying what information is vital and to whom. Data can be helpful, but it can also be stressful. Data is the figurative key that can open the door to future success but what if someone handed you the keys to the entire building instead of the room you want. How do you find the right key? It is important to have help to identify which (if any) of those keys is useful. That way you can make your Day to Day Data Meaningful.

In my previous posts, I explained what Data Vault is and where it fits in Enterprise Data Warehouse Architecture. I also covered its key components along with an example.

It is a known fact that Business expectations from a BI project are not always met. There is always pressure to deliver more business value in less time. Below are some of BI architecture pressures that can overcome by following a Data Vault Architecture. Data Vault offers a unique solution to the business problems and technical problems alike.

In the next few posts, I will try to present how Data Vault overcomes the above issues or pressures when compared to traditional methods (3rd NF or Dimensional Model). In nutshell, with the help of examples I am trying to prove what Raphael Klebanov has suggested. He concluded (see below pictures) that:

  1. 3rd NF model is more opt for Operational Data Stores (ODS)
  2. Dimensional Model is more suitable for data marts and access/presentation layers. The model can be used to create Data warehouse for small data volumes and stable business structures.

 Principal Data Flow (Simplified)

 

Conclusion on 3rd NF Model

 

 Conclusion on Dimensional Model

 

Conclusion on Data Vault Model

Big Data….what’s this talk all about?   I’m finding many technology articles espousing the “next big thing” including reasons to choose Hadoop or when not to choose Hadoop.   There are a lot of stories about what Appliances to use from the likes of IBM, HP, Oracle, and others.  Business leaders should want to know what kind of infrastructure it is going to take to do “Big Data” and this time, the Brick and Mortar can take lessons from the eTailers and Google and such.   Typical Big Data systems are also migrating over to NoSQL ecosystems and other computing approaches that will be able to scale.   Other questions such as what really are the limitations of RDBMS’ and what are the new design concepts.

We are now seeing the pent-up demand for Data through the holiday season and let me assure you, “Big Data” is everywhere.    What’s Big Data?   Is there a good definition?   When in doubt, go to WIKIPEDIA?   Not the “center of truth” for sure but it is also a reasonable and acceptable starting point.   Here’s what they say:

    “Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing. This trend continues because of the benefits of working with larger and larger datasets allowing analysts to “spot business trends, prevent diseases, and combat crime.”

They also go on to say:

    “One current feature of big data is the difficulty working with it using relational databases and desktop statistics/visualization packages, requiring instead “massively parallel software running on tens, hundreds, or even thousands of servers.”

OK, that’s one definition of “Big Data.”   Let’s try to think of Big Data in the context of the business world and what it may mean to our clients and prospective clients.   Although this article doesn’t focus on the “industry term” Big Data and instead really centers on what Big Data can mean for a company’s strategy and their future growth plans.

The larger picture is well-represented by the recent November 2011 McKinsey Quarterly.   In the publication, McKinsey’s Michael Chui interviews Bob McDonald, CEO, P&G.

In the interview, Mr. McDonald expertly describes P&G’s future digital strategy as initiating new business models that are supported by the new technologies of analytics, real time processing, and “Big Data.”   Key points that I got out of the article include:

  1. We believe digitization represents a source of competitive advantage.
  2. There’s new demand for a whole new approach to accounting:   Activity-Based Accounting.   It’s been around for years.   The traditional “double-entry bookkeeping,” founded by the Christians over 700 years ago, focuses on historical analysis and not “real time profitability.”   Think of what we could do if we were to give each operating unit real time data and processing?   Most businesses are handicapped to some extent by their internal ability to be agile.   Their supply-demand models limit their abilities to flex when they need to and contract when indicators are telling them to do so.    A simple example that we have all known about for years is in healthcare.   Let’s get personal:  have you or one of your loved ones ever waited for 2 hours for a doctor’s appointment?   Did the doctor do that purposely?  Certainly not!  The ability to match capacity with demand on a real time basis is where most businesses are going.
  3. “Data Partners” are referenced in the story.   They are the “suppliers” of critical data that is needed for real time analytics for decision-making.   Quality of data, periodicity, and timeliness are key attributes to this relationship.    Perhaps traditional billboard and TV ads will not be as strategic anymore in the future?
  4. “How you ever done a Monte Carlo simulation?”  and he goes on to say “We wanted to find people who had true mastery in computer science…analytical thinking skills have become ever more important to this company…..those innovations are always informed by data.”    That is reclassifying whole finance, accounting, and IT personnel in one fell swoop!

OK, John, what’s the future in store?  We keep hearing the innovation is the key.    Is innovation rewarded in most companies?   Apple is probably the most cherish innovator in today’s economy and there’s a whole industry and supply chain right behind them!

My reaction to his comments on human capital:

  1. Business Professionals need to have renewed “personal development plans” to reach higher and challenge themselves to attain a higher mastery of their particular skills and even broader skills for the future.    They will need to balance both technology and business acumen together to reach a higher level of productivity.   Once again, the “Knowledge Worker” is all powerful and will be able to sustain themselves in a globally competitive marketplace where anything of potential commodity value is outsourced.
  2. Business Professionals will need to have a well-rehearsed “business and domain-expertise” conversations that they can have with their decision-makers, customers, and suppliers.  Intelligence innovation requires risk taking and guts.   After the recent recession, how many businesses are really ready to get out of the proverbial “foxhole” and get their heads back to the commercial battlefield?   Is there significant enough rewards to encourage this behavior or is this more of a survival tactic?  The job is to identify ways to leverage data to LEAP FROG the competition with new business models.  Will Fortune 500 companies like P&G lead this behavior or leave it up to the mid-market?
  3. Attention all Finance, HR, and Accounting Professionals:   I love what Bob McDonald says in the article:  Ineffective systems and cultures are bigger barriers to achievement than the talents of people.    If there are ways and approaches to increase productivity and eliminate old processes, I believe McDonald indicates “now’s the time.”    The opportunity is to allow for our own organizations to have a better line of sight to our “real time performance,” then we must make the investments to increase our game.   It’s important that all firms look for ways to reduce manual work and increase our automation of our financial and operational reporting.

There it is.  These are big ideas for you and they are inspired by one of the most successful companies on the planet.   I know that you will have to do more planning at all levels of your company in order to take advantage of the explosive market-growth that is occurring in our global market.    The irony is that the daily news seems to tell us that “gloom and doom” are just around the corner.   Interesting enough, the current opportunity we have kinda reminds me of the late 90’s and the “go-go” years of the Clinton Era!??

The most criticized tool in Data Visualization is the pie chart. There are many areas of debate in the world of Data Visualization, but there is little debate among the experts about the pie chart. The number one rule about pie charts is “Don’t Use Pie Charts”. Personally, I’m not offended by them. I understand that it has been the tool of choice and that it has become ingrained into society and business. However, I am in complete support of the expert opinions. Pie charts are deficient in displaying and comparing data. There are a few acceptable uses for them, but in most cases a simple bar chart would be a better tool overall and provide a much better visual comparison.

I have heard people argue that pie charts take up less space or that they are easier to understand, but even these arguments are not valid. There are just too many fundamental problems with pie charts and this is why I advocate that they should not be used. Let’s examine a very simple data set and compare. Here is a table of The Twelve Days of Christmas.

 

Below is a pie chart of the Twelve Days of Christmas and basically the default view from Excel. To help this visual I’ve followed a common rule of pie charts which is to start at noon and move clockwise from the largest to smallest. The other common practice, as described by Dona Wong in The Wall Street Journal Guide to Information Graphics: The Dos and Don’ts of Presenting Data, Facts, and Figures, is to place the largest slice at noon and the second largest slice to the left of noon and then clockwise with the remaining largest to smaller. I find this practice to be even more confusing, unless the last category is “Other” or “Misc.” and therefore an aggregation of the remaining smaller categories. Also, I added the data to the legend and resized it as large as reasonably possible to make the text readable.

Note the following problems with the pie chart:
• To visually compare the reader must go back and forth from the pie chart to the legend to determine which present matches which color. It would be impossible to list the labels within each slice because the text would be too long. Another popular option is to create lines from the pie chart pointing to each label and place the labels around the pie chart. This creates a very busy chart and clutters the chart with extra lines.
• The use of many different colors is required to create a categorical comparison color scheme. This makes it difficult to see the difference in colors from the shades of blue, red and purple.
• The comparison between the categories is very difficult. The eye cannot easily discern between the size of the “Drummers Drumming” and the “Pipers Piping”. This is because the size of the pie slice is not easily calculated.
• The beginning of one category starts at the end of the previous category. This means that you cannot compare multiple categories from the same baseline, because the baseline shifts from one category to the next.
• Finally, to generate a pie chart it is necessary to calculate the percentage of the categories, after all a pie chart is by nature showing 100% and not 78 total gifts. This may be done manually, but that is not necessary as the software used to create the pie chart will do this automatically (these example charts were built in Microsoft Excel). Now in some cases a percentage might be the correct measure, but in other cases the values may be more appropriate. Below are the calculated fields for what the pie chart is actually showing.

There is nothing wrong mathematically with the pie chart. There are twice as many Geese a Laying then there are French Hens and three times as many Ladies Dancing and French Hens. However, the comparison between these is exactly the point. The pie chart does not make it easy to tell that comparison. It’s hard enough to tell which slice is bigger. It would be impossible to discern twice as much or three times as much.
Here is the same data graphed using a simple bar chart.

This chart solves all of the problems mentioned above.
• Comparisons are made easily from one category to the other because the baseline is now the same for each category. Turtle Doves is clearly twice as many as the Partridge in the Pear Tree. There is no question if there are more Ladies Dancing or Maids a Milking.
• Color is easily managed. There is no color requirement to discern between categories. In fact, this graph could be done in gray scale and printed on a black and white printer or copy machine and it would still be usable.
• The axis labels are now adjacent to the data and the bar. This allows for a very compact chart and is easy to read.
• Finally, unless the pie chart is shrunk to a tiny graphic, for example as a data layer on top of a map, then there is no real space savings. In fact, the bar chart takes up less room on the page and is more readable than the pie chart.

Hopefully this holiday example illustrates the problems associated with using pie charts and the better alternatives. Best wishes for a safe and happy holidays and please keep checking back for more on Data Visualization.

The Bureau of Labor Statistics (BLS) has published some really bad graphs and maps over the years.  Below is an example of a map they publish monthly for the “Unemployment rates by state”.  In this map they are attempting to have a sequential color scheme, going from light to dark to represent low to high unemployment rates, but because of poor color choices it has unintentionally become categorical.  Black, which is the highest rate, seems muted against the other colors.  The bright red, which is a middle value of 5%-5.9% unemployment, seems to dominate the map more than the darker red or purple color which is actually a higher rate.

However, the highlight for today is a refreshingly well done graph on the unemployment rate and median earnings when compared to education attained.

This graph is very well done.  Notice the following characteristics.

  • Simple bar graph used for comparison.  The choice of the bar graph allows the reader to easily compare the categories.
  • Consolidated labeling and diverging horizontal scale allows for combined axis labels in the middle.
  • There are no extra gridlines, no horizontal axis line, no axis scale and no border  around the chart (unfortunately the webpage coding added an unnecessary border on their website at http://www.bls.gov/emp/ep_chart_001.htm)
  • The data points are placed on the bars themselves providing addition information to the story.
  • The addition of a very clean reference line (in this case the average) gives additional context to the story and provides a context for each bar to be compared against.
  • Formatting is very clean. A single decimal place is consistent for the unemployment rate and the median income is not cluttered with decimal places, but includes a comma for thousands.
  • The use of color is simple.  Someone who is color blind may not be able to distinguish between the red and the green easily, but since the color is not crucial to the story nothing will be lost.
  • Great care was taken to have the negative statistic, in this case the unemployment rate, increase horizontally to the left, while the positive indicator of median weekly earnings increase to the right.

Notice that they utilized some of the same techniques that were discussed in the recent “Make Category Comparisons Much Easier with these Redesigns” post on the Making Data Meaningful blog.  Now some may argue the overall message of this graph, which is, higher education will lead directly to higher income.  This may or may not be the case; however, the BLS has done an excellent job at presenting this data. Congratulations to the Bureau of Labor Statistics for creating an excellent graph.

As it relates to the first unemployment map, simply changing the color scheme would solve the categorical color problem.  Here is the same map using a color blind friendly blue-orange diverging color scheme.  More importantly though, examine the difference in the emphasis on the orange and dark orange states and very little emphasis on Montana, Kansas, Louisiana and Virginia which were bright red in the original version.  

However, when using this diverging color scheme the blue still attracts attention to the low percentage states.  This kind of color contrast might work well for a political map, for example Republican vs. Democrat, but for a low to high scale this can be confusing.  A better version of this could be achieved by using a single color, light to dark, and removing a few of the bands, for example 5 or 6 bands instead of 8.  Here is the same map using the sequential color palette but only using orange and 6 bands.  This is similar to the original map, but avoids the purple and dark red being interpreted as categorical.

Here is the same map using only gray scale as the sequential coloring.

Another major issue with these charts though is the difference in scales from month to month and what appears to be an arbitrary grouping of states. Here’s a comparison of the legend for the map in December 2008 and October 2011. Notice the different scale for the two legends as well as the groupings within each color.

Compare the maps side by side.


In the December 2008 version the groupings start within a band of 0-1.9% and then move in 1% increments until the purple band which has 3%.  In the October 2011 the bands are grouped differently.  Below is a straight line band of 1% to outline the color difference.

By changing this color scheme it makes it impossible to have an apples-to-apples comparison from one time period to the next.  This is a shame because this type of map would make an excellent trellis charts to compare month by month or year over year.  Also, the color choice and band selection will have a dramatic impact on the visual story.  This inconsistency allows for the creator to manage the story. Hopefully, the future graphs of the BLS will continue to follow the good example.