Wednesday, November 9, 2011

Getting Data (and a bit of visualization)



The first step of data illumination is getting the data. Luckily, there is an abundance of interesting data sets that are freely available on the Internet. A few good sites that I found are:


There's more to getting the data than meets the eyes. The data sites listed above are nice to start with because the provider has cleaned(for the most part) the data, formatted in ways that major applications can read (Excel, csv text formats, etc.), or better yet it's self contained in an environment where one can do some simple visualization and exploratory data analysis as in the case with Google Public Data Explorer.

However in many situations, the data is dirty, formatted oddly or just need some kind of cleanup. The process of getting data to where it needs to be before any analysis and visualization are done is called data munging(among data geeks) and ETL(Extract, Tranform, Load) among the business intelligence(BI) community. But that's a subject for another post.

Let's illuminate some data using Google Public Data Explorer!


Infectious Disease Outbreaks Analysis:
For fun, I created the following animated chart using Infectious Disease Outbreaks data provided by Harvard Medical School:




(Note: Please push the play button to see the animated chart.)

This animated chart shows the number of outbreaks and the average time it took to discover there was an outbreak(color coded). As an exercise for the reader, you can experiment with color coding by number of outbreaks instead or average time to public communications. If we look at the animation carefully, then we see for the most part health organizations discover there's an outbreak in less than 50 days, which is less than two months. So, don't PANIC the next time you hear the media raving about an epidemic outbreak.

There is an exception though. In 2001 in the Western Pacific region, it took over 300 days to discover the outbreak, but fortunately it's showing a small red bubble which means only a few cases of outbreaks. What happened in 2001? What was the story? Now, I'm not an epidemiologist, but the intent of Google public data explorer is to open up data and make it accessible and easy to analyze issues of importance. This could be a starting point to delve deeper and find other data, news stories, and other specific issues to get a deeper understanding. It's like detective work. Open data creates an open society.

PIGS Analysis:
In this next study I tried to shed light on what has been on the news a lot lately. The crisis in Greece and entire EU mess. It's a debt and economic problem, which includes unemployment as one issue. I looked at unemployment rate in Europe and highlighted the PIGS(Portugal, Italy, Greece, Spain). Then I zeroed in on a demographic segment of those of working age(25 to 74, the other choice is under 25).




(Note: Please push the play button to see the animated chart.)

And for comparison, I highlighted Germany who is apparently the more(if not most) economically successful European country who is bailing out Greece and much of Europe. For kicks, I also highlighted Finland and the U.K.

What we see is quite a story. In the 1990s, Finland had unemployment rate as high as bordering 15%. Germany was hovering around 10%. But both countries turn themselves around in the 2000s and this is when the PIGS started to deteriorate. As of the most recent timer period available (Sept 2011), Spain is at the far upper right hand corner and Greece is OFF THE CHART!

This is an easy and powerful tool to visualize the evolution of unemployment patterns across the EU. Now, the next question is why did the PIGS faltered and Germany not only improved but leading now? Poor fiscal discipline? Bad monetary policies. Demographics? We don't know. Again, we need more data and analyses to answer that question.The cycle begins again.

Data illumination is a virtuous circle. We first get some data that interests us, then we do some exploratory data analysis(using some visualizations or statistical studies). Then we discover some basic trends and patterns that then lead us to ask further questions. And to answer these questions, we need to find more data and do more analysis, which would then lead to further questions.

Unfolding the deeper stories and seeing the connections...

Per the latest breaking news, here's a look at the size of government deficit/surplus as a percentage of GDP (Note: You will need Adobe Flash Player to view the animated charts, which can be easily downloaded. Please push the play button to see the animated chart. Thanks)

5 comments:

  1. Try Knoema.com

    One of the best data resources I came across.

    ReplyDelete
  2. Thanks for suggesting Knoema.com. Looks like a good site.

    ReplyDelete
  3. That was a cool graphic presentation of data...

    ReplyDelete
  4. Thank Sunil. Glad you liked the animated charts. I'll try to use some other cool visualizations in future postings.

    ReplyDelete
  5. http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen.html

    Hans takes the biscuit right off your plate.

    ReplyDelete