Friday, November 11, 2011

11-11-11 and Benford's Law





  









So today is 11-11-11! There's a lot of buzz on the internet about the significance of this date including an extreme case(apocalypse is at hand) and the number 11. Since this is a blog about data illumination, let's take a look at the numbers and supporting data.

First off, let's see where do see 11s:

  • An 11-sided polygon is called hendecagon or undecagon. Tell that to your geometry teacher! 
  • Susan B Anthony dollar has 11 sides. It was originally minted from 1979 to 1981 and then again in 1999. 
  • The Canadian one dollar coin, called the loonie, has 11 sides (technically it's a 11-sided curve of constant width).
  • 11 is the smallest two-digit prime number
  • Sodium has the atomic number 11 in periodic table
  • Group 11 elements in the periodic includes copper, silver, and gold, which has been used in coins and held in esteemed since ancient times
  • Apollo 11 (first to land on the Moon with humans)
  • 11 players on the field at one time in American football 
  • 11 Downing Street, where the British Second Lord of Treasury lives. This is next to the more famous 10 Downing Street where the Prime Minister resides
  • 11 is the number of gun salute in the U.S. armed forces
  • There are endless examples of 11s all around us IF you look hard enough....
Numerologists assign mystical powers and meanings to the number 11. Because of this belief, everywhere they go they see 11s and associate meanings to the number 11. Psychologists call this confirmation bias, which means people have a tendency to place more weight on information that confirms their preconceptions/beliefs regardless of the validity of the information.

So, let's look at the data to see if this is true.

There's a law in statistics called Benford's Law that says in a list of numbers from real-life data, the first digit of each number occurred in a non-uniform way. Uniform just means that there's an equal chances of occurrence. So, for the digits 1 to 9, if they were to occur uniformly then each of the digits should have equal chance of occurring (1/9 or 11.1%).

However, Frank Benford discovered this is not the case. In fact, the leading digit 1 occurs MORE often than the other digits. About 30% of the time according to the theoretical distribution below:


(Note: Please see technical addendum below for details.)

But that's just the theory. Let's test Benford's Law out on some REAL data to see if these spooky 1s that make up the 11s do occur more often than not in real life.

I went to Factual.com and blindly picked a random data set. I downloaded the Census.gov Datasets - Agricultural Exports and Imports--Volume by Principal Commodities. You can also get the data directly from the Census.gov website as well.

The data consists of a listing of various commodities that we import and export from 1990 to 2007:  beef, pork, lamb, poultry, coffee, feed grains, fruits, nuts, oilseeds, oilcake(what is that?), rubber and allied gums, rice, fresh and frozen vegetables, wheat, malt beverages, feeds and fodders, wine, etc. The list just goes on and on.  Totally random data set as one can pick.

Analysis:
I looked at all these imports and exports number and extracted their first digits. After a bit of data cleaning and munging(see Technical Addendum for details), here's the summary table:



Digits







Year 1 2 3 4 5 6 7 8 9
1990 39.3% 21.4% 7.1% 7.1% 14.3%
3.6% 3.6% 3.6%
1991 32.1% 14.3% 17.9% 7.1% 10.7% 3.6%

14.3%
1992 35.7% 25.0% 7.1% 14.3% 3.6% 7.1% 3.6%
3.6%
1993 32.1% 17.9% 10.7% 10.7%
7.1% 7.1% 3.6% 10.7%
1994 32.1% 25.0% 10.7% 10.7%
7.1% 3.6% 7.1% 3.6%
1995 42.9% 10.7% 17.9% 7.1% 10.7% 3.6% 3.6%
3.6%
1996 42.9% 25.0% 7.1% 10.7% 10.7%
3.6%

1997 39.3% 17.9% 14.3% 10.7% 3.6% 7.1% 3.6%
3.6%
1998 32.1% 28.6% 7.1% 7.1% 7.1% 7.1%
3.6% 7.1%
1999 35.7% 25.0% 14.3% 10.7% 3.6%
3.6% 7.1%
2000 32.1% 32.1% 3.6% 10.7% 7.1% 3.6% 3.6% 3.6% 3.6%
2001 28.6% 28.6% 10.7% 10.7% 7.1%
7.1% 7.1%
2002 28.6% 25.0% 10.7% 17.9% 7.1% 3.6%
7.1%
2003 28.6% 25.0% 21.4% 3.6% 7.1% 3.6% 3.6%
7.1%
2004 28.6% 21.4% 21.4% 7.1% 7.1%
3.6% 3.6% 7.1%
2005 35.7% 21.4% 17.9% 10.7% 3.6% 3.6% 3.6%
3.6%
2006 39.3% 7.1% 25.0% 14.3% 7.1% 3.6%
3.6%
2007 39.3% 10.7% 25.0% 7.1% 7.1% 7.1%
3.6%

We see that amazingly enough, 1s do occur as the first digit more often than any of the other digits and quite close to the theoretical number of 30.1%. Here's a visualization of the same table above:



Again, we see the 1s dominate over the entire 18 year time period and there are years where the number 8 does NOT occur as a leading digits in our import and export data.

Let's look at the summary over the entire time period by averaging the data:




Potential Extras:
I'm trying to get a dataset of birthdates. I think I found a source though if someone has such data set, then please kindly point the way to me. Thanks. :)


Updated: Yes, I found a dataset and did some analysis on birthmonths  for selected years. Check out at this blog Were Less Babies born in the summer in the late 1800s ?!

Conclusion:
This is perhaps why our mystically inclined friends see 11s everywhere! And they are getting spooked out by today's TRIPLE 11s occurrence 11-11-11. Or conversely, feel lucky given all the weddings going on around the world today. It's just the statistical distribution of the leading first digits in real life data according to Benford's Law. But then again, they will say, "Yep, this PROVES our beliefs!" Correlation does not necessarily imply causality, which means just because you see things associated together does not necessarily mean A causes B to happen.

But I'll throw a bone for the 11-11-11 crowd and end on this note. According to superstring theory, a grand unifying theory in physics, we are potentially living in an 11-dimensional universe... (Twilight Zone music playing in the background)..

 

HAPPY VETERAN'S DAY!  Thank You.


Technical Addendum:
The general reader can safely ignore this section, however, the interested reader can forge ahead. :)

Mathematical statement of Benford's Law (and how I was able to get the 30.1% with the Excel chart)

Getting the Data, Data Cleaning\Data Munging, Data Analysis, and Data Visualization:
  1. Download the data from here Census.gov Datasets - Agricultural Exports and Imports--Volume by Principal Commodities
  2. Clean up the header in Excel or use Google Refine(a topic of another post)
  3. Start using the language R (http://www.r-project.org/)
  4. Download and install R if you don't have it. It's FREE (open source).
  5. Use the following R code to do data massaging and calculations (details will follow shortly)
  6. Export back into Excel
  7. Graph it in Excel (potentially one could graph it in R as well) 

2 comments:

  1. You're exporting to Excel to plot? Try install.packages("ggplot2") in R and you'll see what you've been missing.

    ReplyDelete
  2. Mr. Gunn - Thank you for the suggestions of ggplot2. I'm aware of the graphics package. I need to get around to exploring it further. Perhaps, in a future posting.

    ReplyDelete