Wednesday, March 14, 2012

Predicting March Madness through Social Media Analytics & Sports Crowdsourcing

March Madness is upon us. Let's see  what's all the buzz is about around March Madness using some
social media analytics.

First off, I'll do a word cloud of the buzz surrounding hashtag marchmadness. Note: This was
a snapshot I did a point in time yesterday.

We can observe a few things. Obviously, marchmadness has the largest representation, because
it has the most frequency(most occurring). This is by design since I used marchmadness as
a keyword search. Next we see bracket and brackets being dominantly represented as well.
This too makes sense. Next down the list are words like tournament, tourney, and ncaa.

Inspecting it more closely and zeroing in on just teams, we see syracuse, kentucky, msu,
michigan, pittsburg, msu. (Note: If you can find anymore teams from that word cloud, then let me know.)
So, there are definitely some buzz around Syracuse. The word Thursday is also nicely represented. This also makes a sense given the games are starting tomorrow- Thursday.

It is all fun, but also good. It means that mining unstructured data(text) in social media streams can give a general feel and milieu of what people are saying and matches our natural intuition.

To those who says word cloud is not insight, I'm going a step further and doing some real sentiments analytics. So, we already got our word cloud using the text mining and word cloud libraries in R as detailed in my post Visualizing Strata Conference Tweets as Word Cloud using R.

Now, we have to build some kind of sentiment scoring function. To score words that are positive versus words that express negative sentiments. Fortunately, there are libraries of  positive and negative words used in
the text analytics research areas(Hu Lieu). Alternatively, you can create your own list of positive and negative words. With these two lists, we can match up the tweets generated about March Madness to score how the sports crowd feels about it.

Here's the sentiments analytics illuminated and visualized:

Yes, people are excited about March Madness. Overwhelming more positive sentiments than negative sentiments.

An extension of this is to track the sentiments of tweeters have about certain teams to see if their opinions are a good indicator to PREDICT team's performance. :) This does not necessarily imply that positive scores(sentiments) equate to winning teams. In reality, it could be a contrarian indicator. Likewise, a preponderance of negative sentiments could be the opposing fan base trash talking.

I ran the same analysis a few minutes ago and the word cloud dynamically changes to reflect today's sentiments:

Now more schools are mentioned like duke, carolina, vcu, vanderbilt, wku, louisville, lehigh, and my current alma mater(hahvard). Too bad my other alma mater- MIT Engineers - is in Division III NCAA.

Happy Pi Day! (3/14 !)

Note: A friend of mine noted that in 2016 it will be a really BIG Pi day. 3.1416. I added
that 2015 is even bigger and more precise 3.14159. 2016 would be a rounding up.

Technical note: Plots are done using ggplot2 libraries and corpus generation done with text mining
library in R. Ditto for word cloud generation.

No comments:

Post a Comment