In this series of posts, I'll demonstrate some examples of retail analytics. Though the data is from a retail setting, the methodology and analyses can be applied more broadly than just the retail industry.
The examples are from the book titled Data Analysis Using SQL and Excel. This book is actually very well written and has great examples. I'm recasting some of it using more modern data science tools & approach. In the prior blog posts, I've downloaded data from public sources and the data usually came in csv or Excel format. This time it will be from a standard SQL Server database, which is more typical. Perhaps, “open data” in the future will available in database format or more likely through some RESTful APIs.
Getting the data
You can get the accompanying data set here. The site has the scripts to load the data set into your favorite RDBMS system(i.e. SQL Server, MySQL, Oracle, etc) and it also has Excel files for each chapters of the book.I'll be working through some of the examples using R instead of Excel since R is such a powerful statistical, graphics, and analytics tool. Additionally, because the book uses Excel as its graphing engine, it had to contort the data to certain formats and other forms of acrobatics. I think that's unnatural.
WIth R, we can do it all in one shot with its extensive libraries for graphing and analysis. Additionally, R is significantlly more “expressive” as language than Excel/VBA. Hilary Mason posted on LinkedIn today an article by Donnie Berkholz on Programming languages ranked by expressiveness. R is considered one of the more expressive domain specific languages.
Let's get down to it. So, let's connect to a database. To do that, we'll need RODBC library. I've assumed you've run the script that installs the database and accompanyting tables into your favorite RDBMS. The database is called “sqlbook.”“
library(RODBC)
# set a connection to the database on SQL Server 2008
conn <- odbcDriverConnect("driver=SQL Server;database=sqlbook;server=PETERCHEN-PC")
The gist of any data science or analytics analysis is the questions we like to ask. We are not just crunching numbers and graphing for its own sake(though that in itself can be its own enjoyment). It's because we want to answer some questions to help us run our businesses better, generate more revenues, increase retention, decrease attrition, increase customers' loyalties, predict any number of things, etc.The book mentioned above has the data model which I'll not repeat here. It's instructive to understand the ER(entity-relationship) diagram so we can connect the various tables to form the data set that we'll be doing the analytics on. Fortunately, the ER diagram is rather simple and thus can be readily use for illustrative purposes of analytics.
Day of Week Analytics
The first question we'll ask is "Are there day of the week effects for the number of orders and average price per order ?”I used the handy plyr and lubridate library from Hadley Wickham. plyr for data massaging and lubridate to handle calendar dates math. It's unfortunate that calendar dates math is so complex in many programming languages. lubridate simplies some of that for us. To access data in any database, I used sqlQuery and pass SQL command strings with the connection instantiated above(conn variable).
I think one of the strengths of R is its ability to do “vectorized operations”, which is what makes the language expressive. One line of code using vectorized operations is equivalent to many lines in other languages. Just as “list comprehension” is considered one of the cool features of Python.
This brings up a larger issue. One has to know when to apply the right type of tools for the right kinds of problems. Using a relational database as the data store makes sense. It wouldn't make much sense for R to solve that problem. Though for unstructured data, there is currently a revolution with Hadoop and its variants that might complement or possibly supplant the relational database model.
The problems arise when people apply the wrong tools for the problem at hand. For example, some people use Excel as a data store(I've been guilty of that in the past where I was in circumstances where that was the case and just went along) as a kind of a mini-database. Though probably workable for very small scale data set, it becomes problematic as the data set gets bigger. Then all sorts of contortions such as using various Excel functions(index,match,vlookup,hlookup, cell offsets, etc.) is employed. It's unnatural and difficult to maintain.
So, the book above uses SQL databases as the backend and Excel as the general all-purpose graphics and analytics tool. This is all fine except in this era of data science, it's better to use a highly expressive all encompassing anlytics and statistical tool like R.
library(plyr)
library(lubridate)
ordersdata <- sqlQuery(conn, "SELECT orderdate, COUNT(*) as numorders,\nAVG(totalprice) as avgtotalprice\nFROM orders\nGROUP BY orderdate\nORDER BY 1")
# Day of the week Add the column wkday that shows the day of the
# week(label) to ordersdata
dayofwk <- ddply(ordersdata, "orderdate", mutate, wkdy = wday(as.Date(orderdate),
label = T))
# Group by Day of Week: Sum # of Orders and Avg Price Per Order
dayofwk <- ddply(dayofwk, "wkdy", summarize, N = sum(numorders), AvgPrice = mean(avgtotalprice))
Use R Base graphics to look at Day of Week Effects
# Using Base R graphics
par(mar = c(5.1, 4.1, 4.1, 6.1))
barplot(height = dayofwk$N, names.arg = dayofwk$wkdy, ylab = "Num of Orders",
xlab = "Day of Week", main = "Day of the Week Effect for Num of Orders & Avg Price Per Order",
axes = T)
par(new = T)
plot(dayofwk$wkdy, dayofwk$AvgPrice, xlab = "", ylab = "", type = "b", col = "red",
axes = F)
axis(4, col = "red", col.axis = "red")
mtext("Avg Total Price Per Order", side = 4, col = "red", line = 2.5)
We see that Monday has the largest number of orders. However, the items ordered are on average not of the highest price. That honor goes to Saturdays. Perhaps, on weekends, people order more expensive higher priced items than on weekdays. This is evident for both Saturdays and Sundays. On Wednesdays, people ordered the least expensive items on average. All of this data can have strategic implications on when to market certain items to your customer base. This type of information is useful beyond the retail space.
Use ggplot2 Library to plot
We can also use the ggplot2 library. However, dual Y-axes are NOT supported in ggplot2 so we have to plot two separate graphs.# Using ggplot2
library(ggplot2)
# Dual Y axis is not supported because it's against data viz best
# practices principles
qplot(wkdy, N, data = dayofwk, geom = "bar", fill = wkdy)
qplot(wkdy, AvgPrice, data = dayofwk, geom = "bar", fill = wkdy)
Convert data to Time Series
Now, we look at the orders data from a time series perspective. We convert the same data frame into a times series object of the class zoo. With a time series perspective, we can see how things evolve over time as well as bring in the entire analytics toolbox of time series analysis for forecast, trending, seasonality modeling, etc. This will be covered in a separate blog post.library(zoo)
# Convert ordersdata dataframe to a zoo object(time series object)
df.zoo <- with(ordersdata, zoo(ordersdata[, 2:3], order.by = as.Date(ordersdata$orderdate)))
plot(df.zoo, plot.type = "multiple", col = 2:4, main = "Num of Orders & Avg Total Per Order over Time")
We then take a monthly average of the data series. But first we have to convert the zoo object to xts object.
library(xts)
# Monthly average
avedata <- apply.monthly(as.xts(df.zoo), mean)
# Plot monthly average
plot(avedata$numorders, main = "Monthly Average Number of Orders", ylab = "Num of Orders")
plot(avedata$avgtotalprice, type = "o", main = "Monthly Average Price Per Order")
lines(avedata$avgtotalprice, col = "red")
prettypts <- pretty(avedata$avgtotalprice)
axis(2, at = prettypts, labels = sprintf("$ ", prettypts))
title(ylab = "Average Price of Order")
Just a one year slice:
# One year slice using window function
oneyr <- window(df.zoo, start = as.Date("2009-10-04"), end = as.Date("2010-10-04"))
plot(oneyr, plot.type = "multiple", col = 2:4, main = "One Year slice of the data")
Notice that though throughout the spring(Mar) to summer (July) that though the number of orders were low, the items bought were high priced. There's a season for the timing of purchases.
By using ggplot2, reshape2, and the melt function to graph these two time series together instead of converting the data frame to a time series using the zoo library
library(reshape2)
meltdf <- melt(ordersdata, id = "orderdate")
ggplot(meltdf, aes(x = orderdate, y = value, colour = variable, group = variable)) +
geom_line()
However, the problem is the y-axis scale is not quite correct. Remember, ggplot2 is against dual Y-axis as a best practice.
Most Popular Product Group
Another slice is to find out what is the most popular product group by month. We can see the most popular product group varies from month to month.mostPopularProd <- sqlQuery(conn, "SELECT prodmon.yr, prodmon.mon, prodmon.cnt, p.productgroupname\nFROM (SELECT YEAR(orderdate) as yr, MONTH(orderdate) as mon,\nproductid, COUNT(*) as cnt\nFROM orders o JOIN orderline ol ON o.orderid = ol.orderid\nGROUP BY YEAR(orderdate), MONTH(orderdate), productid)\nprodmon JOIN\n(SELECT yr, mon, MAX(cnt) as maxcnt\nFROM (SELECT YEAR(orderdate) as yr, MONTH(orderdate) as mon,\nproductid, COUNT(*) as cnt\nFROM orders o JOIN orderline ol ON o.orderid = ol.orderid\nGROUP BY YEAR(orderdate), MONTH(orderdate), productid) c\nGROUP BY yr, mon\n) prodmax\nON prodmon.yr = prodmax.yr AND prodmon.mon = prodmax.mon AND\nprodmon.cnt = prodmax.maxcnt JOIN\nproduct p\nON prodmon.productid = p.productid\nORDER BY 1, 2")
yrmon <- as.Date(as.yearmon(paste(mostPopularProd$yr, mostPopularProd$mon, sep = "-")))
mostPopularProd <- data.frame(yrmon, mostPopularProd[, -c(1, 2)])
qplot(yrmon, cnt, data = mostPopularProd, geom = "bar", fill = productgroupname,
stat = "identity", main = "Most Popular Product Group by Month", ylab = "Num of Orders")
Fulfilment Delays Analysis
The question we are trying to understand is the relationship between order to fulfillment with respect to order size. The larger the order the longer it takes to fulfill.One of the coolest features of R is its power of vectorized operations. So, I created a function called cumTotals that figures out the cumulative % based on cumsum function then pass it to colwise which then operate on the dataframe columnwise. With just a couple of lines of code, we can find the cumulative % for all of the columns of a data frame.
Just as SQL is good with sets since it's based on the concepts of relational algebra, R is good with functional concepts and thus very expressive and powerful with vectorized operations. (Note: Supposedly, Haskell is based on category theory. My friend thinks it's wicked cool language. I haven't gotten around to Haskell yet.)
Different languages and tools are designed to solve different types of problems.
delays <- sqlQuery(conn, "SELECT DATEDIFF(dd, orderdate, fulfilldate) as delay, COUNT(*) as cnt,\nSUM(CASE WHEN numunits = 1 THEN 1 ELSE 0 END) as un1,\nSUM(CASE WHEN numunits = 5 THEN 1 ELSE 0 END) as un5,\nSUM(CASE WHEN numunits >= 6 THEN 1 ELSE 0 END) as un6pl\nFROM orders o JOIN\n(SELECT orderid, MAX(shipdate) as fulfilldate\nFROM orderline\nGROUP BY orderid) ol\nON o.orderid = ol.orderid\nWHERE orderdate <= fulfilldate\nGROUP BY DATEDIFF(dd, orderdate, fulfilldate)\nORDER BY 1")
library(scales)
cumTotals <- function(x) {
cumsum(x)/cumsum(x)[length(cumsum(x))]
}
cumDelays <- colwise(cumTotals)(delays[, -1])
cumDelays <- data.frame(delays$delay[1:49], cumDelays[1:49, ])
names(cumDelays) <- c("Delay", "Total", "1 Unit", "5 Units", "6 Units+")
df <- melt(cumDelays, id = "Delay", variable_name = "Units Ordered")
qplot(Delay, value, data = df, color = variable, geom = "line", ylab = "Cumulative % Shipped",
main = "Cumulative Proportion of Orders Shipped for Different Numbers of Units",
xlab = "Delays in Days") + scale_y_continuous(labels = percent)
We can see that by about 7 days about 70% of the order with one unit is fulfilled. As the curve stretches out, we achieve higher and higher percentages of fulfillment.
This is part one of a series. It looks at the time dimension which can lead to future investigations of forecasting survival analysis(retention or churn). Another slice is location which leads to location analytics. Then there's market basket analysis sliced by cutomers, orders, and items.
-->