I’ve been interested in developing models and using data to drive business decisions, and so I was recently reading “Doing Data Science”, which is available at http://www.amazon.com/Doing-Data-Science-Straight-Frontline/dp/1449358659/. The book contains a fair bit of math, which might make it seem a bit daunting, but I believe it’s worth the read since the authors offer some interesting insights into how to incorporate data analysis and modelling into solving business problems. There are two sections in particular that I found useful. The first is on exploratory data analysis, which is the process by which you start to construct a solution to your problem. As the author states, “Exploratory data analysis (EDA) is often relegated to chapter 1 (by which we mean the ‘easiest’ and lowest level) of standard introductory statistics textbooks and then forgotten about for the rest of the book… But EDA is a critical part of the data science process…” One of the challenges for me, especially when facing a (messy) business problem, is figuring out what is relevant to the issue, and so I think the framework laid out in this book for doing EDA gives me a good structure for how to approach this step. This involves both asking what information might be available to help me develop correlations between with the desired business result as well as strategies for teasing out those correlations. Related to this is the chapter on extracting meaning from data, where the author effectively makes the point that just asking more questions and getting more information doesn’t necessarily lead to a better outcome/model if the data you are gathering is not relevant to the problem at hand.
The book also includes a number of useful vignettes about the real-life application (and misapplication) of data-driven business decisions. For instance, here is an example from IBM where they wanted to find potential customers for their online business service:
At IBM, the target was to predict companies that would be willing to buy “websphere” solutions. The data was transaction data and crawled potential company websites. The winning model showed that if the term “websphere” appeared on the company’s website, then it was a great candidate for the product. What happened? Remember, when considering a potential customer, by definition that company wouldn’t have bought websphere yet (otherwise IBM wouldn’t be trying to sell to it); therefore no potential customer would have websphere on its site, so it’s not a predictor at all… Doing simple sanity checking to make sure things are what you think they are can sometimes get you much further in the end…