The following notes were live blogged from the “Understanding Predictive Analytics” session given by Chuck Chakrapani (Leger Marketing) on June 10, 2014. Minimal editing was done on the post, so there will be typos in the post. Below is a video interview with the presenter:
Is interested in technology enabled predictive analytics (as opposed to technology driven)
What is Data Analysis:
- big data
- machine learning
- data mining predictive analytics
- text mining
- etc.
Everything is predictive:
- do we want to go to this session or another
- do i take this job offer
- will my stocks go up as well
Business
- will this new product succeed
- can i icrese the price
- who will be by my target audience
Steps:
What will happen — A or B will happen, will have consequences on either results
Google Fusion
- Enable you to pull information from the web
- This means we have access to a vast amount of secondary data
The New Science of Data Science
Data science is the study of the generalizable extraction of knowledge from data. It builds on techniques and theories from many fields:
- signal processing
- probability
- etc
What is big data?
- A large amount of data?
- More data than your desktop could handle?
- One zetabyte of data
- No agreed upon definitions
- A tentaive framework
- From the data universe that is infinite and constantly in flux
Big Data and the Flu
- Google searches conversations about the flu to predict infection rates. So big data is great when it works. The problem with big data is that it is only correlations
Machine learning
- Example: Amazon tells me what I should read based on what I am reading now
- Machine learns and predicts
What Happens When You Use Gmail
- Google ads based on emails
Two Functions of Predictive Analytics
- Classification
- Prediction
The objectives haven’t changed, but:
- Lower costs
- better predictability
- faster turn-around
Example
- 25 years ago, a single cluster analysis of 600 respondents on 30 variable will run for 24 hours on a pc
- Today you can run 100 cluster analysis of 1000 respondents on 30 variables in one afternoon
How does that help?
Then:
- one respondent randomly to represent a segment
- everyone close is assigned to the segment
- there is nothing to indicate if it is reasonable
- no way of validating your segments
- holdout sample is better than nothing, not good enough
Now:
- We can have larger samples which help us split the sample into a Training set and Test set
- We can do hundred of clutters on analysis on the same data
Message:
Do not think of big data as everything. Unless you combine data with analysis the whole thing is useless. You need to have objectives.