What is insight?
For me, it’s always about cause and effect, although sometimes we don’t do a good enough job of making that explicit.
Let’s have a look at the process of going from data to insight.
When we analyse data, we start simply by describing it. We look at averages, percentages, etc. to understand what’s typical, and how much it varies. But no one could call that insight.
The next step is to look for patterns, exceptions, and differences. Analysis can get extremely complex, involving all sorts of fancy techniques, but fundamentally it comes down to three things:
- Comparisons (“let’s split by region”)
- Trends (“what’s happened to our Ease score over the past year?”)
- Association (“what correlates with NPS?”)
Hidden within all three of these approaches is a key causal question, “why?” Why does this region have higher retention than that region? Why is our Ease score trending down? Why are some people Promoters and others are Passives?
We should take care to be clear when we move to this kind of causal reasoning, because it is a bit of a leap of faith, and in practice organisations often use data and statistical tools which are not really up to the job.
Correlation is not causation, as the saying goes, but frankly if we’re reporting correlation then it’s usually because we’re at least speculating about causation. That’s not necessarily a problem, as long as we’re clear on what we’re doing:
“…speculation is fine, provided it leads to testable predictions and so long as the author makes it clear when he’s ‘merely’ speculating.”V.S. Ramachandran
Making up our own stories to explain correlation is not the answer (although as I’ve said elsewhere, telling stories is often the best way to communicate cause and effect arguments to other people).
What we’re really interested in is not asking “why”, but the related question “what if?” What if we take the account plans that the top-scoring regional manager has developed and use them in the other regions? What if we invest in a better customer portal? What if our score for reliability goes up by 1 point?
Asking “what if…?”
This is a very “big data” approach to insight, to focus heavily on building machine learning models which predict some outcome of interest (perhaps customer defection) with a great deal of accuracy. These techniques can be extremely powerful—they are more robust than statistical approaches when it comes to data that is very sparse, more open in terms of the types of data that they can deal with, and more flexible about the shape of relationship between variables.
But prediction is still not causation.
That might seem a bit odd if you’re not used to thinking about this stuff, so let’s prove it with a classic thought experiment. We can look outside at the lawn to see if the grass is wet, and if it is then we can (accurately) predict that it’s been raining. But if we go outside with a watering can and make the grass wet, that doesn’t make it rain.
Focusing on prediction without a theory about the underlying causal mechanism can lead us to make equally stupid mistakes in the real world.
With statistical modelling techniques we can build models to capture a theory of cause and effect, and to test it. But what we’re really, really, interested in is not even asking “what if…”, it’s understanding what will happen if we take some action. What if we cut our delivery times in half?
Asking “what if we do…?”
How do make this additional leap from prediction to causation?
The key is that we have a theory, because once you make your theory explicit you can test it. Techniques such as Structural Equation Modelling allow us to test how well our theory matches up to the data we’ve got, and that’s extremely useful.
But not all data suits these kind of models. More generally, there’s a lot we can learn from the Bradford Hill criteria, which originated in epidemiology. Put simply, these are a set of 9 conditions for drawing causal conclusions from observational data, and include important ideas such as a dose-response relationship and a plausable mechanism of action.
Judea Pearl is one of the leading thinkers on causal inference, so if you’re interested in this kind of stuff I’d highly recommend his work. The Book of Why is the most accessible starting point.
From theory to data
Even better, theory can guide the data we collect, whereas naive machine learning approaches assume we have all the relevant data. In practice that is very rarely the case.
Often we’re not measuring what really matters, or even what we think we’re measuring, like the online dating service that developed an algorithm to assess the quality of your profile picture. Except that you can’t measure “quality”, so they ended up measuring popularity instead. The result? The “best” photos, according to the algorithm, all belonged to attractive young women.
Not much help for me if I want to improve my profile (although fortunately I’ve been off the market for 20 years!)
That’s why, although they’re extremely powerful when it comes to prediction, I think machine learning approaches are not yet the final word. By focusing on prediction rather than explanation they have missed out the role of theory in guiding what data to collect, the importance of understanding intervention rather than association, and the subtle errors of assuming that you’ve measured what you’re really interested in.
Insight is about cause and effect, and that means developing and testing new theories about the world.