I have a fun one for us tonight. When we’re using data in analytics, the purpose of doing Data Analysis obviously would be to come to some greater understanding to make some change to improve something, make more money, etc. We’ve had some other podcasts where we’ve talked about the purpose of data and why we do it, and part of the reason we go off on rants and criticize people so frequently is that they do data just for the sake of data.
What is correlation?
We’re going to talk about a fun one here, which is a correlation. If the goal is to predict, we do so because we want to anticipate something to prepare for it so that we can deal with it more effectively or possibly so that we can change the course of the outcome.
For example, greenhouse gases caused global climate change. This isn’t the political commentary, that’s science, and I am a scientist. We want to know this to change what happened so that there are less severe hurricanes, for example, that destroy New Orleans or Florida or the Carolinas or even New York City.
That’s important to understand the relationship between those things to do something about it, whether that means reducing greenhouse gases or doing something after the fact to clean it up. Those are all different discussions. We want to be able to understand that relationship so that we can do something and either prevent the outcome. And reduce the magnitude of the severity of the outcome.
There’s often confusion between correlation and causation. There was an Economist magazine article that talked about this subject and noted that eating more ice cream showed that reading test scores increased. Is that possible? Does eating more ice cream cause you to become more intelligent, to improve your reading skills?
There’s a whole range of these things, and I’ll give you a few just because I love these what are often called spurious correlations. One of those I like is people who have drowned after falling out of a fishing boat correlates heavily with Kentucky’s marriage rate. Let me get this straight. The more people die in falling out of fishing boats that drown; the less people get married in Kentucky. Maybe, they’re trying to avoid marriage, and so they drown themselves. I’m not entirely sure. It seems unlikely that there’s causality there, but maybe.
It turns out that the less pirates there are, the more global warming there is. As the piracy rate has gone down, the number of pirates has reduced; the rate of global warming has increased almost entirely in line with that.
Look at some interesting statistics
The rate of U.S. spending on science, space, and technology has gone up almost in lockstep with the number of suicides by hanging, strangulation, and suffocation.
Here’s one that’s near and dear to my heart because I was born and raised in Wisconsin. The per capita cheese eating correlates with the number of people who died getting tangled in bedsheets. I’ll repeat that. The per capita cheese consumption has gone up very closely with the number of people who have died by becoming tangled in their bedsheets.
Perhaps, my favorite all-time is the rate of backyard swimming pool deaths for children correlates with Nicolas Cage’s film. The more Nicolas Cage films there are, the more children die in a backyard swimming pool. Nicolas Cage makes some terrible films. Hopefully, the children haven’t seen them, and they’re just drowning themselves because they’re so upset by how bad these films are. If anybody’s upset about children dying, I apologize. I’m just trying to make light of spurious correlations.
What are the correlations?
Some of these, by the way, have correlations of like 0.95-0.99. I mean, they are extraordinarily high correlation rates, meaning they’re almost entirely predictive because they go so perfectly lockstep.
That takes us to the definition of correlation. The definition of correlation is a relationship or connection between two things based on co-occurrence or pattern of change or the tendency for two values or variables to change together near the same way or the opposite way. In other words, how closely do these move in lockstep? Correlation is typically measured between -1 and +1, and +1 would be a perfect correlation meaning they move exactly together. Many of these are 0.93, 0.95, 0.99 percent. It’s extraordinary.
The definition of causation is the act or agency which produces an effect from Merriam-Webster. The summary of this is causation means cause and effect. It’s not a coincidence. Correlation is a coincidence. Causation means “something makes something else happen.” Do Nicolas Cage films cause children to die in backyard swimming pools? I am sure that’s not case as much as I sometimes want to drown myself when I see one of his films. Those are not. One is not causal of the other.
It’s difficult sometimes to find the distinction and to determine whether or not something is causal. We can look at data and see that things are related to each other, and then the question of causality has to be investigated once we’ve found that they’re correlated.
A vital part of this is we have to be skeptical and not just assume that something causes something else just because they are highly correlate. We need to test. We need to see whether or not a change in one in the future then a change in other one. Even potentially if we can move one of those variables and see whether or not it does cause a corresponding change in the other one.
That’s our little session on correlation and causation. Hopefully, it was entertaining and also a little bit educational. We’ll see you on the other side.