Correlation vs. Causation
Correlation tests for a relationship between two variables. However, seeing two variables moving together does not necessarily mean we know whether one variable causes the other to occur. This is why we commonly say “correlation does not imply causation.”
A strong correlation might indicate causality, but there could easily be other explanations:
- It may be the result of random chance, where the variables appear to be related, but there is no true underlying relationship.
- There may be a third, lurking variable that that makes the relationship appear stronger (or weaker) than it actually is.
For observational data, correlations can’t confirm causation...
Correlations between variables show us that there is a pattern in the data: that the variables we are looking at tend to move together. However, correlations alone don’t show us whether or not the data are moving together because one variable causes the other.
It’s possible to find a statistically significant and reliable correlation for two variables that are actually not causally linked at all. In fact, such correlations are common! Often, this is the case because both variables are associated with a different causal variable, which tends to co-occur with the data that we’re measuring.
Example: Exercise and skin cancer
Let’s think about this with an example. Imagine that you’re looking at health data. You observe a statistically significant positive correlation between exercise and cases of skin cancer – that is, the people who exercise more tend to be the people who get skin cancer at higher rates. This correlation seems strong and reliable, and shows up across multiple populations of patients. Without exploring further, you might conclude that exercise somehow causes cancer! Based on these findings, you might even develop a hypothesis: perhaps the stress from exercise causes the body to lose some ability to protect against these types of cancers. However, exercise is generally thought to reduce cancer risk, so that conclusion and hypothesis is questionable.
Perhaps in reality, this correlation exists in your data set because people who live in places that get a lot of sunlight year-round have more opportunities for outdoor recreation than people who live in places that don’t. This situation shows up in their data as increased exercise. At the same time, increased daily sunlight exposure means that there are more cases of skin cancer. Both of the variables – rates of exercise and skin cancer – are affected by a third, causal variable – amount of sunlight – but they are not causally related to one another.
...but with well-designed empirical research, we can establish causation!
Distinguishing between what does or does not provide causal evidence is a key piece of data literacy. Determining causality is never perfect in the real world. However, there are a variety of experimental, statistical, and research design techniques for finding evidence toward causal relationships: e.g., randomization, controlled experiments, and predictive models with multiple variables. Beyond the intrinsic limitations of correlation tests (e.g., correlations measure relationships between pairs of variables, and therefore cannot account for a potential underlying relationship with a third variable), it's important to understand that evidence for causation typically comes not from observational data but from careful experimental design.
Example: Heart disease, diet and exercise
For example, imagine again that we are health researchers, this time looking at a large data set of disease rates, diet, and other health behaviors. Suppose that we find that increased exercise is correlated with lower rates of heart disease (a negative correlation). This correlation is large, and we find it reliably. Surely this provides a clue to causation, right?
In the case of these health data, correlation might suggest an underlying causal relationship, but without further work, it does not establish it. Imagine that after finding this correlation, as a next step, we perform a biological study that investigates how physical activity affects the heart and circulatory system. Perhaps we find a physiological mechanism through which increased physical activity lowers blood pressure: exercise increases production of nitric oxide, causing the blood vessels to widen. Lower blood pressure reduces the risk of cardiovascular disease, among other health risks. We might then design a randomized, controlled experiment to study the effects of physical activity on levels of nitric oxide, and determine that there is a causal relationship between the two.
In this example, notice that our causal evidence was not provided by the correlation test itself, which simply quantified the relationship between variables from observational data (rates of heart disease and reported exercise). Instead, we used a controlled experiment to find evidence that physical activity can cause changes in levels of nitric oxide.
So how do we explore causation? With the right kind of investigation!
Understanding causation is a difficult problem. In the real world, we never have access to all the data we might need to map every possible relationship between variables. But there are some key strategies to help us isolate and explore the mechanisms between different variables. For example, in a controlled experiment we can try to carefully match two groups and randomly apply a treatment or intervention to only one of the groups. The principle of randomization is key in experimental design, because it allows us to make inferences about the direct effect of one variable on another without worrying that there is some unmeasured causal variable co-occurring with the variables we’re studying.
It’s not always realistic or even possible to perform a controlled experiment. But let’s return to the first example above that described the apparent relationship between exercise and skin cancer rates. What would kind of data would we need to infer causality? Imagine that we’re somehow able to take a large, globally distributed sample of people and randomly assign them to exercise indoors at different levels every week for decades. At the end of that time, we record skin cancer rates for each group of exercisers. We will end up with a data set that has been experimentally designed to test the relationship between exercise and skin cancer! Because exercise was directly manipulated in the experiment via random assignment, it will not be systematically related to any other variables that could be different between these two groups (assuming all other aspects of the study are valid). It means that in this case, because our data were derived via sound experimental design, a correlation (positive or negative!) between exercise and skin cancer would be meaningful evidence for causality.