Correlation vs. Causation

Correlation tests for a relationship between two variables. However, seeing two variables moving together does not necessarily mean we know whether one variable causes the other to occur. This is why we commonly say “correlation does not imply causation.”

A strong correlation might indicate causality, but there could easily be other explanations:

  • It may be the result of random chance, where the variables appear to be related, but there is no true underlying relationship.
  • There may be a third, lurking variable that that makes the relationship appear stronger (or weaker) than it actually is.

For observational data, correlations can’t confirm causation...

Correlations between variables show us that there is a pattern in the data: that the variables we have tend to move together. However, correlations alone don’t show us whether or not the data are moving together because one variable causes the other.

It’s possible to find a statistically significant and reliable correlation for two variables that are actually not causally linked at all. In fact, such correlations are common! Often, this is because both variables are associated with a different causal variable, which tends to co-occur with the data that we’re measuring.

Example: Exercise and skin cancer

Let’s think about this with an example. Imagine that you’re looking at health data. You observe a statistically significant positive correlation between exercise and cases of skin cancer—that is, the people who exercise more tend to be the people who get skin cancer. This correlation seems strong and reliable, and shows up across multiple populations of patients. Without exploring further, you might conclude that exercise somehow causes cancer! Based on these findings, you might even develop a plausible hypothesis: perhaps the stress from exercise causes the body to lose some ability to protect against sun damage.

But imagine that in reality, this correlation exists in your dataset because people who live in places that get a lot of sunlight year-round are significantly more active in their daily lives than people who live in places that don’t. This shows up in their data as increased exercise. At the same time, increased daily sunlight exposure means that there are more cases of skin cancer. Both of the variables—rates of exercise and skin cancer—were affected by a third, causal variable—exposure to sunlight—but they were not causally related.

...but with well-designed empirical research, we can establish causation!

Distinguishing between what does or does not provide causal evidence is a key piece of data literacy. Determining causality is never perfect in the real world. However, there are a variety of experimental, statistical and research design techniques for finding evidence toward causal relationships: e.g., randomization, controlled experiments and predictive models with multiple variables. Beyond the intrinsic limitations of correlation tests (e.g., correlations cannot not measure trivariate, potentially causal relationships), it's important to understand that evidence for causation typically comes not from individual statistical tests but from careful experimental design.

Example: Heart disease, diet and exercise

For example, imagine again that we are health researchers, this time looking at a large dataset of disease rates, diet and other health behaviors. Suppose that we find two correlations: increased heart disease is correlated with higher fat diets (a positive correlation), and increased exercise is correlated with less heart disease (a negative correlation). Both of these correlations are large, and we find them reliably. Surely this provides a clue to causation, right?

In the case of this health data, correlation might suggest an underlying causal relationship, but without further work it does not establish it. Imagine that after finding these correlations, as a next step, we design a biological study which examines the ways that the body absorbs fat, and how this impacts the heart. Perhaps we find a mechanism through which higher fat consumption is stored in a way that leads to a specific strain on the heart. We might also take a closer look at exercise, and design a randomized, controlled experiment which finds that exercise interrupts the storage of fat, thereby leading to less strain on the heart.

All of these pieces of evidence fit together into an explanation: higher fat diets can indeed cause heart disease. And the original correlations still stood as we dove deeper into the problem: high fat diets and heart disease are linked!

But in this example, notice that our causal evidence was not provided by the correlation test itself, which simply examines the relationship between observational data (such as rates of heart disease and reported diet and exercise). Instead, we used an empirical research investigation to find evidence for this association.

So how do we explore causation? With the right kind of investigation!

Understanding causation is a difficult problem. In the real world, it’s never the case that we have access to all the data we might need to map every possible relationship between variables. But there are some key strategies to help us isolate and explore the mechanisms between different variables. For example, in a controlled experiment we can try to carefully match two groups, and randomly apply a treatment or intervention to only one of the groups.

The principle of randomization is key in experimental design, and understanding this context can change what we are able to infer from statistical tests.

Let’s think again about the first example above that examined the relationship between exercise and skin cancer rates. Imagine that we’re somehow able to take a large, globally distributed sample of people and randomly assign them to exercise at different levels every week for ten years. At the end of that time, we also gather skin cancer rates for this large group. We will end up with a dataset which has been experimentally designed to test the relationship between exercise and skin cancer! Because exercise was directly manipulated in the experiment via random assignment, it will not be systematically related to any other variables that could be different between these two groups (assuming all other aspects of the study are valid). This means that in this case, because our data was derived via sound experimental design, a positive correlation between exercise and skin cancer would be meaningful evidence for causality.