Statistics Knowledge Portal

A free online introduction to statistics

Correlation

What is correlation?

Correlation is a statistical measure that expresses the extent to which two variables are linearly related (meaning they change together at a constant rate). It’s a common tool for describing simple relationships without making a statement about cause and effect.

How is correlation measured?

The sample correlation coefficient, r, quantifies the strength of the relationship. Correlations are also tested for statistical significance.

What are some limitations of correlation analysis?

Correlation can’t look at the presence or effect of other variables outside of the two being explored. Importantly, correlation doesn’t tell us about cause and effect. Correlation also cannot accurately describe curvilinear relationships.

Correlations describe data moving together

Correlations are useful for describing simple relationships among data. For example, imagine that you are looking at a dataset of campsites in a mountain park. You want to know whether there is a relationship between the elevation of the campsite (how high up the mountain it is), and the average high temperature in the summer.

For each individual campsite, you have two measures: elevation and temperature. When you compare these two variables across your sample with a correlation, you can find a linear relationship: as elevation increases, the temperature drops. They are negatively correlated.

What do correlation numbers mean?

We describe correlations with a unit-free measure called the correlation coefficient which ranges from -1 to +1 and is denoted by r. Statistical significance is indicated with a p-value. Therefore, correlations are typically written with two key numbers: r = and p = .

• The closer r is to zero, the weaker the linear relationship.
• Positive r values indicate a positive correlation, where the values of both variables tend to increase together.
• Negative r values indicate a negative correlation, where the values of one variable tend to increase when the values of the other variable decrease.
• The p-value gives us evidence that we can meaningfully conclude that the population correlation coefficient is likely different from zero, based on what we observe from the sample.
• "Unit-free measure" means that correlations exist on their own scale: in our example, the number given for r is not on the same scale as either elevation or temperature. This is different from other summary statistics. For instance, the mean of the elevation measurements is on the same scale as its variable.

What is a p-value?

A p-value is a measure of probability used for hypothesis testing.

It is the probability of obtaining test results equal to or more extreme than what was observed, assuming that no effect is actually present – in other words, assuming that the null hypothesis is true. For our campsite data, the null hypothesis is that there is no linear relationship between elevation and temperature. A small p-value suggests that the observed data is unlikely under the null hypothesis. When a p-value is used to describe a result as statistically significant, this means that it falls below a pre-defined cutoff (e.g., p <.05 or p <.01) at which point we reject the null hypothesis in favor of an alternative hypothesis (for our campsite data, that there is a relationship between elevation and temperature).

Once we’ve obtained a significant correlation, we can also look at its strength. A perfect positive correlation has a value of 1, and a perfect negative correlation has a value of -1. But in the real world, we would never expect to see a perfect correlation unless one variable is actually a proxy measure for the other. In fact, seeing a perfect correlation number can alert you to an error in your data! For example, if you accidentally recorded distance from sea level for each campsite instead of temperature, this would correlate perfectly with elevation.

Another useful piece of information is the N, or number of observations. As with most statistical tests, knowing the size of the sample helps us judge the strength of our sample and how well it represents the population. For example, if we only measured elevation and temperature for five campsites, but the park has two thousand campsites, we’d want to add more campsites to our sample.

Visualizing correlations with scatterplots

Back to our example from above: as campsite elevation increases, temperature drops. We can look at this directly with a scatterplot. Imagine that we’ve plotted our campsite data:

• Each point in the plot represents one campsite, which we can place on an x- and y-axis by its elevation and summertime high temperature.
• The correlation coefficient (r) also illustrates our scatterplot. It tells us, in numerical terms, how close the points mapped in the scatterplot come to a linear relationship. Stronger relationships, or bigger r values, mean relationships where the points are very close to the line which we’ve fit to the data.