Correlation Coefficient

What is the correlation coefficient?

The correlation coefficient is the specific measure that quantifies the strength of the linear relationship between two variables in a correlation analysis. The coefficient is what we symbolize with the r in a correlation report.

How is the correlation coefficient used?

For two variables, the formula compares the distance of each datapoint from the variable mean and uses this to tell us how closely the relationship between the variables can be fit to an imaginary line drawn through the data. This is what we mean when we say that correlations look at linear relationships.

What are some limitations to consider?

Correlation only looks at the two variables at hand and won’t give insight into relationships beyond the bivariate data. This test won’t detect (and therefore will be skewed by) outliers in the data and can’t properly detect curvilinear relationships.

Correlation coefficient variants

In this section, we’re focusing on the Pearson product-moment correlation. This is one of the most common types of correlation measures used in practice, but there are others. One closely related variant is the Spearman correlation, which is similar in usage but applicable to ranked data.

What do the values of the correlation coefficient mean?

The correlation coefficient r is a unit-free value between -1 and 1. Statistical significance is indicated with a p-value. Therefore, correlations are typically written with two key numbers: r = and p = .

  • The closer r is to zero, the weaker the linear relationship.
  • Positive r values indicate a positive correlation, where the values of both variables tend to increase together.
  • Negative r values indicate a negative correlation, where the values of one variable tend to increase when the values of the other variable decrease.
  • The values 1 and -1 both represent "perfect" correlations, positive and negative respectively. Two perfectly correlated variables change together at a fixed rate. We say they have a linear relationship; when plotted on a scatterplot, all data points can be connected with a straight line.
  • The p-value helps us determine whether or not we can meaningfully conclude that the population correlation coefficient is different from zero, based on what we observe from the sample.

What is a p-value?

A p-value is a measure of probability used for hypothesis testing. The goal of hypothesis testing is to determine whether there is enough evidence to support a certain hypothesis about your data. Actually, we formulate two hypotheses: the null hypothesis and the alternative hypothesis. In the case of correlation analysis, the null hypothesis is typically that the observed relationship between the variables is the result of pure chance (i.e. the correlation coefficient is really zero — there is no linear relationship). The alternative hypothesis is that the correlation we’ve measured is legitimately present in our data (i.e. the correlation coefficient is different from zero).

The p-value is the probability of observing a non-zero correlation coefficient in our sample data when in fact the null hypothesis is true. A low p-value would lead you to reject the null hypothesis. A typical threshold for rejection of the null hypothesis is a p-value of 0.05. That is, if you have a p-value less than 0.05, you would reject the null hypothesis in favor of the alternative hypothesis—that the correlation coefficient is different from zero.

 

How do we actually calculate the correlation coefficient?

The sample correlation coefficient can be represented with a formula:

$$ r=\frac{\sum\left[\left(x_i-\overline{x}\right)\left(y_i-\overline{y}\right)\right]}{\sqrt{\mathrm{\Sigma}\left(x_i-\overline{x}\right)^2\
\ast\ \mathrm{\Sigma}(y_i\ -\overline{y})^2}} $$

View Annotated Formula

Let’s step through how to calculate the correlation coefficient using an example with a small set of simple numbers, so that it’s easy to follow the operations.

Let’s imagine that we’re interested in whether we can expect there to be more ice cream sales in our city on hotter days. Ice cream shops start to open in the spring; perhaps people buy more ice cream on days when it’s hot outside. On the other hand, perhaps people simply buy ice cream at a steady rate because they like it so much.

We start to answer this question by gathering data on average daily ice cream sales and the highest daily temperature. Ice Cream Sales and Temperature are therefore the two variables which we’ll use to calculate the correlation coefficient. Sometimes data like these are called bivariate data, because each observation (or point in time at which we’ve measured both sales and temperature) has two pieces of information that we can use to describe it. In other words, we’re asking whether Ice Cream Sales and Temperature seem to move together.

As before, a useful way to take a first look is with a scatterplot:

correlation-ice-cream-and-temp

We can also look at these data in a table, which is handy for helping us follow the coefficient calculation for each datapoint. When talking about bivariate data, it’s typical to call one variable X and the other Y (these also help us orient ourselves on a visual plane, such as the axes of a plot). Let’s call Ice Cream Sales X, and Temperature Y.

Notice that each datapoint is paired. Remember, we are really looking at individual points in time, and each time has a value for both sales and temperature.

Ice Cream Sales (X)Temperature °F (Y)
370
675
980

 

1. Start by finding the sample means

Now that we’re oriented to our data, we can start with two important subcalculations from the formula above: the sample mean, and the difference between each datapoint and this mean (in these steps, you can also see the initial building blocks of standard deviation).

The sample means are represented with the symbols and , sometimes called “x bar” and “y bar.” The means for Ice Cream Sales () and Temperature () are easily calculated as follows:

$$ \overline{x} =\ [3\ +\ 6\ +\ 9] ÷ 3 = 6 $$

$$ \overline{y} =\ [70\ +\ 75\ +\ 80] ÷ 3 = 75 $$

2. Calculate the distance of each datapoint from its mean

With the mean in hand for each of our two variables, the next step is to subtract the mean of Ice Cream Sales (6) from each of our Sales data points (xi in the formula), and the mean of Temperature (75) from each of our Temperature data points (yi in the formula). Note that this operation sometimes results in a negative number or zero!

Ice Cream (X)Temperature °F (Y)$x_i-\overline{x}$$y_i-\overline{y}$
$3$$70$$3 - 6 = -3$$70 - 75 = -5$
$6$$75$$6 - 6 = 0$$75 - 75 = 0$
$9$$80$$9 - 6 = 3$$80 - 75 = 5$

 

3. Complete the top of the coefficient equation

This piece of the equation is called the Sum of Products. A product is a number you get after multiplying, so this formula is just what it sounds like: the sum of numbers you multiply.

$$ \sum[(x_i-\overline{x})(y_i-\overline{y})] $$

We take the paired values from each row in the last two columns in the table above, multiply them (remember that multiplying two negative numbers makes a positive!), and sum those results:

$$ [(-3)(-5)] + [(0)(0)] + [(3)(5)] = 30 $$

INSIGHT:

How does the Sum of Products relate to the scatterplot?

correlation-sp-regions.png

The Sum of Products calculation and the location of the data points in our scatterplot are intrinsically related.

Notice that the Sum of Products is positive for our data. When the Sum of Products (the numerator of our correlation coefficient equation) is positive, the correlation coefficient r will be positive, since the denominator—a square root—will always be positive. We know that a positive correlation means that increases in one variable are associated with increases in the other (like our Ice Cream Sales and Temperature example), and on a scatterplot, the data points angle upwards from left to right. But how does the Sum of Products capture this?

  • The only way we will get a positive value for the Sum of Products is if the products we are summing tend to be positive.
  • The only way to get a positive value for each of the products is if both values are negative or both values are positive.
  • The only way to get a pair of two negative numbers is if both values are below their means (on the bottom left side of the scatter plot), and the only way to get a pair of two positive numbers is if both values are above their means (on the top right side of the scatter plot).

So, the Sum of Products tells us whether data tend to appear in the bottom left and top right of the scatter plot (a positive correlation), or alternatively, if the data tend to appear in the top left and bottom right of the scatter plot (a negative correlation).

 

4. Complete the bottom of the coefficient equation

The denominator of our correlation coefficient equation looks like this:

$$ \sqrt{\mathrm{\Sigma}{(x_i\ -\ \overline{x})}^2\ \ast\ \mathrm{\Sigma}(y_i\ -\overline{y})^2} $$

Let's tackle the expressions in this equation separately and drop in the numbers from our Ice Cream Sales example:

$$ \mathrm{\Sigma}{(x_i\ -\ \overline{x})}^2=-3^2+0^2+3^2=9+0+9=18 $$

$$ \mathrm{\Sigma}{(y_i\ -\ \overline{y})}^2=-5^2+0^2+5^2=25+0+25=50 $$

When we multiply the result of the two expressions together, we get:

$$ 18\times50\ =\ 900 $$

This brings the bottom of the equation to:

$$ \sqrt{900}=30 $$

 

5. Finish the calculation, and compare our result with the scatterplot

Here's our full correlation coefficient equation once again:

$$ r=\frac{\sum\left[\left(x_i-\overline{x}\right)\left(y_i-\overline{y}\right)\right]}{\sqrt{\mathrm{\Sigma}\left(x_i-\overline{x}\right)^2\ \ast\ \mathrm{\Sigma}(y_i\ -\overline{y})^2}} $$

Let's pull in the numbers for the numerator and denominator that we calculated above:

$$ r=\frac{30}{30}=1 $$

A perfect correlation between ice cream sales and hot summer days! Of course, finding a perfect correlation is so unlikely in the real world that had we been working with real data, we’d assume we had done something wrong to obtain such a result.

But this result from the simplified data in our example should make intuitive sense based on simply looking at the data points. Let's look again at our scatterplot:

correlation-ice-cream-and-temp

Now imagine drawing a line through that scatterplot. Would it look like a perfect linear fit?

correlation-ice-cream-and-temp-line

A picture can be worth 1,000 correlation coefficients!

Scatterplots, and other data visualizations, are useful tools throughout the whole statistical process, not just before we perform our hypothesis tests.

In fact, it’s important to remember that relying exclusively on the correlation coefficient can be misleading—particularly in situations involving curvilinear relationships or extreme outliers. In the scatterplots below, we are reminded that a correlation coefficient of zero or near zero does not necessarily mean that there is no relationship between the variables; it simply means that there is no linear relationship.

curve-and-cyclic

Similarly, looking at a scatterplot can provide insights on how outliers—unusual observations in our data—can skew the correlation coefficient. Let’s look at an example with one extreme outlier. The correlation coefficient indicates that there is a relatively strong positive relationship between X and Y. But when the outlier is removed, the correlation coefficient is near zero.

correlation-with-and-without-outlier

Back to Top