Chi-Square Goodness of Fit Test

What is the Chi-square goodness of fit test?

The Chi-square goodness of fit test is a statistical hypothesis test used to determine whether a variable is likely to come from a specified distribution or not. It is often used to evaluate whether sample data is representative of the full population.

When can I use the test?

You can use the test when you have counts of values for a categorical variable.

Is this test the same as Pearson’s Chi-square test?

Yes.

Using the Chi-square goodness of fit test

The Chi-square goodness of fit test checks whether your sample data is likely to be from a specific theoretical distribution. We have a set of data values, and an idea about how the data values are distributed. The test gives us a way to decide if the data values have a “good enough” fit to our idea, or if our idea is questionable.

What do we need?

For the goodness of fit test, we need one variable. We also need an idea, or hypothesis, about how that variable is distributed. Here are a couple of examples:

  • We have bags of candy with five flavors in each bag. The bags should contain an equal number of pieces of each flavor. The idea we'd like to test is that the proportions of the five flavors in each bag are the same.
  • For a group of children’s sports teams, we want children with a lot of experience, some experience and no experience shared evenly across the teams. Suppose we know that 20 percent of the players in the league have a lot of experience, 65 percent have some experience and 15 percent are new players with no experience. The idea we'd like to test is that each team has the same proportion of children with a lot, some or no experience as the league as a whole.

To apply the goodness of fit test to a data set we need:

  • Data values that are a simple random sample from the full population.
  • Categorical or nominal data. The Chi-square goodness of fit test is not appropriate for continuous data.
  • A data set that is large enough so that at least five values are expected in each of the observed data categories. 

Chi-square goodness of fit test example

Let’s use the bags of candy as an example. We collect a random sample of ten bags. Each bag has 100 pieces of candy and five flavors. Our hypothesis is that the proportions of the five flavors in each bag are the same.

Let’s start by answering: Is the Chi-square goodness of fit test an appropriate method to evaluate the distribution of flavors in bags of candy?

  • We have a simple random sample of 10 bags of candy. We meet this requirement.
  • Our categorical variable is the flavors of candy. We have the count of each flavor in 10 bags of candy. We meet this requirement.
  • Each bag has 100 pieces of candy. Each bag has five flavors of candy. We expect to have equal numbers for each flavor. This means we expect 100 / 5 = 20 pieces of candy in each flavor from each bag. For 10 bags in our sample, we expect 10 x 20 = 200 pieces of candy in each flavor. This is more than the requirement of five expected values in each category.

Based on the answers above, yes, the Chi-square goodness of fit test is an appropriate method to evaluate the distribution of the flavors in bags of candy. 

Figure 1 below shows the combined flavor counts from all 10 bags of candy.

Figure 1: Bar chart of counts of candy flavors from all 10 bags

Without doing any statistics, we can see that the number of pieces for each flavor are not the same. Some flavors have fewer than the expected 200 pieces and some have more. But how different are the proportions of flavors? Are the number of pieces “close enough” for us to conclude that across many bags there are the same number of pieces for each flavor? Or are the number of pieces too different for us to draw this conclusion? Another way to phrase this is, do our data values give a “good enough” fit to the idea of equal numbers of pieces of candy for each flavor or not?

To decide, we find the difference between what we have and what we expect. Then, to give flavors with fewer pieces than expected the same importance as flavors with more pieces than expected, we square the difference. Next, we divide the square by the expected count, and sum those values. This gives us our test statistic.

These steps are much easier to understand using numbers from our example.

Let’s start by listing what we expect if each bag has the same number of pieces for each flavor.  Above, we calculated this as 200 for 10 bags of candy.

Table 1: Comparison of actual vs expected number of pieces of each flavor of candy

FlavorNumber of Pieces of Candy (10 bags)Expected Number of Pieces of Candy
Apple180200
Lime250200
Cherry120200
Cherry225200
Grape225200

Now, we find the difference between what we have observed in our data and what we expect. The last column in Table 2 below shows this difference:

Table 2: Difference between observed and expected pieces of candy by flavor

FlavorNumber of Pieces of Candy (10 bags)Expected Number of Pieces of CandyObserved-Expected
Apple180200180-200 = -20
Lime250200250-200 = 50
Cherry120200120-200 = -80
Orange225200225-200 = 25
Grape225200225-200 = 25

Some of the differences are positive and some are negative. If we simply added them up, we would get zero. Instead, we square the differences. This gives equal importance to the flavors of candy that have fewer pieces than expected, and the flavors that have more pieces than expected.

Table 3: Calculation of the squared difference between Observed and Expected for each flavor of candy

FlavorNumber of Pieces of Candy (10 bags)Expected Number of Pieces of CandyObserved-ExpectedSquared Difference
Apple180200180-200 = -20400
Lime250200250-200 = 502500
Cherry120200120-200 = -806400
Orange225200225-200 = 25625
Grape225200225-200 = 25625

Next, we divide the squared difference by the expected number:

Table 4: Calculation of the squared difference/expected number of pieces of candy per flavor

FlavorNumber of Pieces of Candy (10 bags)Expected Number of Pieces of CandyObserved-ExpectedSquared DifferenceSquared Difference / Expected Number
Apple180200180-200 = -20400400 / 200 = 2
Lime250200250-200 = 5025002500 / 200 = 12.5
Cherry120200120-200 = -8064006400 / 200 = 32
Orange225200225-200 = 25625625 / 200 = 3.125
Grape225200225-200 = 25625625 / 200 = 3.125

Finally, we add the numbers in the final column to calculate our test statistic:

$ 2 + 12.5 + 32 + 3.125 + 3.125 = 52.75 $

To draw a conclusion, we compare the test statistic to a critical value from the Chi-Square distribution. This activity involves four steps:

  1. We first decide on the risk we are willing to take of drawing an incorrect conclusion based on our sample observations. For the candy data, we decide prior to collecting data that we are willing to take a 5% risk of concluding that the flavor counts in each bag across the full population are not equal when they really are. In statistics-speak, we set the significance level, α , to 0.05.
  2. We calculate a test statistic. Our test statistic is 52.75.
  3. We find the theoretical value from the Chi-square distribution based on our significance level. The theoretical value is the value we would expect if the bags contain the same number of pieces of candy for each flavor.

    In addition to the significance level, we also need the degrees of freedom to find this value. For the goodness of fit test, this is one fewer than the number of categories. We have five flavors of candy, so we have 5 – 1 = 4 degrees of freedom.

    The Chi-square value with α = 0.05 and 4 degrees of freedom is 9.488.
  4. We compare the value of our test statistic (52.75) to the Chi-square value. Since 52.75 > 9.488, we reject the null hypothesis that the proportions of flavors of candy are equal.

 

We make a practical conclusion that bags of candy across the full population do not have an equal number of pieces for the five flavors. This makes sense if you look at the original data. If your favorite flavor is Lime, you are likely to have more of your favorite flavor than the other flavors. If your favorite flavor is Cherry, you are likely to be unhappy because there will be fewer pieces of Cherry candy than you expect.

Understanding results

Let’s use a few graphs to understand the test and the results.

A simple bar chart of the data shows the observed counts for the flavors of candy:

 

Figure 2: Bar chart of observed counts for flavors of candy

Another simple bar chart shows the expected counts of 200 per flavor. This is what our chart would look like if the bags of candy had an equal number of pieces of each flavor.

Figure 3: Bar chart of expected counts of each flavor

The side-by-side chart below shows the actual observed number of pieces of candy in blue. The orange bars show the expected number of pieces. You can see that some flavors have more pieces than we expect, and other flavors have fewer pieces. 

Figure 4: Bar chart comparing actual vs. expected counts of candy

The statistical test is a way to quantify the difference. Is the actual data from our sample “close enough” to what is expected to conclude that the flavor proportions in the full population of bags are equal? Or not? From the candy data above, most people would say the data is not “close enough” even without a statistical test.

What if your data looked like the example in Figure 5 below instead? The purple bars show the observed counts and the orange bars show the expected counts. Some people would say the data is “close enough” but others would say it is not. The statistical test gives a common way to make the decision, so that everyone makes the same decision on a set of data values. 

Figure 5: Bar chart comparing expected and actual values using another example data set

Statistical details

Let’s look at the candy data and the Chi-square test for goodness of fit using statistical terms. This test is also known as Pearson’s Chi-square test.

Our null hypothesis is that the proportion of flavors in each bag is the same. We have five flavors. The null hypothesis is written as:

$ H_0: p_1 = p_2 = p_3 = p_4 = p_5 $

The formula above uses p for the proportion of each flavor. If each 100-piece bag contains equal numbers of pieces of candy for each of the five flavors, then the bag contains 20 pieces of each flavor. The proportion of each flavor is 20 / 100 = 0.2.

The alternative hypothesis is that at least one of the proportions is different from the others. This is written as:

$ H_a: at\ least\ one\ p_i\ not\ equal $

In some cases, we are not testing for equal proportions. Look again at the example of children's sports teams near the top of this page.  Using that as an example, our null and alternative hypotheses are:

$ H_0: p_1 = 0.2, p_2 = 0.65, p_3 = 0.15 $

$ H_a: at\ least\ one\ p_i\ not\ equal\ to\ expected\ value $

Unlike other hypotheses that involve a single population parameter, we cannot use just a formula. We need to use words as well as symbols to describe our hypotheses.

We calculate the test statistic using the formula below:

$ \sum^n_{i=1} \frac{(O_i-E_i)^2}{E_i} $

In the formula above, we have n groups. The $ \sum $ symbol means to add up the calculations for each group. For each group, we do the same steps as in the candy example. The formula shows Oi  as the Observed value and Ei  as the Expected value for a group.

We then compare the test statistic to a Chi-square value with our chosen significance level (also called the alpha level) and the degrees of freedom for our data. Using the candy data as an example, we set α = 0.05 and have four degrees of freedom. For the candy data, the Chi-square value is written as:

$ χ²_{0.05,4} $

There are two possible results from our comparison:

  • The test statistic is lower than the Chi-square value. You fail to reject the hypothesis of equal proportions. You conclude that the bags of candy across the entire population have the same number of pieces of each flavor in them. The fit of equal proportions is “good enough.”
  • The test statistic is higher than the Chi-Square value. You reject the hypothesis of equal proportions. You cannot conclude that the bags of candy have the same number of pieces of each flavor. The fit of equal proportions is “not good enough.”

Let’s use a graph of the Chi-square distribution to better understand the test results. You are checking to see if your test statistic is a more extreme value in the distribution than the critical value. The distribution below shows a Chi-square distribution with four degrees of freedom. It shows how the critical value of 9.488 “cuts off” 95% of the data. Only 5% of the data is greater than 9.488.

Figure 6: Chi-square distribution for four degrees of freedom

The next distribution plot includes our results. You can see how far out “in the tail” our test statistic is, represented by the dotted line at 52.75. In fact, with this scale, it looks like the curve is at zero where it intersects with the dotted line. It isn’t, but it is very, very close to zero. We conclude that it is very unlikely for this situation to happen by chance. If the true population of bags of candy had equal flavor counts, we would be extremely unlikely to see the results that we collected from our random sample of 10 bags.

Figure 7: Chi-square distribution for four degrees of freedom with test statistic plotted

Most statistical software shows the p-value for a test. This is the likelihood of finding a more extreme value for the test statistic in a similar sample, assuming that the null hypothesis is correct. It’s difficult to calculate the p-value by hand. For the figure above, if the test statistic is exactly 9.488, then the p-value will be p=0.05. With the test statistic of 52.75, the p-value is very, very small. In this example, most statistical software will report the p-value as “p < 0.0001.” This means that the likelihood of another sample of 10 bags of candy resulting in a more extreme value for the test statistic is less than one chance in 10,000, assuming our null hypothesis of equal counts of flavors is true.