Histogram

What is a histogram?

A histogram shows the shape of values, or distribution, of a continuous variable.

How are histograms used?

Histograms help you see the center, spread and shape of a set of data. You can also use them as a visual tool to check for normality. Histograms are one of the seven basic tools in statistical quality control.

What are some issues to think about?

Histograms provide a great way to evaluate data. They can be used to check data for extreme values, or outliers, and to help understand the distribution of your data. The distribution of a variable is important to understand when selecting appropriate statistical analysis tools.

See how to create a histogram using statistical software

Histograms show the shape of data

Histograms show the shape of your data. The horizontal axis shows your data values, where each bar includes a range of values. The vertical axis shows how many points in your data have values in the specified range for the bar.

In the histogram in Figure 1, the bars show the count of values in each range. For example, the first bar shows the count of values that fall between 30 and 35.

The histogram shows that the center of the data is somewhere around 45 and the spread of the data is from about 30 to 65. It also shows the shape of the data as roughly mound-shaped. This shape is a visual clue that the data is likely to be from a normal distribution.

Figure 1: Histogram

What is the difference between histograms and bar charts?

The key difference between histograms and bar charts is the type of data that is being plotted. Histograms are used with continuous data, while bar charts are used with categorical or nominal data.  

Histograms do not have gaps between bars. The bars represent the number of values occurring within a range specified on the horizontal axis.  Bar charts can have gaps between bars. The bars represent the measured values for each category.

How do I create a histogram?

To generate a histogram, the range of data values for each bar must be determined. The ranges for the bars are called bins. Most of the time, the bins are of equal size. With equal bins, the height of the bars shows the frequency of data values in each bin. For example, to create a histogram for age in years, you might decide on bins by decade (0-10, 11-20, and so on). The bar height then shows the number of people in each decade.

With software, the bins are defined by the program. However, some software tools allow you to change the number of bins and bin starting points, which allows you to explore and better understand your data. 

Figure 2 shows the same data as in Figure 1 but with many more bars. You can still see the center, spread, and shape of the data. However, it is harder to see the overall shape than in the first figure.

 

Figure 2: Histogram from Figure 1 with more bars

Figure 3 shows the same data as Figure 1 but with only three bars, or bins. It is a lot harder to see the center, shape, and spread of the data.

Figure 3: Histogram from Figure 1 with fewer bars

The animation below shows how to use JMP and its grabber tool to explore changing bin boundaries for the data shown in the Figures 1-3.

Figure 4: Animation showing interactive bin adjustment tool available in JMP.

You might want to change axis values and axis increments to explore your data, even if your software does not let you explore interactively.

How extreme data values are observed in histograms

Histograms are affected by extreme values, or outliers. Figures 5 and 6 show a data set with an outlier excluded and included. 

Figure 5: Histogram displaying data with no outliers
Figure 6: Histogram displaying data with an outlier

In the figures above, both histograms have a horizontal axis scale of 20 to 90. Most software would show the histogram without the outlier on a smaller scale. Figure 6 uses the same scale to show how outliers appear in a histogram, which is higher than the rest of the data values. You may also have outliers lower than the rest of the data values or outliers at both ends of your data.

How skewness is observed in histograms

Not all histograms are symmetrical. Histograms display the distribution of your data, and there are many common types of distributions. For example, data is often nonsymmetrical. In statistics, this is called skewed data. For example, the battery life for a phone is often skewed, with some phones having a much longer battery life than most. 

Figure 7: Histogram displaying nearly symmetrical data
Figure 8: Histogram displaying data that is left (negative) skewed
Figure 9: Histogram displaying data that is right (positive) skewed

Figure 7 shows nearly symmetric data. If you think about folding the plot in half in the middle, the two sides will be about the same.

The histogram in Figure 8 shows data that is not symmetric. It is skewed to the left, with a longer left tail of values trailing off to the left. The skewness statistic is negative. 

The histogram in Figure 9 also shows data that is not symmetric. It is skewed to the right, with a longer right tail of values trailing off to the right. The skewness statistic is positive. 

How are groups in data observed in histograms?

If you know that there are groups in your data, then building histograms for each group may be more meaningful than building a single histogram. However, if you are unsure or unaware if there are groups, the histogram may reveal a pattern that leads you to discover groups in your data

For example, the graph in Figure 10 contains data for men and women. We think there may be a difference in the data for men and women.

Figure 10: Histogram displaying data about various groups

Roughly mound-shaped, this graph shows data with the center near 22 and a spread  from about 7 to about 32.

Figure 11 shows the data for men highlighted with the striped portion of each bar. The data for men looks roughly mound-shaped.

Figure 11: Histogram from Figure 10 highlighting the data for men

The graph in Figure 12 shows the data for women highlighted with striped bars. This data also looks roughly mound-shaped.

Figure 12: Histogram from Figure 10 highlighting the data for women

The graphs above show examples where the difference between groups has an impact, but the overall spread of values is the same for the two groups. When you compare the highlighted histograms for men and women, you see that the men are more likely to have lower values than the women. There is a lot of overlap, but the histograms support the idea that there is a difference between men and women. 

Figure 13 shows data where the two groups are very different. If you look at the overall histogram, the data is not mound-shaped. The graph shows the data for one group highlighted with striped bars. This group is roughly mound-shaped, has a spread from about 5 to 15 and a center about 9. The graph shows the data for the second group with solid bars. It is not roughly mound-shaped, has a spread from 20 to about 32, and a center of about 23.

Figure 13: Histogram displaying data where values for each group are noticeably different

These graphs help identify an important consideration: whenever you create a histogram, think about whether or not there are groups in your data. If there is a possibility of groups, you are likely to learn more about the data by creating separate histograms for each group. With some software, you can explore group differences in a single histogram, as is shown in the figures above. 

Histograms and types of data

Continuous data: appropriate for histograms

Histograms make sense for continuous data since they are measured on a scale with many possible values. Some examples of continuous data are:

  • Age
  • Blood pressure
  • Weight
  • Temperature
  • Speed

For all of these examples, a histogram is an appropriate graphical tool to explore the distribution of the data.

Categorical or nominal data: use bar charts

Histograms do not make sense for categorical or nominal data since they are measured on a scale with only a few possible values. Use bar charts instead of histograms.

With categorical data, the sample is often divided into groups and the responses have a specific ordering. For example, in a survey where you are asked to give your opinion on a scale from “Strongly Disagree” to “Strongly Agree,” your responses are categorical.

With nominal data, the sample is also divided into groups but without any particular ordering. Country of residence is an example of a nominal variable. You can use the country abbreviation, or you can use numbers to code the country name. Either way, you are simply naming the different groups for the data.