What is a box plot?
A box plot shows the distribution of data for a continuous variable.
How are box plots used?
Box plots help you see the center and spread of data. You can also use them as a visual tool to check for normality or to identify points that may be outliers.
Is a box plot the same as a box-and-whisker plot?
Yes. Box plots may also be called outlier box plots or quantile box plots. Each is a variation on how the box plot is drawn.
What are some issues to think about?
When using a box plot, check your data for extreme values. Be careful if you have a very small data set. If you have categorical or nominal variables, use a bar chart instead.
Box plots show the distribution of data
The term “box plot” refers to an outlier box plot; this plot is also called a box-and-whisker plot or a Tukey box plot. See the "Comparing outlier and quantile box plots" section below for another type of box plot.
Here are the basic parts of a box plot:
- The center line in the box shows the median for the data. Half of the data is above this value, and half is below. If the data are symmetrical, the median will be in the center of the box. If the data are skewed, the median will be closer to the top or to the bottom of the box.
- The bottom and top of the box show the 25th and 75th quantiles, or percentiles. These two quantiles are also called quartiles because each cuts off a quarter (25%) of the data. The length of the box is the difference between these two percentiles and is called the interquartile range (IQR).
- The lines that extend from the box are called whiskers. The whiskers represent the expected variation of the data. The whiskers extend 1.5 times the IQR from the top and bottom of the box. If the data do not extend to the end of the whiskers, then the whiskers extend to the minimum and maximum data values. If there are values that fall above or below the end of the whiskers, they are plotted as dots. These points are often called outliers. An outlier is more extreme than the expected variation. These data points are worthy of review to determine if they are outliers or errors; the whiskers will not include these outliers.
Figure 1 shows a box plot:
The median is near the middle of the box in the graph in Figure 1, which tells us that the data values are roughly symmetrical. See Figure 4 below for data where that is not the case.
Comparing outlier and quantile box plots
Both outlier and quantile box plots show the median, 25th and 75th percentiles. The 25th percentile is also the 25th quantile, which means that 25% of the data is lower than the 25th quantile. A quantile box plot adds the 2.5th, 10th, 90th and 97.5th quantiles to the outlier box plot. Figure 2 shows quantile and outlier box plots for the same data.
Comparing box plots and histograms
Both box plots and histograms show the shape of your data. Both can be used to identify unusual points or outliers. Figure 3 shows an outlier box plot and a histogram for the same set of data. In this example, the histogram is vertical instead of horizontal.
You might find it helpful to use both types of graphs with your data. The box plot helps you see skewness, because the line for the median will not be near the center of the box if the data is skewed. The box plot helps identify the 25th and 75th percentiles better than the histogram, while the histogram helps you see the overall shape of your data better than the box plot.
How do I create box plots?
In the past, box plots were created manually. Today, most people use software to create box plots, thus avoiding manual arithmetic and reducing errors. A box plot is based on what is known as the five-number summary, which is the minimum, 25th percentile, median, 75th percentile, and maximum values from a data set. With these five numbers, you can create a box plot, meaning that with any given data set, you can generate a box plot in five steps:
- Calculate the median, 25th, and 75th percentiles.
- Calculate the interquartile range (IQR) as the difference between the 75th and 25th percentiles.
- Calculate the maximum length of the whiskers by multiplying the IQR by 1.5.
- Identify outliers.
- Use the calculated statistics to plot the results and draw a box plot.
Box plot example
The cereal data in the box plot below shows results from measuring calories per serving for 76 types of cereal. The variable Calories is continuous, so a box plot makes sense.
This data is skewed, as the median of 102 is much closer to the 25th percentile of 101 than to the 75th percentile of 200.
Adding the mean to a box plot
You can enhance the box plot depending on the software you use. JMP can add a means diamond, as shown in Figure 5. The top and bottom of the diamond are a 95% confidence interval for the mean. The middle of the diamond is the sample average, which is an estimate of the population mean.
For the cereal data, the mean is higher than the median. The difference between the mean and median tells you that these data are skewed and not likely to be from a normal distribution.
With JMP, you can also add features to graphs. The box plot in Figure 6 shows a thick green line added to the middle of the means diamond, which helps show the difference between the mean and median.
JMP also provides annotation tools, as shown in Figure 7:
This graph summarizes basic statistics for calories and displays the distribution of the data, highlighting that the data are skewed and that the data are not from a normal distribution.
Box plots highlight outliers
Box plots help you identify interesting data points, or outliers. These values are plotted as data points and fall beyond the whiskers. Figure 8 shows a box plot that has three outliers, shown as red dots above the upper whisker. These three points are more than 1.5 times the IQR. Points that are beyond 1.5 times the IQR are beyond the expected range of variation of the data.
The outliers affect the mean, median, and other percentiles. Because extreme points are highlighted in a box plot, you can easily identify the data points for investigation. You may find that the outliers are errors in your data or you may find that they are unusual for some other reason. For example, if the three outliers in Figure 8 are outside the expected range of values, you would need to determine if they are valid data points or not.
Box plot example for groups
If your data have groups, you might learn more about the data by creating side-by-side box plots, providing a simple, yet powerful, tool to compare groups.
One way to measure a person’s fitness is to measure their body fat percentage. Most guidelines expect a difference between body fat for men and for women. (For more on this data, see the two-sample t-test page.) The variable Body Fat is continuous, so a box plot is an appropriate method for displaying the distribution of the data. Figure 9 shows separate side-by-side box plots for men and women.
From this graph, you can see that men have a lower median body fat than women. You can also see that the ranges for men and women overlap. The data for men has more skewness than the data for women. Neither group has outliers. With JMP, you could add means diamonds, a line for each mean, and annotations to these box plots.
Using separate side-by-side box plots for groups can help show group differences and identify outliers.
Box plots and types of data
Continuous data: appropriate for box plots
Box plots make sense for continuous data, since they are measured on a scale with many possible values. Some examples of continuous data are:
- Blood pressure
For all of these examples, a box plot is an appropriate graphical tool to explore the distribution of the data.
Categorical or nominal data: use bar charts
Box plots do not make sense for categorical or nominal data, since they are measured on a scale with specific values. Use bar charts instead.
With categorical data, the sample is often divided into groups, and the responses might have a defined order. For example, in a survey where you are asked to give your opinion on a scale from “Strongly Disagree” to “Strongly Agree,” your responses are categorical.
With nominal data, the sample is also divided into groups but without any particular order. Country of residence is an example of a nominal variable. You can use the country abbreviation, or you can use numbers to code the country name. Either way, you are simply naming the different groups for the data.