Figure 1: The False Discovery Rate (FDR) p-value plot: The unadjusted p-values in red need to get under the red ramp to be judged significant at FDR of .05. This is equivalent to FDR-adjusted p-values in blue getting under the .05 line in blue. In this case, 60 percent of the values were significant, so the p-value adjustment is not too much.
Big statistics will force us to change our ways
by John Sall
If you only had one measurement to scrutinize, then it would be easy. You could fit a simple model for that response, perhaps a one-way analysis of variance across the engineering changes, testing to see whether the mean was different across the groups. But what if you have hundreds, thousands, even millions to examine? Process engineers have hundreds of sensors on their processes to watch. Semiconductor engineers have thousands of chips on each wafer to test. Genomics researchers have millions of SNPs to study.
What should you do differently when the problem is big? What should you look for in software when you’re doing big statistics? Here are 10 important ways that big is different:
- If there are a lot of relationships to screen, you want a fairly automatic tool to do it. You certainly don’t want to have to specify the details of analysis for each pair of relationships. You don’t even want to look at hundreds or thousands of pages of output. You want the software to show you quickly which effects are the most significant and help characterize patterns of significance.
- The bigger the problem, the better it is to have your statistics in data tables instead of reports. There is a lot to look at, and you need to treat it like data exploration, doing the same kinds of things you do with regular data: cleaning, adding formulas, searching, subsetting, summarizing, graphing, doing statistics on statistics. Sure you need graphs, but instead of a graph for each test, you should have graphs that can show thousands of tests.
- Since there are many hypothesis tests, you want reasonable control over the multiple inferences to judge whether you are just filtering out the coincidences that are significant through selecting over random chances. The false discovery rate approach to managing this is an effective one, as it controls for the rate of falsely significant events. Benjamini and Hochberg devised a simple procedure to do this, and the straightforward graph in Figure 1 shows how it works.
- When you have lots of small p-values, you can’t distinguish between what is merely significant, and what is very significant. Using the –log10 of the p-value becomes a better scale to see significance than the p-value. This is called logworth, and it is already a tradition in data mining fits. Just remember that a logworth of 2 is a p-value of .01. On the p-value scale, it is easy to distinguish among the non-significant tests but hard to distinguish among the significant tests. Changing to logworth fixes that.
- Often data has outliers that swamp the estimates of variance, spoiling the sensitivity to changes. Robust methods allow this to be compensated for automatically, enabling the outliers to remain in the data. The Huber method does this well. Semiconductor data is famous for having huge outliers due to electrical shorts, but you don’t want these outliers to spoil your ability to see differences in your data.
- If you have a lot of data, even a small difference can be significant. Large n leads to very precise estimates of the means, allowing you to discriminate much smaller differences. But then everything looks significant. You don’t care about finding very small differences. You want to find the differences that are at least some amount, typically a fraction of the specification range. We call this practical significance. You can adapt the tests to test when the difference is of practical magnitude.
- Sometimes you want to be reassured that you don’t have differences, but most tests are constructed the other way – to detect differences or to be inconclusive. Equivalence tests provide a way to ensure practical sameness rather than difference. It is easily done by two one-sided t-tests, such that each difference is within a practical difference. If you are changing suppliers, for example, the big question is whether the new supply will lead to the same quality product, relative to specification ranges (Figure 2).
Figure 2: With lots of data, everything may look significantly different from zero. But if you measure what’s different with respect to what matters to you – for example, 10 percent of the specification range – you see that many of the differences may not only be non-significant, but also significantly within the practical difference, i.e., practically equivalent. If you are making a supplier or process change and want to make sure that results don’t suffer from it, then equivalence tests can provide reassurance that results don’t change by enough to matter.
- With big data, you need the hardware to be adequate to the task. Make sure you have lots of memory so that you can do all the calculations in-memory. A year ago, we started seeing 16 gigabytes of memory for less than $100 – so there is no excuse for having less. But desktops and laptops with enough memory can handle pretty large problems; you don’t need a supercomputer or a grid.
- You also need software that is ready to scale up the analysis, as well as addressing all the issues above. The software should take advantage of multithreading to get the answers fast. This enables you to do millions of tests in seconds.
- When you are looking across more than just groups defined by one classification, when you are fitting a multi-term model, then there is an important additional consideration: Some data sets are full of missing values. The standard methods have to drop a whole observation if any of the X’s are missing. If you model with lots of variables, and one or another of the variables is missing, you may end up with no data to estimate with. Furthermore, even if you do end up with enough data, that part of the data may be a biased sample whenever the mechanism to make the missing values is related to the response. The results are biased. The way to ameliorate this is to use missing values in an informative way, as part of the model. You can use a coding system to do this easily. Not only are the results less biased, but often they predict much better, as well as making predictions for all the data, not just the non-missing data.
Because of the need to analyze at scale, we have changed our attitudes, including some of the design rules that guided us previously.
- We used to think of interactivity as useful only for small data. With fast computers and multithreaded software, even large data can be analyzed interactively, getting results back without waiting.
- We used to think of statistics as being for testing and fitting. Now it is often more important to use it for screening, looking for big effects and finding features.
- We used to have the rule that a graph should accompany each test. That is impractical when there are hundreds or even millions of tests. Now the goal is to make fewer graphs, each showing many tests.
- We used to show small p-values as <.0001. This made it impossible to distinguish between moderately significant and very significant effects. Now we use logworths instead.
Welcome to the world of big statistics. We can learn a lot when we have a lot of data. We can put that learning to use to ensure we are making the best products, with an efficient process to do so.