Mixed models: The flexible solution for correlated data
by Russ Wolfinger
Why are mixed models at the center of so many analyses? The main reason is that data measurements tend to be correlated, violating the independence assumption behind usual regression and ANOVA models, and ignoring this correlation can severely bias statistical inference. Mixed models are one of the easiest ways to handle this situation and are flexible enough to accommodate a variety of factors, typically clustering effects that indicate repeated measurements at hierarchical levels of experimental units. Mixed model theory is a unifying theme throughout statistics, encompassing such methods as variance components, empirical Bayes, time series and smoothing splines.
To illustrate the breadth of applications of mixed models, I thought it would be fun to describe a few modern-day case studies in which they apply. Each of these diverse case studies is followed by a brief description of how mixed models could be applied to address the challenges discussed.
You are a human genetics researcher and have just performed a genome-wide association study (GWAS) using next-generation sequencing on thousands of families with a predisposition to various forms of cancer. You have more than 1 million single-nucleotide polymorphisms (SNPs) as well as more than 50,000 gene expression measurements on each person. To properly analyze these data, you must account for correlation due to repeated measures on the same person, pedigree structure within each family and population structure among racial groups.
Mixed model solution: The size of the data makes it difficult to fit one gigantic mixed model, so a good place to start is to first estimate covariance matrices across individuals using the SNP data. These matrices can correspond to more local association (including family association) or more general population structure. Then use these matrices in a QK mixed model to test SNP associations one at a time. Use expression measurements one at a time to perform eQTL analyses. JMP® Genomics includes processes for these analyses.
You are a medical reviewer for a high-profile clinical trial on a controversial cardiovascular drug that is suspected to have several serious adverse side effects. The trial has more than 15,000 individuals spread across different sites and countries. Although you are adept at spotting trends in adverse events and labs, you are concerned about the degree of statistical significance of issues you find and that you might be missing something in a trial this large. You are also wondering about the amount of statistical power this trial has to detect effects of a certain size. JMP Clinical includes processes for these analyses.
Mixed model solution: Use an incidence screen approach across every adverse event, modeling data for each event separately with a generalized linear mixed model that accounts for site and country effects along with any other covariates of interest. Use Double FDR to account for multiplicity from adverse event groupings. Fix variance components at reasonable values and perform a mixed model power analysis.
You are quality manager at a manufacturing firm and have just received alarming news of increased failures of products in the field. You need to design and analyze a new experiment to determine sources of variability and poor quality. Your manufacturing process is somewhat complex and involves some factors that are easy to change, others that are hard to change and still others that are very hard to change. You also suspect that some operators may be better than others at a few key steps that involve human skill.
Mixed model solution: Use the DOE Custom Design in JMP to set up a split-split-plot design, including blocks as needed. Analyze results with a mixed model with appropriate variance components at each level.
You are a statistician tasked with analyzing a large social network data set consisting of dyadic (paired relational) measurements along with extra grouping variables corresponding to families and community organizations. You can draw some nice network diagrams but would like to complement them with quantitatively rigorous statistical modeling results.
Mixed model solution: Set up a social relations model with a covariance structure accommodating various sources of correlation in the data.
You are education commissioner of a large urban county and are receiving severe criticism over declining standardized test scores and presumably poor teaching. You have longitudinal score data from the past several years for nearly every student in the county along with his or her teacher assignments. You want to assess this data fairly and determine both teacher and student effects, while also accounting for correlation due to classroom, school and district, appropriately handling factors such as socioeconomic status.
Mixed model solution: Create a large sparse mixed model to model all effects. The SAS EVAAS (Education Value-Added Assessment System) group members are experts in this kind of approach.
You are a leading analyst for a large retail chain that must set prices on more than 100,000 stock-keeping units (SKUs). The optimal price for each item should be neither too high nor too low in order to maximize profitability. You have legacy data to model and know that there are strong, covariance-inducing effects corresponding to product category, store, region and season of year.
Mixed model solution: Fit mixed models for each SKU, accommodating effects of interest. The SAS Revenue Optimization Suite incorporates this approach.
You are a plant breeder tasked with improving traits of an important new crop. You have randomized block field trial data from several locations along with several hundred genetic markers and progeny data.
Mixed model solution: Use genetic and progeny structure along with blocking effects in a mixed model for traits of interest.
As you can see, the proposed solutions in these cases are like spokes of a wheel emanating from a central mixed model methodology. Knowing how one analysis works can be helpful for another by determining common aspects of the mixed models in question.
Want more on this topic?
- The Power of Crowdsourcing Data Science IdeasRuss Wolfinger on the complex problems analysts are solving through data science competitions.
- Building Better ModelsA webinar series for advanced analysts, scientists, engineers and researchers interested in learning how to build better, more useful predictive models.