Structural Equation Modeling
What is structural equation modeling?
Structural equation modeling (SEM) is a statistical technique used to analyze multivariate relations between variables. With foundations in causal modeling, SEM combines aspects of factor analysis with regression analysis to enable testing theories and modeling associations between observed and unobserved (i.e., error-free latent) variables, such as intelligence, customer satisfaction, or product quality.
What do I gain from using SEM?
SEM offers unique advantages, including the ability to specify latent variables, test causal theories, handle missing data with cutting-edge algorithms, relax constraints from traditional statistical models (e.g., homogeneity of variance), and test sequential associations among variables (i.e., direct and indirect effects) – all within a single, testable model.
What is the goal of using SEM?
SEM can be used for a variety of reasons. It can be as simple as wanting to use cutting-edge missing-data algorithms seamlessly when fitting a multiple regression (or another traditional model), or as complex as developing a survey with good measurement qualities, wanting to test a causal theory about how variables relate to each other while accounting for measurement error, or studying how constructs change over time.
Reasons for using SEM
SEM is particularly useful when you need to do any of the following:
- Model variables that cannot be measured directly (i.e., latent variables).
- Model variables that have measurement error (which you need to account for).
- Specify a model in which variables are both predictors and outcomes.
- Test specific theories about the association of variables.
- Handle missing data with advanced methods without the hassle of multiple imputation.
- Rely on diagrams that describe your models intuitively.
Getting started with SEM requires understanding path diagrams
A central feature of SEM is that every model can be expressed as a path diagram. Consider the equations versus the diagram in Figure 1. If you needed to explain your statistical model to a six-year-old, which of these versions do you think would be most effective?
Naturally, a diagram allows you to easily convey complex statistical models to wide audiences and provides the visual clarity that anyone, not only kids, can appreciate. Thus, we must start by understanding how to create path diagrams correctly. The building blocks used for drawing path diagrams are shown in the left of Figure 2, while two standard path diagrams of SEM are depicted on the right.
Guidelines for creating and interpreting path diagrams in SEM
- Variables that you measure directly are called “manifest” variables and are drawn with squares.
- Unobserved variables (“latent” variables) are factors in the factor analytic sense. They represent the common variance across their manifest variables (aka indicators) and cause the variation we observe in their indicators. Latent variables are represented with circles.
- A triangle is used to represent a constant. It is used in SEM to estimate means and intercepts (e.g., regress a variable on a constant, and you get its mean). When models don’t place restrictions on variable means, the triangle is often omitted.
- One-headed arrows represent regression effects or loadings (loadings are regressions of a manifest variable on a latent variable). One-headed arrows connected to a triangle represent means or intercepts.
- Double-headed arrows represent variances (when they start and end on the same variable), or covariances (when they start and end on different variables).
Using these guidelines, you can draw path diagrams that specify highly complex models. The capacity to model relations among latent variables is why SEM is often thought of as a combination of factor analysis and regression.
The first model depicted on the right side of Figure 2 is a one-factor confirmatory factor analysis. Here, the latent variable has arrows pointing to W, X, and Y because the observed variation in these variables is caused by the latent variable. For example, if the latent variable is intelligence, then W, X, and Y might be measured scores on vocabulary, processing speed, and working memory, respectively.
Indeed, we would expect someone’s intelligence to determine scores on these variables. Note that latent and observed variables have a variance. To estimate such a model, one must set a scale for the latent variable. This is often done by fixing a loading (see Figure 3) or the latent variable’s variance to one.
The second model depicted on the right side of Figure 2 is a simple regression. For example, the amount of money spent on a marketing campaign (X) predicting sales (Y).
Note that X has an explicit variance parameter (i.e., double-headed arrow) and thus assumed to be normally distributed, an assumption not made in ordinary least squares regression. Should we have missing data, making this assumption enables us to retain all available data in our analysis – another huge advantage of SEM!
Specifying, estimating, and assessing the adequacy of SEMs: A summary
- Specifying an SEM starts by defining the relations we expect among variables, often illustrated by a path diagram using the elements in Figure 2.
- The path diagram implies a specific structure on how variables covary, represented by a model-implied covariance matrix.
- The data to which the model is fitted is used to estimate a sample covariance matrix. Thus, we now have two matrices: one implied by the model and one reflecting the associations in the data.
- Estimation algorithms used in SEM attempt to recreate the values in the sample covariance matrix while respecting the constraints reflected in the model-implied covariance matrix.
- After fitting the model and estimating parameters, a matrix of residuals is computed by taking the difference between the sample and model-implied covariance matrices. The smaller the residuals, the better the fit of the model. Note that these residuals are unique to SEM, in that they're differences between sample and estimated covariances rather than between responses and predicted values, as in standard regression models.
- The adequacy of the model is formally tested by computing statistics summarizing the residuals.
Take a deeper dive
Want to review these concepts and see applied SEM examples using JMP Pro? Watch this tutorial (47:06) to get you started!
Assumptions of SEM
SEM typically requires large sample sizes (~10 observations per parameter in the model is a common rule of thumb). Additional assumptions depend on the chosen estimator. For example, when using maximum likelihood, multivariate normality is required, but this can be relaxed by bootstrapping, applying sandwich adjustments to standard errors, or using other estimators, such as weighted least squares for categorical variables.
Variety of models in SEM
SEM can be used for a wide range of purposes: from fitting a simple linear regression, to modeling a nonlinear process over time with factors that predict and are outcomes of that process. Other applications might involve:
Developing a test or survey for measuring one or many latent variables through confirmatory factor analysis (CFA).
Watch a demo of CFA (1:25; video has no sound.)
Testing mechanisms by which a set of variables lead to other variables through path analysis.
Watch a demo of path analysis with latent variables (2:06; video has no sound.)
Investigating the indirect effect that one or many variables have on others through mediation analysis.
Watch a demo of mediation in JMP Pro (2:58)
Characterizing individual and average trajectories of processes through latent growth curve analysis.
Watch a tutorial for fitting latent growth curves in JMP Pro (42:55)
Studying group differences through multiple group analysis.
Watch a tutorial for testing group differences in JMP Pro (48:46)