Analyzing spectral data: Modeling options

by Bill Worley and Jeremy Ash, JMP

In our previous two articles, Jeremy Ash (@JeremyAshJMP) and I showcased JMP spectral analysis and visualization capabilities by analyzing a popular spectroscopy data set from Martens et al.1 We showed different ways to visualize spectral data and a few pre-processing steps that allowed us to dramatically improve the signal to noise ratio in the data for further analysis and modeling.

In this third segment of the series, we demonstrate several ways spectral data can be modeled in JMP. Partial Least Squares (PLS) is the most common tool for analyzing spectral data, and PLS can be done with both JMP and JMP Pro. JMP Pro also offers Elastic Net penalized regression, neural net modeling, and advanced decision tree methods like Bootstrap Forest.

The Bootstrap Forest and Elastic Net techniques allow for quickly and easily finding the most important wavelengths of individual compounds as well as the wavelengths that define components in mixtures. These modeling techniques can be used with raw data or processed data. In this post, Jeremy and I show the differences between modeling the raw data versus the pre-processed data to show the overall improvement in predictive capability after pre-processing. As a statement of clarity on the overall comparison of methods, the same two-level validation column was used with 20% of the data as the validation set for all analyses.

Modeling Methods

Partial Least Squares

Partial Least Squares is the standard method for analyzing spectra. PLS handles both wide and tall data and is especially good at handling highly correlated factors. This ability bodes well for spectra because spectral data often have highly correlated absorbances, which results in each successive wavelength being dependent on the one wavelength immediately before it. Because of this dependency, PLS is often used to analyze raw spectral data with very good results. As stated above though, pre-processing improves the resulting predictive model enough to warrant putting in the time and effort to pre-process the data. Remember that the sample data is a group of mixtures of gluten and starch, and in the case of PLS, it does a really good job of developing a model that predicts the gluten and starch ratios well with any future samples. 

The PLS model was built with two latent factors because previous principal component analysis used two principal components. Also, Q2 does not improve much beyond two factors, and adding more factors makes model interpretation more difficult. 

Figure 1. Partial Least Squares model set up and output.

Figure 2. Partial Least Squares model with Raw data.

Figure 3. Partial Least Squares model with Pre-Processed data.

Below is a comparison of the Actual by Predicted Plot for both PLS models. The one on the left is for the raw data, and the one on the right is using the pre-processed data. As you can see, there is a significant improvement in the model using the pre-processed data.

Figure 4. Partial Least Squares model comparison with Raw and Pre-Processed data.

Generalized Regression (Gen Reg)

Generalized Regression techniques are found in JMP Pro. Gen Reg gives you at least two things you will not get out of ordinary Least Squares. One is that with Gen Reg, your data doesn’t have to be normally distributed. You have options for many different types of distributions. Another difference is that Gen Reg includes the Penalized Regression technique, and it handles highly correlated data very well. Gen Reg provides Ridge, Lasso, and Elastic Net. Elastic Net is a nice compromise for modeling spectral data because you can adjust the Elastic Net Alpha (ENA) penalty to keep from underfitting a model by rejecting to many highly correlated factors as important to the model. The default value of 0.99 ENA was used to build an initial model. This model appeared to be rather sparse and potentially underfit. An advanced option in Gen Reg was used to scan the range of alpha values to see if there was a better model. This scan provided a better fit using an ENA value of 0.55. You can see the difference in the fit for the two ENA values in the solution paths below.

Figure 5 – Elastic Net solution paths for Raw data.

Figure 6 - Fit Statistics for Gen Reg with Raw Data.

Figure 7 -  Elastic Net solution paths for Pre-Processed data.

Figure 8 - Fit Statistics for Gen Reg with Pre-Processed Data.

Without going into more detail, the fit for the raw data used 10 wavelengths when the Elastic Net Alpha (ENA) was 0.99 and 37 wavelengths when ENA was 0.55. 

The pre-processed data has a significantly different outcome. Using Elastic Net again, there were 22 significant wavelengths found using an ENA of 0.99 and 53 significant wavelengths using an ENA of 0.55. While a large number, 37 and 53 of significant wavelengths identified with ENA of 0.55, both results are less than the PLS models by a large margin. Checking the fit statistics for both sets of models shows the pre-processed data to have the better overall predictive model. However, if you only used raw data, you would still be able to come up with a reasonably good predictive model.

Figure 9 – Model comparison for Gen Reg with Raw and  Pre-Processed Data.

A comparison of PLS and ELN models using processed data shows ELN models are only slightly better based on R2.  However, the ELN models are simpler and easier to interpret.

Figure 10 – Comparison of PLS and ELN modeling techniques for processed data.

Does that mean you should use Elastic Net models for all your spectral analysis? You should consider it as a viable candidate, but it really comes down to what modeling platform(s) you have available and whether model interpretability is desirable in addition to prediction performance.

More Modeling Options

Neural Net Models

For demonstration purposes, you can also use neural net models, which in this case show extremely good fits but are highly complex models.

However, the Prediction Profiler in JMP enables users to interpret even the most complex, nonlinear models. The Profiler can be overwhelming when all wavelengths are used for prediction. To keep the neural net model simpler, you can provide the wavelengths selected with Gen Reg in the model dialog.

Figure 11 – Comparison of Neural Net fits for raw and pre-processed data.

In both cases, the data was fit with one hidden layer and three nodes with a hyperbolic tangent activation function using a fixed 20% validation column for cross-validation.

Bootstrap Forest

Using a Bootstrap Forest approach gives the most significant wavelengths for a given set of spectra. It is also a clear indication that pre-processing gives a much better model. The difference is based on the Validation R2 for each data set. The Validation R2 for the raw data is 0.805, and for the processed data the Validation R2 is 0.987. The models below were fit using a fixed 20% of the data as a validation set to avoid a random selection of the validation set.

Figure 12 – Comparison of Bootstrap Forest fits for Raw and Pre-Processed data.

Support Vector Regression

Support Vector Regression (SVR) modeling is also available in JMP Pro, and it is another good alternative for analyzing spectral data. SVR is a powerful and flexible boundary-based machine learning algorithm used to build predictive models for a continuous response. In the case of our data set, SVR also shows a significant improvement when analyzing processed data. 

Figure 13 - Comparison of SVR model fits for Raw and Pre-Processed data.

There are more modeling techniques in JMP and JMP Pro that you can use to model spectral data, so if you didn’t see one that you would like to use or have used, please let us know and we will make sure to cover it in a future blog post.

1) "Light Scattering and light absorbance separated by extended multiplicative signal correction. Application to near-infrared transmission analysis of powder mixturesAnalytical Chemistry 2003 Feb1;75(3):394-404



About the Author

Bill Worley is a Chemical Systems Engineer for JMP, a business unit of SAS that specializes in data visualization software. He supports sales and customer development as part of the Global Technical Enablement Team.

Before joining JMP, Worley spent six years as a Technology Leader at Procter & Gamble, where he oversaw the use of JMP for design of experiments and statistical data analysis. He also trained P&G engineers and scientists in all aspects of empirical modeling and optimization. Worley is an analytical chemist by training and has held research roles at P&G, BASF and Unilever. He holds a master’s degree in Chemistry from the University of Cinncinnati.

Let's stay connected!

You may contact me by email regarding news, events and offers from JMP. I understand I can withdraw my consent at any time.

*
*

JMP Statistical Discovery LLC. Your information will be handled in accordance with our Privacy Statement.