Statistics, Predictive Modeling
and Data Mining with JMP®
Statistics is the discipline of collecting, describing and analyzing data to quantify variation and uncover useful relationships. It allows you to solve problems, reveal opportunities and make informed decisions in the face of uncertainty. Through the effective application of statistics, you can gain insight, foresight and the means for continuous learning and improvement, no matter what context you work within.
Whether your goal is description, prediction or explanation, you will appreciate the statistical discovery paradigm of JMP, which exploits the intrinsic synergy between visualization and modeling. No matter what the shape and size of your data, so long as it fits in memory, JMP will allow you to get the most from it, whatever your current level of statistical expertise.
JMP provides comprehensive facilities for univariate linear and nonlinear regression, the more useful multivariate approaches for exploration, dimensionality reduction and modeling, and for the analysis of time series and categorical data. JMP and JMP Pro are intended to meet the statistical needs of most users most of the time, surfacing the various techniques and results in a way that you can easily grasp, but without compromising the depth of the analysis. JMP also has a set of modeling utilities that deal with common data issues upfront, while JMP Pro includes a rich set of sophisticated algorithms for building better models with messy data.
Through visual and interactive reports and profilers, JMP helps you communicate simple or complex findings to those who may not have an affinity with statistical methods, yet who need to understand and act upon your findings. Model results generated by JMP can also be dynamically profiled in a mobile or desktop web browser.
Finally, using an integrated facility, JMP Pro can easily perform sample size calculations for fitted models (simple or complex) via Monte Carlo simulation. This helps you to assess the power of the data you have collected to address the questions at hand.
The class of linear regression models is diverse and ubiquitous. JMP puts these powerful methods in the hands of practitioners of all skill levels, and in a form they can easily use.
Using Fit Y by X, you can test for and model dependencies between a single input and outcome. JMP unifies what is normally considered a disparate set of statistical approaches into a coherent, understandable whole and provides graphical output so you can understand results easily.
The Fit Model platform provides an environment for fitting simple or complex models with specified fixed and random effects and defined error terms. An Effect Summary report allows you drag and drop terms to see their impact on the model.
Whatever your favored model-building approach, JMP provides a complete set of manual and automated methods, with appropriate diagnostics, to allow you to rapidly build most types of linear models. An “informative missing” approach allows the information in all your rows to contribute. Specific fitting options focus your attention appropriately; JMP Pro extends the repertoire by adding Mixed Models (to correctly handle repeated and spatial measurements) and Generalized Regression (with regularized or penalized regression techniques like the Elastic Net that help identify X's that may have explanatory power). JMP Pro also supports quantile regression.
JMP lets you easily compare competing models. Multiple responses are handled in an integrated way, and the Profiler makes it simple to compare and contrast the interpretability and results of various fits. The Profiler also allows you to find settings to optimize your Y's, and Monte Carlo simulations help you assess how variation in the X's will be transmitted into the Y's.
The Nonlinear platform allows you to model nonlinear relationships. Nonlinear models use either standard least squares or a custom loss function. JMP provides a library of nonlinear model types needed for bioassay and pharmacokinetic studies, and does not require you to input starting values or auxiliary formulas. Grouping variables are supported, and you can quickly and easily isolate any subject effects using graphical displays. The custom loss function facility provides additional flexibility, allowing you to use, for example, iteratively reweighted least squares for robust regression.
The Categorical platform in JMP provides tables, summaries and statistical tests of response data and multiple response data when the measured responses indicate membership of a particular category. Such data is generated in a variety of settings, including test results, classifying defects or side effects, and administering surveys.
Partly because of its diverse application, categorical data can be presented in a variety of formats. A particular strength of the Categorical platform is that it can handle this diversity without any need to reshape the data prior to exploration and analysis. One or more columns can be used to define the categories within and between which variation in the response is assessed, and the Categorical report contains the resulting charts of share and frequency, by category. Used in conjunction with the data filter in JMP, these charts provide quick and easy review of large-scale survey data. The report can also display the associated tabulations and cross-tabulations, which can be quickly transposed for easier viewing or printing if needed.
Depending on the nature of the responses, you can also statistically address questions like:
- Does the pattern of response vary with sample categories, and have they changed over time?
- For each response category, are the rates the same across sample categories?
- How closely do the raters agree?
- What is the relative risk of different treatments?
The Partition platform in JMP enables you to find cuts or groupings within your inputs (X's) that can best predict the variation in an output (Y). X's and Y can both be either categorical or continuous. The process of splitting the data by finding an appropriate X and an appropriate grouping or cut-point for this X is recursive – you can continue it until you get a useful fit. The result is naturally represented as a tree, and you can also get important information about which X's contribute most to explaining the variation in Y.
Trees are robust to the presence of missing values, and accommodate any joint effects of X's directly. You can grow your tree using decision trees, bootstrap forests (JMP Pro only) or boosted trees (JMP Pro only). Note that simple decision trees are not likely to generalize well to new data, so if you need predictive power you should investigate JMP Pro.
The Neural platform in JMP enables you to build fully connected neural networks with hidden nodes in one (JMP) or two layers (JMP Pro). In JMP, all nodes have the same activation functions. In JMP Pro, each node can have one of three different activation functions. You can have any number of nodes in each layer.
JMP Pro also allows you to automatically handle missing data, transform X's within the platform, and use boosting to help your network to learn difficult cases by applying one of four penalty methods.
Multivariate Interdependence Techniques
Multivariate analyses can focus either on observations (rows) or on variables (columns), and may treat variables on an equal footing (interdependence techniques) or distinguish between effects, X's, and responses, Y's, (dependence techniques). But whatever your analytical objective, JMP will work with you to get the job done. (See Multivariate Dependence Techniques section for multivariate methods involving X's and Y's.)
In the multivariate context, it is vital to consider data quality, the identification and treatment of outliers, and the pattern of missing values. JMP provides utilities that take the drudgery out of addressing these issues. Typically, they need to be addressed iteratively as the analysis unfolds, and the interactivity of JMP is built for this way of working.
The Multivariate platform is often the entry point into any analysis with many columns. It allows you to quickly assess the associations and parametric and nonparametric correlations between all pairs of numeric variables, identify outliers and impute missing values.
For interdependence techniques, JMP provides Principal Components Analysis (PCA), factor analysis, clustering, latent class analysis, multi-dimensional scaling, association analysis (JMP Pro), normal mixtures and self-organizing maps. Each uses an unfolding style of analysis so that you can shape your approach according to what the data reveals to you.
PCA lets you reduce the dimensionality of your description when correlations are present, and the implementation in JMP can accommodate very wide data efficiently. When you have categorical rather than quantitative variables, you can use JMP to perform Multiple Correspondence Analysis rather than PCA to achieve a similar result. Factor analysis lets you model variability among observed variables in terms of a smaller number of unobserved factors. The Factor Analysis platform allows multiple fits and rotations in one report, and conditional formatting allows you to suppress small values.
Clustering, a key technique in unsupervised learning, forms subgroups so that cases in a particular subgroup are more alike than those in another subgroup. The Cluster platform in JMP lets you scale and transform variables before analysis, provides various distance measures, and includes hierarchical and k-means clustering. Hierarchical clustering produces a dendrogram you can manipulate interactively to decide on the most useful number of clusters using Cluster Summaries or other heuristics. You can also add spatial measures to stacked data to allow you to cluster specific defect patterns.
Latent class analysis provides an alternative to clustering, and association analysis (also known as market basket analysis) identifies connections between specific objects (such as items that are often purchased together).
Multivariate Dependence Techniques
For multivariate dependence techniques, JMP provides partial least squares regression (PLS), discriminant analysis, naïve Bayes and nearest neighbor classifiers, and the Gaussian Process.
PLS is a versatile technique that can consume data of any shape, and with any number of X’s and Y’s. It is often applied in situations where linear regression is not viable because there are more X's than rows, but it can also be seen as a technique useful within predictive modeling generally.
The PLS platform in JMP provides basic capabilities, but with JMP Pro there is also a PLS personality in the Fit Model platform that allows you to fit more complex models involving powers and interaction terms. With JMP Pro, you can also impute missing values and build PLS models using a choice of validation methods.
JMP provides both the NIPALS and SIMPLS algorithms for fitting and automated ways to find the most appropriate number of latent factors to include in the model. It provides all the usual diagnostics so you can check model adequacy. You can also quickly generate pruned PLS models with a reduced number of terms simply by making appropriate selections in graphical output or defining a VIP threshold value. If your response is categorical, you can use PLS-Discriminant Analysis in JMP Pro.
The Discriminant platform allows you to understand which combination of X's help to explain category membership of a Y. It provides linear, quadratic or regularized methods for discrimination, stepwise selection of X's if needed, and allows you to easily inspect uncertain or misclassified rows to decide what follow-up or remedial action is required. Discriminant can efficiently tackle wide or very wide problems by using an optimally estimated covariance matrix obtained by shrinking the off-diagonal entries appropriately.
The Gaussian Process can be used to exactly interpolate Y values that are a function of any number of X's (to build surrogate models of deterministic systems), or as a more general modeling tool.
The Time Series platform in JMP allows you to explore, model and forecast univariate time series. Your statistical modeling approach can be informed by the usual diagnostics, including plots of autocorrelations and partial autocorrelations, variograms, AR coefficients and spectral density plots. You can easily decompose your time series to remove trend and seasonal effects, including use of the X11 method.
You can build several ARIMA models for a time series with a range of parameters with a single click, and select the best model using various figures of merit, such as AIC, SBC, MAPE and MAE. You can build transfer models to model an output time series in terms of one or more input series, applying pre-whitening to the inputs if required. You can also generate the equivalent PROC ARIMA code to run your model in SAS if needed.
The Time Series platform also contains a number of smoothing techniques for time series, including Holt exponential smoothing, seasonal exponential smoothing, and Winter’s method.
In all cases you can produce interactive forecasts of the predicted future behavior, with confidence intervals.