Understanding and Applying Tree-based Methods for Predictor Screening and Modeling

6 Kudos

See how to:

Model using Partition, Bootstrap Forests and Boosted Tree
Understand pros and cons of decision trees
- Pros: Uncover non-linear relationships, get results that are easy to understand, screen large number of factors
- Cons: Handle one response at a time, forms if-then statement not mathematical formula, high variability can lead to major differences in a model for similar data
Build a JMP Partition classification tree that uses splitting to define relationship between the predictors and categorical responses
Define the number of samples, size of the trees (models) and sample rate to build Bootstrap Forest models using a random-forest technique
Define the number of layers, splits per tree, learning rate and sampling rates to build Boosted Tree models that combine smaller models into a final model
Interpret results and tune models
Specify and run Profit Matrix reports to identify misclassifications for categorical responses, and interpret results using Confusion Matrix and Decision Matrix

Note: Q&A included at time 17:00 and time 38:00

Georg · ‎01-24-2023

@Peter_Hersh ,Thanks for posting this great video! I have in my case variable data, and using decision trees leads to different models for different runs, as you stated, that decision trees have this "con". In your case you used a validation column, meaning that the choice for training and testing is not random anymore. Using "hold-out portion" in my case, is another source of "noise" from run to run. Is there any idea or strategy on how to get a more stable model, or how to define/fix the final model? I'm using Boosted Tree, because it performed best doing a model comparison. Thanks in Advance!

Victor_G · ‎01-24-2023

@Georg I might have a couple of ideas concerning your question :

Trying K-folds cross-validation (with the platform "Model Screening" available in Analyze -> Predictive Modeling) to assess the robustness of your model. You'll get results for each of your folds, so it's easier to assess if your model is robust, no matter which runs are in the training or validation set.
You can save formulas for each model (trained on the K-1 folds, validated on 1 fold), and after it could be possible to do an Ensembling or Averaging of these individual models created on different folds, but some data need to be hold out completely, since here you'll average different models created on your data and validation sets, so to have a fair assessment of the model, an independent test set needs to be created and the test data completely "removed" to avoid any information/data leakage or overfitting.
If you want to have reproducible testing for your model (or "fix" your final model), you can fix the random seed by typing any number : this way, reloading a model with a script and a fixed random seed will enable you to always have the same results.
Finally, you could also try to create different Boosted Tree (or other tree-based models), by optimizing hyper-parameters or making some differences in their values (depth, minimum size split, etc...) in order to have either optimized model(s) or a little diversity in your individual models, and then combine them by ensembling or averaging.

Hope these few ideas may help you,

Peter_Hersh · ‎01-25-2023

@Georg Great question. @Victor_G has a couple of nice solutions with model averaging or ensemble modeling of k-folds. It sounds like your data has a lot of noise with the model changing with additional runs. It might be the case that the model is not very useful at this point without some additional factors or runs to help understand the variability in your data better.

A way to test this is to make multiple validation columns for the holdback validation. If those models using different validation columns are dramatically different from each other that indicates that a random subset of your data is not representative of the data as a whole. That means you will have trouble making an accurate predictive model without making some changes.

Tree-based models are very susceptible to outliers, so it might be a good idea to check for outliers before running the model.

Hope this is helpful let us know how things go.

Victor_G · ‎01-26-2023

@Peter_Hersh : I am not sure what you imply with your sentence "Tree-based models are very susceptible to outliers", since tree-based method are instead quite robust to outliers, or at least outliers have a negligible effect. There is already a lot of examples and publications on this aspect, and this is also why this type of models are often used and famous, because of their "simplicity" and low pre-processing/cleaning step (no big work needed on outliers, and handling of missing values, no assumptions on distributions, etc...). A comparison of the impact of outliers on regression models and random forest (and the benefits of different techniques to deal with outliers applied on both algorithms (spoiler: which show no big effect on random forest algorithm)) can be seen here : How to Make Your Machine Learning Models Robust to Outliers - KDnuggets

@Peter_Hersh @Georg There are however other things to take into account for the choice of tree-based models, here are some :

Overfitting that can occur mainly for decision tree and boosted tree (if no proper validation is done). Random Forests tend to be less prone to overfitting due to the bootstrapping process : there is an independent training of individual trees, each of which is fit to a bootstrap sample of the training data. With the bootstrapping process, some samples are not "used" in the creation of individual trees (called "Out Of Bag" (OOB) samples) but can be used to estimate errors in the model (and hence serve as "validation", estimation of error on data not seen by the model).

Correlation/multicollinearity : The different tree-based models deal differently with multicollinearity or correlation between variables. If you have strong correlation/multicollinearity in your dataset, you may expect some differences : When looking at the Decision Tree features importance plot (or Gradient boosted Tree), you can see that only the "best" features are selected (the ones with the best split attributes), and the other ones (correlated to the first ones) are not selected and have feature importance of 0 (or close to 0). This is normal, since a Decision Tree use all the predictors and try to determine the best splits at each level, so "worse" and correlated features will be dropped out. So reloading Decision/Boosted Tree may lead to different models in presence of correlation/multicollinearity (as it may not always select the same features/variables each time). When looking at Random Forest features importance plot, you can see that all features have a score, even correlated ones. This is due to the feature bagging, a random selection of features done for each decision tree in the forest, giving each features the same "chance" to be selected in individual trees. So reloading this model should give you very similar models.

Noise (or Bias/variance tradeoff): Boosted trees can be more accurate than random forests. In boosted tree, each individual tree is iteratively trained to correct previous tree's errors, so this algorithm is capable of capturing complex patterns in the data. However, if the data are noisy, the boosted trees may overfit and start modeling the noise. So depending on your dataset size and the noise in your data, there is a choice to make between Boosted Tree and Random Forest: The first ones are very accurate but not robust to noise (low bias, high variance like Decision Tree), whereas in comparison RF are less accurate/precise but are more robust to noise (bias is the same as each individual tree in the Forest, lower variance than Decision tree or Boosted Tree because of the bagging process).

And you also have to consider (in addition to previous points) your objective (predictive/explanative ?), the influence of the metrics used to evaluate the performances of your model on this dataset, the dataset size and dimensions (number of variables vs. number of observations), representativeness of your data, etc...

If you're using a small dataset (with possible noise), Random Forest may be a safer and more robust alternative than Boosted Tree/Decision Tree.

In addition to what @Peter_Hersh suggests, if you have different performances accross the folds/validation columns for your Boosted Tree, that is an indication that a subset of your data may have unusual values, so trying to look at distances (Jackknife/Mahalanobis) in the Multivariate platform could enable you to spot and identify an usual observation in this subset.

Sorry for the long answer, I hope it will be helpful,

Georg · ‎01-26-2023

Thanks both for the very detailed analysis, I need some time to work on all your proposals, and will respond later. I know that there are many challenges in my dataset, one additional is that it is unbalanced, it is a mixture of much historical data and some experiments added to it. And we developers want to get the information out of all data, and not to throw anything. These experiments may look like outliers, but are not outliers, they are single points with a meaningful result. As my dataset will be presented in the discovery summit Europe, we can further discuss when it's online. BR

Recommended Articles