Predictive modelling demystified: How to choose the right model for your data

Understanding model types, model validation, and why visualization matters.

Owen Jonathan
April 14, 2026
6 min. read

AI concept. 3D render

What is predictive modelling?

The word “model” can be found in a wide range of disciplines: from mouse models in biological research to runway models in the fashion industry to train models for hobbyists. A model is merely a representation of something, ranging from an abstract idea to a physical object.

In the sciences, we use and build models to enhance our understanding of the subject or process we are studying. Reality is often more complex than a simplified version that a model tries to represent, which is why it’s important to remember the adage by British statistician George Box, who said, “All models are wrong, but some are useful.”

A model will have limitations, but the insight we can derive from those models are what makes them worthwhile to build. Generally, models can be placed into one of two categories: a physical model and a surrogate model.

Physical models are models built using first principles from the ground-up. I remember when I was first tasked to build a model for an undergraduate project. I went through the literature, studying differential equations for estimating cell growth so we could build a mathematical model of a cell culture in a reactor. It was a very demanding process that required many hours of research. We sought to estimate the economic viability of the process, so our model was built on estimations of cellular growth and division only. If we had had the time and resources, we could have gone deeper, adding details about the cellular metabolic pathway that was an active area of research. But for our purposes, it was largely unnecessary and would have been computationally demanding. Adding complexity did not make our model more useful than what we intended.

Bioreactor

Surrogate models, on the other hand, are data-driven models built from a top-down approach. By performing experiments, surrogate models are built from the patterns and trends that are observed from the collected data. Surrogate models are the main topic of discussion here and these are the models that have gained a lot of attention with the rise of buzzwords over the last decade, such as “machine learning” and more recently, ‘“artificial intelligence.”

Machine learning vs. traditional models: What’s the difference?

The idea of machine learning has gained a lot of excitement in recent years and to the average person, there’s a sense of mysticism that is often associated with it. In essence, however, the goal and principle remain the same. All these modelling techniques aim to build surrogate models that make sense of the data by spotting trends and patterns. These models can then be used to make predictions about the future, based on the data they have collected.

There’s a common perception that all surrogate models are complex and advanced. While those under the machine learning umbrella probably are, surrogate models also include simple ones. Recall your own science project in school when you made an experiment and collected your data as a scatter plot. You may have been asked by your teacher to draw a best fit line to explain the positive or negative trend you see in your data. Guess what? You have just built a surrogate model! A simple model, perhaps the simplest, but still a model, nonetheless.

Best fit line

When it comes to your own data, there’s nothing wrong with first building a simple linear regression. But it is important to know that there are other models, such as neural networks and bootstrap forests, that are also available for you to use.

How to choose the right model (and avoid overfitting)

There are a lot of modelling techniques available to build surrogate models for your data. The wide number of choices can be daunting and can often lead people to resist trying them out. There’s a fear of applying techniques that one is not deeply familiar with.

Let me try to help you.

We won’t delve deep into the intricacies and methods of each modelling technique. Instead, picture a large arrow with its head toward the right direction. This is our arrow of complexity. On the far left, we have simple models like linear regression and on the far right, we have more advanced models like neural networks.

When our models are too simple, they fail to capture the complete patterns and trends that describe the data. This is called underfitting, and we can move from the left side of the arrow to the right and use more advanced modelling techniques to capture the hidden patterns that were previously ignored.

However, using more advanced techniques and adding complexity to our model is not always the best approach. While they may capture more patterns, they are also computationally more demanding, and their predictions are harder for us to understand.

Not to mention, advanced techniques can be so good at capturing patterns that they start to capture the noise and variability of your data set. This is called overfitting, where our model is too complex, meaning that it can only make accurate predictions on the current data set, but eventually breaks down when applied to a second, similar data set.

Model complexity
Too simple

Demystify_image_redo - 1

Good fit

Demystify_image_redo - 3

Too complex

Demystify_image_redo - 2

To assess overfitting, we can separate our data into different sets in a step called validation. As an example, let’s imagine you have 100 rows of data. Rather than building a model with all the rows you have, you can separate the data set in a 60/40 split, where 60% of the rows will be used to build the model and the remaining 40% will be used to assess the model. It is crucial that this separation is done randomly to avoid bias. You can then compare the R2 score as a measure of model performance on the 60/40 split.

If we assume that both data sets have enough data points to capture the general pattern of the data, we can assume that our model should be able to fit both sets well with a similar R2 score. However, if it turns out that our model has a much higher R2 score on the 60% split that was used to build it, then we can conclude that our model is overfit and is possibly capturing noise associated with the data.

Comparing models in practice: Why visualization (and no-code tools) matter

To pick the right modelling technique, you have to build them with your data and then compare the performance of each different model (their R2 score) against each other. Typically, this required you to code each model separately, which takes considerable effort and time. Not to mention, comparing the performance of models and visualizing them through code may not be straightforward.

We now have software tools that have democratized modelling to a wider group of people. No longer do you need to know code to build models, and no longer do you have to wait for hours to compare and pick a model that best suits your data. These tools also come with visual aids. After all, since models are all about insight, a visual allows us to quickly interpret and understand the effects of each variable in the model for predictions.

Prediction Profiler

Advanced models may uncover hidden patterns or trends, but for some data sets, simple models may be more than sufficient to describe your data. It is important to remember the adage by George Box from earlier: All models are wrong, but some are useful.

The usefulness of a model can be different for every project. A model might lead to revelations of variables you did not know, or it may lead to further questions to explore and investigate with more experiments.

Always go back to the science when assessing your models. Does the model align with your knowledge on the subject? If yes, does it strengthen your understanding of the process? If not, were there any interactions or effects that you did not anticipate or may have previously ignored?

Predictive modelling is about insight, not complexity

A good model should lead you down a path of process understanding. Despite the buzz and excitement around terms like machine learning and artificial intelligence, it is important to not be distracted by the mysticism that is often attributed in the media. These are all simply algorithms that captures patterns and trends from the data and make predictions based on the data.

Models are only as good as the data they are built from, and they do not have to be complicated; they just have to be useful. With software tools like JMP, you too can build simple to advance models in a few clicks. At the end of the day, modelling is merely a means to an end, not the final product in and of itself.

To see these ideas in action and learn how to apply them in real projects, watch our on-demand webinar on predictive modeling for faster, clearer decisions.