Dimensionality reduction in pharma
ANNE MILLEY: You seem to have a passion for helping pharmaceutical companies get critical new products to market as fast as possible, especially when patients have so few options. What are some of the challenges there, and how do you help them overcome those challenges?
JULIA O'NEILL: Two of the key challenges are in setting specifications, which is an area that I work on quite extensively, but also, the whole issue of dimensionality in development of biopharmaceuticals. This is a problem I first worked on about 10 years ago when I was working in vaccines, and at that time, we had a challenge where we had a relatively small number of manufactured vaccine lots. For each lot, we had about 1,500 parameters that had been measured, and then we had nine really key quality attributes measured on the final product of that process.
There were some issues going on in the final product, which we didn't have an explanation for. And we had a team of people. It included an expert chemometrician, a world-class programmer/coder and a number of statisticians. We spent several weeks, and when we got through the analysis as a team, after several weeks of effort, we did solve the problem, and it had a very big impact on the supply of that vaccine.
Now I face a very similar problem almost every time I visit a client. The problem, in general terms, is the challenge of dimensionality. We typically have just a few subjects. Many of the new biotech products coming out are using starting materials that come from human beings or other animal subjects, and we only have a short list of qualified donors of those materials.
With all the advances in genomics, in characterization of the microbiome, and on and on, we can measure a very long list on samples from each of those subjects. So we have a very large set of properties on a very small number of subjects, and then at the end, we have measurements of a few outcomes, typically. This is the same kind of problem that I solved about 10 years ago with a team of people and took several weeks. What's exciting now is I can solve this type of problem in about a half hour using JMP.
ANNE MILLEY: That's amazing.
JULIA O'NEILL: It really is. This is a simple example, a relatively small one that I've worked on recently. There were nine subjects in this study. Each of them was measured for eight different outcomes, and of course, there's scientific knowledge about each of those outcomes and how they might relate to each other, but that has been made generic for the purpose of sharing this with a wide audience.
There are also 60 properties measured on the samples from each of these nine subjects. One of the first problems we face is—simply using JMP's multivariate platform—we can see that these eight outcomes, when we look at the scatterplot of one versus the other, these are not eight independent measures of results. Some of them are quite highly correlated, and there are also some interesting patterns.
And I should say that one of the outcomes of an analysis like this is to initiate a larger study. I don't want anyone to think that big decisions or development choices are being made based on only nine subjects, but this small study can give us very interesting clues that support investment in a much larger study, where similar methods would be applied to understand the results.
ANNE MILLEY: Right, and help you spend that money more wisely.
JULIA O'NEILL: Yes, because we're all trying to bring down the cost of medicines, and statistics is a huge part of that.
So this is telling us we don't have eight independent outcomes. We really have a smaller number. Very simply, we can then use JMP's principal components analysis to see, how many different things do we really have going on in this data? And it boils down, mainly, to two.
Outcomes 7 and 8 are quite correlated, and outcomes 1 through 6 also move together. Together, that first component of variation alone, based on those first six outcomes, explains over 70% of all the variability in what we've measured in the results. This is very helpful. It means instead of analyzing eight separate outcomes, which each have quite a bit of noise built into them, we can condense that to two that are the primary drivers of differences in results.
Very easy to do that in JMP. I just say, save the principal components. It will give me two new columns that contain what are really linear combinations of the results from those eight. It would be nice if we could get the same thing to work on the 60 properties, but as it turns out, it doesn't work. So that's when we have to pull out the big guns, and this is the point where 10 years ago, I would have been bugging my good friend to do that for me.
But instead, today in JMP, there's a new menu item I just love. If we go to the Analyze, Screening, Predictor Screening, that is designed for just this kind of scenario. What I do is, I take those 60 properties, select them all, and make those the X's or the candidate predictors. And then I can select both principal components, or I can look at one at a time. If we just look at that first principal component, which relates to those first six outcomes, I click OK, and wow!
There are two wonderful things about this. I like to explain this to people as this is like a Pareto chart on steroids. This says of all 60 properties in that candidate's set, which of them have the best chance of predicting the outcomes that are described by those first six outcome measures? Right away, it ranks them, and you can see, whoa, we got to find out what's going on with properties 54, 37, 35, and so forth.
That's very, very valuable. And of course, when we're doing this with a group of scientists, they know the context. They know what these properties represent, and they may see connections immediately of, oh, well, we didn't expect those three to go together. What can we learn from that? Or you know, these are four properties which actually represent almost the same thing, so condense that one. So there's a lot of iteration once this gets rolling.
What's even more valuable—especially issues that have been outstanding and challenging for a long time—is all the scientists and engineers have an idea about what causes what, what's the cause and effect. And some of them are right, but they're not usually all right. This plot—I have seen this be an incredible consensus builder because what people see is this ordered ranking, which includes all of their hypotheses, but puts them in priority. And they can see, if theirs showed up at number 15, well, maybe it's time to think again.
But on the other hand, one of the most interesting problems I ever worked on—the top three at the top of this Pareto chart turned out to be three competing hypotheses, which were all important. And when we saw them all together, the scientists said, oh, so you're right, and you're right, and there was actually a combination of things going on. One of the things that predictor screening overcomes is in the past, if we tried doing regression trees or other predictions one at a time, it would pick the first predictor, and then it would miss others that were similar.
I love this one, because it gets them all. It puts them right there, and in some cases, you might say, well, for convenience, we can control the second one on the list. First one, we can't do much about. So you choose the one that you can actually control. And that's where the manufacturing part comes in and the chemical engineering is using the analysis to say, which of these are actually practical for control?
ANNE MILLEY: Right. So it really helps inform your next steps.