Reliability Analysis
What is reliability analysis?
Reliability analysis is a collection of statistical methods used to assess the quality of a product over time. In product development and manufacturing, it involves collecting and analyzing data about how often a product fails or degrades under specific conditions. Reliability data typically measure the operation time until a unit failure or the degradation of a unit across points in time. The impact of stress factors or product change on product reliability might also be included in this analysis. For very reliable products that rarely fail, a reliability study can accelerate the failure process to observe enough failures to support the analysis.
What are censored data?
When collecting reliability data, some units might not fail during data collection. The failure time data recorded for these units thus represents the time at which the units were last observed. Because the failure did not occur, the actual failure time is greater than the observed time, and the unit’s life is longer than reported. In this case, the data are said to be censored or suspended. Reliability analysis can incorporate both censored and uncensored data.
Using reliability analysis
Let’s say that you would like to measure the reliability of a new computer chip package your company produces. You put 100 packages on a board for testing. Initially, all packages perform as expected. You can’t stay to observe the testing all day, so you leave and return in one week, 168 hours later.
Reliability analysis assesses how product quality changes over time. Product quality is the stated acceptable level of product characteristics or performance, often quantified by upper and lower specification limits. Products might have many qualities that are measured.
In our example, the package on the testing board will fail when the variables we are measuring fall outside of their specification limits.
Reliability data are a measure of the life of a unit, and more generally, reliability data are often called time-to-event data. Life doesn’t need to be measured in time, though; you can use other measures like cycles, actuations, distance, or others.
Examples of life and how to measure it:
What are you measuring? | How are you measuring it? |
Electronic components | Hours in service |
Book bindings | Number of times a book is opened |
Drop test | Height above ground |
Car engine | Distance traveled |
Reliability can be stated in various ways: the proportion of units that should fail by a given time (for example, 9% of units fail by 5,000 hours) or the time at which a given proportion should have failed (for example, 1,205 hours until 10% of the units have failed.)
Using censored or suspended data
You return to your testing board after one week, or 168 hours, to find that 20 units have failed, and 80 units still meet specification. The board records the exact failure time for the 20 units, but what about the 80 units? What is their failure time?
For right-censored data, the actual life is at least as long as the observed life. In our example, let's say 80 units have not failed in the first week of observations. The lifetimes for those 80 units are right censored. Other censoring mechanisms exist. Data for a unit that failed before the first observed time are left-censored, as are data that fall below the limit of detection of an assay, for example. Data for a unit that failed with a lower and upper bound are interval-censored.
In our example, we observed the 100 computer chips on the testing board once every week. While we don’t know exact failure times, we do know the left and right endpoints: 0-168 hours, 168-336 hours, 336-504 hours, etc.
It is important that you include the data for censored units so that the reliability estimates are not biased. The censored data contain valuable information for the analysis. Though censored data cannot inform you about the exact failure time, they can tell you a lower bound on the lifetime. Without the censored data, the estimated probability of failure would be biased upward.
Here’s a different example of 100 simulated failure times in days. You’ll notice that all but eight observations are right-censored. Analyzing all the data, including partial information from the right-censored observations, the predicted proportion of parts that will fail by 1,825 days is about 2%, and the predicted median life is about 4,200 days. Using only the information from the eight exact failure times, the predicted proportion of parts that will fail by 1,825 days is much higher at 10% and the predicted median life is much lower at 3.346 days. The former is closer to reality.
Reliability factors
You’ve seen how reliability measures quality over time. Just like other statistical analyses, reliability can depend on the values of process variables. Here are some examples:
- Multiple failure modes: Suppose the binding of bound books can fail because the binding lets go of the pages. This failure mode is responsible for end-of-life failures. But if the binding was not performed correctly in the first place, books can fail before the book wears out. Perhaps the adhesive was unevenly applied, misplaced, or insufficient. These failure modes can cause earlier failures. If the adhesive process is corrected, other failure modes can cause later failures. In an analysis, it’s important to realize that when one failure mode is eliminated, another can appear. The analysis should predict failure rates after failure modes are eliminated, which is known as competing cause analysis.
- Different groups: It can often be instructive to compare reliability over different groups, such as different suppliers, manufacturing lots, equipment, etc. The ordinary statistical methods analogue is one-way analysis of variance. With reliability data, you can perform a Wilcoxon group homogeneity test of distribution equality.
- Mixture models for life data: Some cases of life data include a mixture of failures, but the actual failure modes are not known. These mixture models are useful when you believe that the life data are affected by one or more uncontrollable factors.
- Covariates: A covariate is a variable that changes over time. A covariate might be controlled or uncontrolled, and there might be more than one covariate to consider. Some common covariates in reliability studies are temperature, pressure, humidity, pH, current, voltage, and irradiation. The characteristics of the life distribution (for example, scale and shape) become functions of the covariate.
- Parametric survival analysis: Just as you can build a linear model to understand variation in a continuous response variable, you can build a linear model to understand variability in lifetime due to factors and covariates. Data might come from field failures, lab failures, or a designed experiment. Analysis is similar to ordinary linear models but takes the censored data, nonnormal distributions, and quality of time aspects of reliability analysis into account.