Reliability Analysis

What is reliability analysis?

Reliability analysis is a collection of statistical methods used to assess the quality of a product over time. In product development and manufacturing, it involves collecting and analyzing data about how often a product fails or degrades under specific conditions.  Reliability data typically measure the operation time until a unit failure or the degradation of a unit across points in time. The impact of stress factors or product change on product reliability might also be included in this analysis. For very reliable products that rarely fail, a reliability study can accelerate the failure process to observe enough failures to support the analysis.

What are censored data?

When collecting reliability data, some units might not fail during data collection. The failure time data recorded for these units thus represents the time at which the units were last observed. Because the failure did not occur, the actual failure time is greater than the observed time, and the unit’s life is longer than reported. In this case, the data are said to be censored or suspended. Reliability analysis can incorporate both censored and uncensored data.

Using reliability analysis

Let’s say that you would like to measure the reliability of a new computer chip package your company produces. You put 100 packages on a board for testing. Initially, all packages perform as expected. You can’t stay to observe the testing all day, so you leave and return in one week, 168 hours later.

Reliability analysis assesses how product quality changes over time. Product quality is the stated acceptable level of product characteristics or performance, often quantified by upper and lower specification limits. Products might have many qualities that are measured.

In our example, the package on the testing board will fail when the variables we are measuring fall outside of their specification limits.

Reliability data are a measure of the life of a unit, and more generally, reliability data are often called time-to-event data. Life doesn’t need to be measured in time, though; you can use other measures like cycles, actuations, distance, or others.

Examples of life and how to measure it:

What are you measuring? How are you measuring it?
Electronic components Hours in service
Book bindings Number of times a book is opened
Drop test Height above ground
Car engine Distance traveled

Reliability can be stated in various ways: the proportion of units that should fail by a given time (for example, 9% of units fail by 5,000 hours) or the time at which a given proportion should have failed (for example, 1,205 hours until 10% of the units have failed.)

Using censored or suspended data

You return to your testing board after one week, or 168 hours, to find that 20 units have failed, and 80 units still meet specification. The board records the exact failure time for the 20 units, but what about the 80 units? What is their failure time?

For right-censored data, the actual life is at least as long as the observed life. In our example, let's say 80 units have not failed in the first week of observations. The lifetimes for those 80 units are right censored. Other censoring mechanisms exist. Data for a unit that failed before the first observed time are left-censored, as are data that fall below the limit of detection of an assay, for example.  Data for a unit that failed with a lower and upper bound are interval-censored.

In our example, we observed the 100 computer chips on the testing board once every week. While we don’t know exact failure times, we do know the left and right endpoints: 0-168 hours, 168-336 hours, 336-504 hours, etc.

Figure 1: Event plot showing exact (marked with a dot) and censored data (marked with triangles).

It is important that you include the data for censored units so that the reliability estimates are not biased. The censored data contain valuable information for the analysis. Though censored data cannot inform you about the exact failure time, they can tell you a lower bound on the lifetime. Without the censored data, the estimated probability of failure would be biased upward.

Here’s a different example of 100 simulated failure times in days. You’ll notice that all but eight observations are right-censored. Analyzing all the data, including partial information from the right-censored observations, the predicted proportion of parts that will fail by 1,825 days is about 2%, and the predicted median life is about 4,200 days. Using only the information from the eight exact failure times, the predicted proportion of parts that will fail by 1,825 days is much higher at 10% and the predicted median life is much lower at 3.346 days. The former is closer to reality.

Figure 2: Event plot, estimated failure probability, and estimated median failure time with and without using censored data.

Reliability factors

You’ve seen how reliability measures quality over time. Just like other statistical analyses, reliability can depend on the values of process variables. Here are some examples: