# How to Plan Life Tests for Optimum Cost-Benefit

Originally Published MDDI February 2002LIFE TESTING Understanding the basic principles behind all the statistical wizardry and jargon of life testing can lead to better results.

February 1, 2002

Originally Published MDDI February 2002

LIFE TESTING

Understanding the basic principles behind all the statistical wizardry and jargon of life testing can lead to better results.

Tom Clifford and Vanessa Koutroupas

Product development often requires accelerated-aging or fatigue-life testing. In a typical test, samples are put into a test chamber under carefully controlled conditions and inspected periodically for failures. The pattern of failures over time establishes the life of the population under those conditions. This finding is then used to evaluate the fitness of the product for a particular application.

Managing the test process can be highly stressful. Test specimens are typically rare and costly prototypes, testing facilities and inspections are expensive, and the tests are time-consuming. You usually get only one shot at the test, and the results can determine the fate of the product line, or the company. Your selection of the test variables is always an uncomfortable balance of cost, time, testing resources, availability of samples, and the "goodness" of the final result.

The deliverable will be some metric describing the failure distribution of the population. You are asked to get the most accurate metric (i.e., decision information) as rapidly as possible for your testing dollar. You must select the number of samples, the inspection frequency, and the method of analysis, and decide whether the test can be stopped before all samples have failed.

Typical tests reported in the literature are not much help. They turn out to be sadly imprecise in terms of statistical credibility. This paper describes the effects of the testing variables on the quality of the resulting metrics. It shows how to use Monte Carlo analyses to determine these effects with proper statistical confidence. Costs are also considered, and guidelines are provided to arrive at the most cost-effective choice of testing variables.

The goal is to encourage you to evaluate your testing options as well as the reported data, using available software and the methods and reasoning described herein. Examples are drawn from solder-joint accelerated-aging tests, but the statistics and results will apply to most accelerated-fatigue-life tests, including, in our experience, aging of plastic packaging.^{1,2}

WHAT'S A USEFUL METRIC?

Two types of tests are common.^{3} One is the pass-fail test. This is the simpler type, asking only whether the samples passed a specific threshold. Pass-fail testing will be discussed briefly towards the end of this article. The primary focus here is on the more-complicated second type, which seeks to define the distribution of failure times of the population being sampled and to show the level of confidence in this finding. The test engineer uses this information to compare one product with another, to check process stability or product uniformity, to quantify a material change, or to provide reliability numbers (e.g., what percent of the population will fail at X cycles).

What's a good metric for this second type of test? Certainly not great swooping full-color Weibull curves of asserted failure data points on some sort of probability axis. That's pretentious, probably misleading, and impractical for making decisions. Similarly, because lifetimes are never linear, don't plot anything, straight-line or curved, on linear paper, and don't report averages.

In most cases, you need only to determine and report a couple of metrics: one showing where most samples fail, and another indicating the breadth of the distribution. For thermal-cycle life tests, the median, or F50, is a good central metric. That's when 50% of the population will have failed. Other central metrics are available: Weibull fans, for instance, favor the "eta," which is when 63% fail. We prefer rounder numbers, however, especially a simple one like F50. (Note that some texts call F50 something different, like N50. However, we will reserve N to indicate sample size.)

Now for the second necessary metric: how to indicate the breadth? Weibull-curve fans would suggest that we use the "beta," an esoteric parameter indicating the slope of a straight two-parameter Weibull-log line. That looks very impressive, but it is not very useful.

For the second metric, we instead suggest F1, which is when 1% of the population will have failed. That's understandable to anyone. And the ratio of F50:F1 defines the slope of the Weibull curve as accurately as beta does. Moreover, anyone can plot a straight line on a Weibull plot, using those two points, and can read off F0.1 or F90, or any other value. (Note that projecting to the parts-per-million level below F0.1 requires some expert help, possibly using a three-parameter Weibull and thoughtful exploration of types and tests of distributions.)^{3}

Figure 1. Typical Weibull plot, showing 95% confidence intervals. |

Figure 1 shows a typical Weibull plot of eight samples, drawn from a population where eta =1000 and beta = 6. While impressive, all you really need to do is report the metrics. The basic metrics can be read off this plot: the F50 is ~870, and F1 is ~310. If other failure-probability points are needed, anyone can plot these two points on blank log-Weibull paper, draw a straight line, and read the other metrics directly from the graph. In this case, F90 is ~1200, and F10 is ~550, for example. (Note that this plot also shows the confidence intervals, discussed below.)

Distributions, of course, can be narrow or broad. When failure times clump together, we say the distribution is narrow. When failure times are spread way out (some early, some very late), we say the distribution is broad. The slope of a straight line on these cumulative probability curves is an indicator of the breadth of the distribution. Because of the way these curves are constructed, a steep slope indicates a narrow distribution, and a shallow slope indicates a broad distribution.

For Weibull distributions, the parameter beta describes the slope of the curve. A higher beta means a steep slope and narrow distribution. Narrow distributions from the literature typically are beta 8 to 12. These are a treat to deal with. Broader distributions, which show very early failures as well as samples that keep hanging on, are typically beta 2 to 6. These present special challenges.

WHAT MAKES A GOOD METRIC?

The quality of your results is determined long before the data start rolling in. Quality is built into the test plan: it is not determined or discovered after the test. The term quality does not mean how well the data satisfy your boss's expectations, or how smoothly the plotted points line up. Quality means how well the metric describes the underlying population.

Assuming nothing goes awry during testing, your initial selection of testing factors will determine the quality of your resulting metrics. And more importantly, you can know, long before you start getting data, what that quality will be. You can set up a quick-and-dirty test to generate ballpark values, or a more elaborate test to discriminate between populations that may be very similar. You are in control. Solid numerical measures of goodness are available for making these decisions. And just what is a proper measure of goodness, i.e., what measure shows how well the metric describes the population?

Classic statistics offers a useful measure: the confidence interval (CI). This is a calculated function of the breadth of the population and the number of samples. Commercially available software, such as Reliasoft Weibull5++, will automatically provide the CI.

A small CI says you are confident the population metric is somewhere within a tight interval; that is, you know your distribution pretty well. A large CI means that you know only that the population metric is somewhere within a large interval; that is, you are confident that the true population is somewhere in that ballpark, but you don't know the underlying population very well.

If you want to know a population precisely, run the test so that the CI will be small. Accordingly, our measure of goodness will be the CI. Each metric gets its own software-calculated CI.

Report the F50 and its CI, and the F1 and its CI. This is essentially all you need. These data describe the best estimate of the underlying population, as well as a measure of the uncertainty about those metrics.

By the way, mathematicians have provided a whole range of CIs. We can select 99, 95, or 50% CI, or whatever we want. Most engineers use 95% CI, which is what we'll use. That means we are 95% confident that the true population metric is within that interval. High-risk products, of course, might demand a 99% CI.

In Figure 1, where the F50 is ~870, the 95% CI around that point is 640 to 1050. This means that we can be 95% confident that the true F50 of the underlying population is somewhere between 640 and 1050. The 95% CI around F1.0 is ~60 to ~500.

CI is strongly affected by the selection of test variables. For example, testing many samples will tighten the CI around any metric, and permit us to maintain that we are confident that the population metric is within a very tight interval of uncertainty. To permit relative comparisons, we'll use a ratio of the CI to the metric, and we'll call that the uncertainty, or 100(CI/F50), in percent. For example, if the metric is F50 = 870, and the CI is 840 to 900, you could be confident the F50 of the population would be between 840 and 900. That's a small interval. The calculation is 100(60/870) = 7%.

By contrast, if your metric is F50 = 870, and the CI is 640 to 1050 (410 cycles), as shown in the figure, that means you are confident only that the F50 falls somewhere between 640 and 1050. That's 100(410/870) = 47%. That's not so good. The CI provides this measure of goodness, and is also useful in deciding whether one population is really different from another. If the CIs do not overlap, you can be confident they are different. If the CIs overlap, the samples are very likely from the same population, and the populations are very likely to be identical.

TEST DESIGN

What elements will affect uncertainty? All of them—some more than others. The more samples, the better your knowledge of the population. Frequent inspection is good, as is letting all samples fail. If your distribution happens to be narrow, better yet. What if there are several components on the same board, all failing at different rates? What test plan will allow you to resolve small differences between populations A and B? You need facts. You need to know how each testing element affects the quality of the resulting metrics.

How do we learn about something? Try it and see. In statistics, this approach is called the Monte Carlo method. To study a distribution, take random samples from a population and analyze them. Do this several times and see what happens. This approach is based on two fundamental concepts: you can learn about a population by drawing random samples from it, and every random sample is just as valid as any other. The Monte Carlo method lets you look at the quality of the metrics you will encounter under different test schemes: how many to test, how often to inspect, how long to let them run before stopping, and so forth.

For example, suppose you want to know what sample size is needed to describe the F50 of a population. First select the distribution you're interested in. Typically, you know something about what you're testing—approximately when samples will start failing and when most will fail. From that rough idea, you can select the distribution you'll use for the Monte Carlo run. Draw a sample set N = 5. Calculate F50. Draw another five; calculate F50. Do this many times. You'll discover that the F50s will vary widely. Now draw several sets of N = 20. Calculate F50s. The resulting F50s will be nearly identical. Using Monte Carlo analysis, you have thus demonstrated that a larger sample size provides a more accurate measure of the population. Pick another distribution (for failures that happen sooner or happen later), and do the exercise again. You'll soon start seeing the effect of sample size.

We'll use this Monte Carlo method to look at all the important testing variables, exploring populations with different distributions. We'll base our analysis on Weibull distributions, and will use F50 as the metric. Note that log-normal distributions behave similarly. The measure of quality is the uncertainty, as calculated above. A large uncertainty means the metric is not very indicative of the underlying population. A small uncertainty means the metric is less uncertain, i.e., a better metric.

Table I is a very simple example of Monte Carlo runs. One dramatic and telling observation is that perfectly valid samples can look very different from their underlying population. This is especially true for small sample sizes. The N = 5 samples, from a population of beta = 8, show betas ranging from 6 to 14. You can see that randomly chosen small handfuls of data will produce widely varying and equally valid metrics. In contrast, choosing larger handfuls of data will get you closer every time.

Three Trials at N = 5

Trial | 1 | 2 | 3 | 1 | 2 | 3 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1412 | 1612 | 1514 | 1518 | 1714 | 1942 | 2115 | 1088 | 1808 | 1960 | 2131 | 1335 | 1650 | 1722 | 1934 |

1493 | 1705 | 1777 | 1548 | 1852 | 1958 | 2116 | 1432 | 1833 | 1982 | 2133 | 1435 | 1689 | 1762 | 1975 |

1509 | 1789 | 2041 | 1642 | 1868 | 1964 | 2256 | 1603 | 1851 | 1995 | 2145 | 1539 | 1693 | 1793 | 1985 |

1930 | 1883 | 2098 | 1648 | 1879 | 2002 | 2308 | 1678 | 1883 | 2053 | 2434 | 1554 | 1694 | 1850 | 2002 |

2300 | 1929 | 2230 | 1681 | 1894 | 2079 | 2325 | 1702 | 1903 | 2069 | 2445 | 1559 | 1705 | 1883 | 2325 |

Sample eta | 1850 | 1842 | 2058 | 2013 | 2043 | 1844 | ||||||||

Sample beta | 5.6 | 14.3 | 6.8 | 9.5 | 6.6 | 9.4 | ||||||||

F50 | 1595 | 1895 | 2020 | 1950 | 1990 | 1820 | ||||||||

CI around F50 | 810 | 220 | 1050 | 195 | 230 | 250 | ||||||||

CI/F50, as % | 51 | 12 | 52 | 10 | 12 | 14 | ||||||||

F1 | 805 | 1350 | 1050 | 1295 | 1020 | 1105 | ||||||||

CI around F1 | 1280 | 950 | 1445 | 520 | 590 | 350 | ||||||||

CI/F1, as % | 159 | 70 | 138 | 40 | 58 | 32 |

Table I. Example of a simple Monte Carlo run: beta = 8, eta = 2000, continuous inspection, test not suspended. Data points were selected randomly from the population by Reliasoft software.

You now have a mathematically valid tool for your comparison decisions: a certain test scenario will result in a knowable metric quality. In this one example, if you decide to test 20 samples, your uncertainty around F50 will be ~12%. If you choose to test only five samples, your uncertainty will be ~40%. Similarly, uncertainties around F1 will be ~40% and ~125%, respectively. This simulation can reveal much about the relationship between sample and underlying population. It also provides valid data sets, to provide experience with expected variations and outriders.

Figure 2. Effect of sample size on uncertainty. |

Effect of Sample Size. A basic principle of statistics is the need to test many samples to properly characterize your population. Figure 2 confirms this for a simple test where eta = 2000, with broad (beta = 4) and narrow (beta = 16) distributions and where samples are continuously inspected, and the test is not suspended. It's clear that a sample size of five (N = 5) results in two to three times greater uncertainty compared to N = 20. A sample size of 20 is almost as good as 100, which is as good as you'll get. Note that in every case F1s are more uncertain than are F50s.

Note also that narrow distributions are much more desirable. The uncertainty of any metric, at any sample size, will always be better in a narrow distribution (beta =12 to 16) than in a broad one (beta = 4). This makes sense: From a tight distribution, any specimen is going to behave much like any other specimen, so it matters less how many you test. Tailor your test plan to fit your expected distribution, but start with eta = 6 to 10 and use it until you know better.

Effect of Inspection Frequency. Typically, samples are inspected periodically. You will know that the sample failed between X and Y cycles, but you won't know exactly when. Reassuringly, when properly treated, this sort of data can be quite good, even if you have as few as 4 to 6 inspections during the population's failure span. Also, inspection intervals need not be the same throughout the test. Good software or manual methods can account for variations.

Figure 3. Effect of inspection frequency. |

If you expect the span of failures to be ~500 to ~2000 cycles, check every 100 cycles. Then check every 200 to 400 cycles. Don't worry if a particular inspection shows many failures having occurred within one interval. That can be dealt with using the right software and knowledge. Note that there are devices that can tell you exactly when the failure occurs (two of these in the authors' experience include circuit analyzers to continuously monitor fatigue-induced electrical opens, and pressure-decay switches to monitor leaks in plastics packaging). These devices provide a continuous record, such that you know exactly when a specimen fails, However, you do not need continuous data to get good results. Figure 3 shows some representative case results.

Effect of Test Completion. By stopping the test before the last sample fails, you can save lots of valuable time and make good use of the data that are available. Suspending a test can affect its cost and feasibility.

Figure 4. Effect of suspension and sample size for two distributions. |

Suspending a test when 60 to 80% of samples have failed can save time and money without seriously affecting the quality of the results. Sample size is important, and good software or manual methods are necessary. Figure 4 shows some examples of the quality penalty incurred by suspending a test. It doesn't hurt as much as you might think.

While the quality impact can be relatively slight, the time savings benefit can be substantial. In a test with an eta = 2000, beta = 8, you would expect completion at about 4000 cycles. At 15 cycles per day, that's 8 months. Suspending at 60% saves about 2000 cycles, or 4 months. Most importantly, you can see and evaluate any quality penalty and time-to-market benefit long before finishing the test.

Effect of the Data Distribution. Failure distributions can be broad (some samples fail early, others very late) or narrow (all samples fail around the same time). Most are Weibull shaped, others are log-normal shaped.^{3}

For example, the most uniform solder-joint populations, under tightly controlled test conditions, typically fail in a Weibull shape, beta = 10 to 12. Sloppier workmanship control or several overlapping failure modes (such as those typically encountered with the newest microelectronics prototypes) will result in a broad distribution, beta = 2 to 4.

For complex situations, the distributions will be log-normal, or even bimodal. It is important to note that you cannot tell for sure just by looking at the straightness of a line of plotted points from a small sample whether the population is Weibull or log-normal if your sample size is 50 or less.

For most useful sample sizes, i.e., N = ~20 or less, plots on either type of paper will look ragged. That's to be expected. An apparently ragged line doesn't usually mean that the underlying distribution is not consistent with the type of graph paper you are using, that you have a bimodal distribution, that you have outriders, or that anything is wrong with the data. Monte Carlo experience teaches that perfectly valid random-sample sets will often plot raggedly.

If you are concerned that you do not know whether you have log-normal or Weibull data, be reassured that it doesn't much matter. You'll get about the same central metrics, whether you treat the failure points as Weibull or as log-normal. Good software can easily analyze your data either way. F50 and F1 are relatively insensitive to which shape you think you have.

In the above examples, Weibull data treated as Weibull give you 1400 and 700 for F50 and F1, respectively. The same data points treated as if they were log-normal give you 1450 and 830. The log-normal data treated as log-normal give you 1500 and 820. The same data points treated as Weibull give you 1600 and 680 cycles.

Selection of the most appropriate type of distribution can be an important exercise, but possibly unnecessary in an industry with a mature database of straight lines on Weibull paper. As always, data quality is always better when the distribution happens to be tight (large Weibull eta). Again, metrics at the lower end of the curve, at F.01 and below, warrant special attention.

DATA REDUCTION METHODS

Manual methods of data reduction can be effective. Use Weibull-log paper unless a mass of test data plot straighter on log-normal paper. Plot each continuous-data failure point, run a best-fit straight line, and read off the F50 and F1. For N = ~10, your metrics will be within 10% of what you'd get from good software. For N = >20, any difference in F50 estimates is trivial. However, for N = 5, the F50 will be 15 to 25% off and the F1 will be even worse. The real problem is the small sample size, not the method.

Note also that suspended data, plotted properly by hand, will very closely match the software's straight-line curve fit. Intervalized data can also be manually plotted, with an excellent match to software results. The trick—and this is essential— is to plot the failure point between the time the sample was OK and the time it was found to have failed. If several samples fail within that interval, spread out the failures uniformly within the interval. This is a bias-free method. It is mathematically and logically incorrect to graph points at the end of the interval.

Note also that attempting to calculate the CI manually will probably be futile. This conclusion about manual-graphing adequacy can be readily confirmed by a few Monte Carlo trials comparing manual results with software results. Any difference is overshadowed by sample-size effects.

PASS-FAIL TESTS

Not as simple as one might think, pass-fail tests certainly don't describe the underlying population. For borderline populations, sample size can be a very important factor, and, in fact, it can be a tool to try to accomplish a biased result.

If you are devious and want to prove that your borderline population passes, test only a few samples, hoping that your sample set happens to not contain any that fail at the early end of the distribution. You might fool the customer. Worse yet, you will fool yourself.

Figure 5. Effect of sample size on the likelihood of detecting a "<1000 failure" in a pass/fail test. |

Figure 5, a Monte Carlo run on a typical (beta = 8, eta = 1500) distribution, is revealing. Test five samples, and there is only a 20% chance of detecting a failure if you stop the test at your threshold of 1000 cycles. However, from that same population, if you had tested 20 samples, you would probably have encountered at least one failure.

Note that in all the above discussions, we never use the phrase "good enough." Proper selection of testing variables depends on such things as the available samples, time, test racks, inspection resources, and particularly the required level of certainty. If all you need is ballpark accuracy, do the test fast and cheap. Conversely, to demonstrate the reliability of a critical device, you should run the test with more samples, more-frequent inspections, no suspension, and so forth. You should base your decision on data from Monte Carlo runs.

HOW GOOD ARE REPORTED METRICS?

It's possible to estimate the confidence intervals around reported metrics, but only if the actual data are presented. Unfortunately—and perhaps revealingly—they are usually not reported.

In a hypothetical case with a sample size of 10 (beta 6, eta 1000, inspected every 100 cycles) using Monte Carlo analysis, the F50 uncertainty will be 47% (between ~450 and ~950 cycles), and the F10 uncertainty will be 82%. Even worse, when you see a sample size of five, assume that any mid-range metric will be at least 200% off, either way. Conversely, if you see a sample size of 30 or more, a tight beta (8–12), some persuasive evidence of frequent or continuous inspection, no suspension, and proper data reduction, you can be 95% confident that the true population F50 will be within perhaps 5% of the reported sample F50.

If there are no data points, assume there's a reason. A colorful line on a pretty graph with no points and no backup data is a clear invitation to purchase snake oil. Fully reported data can be readily analyzed to determine their credibility.

COST IMPACT

Figure 6. Effect of sample size, suspension, and inspection interval on the quality of the F50 metric. |

All this statistical wizardry is not enough. Money is the name of the game. Samples and inspections cost money. But you really need a certain level of quality in your resulting metrics. What should you do? Use lots of samples but suspend the test sooner? Save on samples but inspect more frequently? Inspect less often but let the test go to completion? Use lots of samples and suspend half-way through?

Table II and Figure 6 describe several hypothetical test scenarios. The costs should be understandable, as the statistics come from Monte Carlo runs. Table II summarizes a few hypothetical planning cases, where costs have been assigned to some of the major testing elements.

Test Number |
---|

N = 8 |

1 |

2 |

3 |

4 |

5 |

6 |

N = 24 |

10 |

11 |

12 |

13 |

14 |

15 |

Table II. Costed examples of test plans. Elapsed time is 100 cycles per week; test is 100% complete in 30 weeks.

For this exercise we assumed a common distribution (eta = 2000, beta = 8). Assume that samples cost $500 each, inspections cost $20 each, and testing costs $250 per week. You can see the quantitative effect of backing off on your inspection frequency (save money, lose a bit of accuracy), letting the test go to completion (it'll cost time and money, but you'll gain some accuracy), or starting with more samples (more initial expense, but more testing options, much better quality, and time savings).

Plotting these data in Figure 6 provides a visually intuitive sense of the sort of trends and trades that can be prepared and discussed using this technique. Using this approach, you can make sweeping plans or explore subtle details, made more rational and accurate by the effort you put into estimating the costs of the test elements and the statistical cases chosen. If samples are rare or very expensive, you can show how to compensate (more frequent inspections and less suspension). If time is all-important, you can determine how many samples you need and how soon you can suspend the test. If ballpark numbers are OK, you can show how inexpensively and quickly you can obtain that. For any test setup, you can know what you'll lose by shutting down early to save time.

CONCLUSION

Life tests can be treated like any other task, using comfortable principles and tools. Good software and Monte Carlo analysis can help you plan an optimum test, and can help you evaluate the life-test data that are reported in the literature.

ACKNOWLEDGMENTS

The authors acknowledge the support of the life-test personnel at Lockheed Martin Sunnyvale, particularly Bob Haugh, software; Robert Hill, design; and Grant Inouye, reliability. Technical inputs from Dave Groebel of Reliasoft are also gratefully acknowledged.

REFERENCES

1. "Life Tests: Get the Biggest Bang for the Buck," Printed Circuits Europe, 22, Fall 2000.

2. "Thermal-Cycle Life Tests: Help!" Circuits Assembly, September 2000.

3. NIST/Sematech Engineering Statistics Handbook, sect. 8.2 (Austin, TX: Sematech).

Copyright ©2002 Medical Device & Diagnostic Industry

You May Also Like