An MD&DI May 1997 Column
When doing its own product testing, how can a device company be sure that test results will lead to both a safe product and FDA approval?
Unlike some other forms of performance testing, strength testing usually results in the destruction of the element being tested. Testing an endoscope joint, for example, might involve pulling on it until it breaks to determine the degree of stress it will be able to withstand in use in the body. For this reason, strength testing cannot be performed on parts actually intended for sale. Rather, the device maker must select a sample from the total produced, test that sample, and extrapolate from the data obtained an estimation of the reliability of the (untested) parts that will actually be sold and used, often in life-critical situations.
How large should that sample be? Large enough to give a statistically valid picture of the safety of the parts to be sold. Too small a sample can lead to FDA scrutiny and with it the necessity of expensive retesting, or worse, to the sale of unsafe devices. On the other hand, particularly if the parts in question are expensive, the manufacturer does not want to throw money away by destroying a larger number of them than necessary. If sample size is allowed to increase until it destructively tests all the parts, the manufacturer will be certain about their strength but have nothing left to sell.
The Bionix 200 (MTS Systems Corp; Eden Prairie, MN) performs material testing. Photo courtesy Annex Medical.
Common sense tells us that the larger the sample size, the greater the certainty about the extrapolation of sample results to all parts. And the greater the cost. So we must determine the minimum sample size that satisfies both regulatory standards and our resolve to make and sell the safest devices at the highest possible level of confidence.
This leads to other questions. How high can our level of confidence be that the part is safe? And how confident can we be that all parts sold will be as safe as the ones we have tested?
These questions recently became matters of practical concern for Annex Medical, Inc. (Eden Prairie, MN), which had developed a new ureteral stone basket and needed to make a statistical analysis of product validation test results to ensure that the product would meet the minimum strength specifications set down in the company's product validation protocol. The analysis included selection of sample sizes and determination of confidence intervals and population percentages (all defined below). In trying to conform to industry standards and practices, however, Annex discovered that guidelines were difficult to locate.
Individuals contacted at FDA declined to provide specific guidance, explaining that the manufacturer, who is intimately acquainted with the details of a product and its use, is the one best qualified to determine levels of confidence, population percentages, and required sample sizes for validation testing. In Annex's case, however, these confidence levels and percentages were precisely what the company needed help with. The only advice FDA was able to offer Annex was that it hire a consultant.
Most small medical manufacturers lack staff statisticians but would like to do at least some of their own analysis, both to learn more about their business and to save money. The result of Annex's call to FDA was the development of step-by-step procedure for setting up and analyzing minimum-strength tests.
The procedure met Annex's needs and could be applied in full or in part to other companies' situations. The following article explains what that procedure entails.
Several statistical concepts are involved in making good extrapolations. Initial decisions include determination of confidence intervals, population percentages, and sample size. The meaning of these terms and how they are determined are discussed below.
Confidence Intervals. To determine confidence intervals, measurements are first taken from a sample group of a few parts. The average (or sample mean) and sample standard deviation are two statistical measures calculated from the sample data. (The average is the central value of a data set, derived by dividing the sum of the values of the set by the number of terms in it. The standard deviation is the common measure of the dispersion of the data set around the average.)
These measures are then used to estimate the mean and standard deviation of all the parts, including those that are untested. In simple terms, a confidence interval is an upper and lower limit for how far the mean of all the parts might vary from the sample mean. For example, at a 90% confidence level, a company would be 90% sure that the mean of all the parts would fall between the upper and lower limits of the interval.
Traditionally, statistical analysis has been carried out at 95% confidence, with many peer-reviewed journals refusing to publish studies that use confidence intervals below that level. The 95% figure, however, is based more on the availability of published tables than on any solid rationale. Many large medical manufacturers use a sliding scale of confidence levels, usually ranging from 95 to 99.9%. Higher confidence levels are used for more-critical measurements or tests.
Population Percentage. Population percentage is the proportion of the product expected to exceed the minimum strength. Ideally, this percentage would be 100%, but such certainty could only be ensured by 100% destructive testing. Since that is not feasible, analysis is performed based on a high, but realistic, population percentage. Like confidence levels, more-critical measurements and tests are usually assigned higher population percentages than less-critical ones. For a critical joint, for example, 99.9% of the product might need to be expected to exceed a safe minimum, while for a less-critical item 95% might be acceptable. For a given data set, then, increasing the confidence level or the population percentage increases our assurance about the true minimum strength of a product but paradoxically--since we are in effect upping the ante--makes it less likely to be found acceptable.
Sample Size. Sample size should be as large as necessary to provide statistically valid information about the total population of (untested) parts, but no larger. How the appropriate sample size is determined for a particular test protocol will be further discussed below.
Endoscopic instruments from Annex Medical (left). Tensile testing of a urological endoscopic instrument (right). Photos courtesy Annex Medical.
K values are a convenient way of expressing the statistical interaction of confidence intervals, population percentages, and sample size.
Suppose the strengths of a sample group of parts have been tested, with the results providing an average strength and a standard deviation. Extrapolation from this test-sample information provides a range of strengths expected for all the parts. K value can be defined as the number of standard deviations that the upper and lower limits of this strength range are away from the sample mean. A larger K value corresponds to a wider strength range. Reversing this procedure, if we start by knowing the K value--which we do because past statistical analysis has provided K-value ranges (see Table I)--we can find a lower or minimum limit for a strength measurement by simply multiplying the standard deviation of the sample by the appropriate K value and subtracting the result from the sample mean. (Later in the article this procedure is illustrated with an example.)
The more critical the part, the more stringent the test criteria should be. Higher K values correspond to more-stringent test criteria. K values can be made higher by increasing the confidence level, the population percentage, or some mix of the two. K values become lower as sample size increases.
Table I. One-tailed K values for calculating the probability of exceeding the minimum. Highlighting and outlining refer to the example discussed in the text.
The top of each Table I column shows the one-tailed confidence level and population percentage used to develop it. (The table used here is one-tailed because we are interested in only one end, or tail, of the strength distribution curve--namely, minimum strength. If we were interested in both minimum and maximum strength we would use a two-tailed test.) Not all the confidence level/population percentage combinations listed in the table would have been specified in past (precomputer) statistical procedures, largely because intermediate values of confidence and population percent are not widely published.
The columns of Table I represent an increasing progression of both confidence level and portion of population from left to right, that is, from 95/95 to 99/99.9. Arranging the table in this fashion results in an orderly increase in K values as a product or part becomes more critical, while giving appropriate credit to the increased certainty obtained from testing larger sample sizes. Population percentages in this table are for one-tailed testing, and are specifically arranged to find the probabilities of exceeding a minimum (for example, a minimum strength).
The method presented in this article produces results similar to those resulting from more complicated statistical specifications.
Normally Distributed Data. Table I is based on the assumption that the measurements are normally distributed. This assumption can be tested without a computer using normal-probability graph paper. Data that are normally distributed will plot as a straight line.
Plot each data point (y-axis) against its median rank (x-axis). In the present case, each data point represents a break point in a sample subjected to strength testing. The median ranks are assigned by arranging the data in ascending order. The smallest value becomes rank order 1, the next smallest becomes rank order 2, and so on. The median rank for each data point is then found in the column of a median-rank table--such tables are readily available in reliability texts--headed with the appropriate sample size and the row associated with the rank order.
For sample sizes over 100 a common numerical method of determining normality of data distribution is the chi-square test. Examples of this test are found in most statistics books.
Data that, when graphed, form two distinct lines on normal-probability graph paper or two humps in a frequency histogram are said to be bimodal. If this pattern is observed the data are not normally distributed. (See the discussion on bimodality under Common Problems, below.)
FINDING MINIMUM STRENGTH VALUES
The procedure for evaluating a part's minimum strength values can be broken down into the following series of steps. Steps AG set up the analysis by determining sample size and critical test values. Steps H and I cover analysis of the results. We use Annex Medical's testing of an endoscopic instrument joint is used as an ongoing concrete example of the procedure.
Table II. Severity-of-failure categories and their respective safety factors.
Step A. Use failure mode and effects analysis (FMEA) or simply brainstorm to identify the different ways the product to be tested might fail. Determine the severity of the failure for each failure mode. The following severity categories, which are listed in Table II, have proven useful for products such as medical instruments:
1.Nuisance: A failure the user (physician or patient) will become aware of but can tolerate without the patient being affected.
2.Decreased Device Performance: The device is functional but not at the intended level.
3.Lengthened or More Complex Procedure: The failure of the device creates the need for additional procedural steps. These may include simple recoveries of broken product components from introducer instruments or from the patient.
4.Surgical Intervention: Surgery and/or other serious steps must be taken to correct a problem caused by the device failure.
5.Serious Injury: A device failure in this category is likely to cause injury to the patient. The injury may result directly from the device or from subsequent intervention.
6.Death: The device failure could cause death either directly or as a result of subsequent intervention.
If the endoscopic joint being used in our example were to fail, a smooth, easily grasped piece of the instrument would remain in the patient, and an additional endoscopic procedure would be required to retrieve it. Therefore, the Lengthened or More Complex Procedure column in Table I would be used to measure the joint's severity of failure.
Step B. Devise a mechanical test to evaluate each failure mode for the part under consideration. The endoscopic joint, for example, was evaluated using a pull test to measure its strength.
Step C. Select a minimum acceptable result for the test performed in step B. Consider the conditions of actual use and answer the question, How strong must this part be to perform its function?
In the example in which the endoscopic instrument was tested, a small amount of experimental measurement was done. It was discovered that only 1 lb of force was required to actually operate the instrument.
Step D. Typically, the minimum result from step C is multiplied by a safety factor, normally ranging from 2 to 10, to arrive at the minimum acceptable value. Product knowledge and experience are the best guides in selecting this factor. Alternatively, the consequence of failure determined in step A can be used as a guideline, as shown in Table II.
Some products or parts of products are much stronger that needed for their intended function. In such cases, the minimum acceptable value may simply be set so it exceeds the maximum load that can possibly be applied to a part in use or abuse. This approach to assigning the minimum acceptable value is safest if the maximum load on one part is limited by the strength of some weaker part. For instance, if a steel screw is threaded into a plastic handle, the maximum load on the screw's threads will not equal the force required to break the threads themselves, but rather the force required to strip them out of the plastic.
In the example, as mentioned above, the endoscopic joint failure had been given a severity-of-failure described as a "lengthened or more complex procedure." The corresponding safety factor of six (per Table II) was multiplied by the operating force of 1 lb to establish a minimum acceptable value of 6 lb.
Step E. Test 10 samples and determine the sample mean and standard deviation of the test results. In the sample, include parts made by several different operators using different fixtures and machines; try to introduce as much variation as possible at this step. In addition, write a short explanation of how each sample failed, that is, the failure mode. Examples are "joint broke" or "wire broke."
In the endoscopic joint test procedure, 10 sample joints were pull tested. Their average strength was found to be 15.3 lb, with a standard deviation of 3.4 lb.
Step F. Determine the number of standard deviations (NSD) that the minimum acceptable value from step D is away from the sample mean by using this equation:
NSD = Mean minimum acceptable value
In the example:
NSD = (15.36) = 2.735
Step G. Using the K-value table (Table I), find the column associated with the severity of failure determined in step A. Trace downward in the column until you find a row with a K value equal to or just smaller than the NSD value found in step F, in this case, 2.706. Follow this row to the left to find the minimum sample size (nmin), in this case 22, that should be used to validate the product on this test.
A minimum of three separate validation test batches (or runs) should be made to conform to good statistical practice (and minimize the possibility that a single run was a lucky result of an otherwise poor process). The sample size number, nmin, can be used in two ways: For each batch to be statistically valid on its own, run at least nmin pieces in each. This is the soundest procedure statistically and should be used if the consequence of failure is surgical intervention, serious injury, or death. Conversely, for a high-cost product with a lower severity of failure, use a minimum of three separate batches to test a total of at least nmin samples. For instance, if nmin is 22, as in the example above, then three batches of 8 each for a total (nactual) of 24 would be a good choice. Be aware that in this case the required confidence level will not be achieved until all three batches have been tested and the results pooled. If nactual is large enough, it is good practice to spread the testing over as many batches as possible. For example, if nactual is 100, 5 batches of 20 or 4 batches of 25 would be preferable to 3 larger batches. It can be advantageous (but not necessary) to select a batch size of at least 6, since this will allow some separate analysis of the runs using the information in the K-value table.
This experiment has now been set up. The next sequence of steps covers analysis of the results.
Step H. Test nactual samples. Find the mean and standard deviation for the results. If you are using a batch size of nactual, repeat this for each batch. As in step E, write a short explanation of how each sample failed. In the example, after testing 24 sample joints (three batches of 8), the overall average was found to be 15.21 and the standard deviation 2.67.
Step I. Multiply the standard deviation from step H by the K value corresponding to the nactual used. Subtract the resulting number from the average found in step H. This is the critical value.
Critical value = Mean (K value x standard deviation)
If the critical value exceeds the minimum acceptable value found in step D, the validation is complete. Write the report. If the critical value is less than the minimum acceptable value, see the following section on common problems. In the example, nactual was 24, so the K value (from the Lengthened or More Complex Procedure column) was 2.654.
Critical value = (15.21) (2.654 x 2.67) = 8.124
Since the critical value (8.124) exceeds the minimum acceptable value (6 lb from step D), joint strength is adequate.
If the value of NSD found at step F is smaller than anything in the table, or extremely large sample sizes are indicated, there are two possible causes for these related occurrences.
In the first case, the variability of pieces from the process is large compared to the mean strength of the piece. If the standard deviation is more than about 30% of the mean, it would probably be a good idea to look at the process to see if some of the variables can be better controlled.
Operator-controlled variability is particularly suspect, especially if, as recommended, more than one operator made the samples. Study the operators' work processes to determine and correct the differences between their techniques. (If this is not done, the variability will occur when actual production begins and will be much more expensive to correct.) Some other sources of variation include the tolerances of incoming material, amount of adhesive or solder applied, times, and temperatures.
Another way to reduce variability is to limit the combinations of fixtures, machines, and operators used. For instance, always run fixture A on machine A and fixture B on machine B. Keep in mind, however, that if this is done only the combinations that have been tested will be valid for production, and subsequent QC records will need to confirm that only these validated combinations are being used for production.
The second possibility is that the average strength of the product is too close to the minimum acceptable value. The best fix here is to rethink the design or materials in the piece. Otherwise, the product will always be running on the ragged edge of not working.
If the value of NSD found at step F is larger than anything in the table, or step G indicates a sample size of less than 10, 10 will serve as an acceptable validation sample size for the initial test run. Running two more small batches (6 parts each, for example) to meet the three-batch minimum should be sufficient to validate the product when the 12 pieces are lumped with the first 10.
If the critical value found at step I is smaller than the minimum acceptable value, the most probable cause is that something in the process or materials has changed between the time of the pilot run of 10 and the time the three validation batches were run. Review everything! If the cause of this variation is not found, it could surface again unexpectedly, resulting in problems with production parts. Correct the changes and rerun this procedure starting at step E, taking 10 new samples. Continue sequentially through the steps. Do not use the old results in the rerun. If no changes can be identified and corrected, use the mean and standard deviation of the validation batches to find a new NSD. Go to step G with this NSD and continue sequentially determining a new sample size, n. In this case, the data that have already been collected can be included in the analysis, necessitating only a run of enough additional pieces to bring the total up to the new n.
Bimodality. If bimodality is evident, look at the failure mode for each individual sample (as in step E, above). Group the data so that similar failure modes are together. Does one failure mode seem to have a different mean and standard deviation than the other(s)? This could identify the cause of the bimodality. Try to understand why the bimodality is occurring. Perhaps one of the failure modes can be reduced or even eliminated.
Alternatively, divide the data into subgroups by failure mode and analyze each as if it were a separate test, starting at step H. The size of each subgroup becomes nactual for the analysis. It may become necessary to run more samples to accumulate enough of each subgroup to complete the analysis. Frequently each failure mode subgroup will meet the minimum requirements if it is analyzed separately.