Sterile Packaging: Sample Sizes and Statistics

Originally Published MDDI October 2004

October 1, 2004

21 Min Read
Sterile Packaging: Sample Sizes and Statistics

Originally Published MDDI October 2004

Cover Story - Packaging

Sterile Packaging: Sample Sizes and Statistics

Determining appropriate sample sizes for operational qualifications can help manufacturers ensure sterility of medical device packaging.

Dennis Gilliland, Laura Bix, Hugh Lockhart, and Nick Fotis

Packaging products courtesy of B. Braun OEM/Industrial Div.; Kimberly Clark; Rollprint Packaging Products Inc.; Perfecseal; Sherwood, Davis and Geck; and Technipaq Inc.

It seems so simple; sterile medical devices must be delivered to hospitals in a sterile state. But, in reality, the issue is not simple. Device industry professionals must demonstrate with a high degree of confidence that medical device package integrity will be maintained during storage, handling, and distribution. Moreover, this must be done in an economy where cost-competitiveness is increasingly important. In addition, reducing the ballooning costs of healthcare is currently a national concern.

No process is defect-free, and the added unknowns of distribution and handling indicate that problems may occur. It is impossible to know the outcome of a future event, such as the sterility of a device at time of use. The medical packaging industry uses research, experience, accumulated scientific knowledge, controlled manufacturing processes, and FDA guidance to minimize the risk of nonsterility at the time of use.

This article describes a system for ensuring sterile integrity of medical device packages, as well as the goal of the system and the importance of cooperation among the system's components. A good understanding of the operating characteristics of various sampling plans and the limitations of statistics should enable manufacturers to better balance the risks and costs of achieving sterile device packaging.

Quantifying a Small Rate of Failure

It makes sense to score a football game. But does it make sense to score the healthcare industry for the sterility assurance of packaged medical devices? If so, how should such a score be determined? How reliable would the measurement of that score be? For this discussion, the score is defined as the percentage of devices that are sterile upon opening the packages at the time of use. A medical device that is nonsterile at the time of its application to the patient, or its endpoint, is a failure. It is accepted practice to demonstrate that the package has integrity and was produced and sterilized by validated, documented processes. However, with this practice, no endpoint score is ever recorded.

This article pertains to statistical treatment of sterile package failures that may result in nonsterile product. What percentage of failures is acceptable? 0.1% (1 nonsterile device in 1000)? FDA states, “The presence of viable microorganisms on medical devices poses a significant risk to patients; particularly to those patients with diminished resistance.”1

It is universally agreed that a zero-defect level is desired. However, in practice, sampling plans and criteria for passing tests must be determined with feasible sample sizes.

An early version of the “FDA Compliance Program Guidance Manual” contained objectives for an acceptable quality level (AQL). AQL is the maximum percentage of nonconformities that, for purposes of sampling inspection, can be considered satisfactory as a process average. The agency determined an AQL of 0.25% for invasive devices and an AQL of 0.65% for noninvasive devices.2

The American Society for Quality further defines AQL in its “Note on the Meaning of AQL” in ANSI/ASQ Z1.4, as follows:

When a consumer designates some specific value of AQL for a certain nonconformity or group of nonconformities, it indicates to the supplier that the consumer's acceptance sampling plan will accept the great majority of the lots or batches that the supplier submits, provided the process average level of percent nonconforming (or nonconformities per hundred units) in these lots or batches be no greater than the designated value of AQL.3

This definition includes an acceptance sampling plan at the consumer end. No specified plans are mandated for accepting sterilized medical device packages; therefore, each manufacturer is responsible for setting its own risk levels.

On its face, AQL does not directly apply in this context. However, one might interpret “no greater than 0.25% failures for invasive devices” and “no greater than 0.65% for noninvasive devices” as objectives. These values also come up later in this article when discussing a manufacturer's operating characteristics of sampling plans for operational qualification (OQ).

Though an interesting concept, testing for endpoint failures is difficult and not very useful. The usefulness of statistics for quantifying rates of rare events is limited. Consider as an example a stable system for the sterility assurance of packaged medical devices. Suppose that randomly selected packages from the system's output are opened just prior to use and that the sterility of the packages is determined. Table I shows the upper 90% confidence limit for the rate of failures for the process given various sample findings. In this case, failures might be indicated by holes that may allow penetration of viable microorganisms. Finding no failures in a sample of 100 does not rule out the possibility of an unacceptable system rate of up to 23 failures per 1000, at 90% confidence. In other words, if there are zero failures in the sample of 100, as many as 23 failures could be found in the next 1000.

FDA recognizes the limitations of endpoint testing. In a recent document, the agency stated, “It has been determined by experts in sterilization that finished-product testing alone is inadequate to assure total product integrity. Therefore, to reduce the risk of distributing nonsterile devices to an acceptable minimum, it is necessary for every lot of devices labeled as sterile to be subjected to well-controlled sterilization processes of proven effectiveness.”1 It is important to note that FDA emphasizes judgment and process over endpoint testing. Such emphasis is not only sensible, but it is also compatible with the philosophy that statistical testing should not be used in lieu of engineering judgment in package and process design.

Medical Device Packaging Validation

Table I. The upper 90% confidence interval for process rate of failures is based on the binomial distribution. The 90% confidence interval consists of all probabilities š not rejected at level 0.10 by the exact left-tailed test of š based on the observation on B(n, š), the binomial distribution, with n trials and probability of failure, š. (click to enlarge).

In a medical packaging system, a packager has a major quality assurance responsibility to provide sterile medical devices. FDA states:

Emphasize validation in your review of the packaging process. The process should have been validated to assure integrity of the seal. The validation study should include verification that the sterilization process will not have an adverse effect on packaging integrity. *If resterilization is anticipated, the validation study should confirm that package integrity will not be adversely affected for a specified number of cycles.* Also, the study should include reasonable expectations of handling and storage conditions to which the packaging would normally be subjected. The packaging material and procedure used must correspond to that described in the validated procedure. Significant changes that would affect package integrity require revalidation. Compare the packaging procedures used by the firm with the instructions in the operations manual provided with the equipment. The device packager should be able to explain the reason for any variation between his procedure and the procedures described in the manual.4

A packager must validate, or qualify, its processes. FDA provides general guidance and some requirements, but no details. Therefore, a packager must use the specifics of its validation process to create its own guidance and requirements. Moreover, manufacturers must do so based on engineering knowledge, theory, and empirical information derived from experimentation and statistical analyses. So what is reasonable? What role does statistics play?

Before exploring the effect of the sample size n on the probability of qualification (Pq), two competing concerns exist. One suggests that n be large; the other suggests that n be small. For example, based on the concern that nonsterile medical devices will be used on patients, some think should be very large. But some think n should be small to avoid any delay in getting the product to market and to minimize the cost of sampling and testing.

The principles of OQ can be described by using a heat-sealing process as an example. The process demonstrates the effect statistics can have on quality and cost. During OQ, “process parameters should be challenged to assure that they will result in a product that meets all defined requirements under all anticipated conditions of manufacturing, i.e., worst-case testing.”5

An OQ confirmatory process consists of numerous inputs, so to truly challenge the process, validation should consider more than the equipment itself. Process-control parameters, such as dwell time, temperature, and pressure and line speed, should be taken to their extremes. But validation engineers should not neglect other parts of the process, such as varying shifts and differing material lots, that also can affect the validation outcome. This experiment can be expanded to include components of the system when n packages are sealed, put in shipping cases, sterilized, and put through distribution-testing procedures. These procedures subject packages to tougher conditions than they would encounter in the actual distribution environment. This confirmatory process, then, tests multiple aspects of the process and system. By challenging multiple aspects, the requirements for worst-case testing are met.

Figure 1. Operating characteristics of qualification plans (n,0) (click to enlarge).

An OQ confirmatory test might require, for example, that all n packages pass the distribution testing. In this case, the OQ confirmation test is (n, 0), which denotes that qualification occurs if there are no failures in the sample of n packages.

But what sample size n is appropriate? A large sample size makes it difficult to qualify a process. This is true even if the process rate of failure is acceptable, because a large n produces a greater likelihood of finding a failure. By contrast, small n allows a process to qualify easily, even if the process rate of failure is unacceptable, because a small n by nature makes it less likely that a failure would appear. There is no magic formula that balances risks and costs to achieve an ideal sample rate. However, examining the operating characteristics of the plan (n, 0) for various n can help in making the decision.

For example, in Figure 1, one assumes random sampling from the Bernoulli distribution B(1, š), where š is the process failure rate. (The Bernoulli distribution is a discrete distribution with two possible outcomes.) With the plan (n, 0), the Pq is (1–š)n which is plotted in Figure 1 for various n. One may prefer to plot ln(Pq) = n ln(1 - π), or ln (1–š) ² –nš, where ln denotes the natural logarithm and the linear approximation is quite good for small š.

Whatever the process failure rate, the chance of qualification decreases as sample size n increases. Table II shows that with a process failure rate of š = 0.65%, there is an 82.2% chance of qualification if n = 30, a 27.1% chance if n = 200, and a 7.4% chance if n = 400.

With a process rate of š = 0.25%, there is a 92.8% chance of qualification if n = 30, a 60.6% chance if n = 200, and a 36.7% chance if n = 400. Note that the chance of qualification is 1 only when the process failure rate is 0. The values 0.25% and 0.65% are the FDA AQL levels previously cited. Table II also includes n = 106, since in this case Pq = 0.50, or a 50% chance of qualifying and 50% chance of not qualifying, at š = 0.65%. It also shows n = 277, since in this case Pq= 0.50 at š = 0.25%.

Sample Size, Binary Data, and Variable Data

Table II. Qualification probability for plans (n, 0) for various n and percent defective, š. The data for n = 107 and n = 207 were included to show Pq = 0.50 at 0.25% and 0.65%(click to enlarge).

Some practitioners have the false impression that there is something magical about the sample size n = 30 as a basis for inference from a test result. This may be due to the fact that sample sizes of 30 or more produce good normal approximations to the sampling distributions of certain sample statistics. Advances in computing have minimized the need for approximation. Moreover, the context of the situation in which the inference is to be made should dictate the necessary precision and control of risks and, therefore, the necessary sample sizes. There is, in fact, nothing magical about the sample size 30.

Random sampling has limited power in quantifying rates of rare events. Table I shows that zero failures in n = 30 randomly selected units from a stable process or population with Bernoulli distribution B(1, š) results in the one-sided 90% confidence interval for š. In this case, 0  š  7.4%, where š is the probability of failure expressed as a percentage. The upper estimate, 7.4% for failures, is hardly reassuring if the concern is with the rate of nonsterile devices.

Although binary data are difficult to deal with, they arise naturally in regard to the outcomes of endpoint testing. Binary data are the result of what is often called attribute testing, because attributes are inspected and reported on a pass or fail basis. One example of attribute binary data that is common to package testing is to rate a package's integrity as a success or a failure. Binary data may also result from truncating or censoring variable data, which sometimes arises from the measurement process itself. For example, microbes might be reported as either detected or not detected. And although a carton responds to a vibration and shock test with varying degrees, the outcome might be reported simply as pass or fail.

In some cases, an engineer can use statistics, accelerated testing conditions, and engineering knowledge to make inferences about a small probability. By testing under conditions more severe than the field conditions, failure rates can be determined for both standard and experimental processes and products. Such conditions enable testers to differentiate among the larger probabilities of failure using smaller, more-manageable sample sizes. Engineering judgement and theory must answer the question, Does an improvement under extreme conditions translate into an improvement under field conditions?

Variable data arise from measurements on a quantitative or ordinal scale. These data often provide more-powerful information than binary data. Engineering practice, knowledge, and modeling sometimes lead to the study and analysis of variable data. For example, engineers who understand the relationship of seal failure to seal strength can focus their analyses on the latter. In bubble tests, the relationship between the size of the hole and the pressure provides insight into package integrity.

Testing until failure is another way binary data are converted to variable data. Testing until failure is common in packaging. Cartons are tested to failure by dropping them, and the variable height data are analyzed to assess effects and differences. Seal strength can be defined by a continuous variable, such as force to separation, while a pressure decay test may use pressure to measure the integrity of a sealed pouch.

It is important to understand the joint and marginal effects of predictor variables on failures. In logistic regression, a two-value outcome is predicted by one or more variables. Logistic regression of binary data on continuous variables may provide useful models for understanding these effects. Sometimes, small failure rates remain. In these cases, it may be necessary to impose conditions that are more extreme than would be expected in the application to identify these failures. The idea behind this is to produce failures and allow for estimation of effects. Such models can add information and insights that enhance engineering judgment with regard to process improvement.

In reaching a decision about n, it is important to keep in mind that OQ has stacked the odds against qualification. Factors fighting against qualification include the packages produced at extremes within the process window and put through the extreme conditions simulated by the distribution test. Therefore, when deciding on the value of n, it is important to use both sound engineering judgment and knowledge of the operational characteristics of statistical tests from a random sampling model.

System for the Sterility Assurance of Packaged Medical Devices

W. Edwards Deming states, “A system is a network of interdependent components that work together to try to accomplish the goal of the system.” He goes on to say, “The secret is cooperation among the components toward the goal of the organization.”6 At the most basic level, any system for sterility assurance of packaged medical devices is implemented by a group. They must combine their work ethic, accumulated knowledge, potential to gain new understandings, and commitment to continuous improvement. They then apply these characteristics to developing equipment, methods, materials, processes, and procedures. The goal of the system is to have sterile medical devices available when needed.

Such a system is very complex; many of its components are made up of individuals, equipment, materials, and organizations. The components, or processes, may be defined by function or employment. Individuals may be grouped into companies, suppliers, engineers, scientists, consultants, government agencies, distribution systems, hospitals, nurses, and physicians. For a broad overview, one might take FDA, suppliers, medical device manufacturers, distributors, and hospitals as the major groupings in the system that ensures that sterile devices are available for medical procedures. Figure 2 shows FDA supporting the other components of the system through regulations, guidance, and inspections, which apply to the system at all stages.

The goal of this system is to deliver sterile medical devices when they are needed. No individual, no system, and no amount of money will be able to meet the goal with 100% certainty. As mentioned earlier, it is impossible to know the outcome of a future event. Through the use of its resources and resourcefulness, the system can only make delivering sterile devices as likely as possible. The system can increase the likelihood with factors such as its members' scientific knowledge, good engineering judgment, and good business practice. The focus should be on creating, maintaining, and continually improving the system to maximize its chances of attaining the desired result. Harriet B. Braiker, MD, a clinical psychologist and management consultant, said, “Striving for excellence motivates you; striving for perfection is demoralizing.” This could be taken as a basic tenet for the sterility assurance system.

A medical device manufacturer depends on a supplier to provide materials produced by stable and capable processes. If the supplier were to purposely produce at the low end of the strength specifications to increase profit, it would not be operating toward the goal of the system. Similarly, if a medical device manufacturer failed to validate a major change in a packaging process, it would not be operating toward the goal of the system. If a distributor were to destroy the integrity of a carton through rough handling and fail to report it, it too would not be operating toward the goal of the system. If a hospital improperly stored or handled packaged medical devices, it would not be operating toward the goal of the system. These examples illustrate that all interdependent components must work openly and cooperatively toward the goal of the system, not suboptimally for their own separate interests. FDA promulgates rules and monitors the components of the system to promote this outcome.

A product or package is developed using research and design processes. The product and the package are created and united using production and packaging processes. Many factors, like materials, methods, machines, milieu, man, and management affect the outcome of a given process. The large number of variables within each input renders the outcome unknown. As a result of this uncertainty, FDA requires, by way of Quality System Regulation (QSR) 820.75(a), that processes be either fully verifiable or validatable. The Global Harmonization Task Force says, “Validation of a process entails demonstrating that, when a process is operated within specific limits, it will consistently produce product complying with predetermined (design) requirements.”7

Figure 2. System for the sterility assurance of packaged medical devices (click to enlarge).

Three components of validation attempt to ensure that production generates a predictable outcome. They are OQ, installation qualification (IQ), and performance qualification (PQ). OQ is the phase of validation that establishes “by objective evidence, process control limits and action levels which result in product that meets all predetermined requirements.”8 Simply put, OQ involves pushing a process to its limits to determine the point at which the result is no longer acceptable. IQ verifies that the equipment is installed, maintained, and used as the equipment manufacturer intended. The production process is maintained and monitored using PQ, typically by employing an established sampling plan such as those published in ANSI/ASQC Z1.4-1993. The process uses objective evidence to establish “that the process, under anticipated conditions, consistently produces a product meeting predetermined equirements.”9

Developing a validation plan can be challenging for device manufacturers. FDA tends to define terms broadly and leaves interpretation to the manufacturers. How does a manufacturer know when it has reached a “high degree of assurance” that an entire process, and all that entails, will produce consistently?

A concern for patient safety drives this process to achieve a high degree of assurance. The effect of a nonsterile medical device depends on many factors. Such factors include the types of microbes present, the use of the device, the health of the patient, and the difficult-to-quantify contribution that the nonsterile device would add to a negative outcome. Such contemplated losses explain the need for the validation and testing of products, but they do not dictate the details for validation plans and testing. Redundancy in validation and testing, however, takes resources that might otherwise be used to bring new and better products to patients.


Engineering judgment, and the costs and risks for validating an unacceptable process or for failing to validate an acceptable process, are part of the qualification process. Statistical analysis can help testers to gather data efficiently through planned experiments and surveys. This analysis helps assess the significance of any deviations in outcomes from those predicted through theory and judgment. Calculations of operating characteristics of qualification plans can provide useful benchmarks. However, to provide sterile medical devices, excellence and improvement in the system depend mostly on the theory that connects engineering judgment and experience to decisions in the qualification process.

Clearly, very difficult decisions must be made. Sometimes statistics can provide insights necessary to support the engineering and business judgments that are crucial to the decisions. A manufacturer should weigh risks and costs in the context of its application. For example, a manufacturer may decide that a sample size of at least 277, allowing no failures, is required for OQ for packaging an invasive device, and that at least 106, allowing no failures, is required for a noninvasive device. It is ultimately the manufacturer's responsibility to weigh the system's risks, limitations, and costs when deciding how to allocate resources to provide sterile medical devices.

Authors' Note

The substance, recommendations, and views set forth in this article are not intended as specific advice or direction to medical device manufacturers and packagers, but rather, this information is presented for discussion purposes only. Medical device manufacturers and packagers should address any questions to their own packaging experts and have an independent obligation to ascertain and ensure compliance with all applicable laws, regulations, industry standards, and requirements, as well as their own internal requirements.

1. Compliance Program Guidance Manual 7382.830A, Part I, Paragraph 5.4, FDA.
2. Compliance Program Guidance Manual 7382.830A, Part III, Paragraph 8a, iii, Bullet 3, FDA.
3. ANSI/ASQC. Z1.4-1993, “Sampling Procedures and Tables for Inspection By Attributes,” Paragraph 4.3 (1993).
4. Compliance Program Guidance Manual 7382.830A, Part III, Paragraph 8a, FDA.
5. Process Validation Guidance, Paragraph 5.4 Global Harmonization Task Force, 1999.
6. WE Deming. The New Economics for Industry, Government and Education, (Massachusetts Institute of Technology, Center for Advanced Engineering Studies, 1993).
7. Process Validation Guidance, Introduction.
8. Process Validation Guidance, Paragraph 2.2.
9. Process Validation Guidance, Paragraph 2.3.

Copyright ©2004 Medical Device & Diagnostic Industry

Sign up for the QMED & MD+DI Daily newsletter.

You May Also Like