Originally Published MDDI March 2004
Failure modes and effects analysis can be a helpful tool in risk management for medical devices, but it has several inherent traps that should be recognized and avoided.
Mike W. Schmidt
In 2000, ISO published the first standard for medical devices that takes a broad approach to identifying, evaluating, and mitigating risk: ISO 14971. In its class, this standard is unique. Unlike its predecessors (such as EN 1441), it does not look only at the identification, analysis, and control of the risks associated with a medical device. Rather, it adds significant detail to that process and extends it to the full life cycle of the device. In other words, ISO 14971 provides a comprehensive approach to reducing risk to the lowest reasonable level.
In the United States, the standard has been recognized by FDA, and in Europe, it will replace EN 1441 in April of this year. (At the same time, EN 1441 will be withdrawn.) Compliance with ISO 14971 will therefore be crucial not only in assuring the safety of medical equipment, but in meeting regulatory requirements as well.
While the new standard is much broader, many of its requirements are similar to those in standards such as EN 1441. The most fundamental of these are to analyze, evaluate, and control each risk. Within the medical device industry, by far the most common tool for documenting these processes is an adaptation of failure modes and effects analysis (FMEA) or its close variant, failure modes, effects, and criticality analysis (FMECA). For the purposes of this article, the term FMEA encompasses both.
It has been estimated that roughly 80% of manufacturers use some form of FMEA for risk analysis, evaluation, and control. While this approach can be effective, there are several inherent traps that can reduce the effectiveness of the risk management process. This article will attempt to identify those traps and offer ways to overcome them.
Risk Management Basics
Before going into the specifics of using FMEA, a brief review of the risk analysis phase of risk management is in order.
In analyzing risk, the first step is to identify all hazards and harms associated with the device based on its characteristics and intended use. Why distinguish between hazard and harm? Because while a hazard is a potential source of harm, many hazards (such as electrical, mechanical, or thermal energy) result in multiple forms of harm. It is in fact the harm that we are addressing in the risk analysis process. Sometimes, of course, a given hazard may be linked with a single harm. In this case, the two terms can (and frequently are) used interchangeably.
Once all hazards and harms have been identified, the analysis process is completed by estimating the likelihood that the harm will occur and, in the event that it does, the severity of the resulting damage. Combining likelihood and severity (either graphically or mathematically) results in an expression of the risk associated with the hazard.
Following this analysis, the risk is evaluated. Is it necessary to reduce the risk? Or is it inherently acceptable? Where the risk is not considered acceptable, specific actions, or mitigations, are identified to reduce, or control, the risk.
After putting these controls in place, a new value for risk is established for the hazard or harm. The mitigation is then evaluated to determine whether any new hazards or harms have been created. Then the evaluation and, if necessary, control processes are repeated until the risk is found to be acceptable.
While the description above is only a brief overview of the process, it does establish a context for the following discussion of the use of FMEA.
FMEA and Risk
Where should one look for guidance on using FMEA and FMECA to manage medical device risk? Among the first sources one should consider are ISO and IEC standards. These standards frequently carry a presumption of compliance with device safety regulations in most developed countries.
In the ISO and IEC catalogs, only one standard, IEC 60812, addresses the subject. Titled Analysis techniques for system reliability—Procedure for failure modes and effects analysis (FMEA), it was published in 1985.
As its title indicates, this standard does not directly address the issue of using FMEA as a tool for managing risk. It does, however, provide insight into the general use of FMEA.
The first characteristic of traditional FMEA that complicates its use in risk management is right in the title: failure modes. It is certainly true that many risks associated with medical devices are in fact created by failures (such as the “single faults” identified in IEC 60601-1). But medical devices have many risks associated with their use under normal conditions and as intended by the manufacturer.
Many medical devices derive clinical benefit by effectively doing controlled harm. A scalpel that cannot cut tissue might be considered extremely safe—but is useless for surgery. This is a crucial point, since both ISO 14971 and EN 1441 require that these inherent risks be analyzed, evaluated, and reduced as far as is reasonably possible. It is not uncommon for risk management processes based on FMEA to lose sight of this fact, and to focus only on failures of the equipment or those using it. Such implementations of risk management are incomplete and do not comply with either standard.
Another characteristic of FMEA that must be carefully scrutinized is found in clause 2.2.4 of IEC 60812:
FMEA is extremely efficient when it is applied to the analysis of elements which cause a failure of the entire system.
However, FMEA may be very difficult and tedious for the case of complex systems which have multiple functions consisting of a number of components. This is because of the quantity of detailed system information which must be considered. This difficulty can be increased by the number of possible operating modes, as well as by including consideration of repair and maintenance policies.
In the medical device industry, not just devices but also the environment in which they are used have become extremely complex. Moreover, the circumstances in which they are used have nearly unlimited permutations and combinations. To properly perform risk analysis per EN 1441 or risk management per ISO 14971, all of these combinations must be evaluated. Doing so correctly using FMEA techniques as defined in the IEC standard can be daunting and, in the end, inefficient.
Fault Tree Analysis
One way to overcome these difficulties is to use fault tree analysis to focus the FMEA on the components and subassemblies that can actually result in hazards. A true FMEA would evaluate each component's failure modes to determine whether they would result in a hazard.
By contrast, fault tree analysis begins by looking at the equipment and its interface with its expected operating environment to determine what harm can occur. It then traces those harms back to all possible sources, including component or subsystem failures and harms that arise from the use of the device or environmental effects. FMEA is then applied only to those elements of the design that could result in hazards.
The ideal application of these two techniques would involve evaluating all components using FMEA and fault tree analysis to trace all hazards back to the component level, thereby validating the outcome of each against the other. But doing so can be time- and resource-consuming. By using fault tree analysis to direct FMEA efforts, those resources are applied most efficiently.
Detectability and Risk
In applying FMEA to risk management, some manufacturers use the concept of detectability to generate an initial risk priority number (RPN). This troubling practice is not found in IEC 60812. It comes not from design FMEA techniques but from the use of FMEA to evaluate manufacturing processes.
As defined in ISO 14971, RPN involves numeric techniques to represent the relative severity of risk. The value to be given to the severity of each risk is determined by assigning a value indicating the significance of the harm that would occur. This number is multiplied by a value assigned to the probability that the harm will occur. (Risk as defined in the standard is the product of severity and likelihood of occurrence.) This process is virtually identical to the one described for device FMEA in IEC 60812.
However, process FMEA introduces a third term into the calculation. During manufacture, when a defect that could result in harm is detected, action can be taken to either repair the defect immediately or impound the product until it is repaired. In these circumstances, the use of detectability to figure the RPN is completely appropriate. The time lag between detection during manufacture and the actual use, where the harm typically occurs, is substantial.
However, detection of a hazard during use of the device may not assure that the harm will be avoided. An example of how detection can be virtually irrelevant to preventing harm would be as follows: The pin is pulled from a hand grenade with a 10-second fuse. After waiting eight seconds, the grenade is tossed into the room. It is detected, and then everyone in the room is dead. Detection in fact was irrelevant to the prevention of harm.
While the example is extreme, it shows that considering detectability as equivalent to severity and probability in determining the base RPN value is inappropriate when use is involved.
Detection is in fact a mitigation of risk. It reduces the likelihood that the harm will occur. Therefore, its value in preventing the harm must reflect several aspects of the circumstances under which the hazard is detected. The first is the amount of time available to take action. The second is whether those present will have the presence of mind to recognize what is happening and take appropriate action. Finally, the knowledge and training of those present will determine whether they know what action must be taken to avoid the harm.
These significant factors (and there may be others) may certainly be considered during the determination of a value for detectability. But without specific instructions on how these factors are to be evaluated in determining that value, consistency will suffer.
In addition, the evaluation of each factor and the underlying assumptions must be documented for each hazard. Otherwise, the value will be virtually meaningless when the risk analysis is reviewed and updated throughout the product's life cycle (a critical element of risk management as defined in ISO 14971). How, then, can detectability be built into the evaluation of risks without compromising the analysis?
Ideally, detectability becomes a mitigation that reduces the RPN (generated by severity and likelihood only), just like inherently safe design, guarding, or warnings. By identifying detection and the necessary action to avoid the harm as one mitigating factor, the elements time, presence of mind, and knowledge will be evaluated and the assumptions validated.
This ideal approach would ensure that the evaluations are consistent and that the results and validations are documented. The documentation will then be available when design changes are made, so that the changes do not inadvertently negate the effects of detection. It also allows the assumptions made to be reviewed, should field data cast doubt on the original results of the risk analysis. Unfortunately, the ideal is not always practical. In an organization that has been using detectability in calculating the RPN for risks, resistance caused by the perception that detectability is being taken away can be formidable.
I was working with a device manufacturer recently in an attempt to bring its risk management process into full compliance with ISO 14971. While meeting with design engineering personnel to understand their current process (which used severity, likelihood, and detectability to calculate the RPN for each risk) I was told of a major disadvantage to using detectability: They often encountered hazards that were in fact undetectable.
For purposes of this example, we will look at a shock hazard presented by an unearthed piece of metal on the outside of the device with insulated wiring behind it (carrying a hazardous voltage). We will say that the severity scale used is 1 to 10, with 10 being death. The likelihood scale is the same, with 10 being a certainty of occurrence (probability = 1). Finally, detectability will be assigned a scale of 1 to 4, with 1 being completely detectable and 4 being undetectable.
The potential severity of the electric shock in our example is a 10, because the voltage could result in fibrillation. However, because robust insulation is used (double insulation as defined in IEC 60601-1), the likelihood is extremely low, so we will give the likelihood a 1. But if the insulation is broken and the unearthed metal is energized, there is no way to detect the condition until someone touches it and is injured. Therefore detectability is set at 4. The resulting RPN (10 ¥ 1 ¥ 4) is 40.
Unfortunately, the threshold number for mitigation is 30. This means that mitigating action must be taken, even though we have already established that the likelihood is so low that no action should be required. And if detectability had not been included in the calculation, no action would have been required. When we suggested eliminating detectability from the equation, the designers were relieved.
For organizations with cultural resistance to eliminating detectability, there are alternative ways to address concerns about detectability while allowing it to be used in calculating the RPN. The first way is to require that the assumptions behind the value assigned to detectability be documented in writing. The assumptions are then referenced adjacent to the detectability value. To save time, it is reasonable to require the documentation only in those cases where the value assigned to detectability reduces or eliminates the need to further mitigate the risk.
The second way is to combine detectability and probability into a single number. The effect of detectability on risk levels is to reduce the likelihood that harm will occur. Therefore, it makes some sense to simply combine the two.
This was the approach ultimately taken by the manufacturer I mentioned earlier. We included the concept of detection in the scale for likelihood, resulting in a scale of 1 to 40 for the numbers used in the example.
To acknowledge the role of presence of mind in detection, the impact of detection on the likelihood value was made variable. In short, detection is not used at the lowest likelihood values. The reasoning is that users of the equipment will be unfamiliar with infrequent events and therefore unlikely to remember what action to take. They may well be confused enough that even if they did remember, they may not act on it for lack of presence of mind.
As the likelihood of events increases, detectability may be considered as a factor in adjusting the assigned likelihood value. In this case, detectability will be a significant factor for events likely to occur frequently. Effectively, this approach puts detectability onto a sliding scale relative to likelihood.
There is nothing inappropriate about factoring the detectability of an event that could result in harm into the estimation of risk associated with the hazard. In fact, detectability can be a significant factor as long as the three cardinal factors of detectability are considered and documented:
• Is there enough time to react after detection?
• Is information provided to the user to indicate specific actions and their sequence to avoid the harm?
• Will the user have the presence of mind to remember what is to be done and take action?
If these factors are considered each time detectability is used and the results of those considerations are documented, compliance with the intent of ISO 14971 is assured. Thus, should the risk analysis ever become part of litigation, there should be no embarrassing moments for those involved in doing the analysis in explaining it on the witness stand.
Copyright ©2004 Medical Device & Diagnostic Industry