Software Risk Management for Medical Devices 2848

January 1, 1999

19 Min Read
MDDI logo in a gray background | MDDI

Medical Device & Diagnostic Industry Magazine
MDDI Article Index

An MD&DI January 1999 Column

SOFTWARE RISK MANAGEMENT

Editor's note: if you need to print this article, please see our print-ready version

As more devices integrate software, early risk management is critical to ensure that the devices are trustworthy.

Medical devices combine many engineered technologies to deliver a diagnostic, monitoring, or therapeutic function. The number of device functions that depend on correctly operating software continues to increase. Project managers are now making software development and quality assurance the predominant portion of many development budgets. Even for a product with numerous mechanical or electronics elements, software can consume as much as 70% of a multimillion-dollar development budget. Even projects involving simple devices that have basic user interfaces and provide only straightforward therapy—such as the delivery of energy to the body—may allocate 40 to 50% of their budget to software and software-related activities.

The growth of software in medical systems can be traced indirectly to the increased use of commercial off-the-shelf (COTS) software. Consistent with trends in other markets, this growth encompasses both the amount of software contained in a device and the key functions to which it is applied. As software becomes a more critical component in many devices, software risk management is becoming more important. Risk-management expectations now include application-specific software embedded in a device, COTS software used in the computing environment, and software-development engineering tools.

The basic principles of risk management are based on good engineering, common sense, and the ethic of safety. Standard, judgment-based techniques yield work products that are accepted by the engineering and regulatory communities. Patterns now exist that define common responses for certain types of software failures. This article considers the key concepts, the work products that result from analysis, and the management aspects that are necessary to achieve safe software.

BASIC RISK MANAGEMENT

Effective software risk management consists of three activities. First, developers must acknowledge that certain device risks can result from software faults. Second, developers must take appropriate actions to minimize the risks. Third, developers must demonstrate that the means taken to minimize the risks work as intended. Throughout these activities, the focus is on the potential for harm to the patient, the care provider, and the treatment environment.

The basis for decision making regarding software failure risks centers on different forms of analyses. This work links a specific hazard to an envisioned software failure. Assuming a significant hazard exists, the developer must minimize the hazard by applying software or hardware technology or by undertaking other modifications during the development process. Formal tests, which measure device performance when software failures are artificially forced, demonstrate the hazard. Analysis, safe design and implementation, and testing must all be applied fully to software to satisfy the question of having applied the best practices. It is important that these activities are seen as linked, so that once risks are understood the development team is committed to a remedial process.

Software risk assessment as described in this article is directed toward the software contained within a medical device. Product risk is usually analyzed separately from the processes necessary to understand and respond to development risks inherent in software-based projects.1 However, project risk linked to a flawed development process can result in the introduction of flaws that can lead to reduced software safety. Project risk assessment is popular in the engineering press as a means of understanding threats to meeting software delivery goals. Structured around the subjective evaluation of many parameters relating to the development process, the tools available, and the team capabilities, an honest acknowledgment of weaknesses in these areas can indicate potential risks within the product software. Because process flaws and team weaknesses can lead to software faults, project risk analysis is strongly recommended to minimize their effects.

Figure 1. Risk-exposure mapping.

A cornerstone to risk management is the notion of risk exposure. Exposure is defined as a function of the potential for loss and the size of the loss. The highest possible exposure arises when the loss potential and the size of the loss that might occur are both judged as high. Different risk-exposure levels arise with different values for loss probability and size. A two-dimensional plane that diagrams exposure is shown in Figure 1.

For example, assume that a design for a bottle to hold a volatile fluid has a high potential for leaking. Further, assume that should the fluid leak, a fire will start. Given this situation, the exposure is high, as illustrated by point A in Figure 1. The developer might choose to apply a gasket to reduce the possibility of leaks, which would result in the risk exposure marked as point B. Alternatively, adding another compound might reduce the volatility of the fluid if it does leak. If this is undertaken, the potential for loss would shift to the risk exposure represented by point C. A combination of actions would result in the risk being shifted to point D, providing lower risk than any single solution.

This abstract example embodies two key points about applying risk management to software. First, the assessment of the factors that contribute to the risk exposure are taken as informed judgments from individuals who understand the failure mechanisms. Second, a variety of options implemented alone could reduce the risk exposure, but a combination of approaches often yields the best result. This exercise allows the developer to define the mitigation steps. Response patterns for certain software faults are becoming common—much like a gasket is the accepted solution for reducing the potential for a leak.

KEY CONCEPTS

The general concepts of hazard and risk analysis have been presented in previous articles.2,3 Applying general risk management concepts to software requires adapting approaches originally developed for analyzing systems dominated by mechanical and electrical subsystems. As with many engineering areas, risk management is easier to enact if a foundation has been built on key concepts—in this case, concepts particular to software. Such concepts are discussed below, along with a means for applying them together. It is important to understand these concepts in order to tailor risk management techniques to a particular organization. Understanding them enables a product manager to present a foundation for the risk-management plan before presenting the means of implementing it. This also implies that although techniques presented can be changed later, the end result must meet any challenge based on the fundamentals.

Safety Requirements. All medical devices must fulfill a set of operational requirements. These requirements include a subset focused on patient and provider safety. Most of these requirements are derived as a part of the initial engineering, including functional requirement needs analysis, architecture specification, initial risk analysis (IRA), and other processes used by the development team to define the initial concepts and operational requirements for the device.

Software risk analysis typically involves several processes that clarify the role of software in meeting the system safety requirements. Properly conducted, software risk analysis identifies how software failure can lead to compromised safety requirements and ultimately to patient or user hazards. Software risk analysis is applied at different levels of detail throughout product development. Therefore, this analysis supports the formulation of a systemwide risk analysis to understand how all aspects of the system support the safety specification.

Software risk analysis can identify the need for specific hardware or software devoted to supporting safety requirements. Such analysis can also pinpoint the need to modify the design or to reconfigure the software environment. Risk analysis is almost always applied to embedded software to understand its function as the primary safety-significant software. It can also be applied to design tools, compilers, automatic test software, and other supporting software that could indirectly affect system safety.

Software risk analysis assumes that the product software is organized into a hierarchical interconnection of functional building blocks. The execution of the code within a building block provides some function in support of the device requirements. A building block can be a subroutine, a function, or an object; a collection of functions, often called a module; or even a full subsystem, such as the operating system. The relationship of the building blocks—based on the way they interface and depend on one another—is also important. Although the concept of building blocks is an abstraction, this idea provides the structure needed to develop and understand the role of the software.

Trustworthiness. Software is expected to reliably perform a function. However, highly reliable software may not necessarily provide for the safe operation of the device. More importantly, software must be absolutely trustworthy. Trustworthiness hinges on the tenet that when the software fails, the device behavior must be predictable. Typically, when device software fails, the unit operation shifts so that the system is in a safe state. This enables the system to operate with the lowest risk to the patient, the operator, and the environment. Usually, but not always, a device is considered in a safe state when all electromechanical operation is stopped and an alarm system is activated. This standard may not be high enough for a device such as a pacemaker; for this type of device, it might make sense to diminish software control of the device and provide only electronic control. Another alternative is to transfer control to an independent control subsystem that has a separate processor and software. This can become more complicated if the safe state of the device is related to sequential or cyclic medical therapy. For example, a safe state for an intraaortic balloon pump depends on the balloon's inflation. Stopping the system and activating an alarm while the balloon is inflated in the vessel would not be a safe state, whereas if failure occurs when the balloon is deflated, then stopping and alarming would be the safe state.

Since the development of software is a person-intensive, knowledge-based activity, it is common to associate highly reliable software with an increased attention and effort per line of code delivered for a particular use. Some correlation also exists between the maturity of the development process—including the formal verification and validation processes—and the number of defects found in the resulting code. Highly reliable software, such as that being developed for the Federal Aviation Administration's new air-traffic control system, is estimated to cost $500 per delivered source line.4 Starting with a base of $100 per engineer hour, development costs for medical device software, written in C, seldom are more than $90 per delivered source line, even in highly critical, life-sustaining devices. Medical device industry norms don't provide for the level of funding necessary to develop and formally ensure that software is highly reliable; a device in which a software failure places the patient in jeopardy is simply considered to be poorly engineered. Given the economic realities of the medical device business, designers therefore usually apply their efforts to achieving trustworthiness rather than NASA-level reliability.

ESTABLISHING RISK INDEXES

An important part of risk analysis is understanding how critical an unsafe condition might be. A risk index is a derived value that depends on the probability and the severity of the hazard. In traditional risk analysis, values for key parameters are multiplied to yield a numeric risk index called criticality. Based on military specification MIL-SPEC 1629A, this method is typically not used for analyzing software. Instead, using guidance from the more recent system safety specification, MIL-SPEC 832C, a table can be constructed that provides the risk index for each combination of qualitative assignments for both the occurrence probability and the loss or hazard severity. A simple version is shown in Table I. Note that this table is similar to the two-dimensional risk illustration shown in Figure 1.

Hazard Severity/Loss

Probability of Occurrence

Minor

Moderate

Major

Improbable

Low

Low

Moderate

Remote

Low

Moderate

High

Occasional

Moderate

High

High

Reasonable

High

High

Very high

Frequent

High

Very high

Very high



Table I. Example of a simple risk index.

The use of a risk-index table to look up an identified risk combination has proven to be quite useful. It is important to remember the following points when applying this method.

  • The risk-index table should be formally documented, including a description of the qualitative parameters for each occurrence and severity.

  • A development team or quality group may define its own table with different labels and values.

The identified risks will be important information for many on the development team. Since these individuals may join the team after the risk analyses have been completed, enough detail must be provided so they can easily understand the context for the risk judgments. The values and appropriate actions can be developed so that management shares the decision responsibility for high-risk items. A separate table should describe the level of acceptability of the risk-index values (Table II). For example, each risk index can be tied to a specific hazard or loss of safety and with a cause. Because the cause might be linked to mitigation, documents that contain risk indexes must indicate whether the indexes were assigned before or after specification of the hazard mitigation. If assigned before mitigation, the risk index can be used to indicate the need for mitigation mechanisms. If it is assigned after mitigation, the risk index should show how well the cause–mitigation pairing reduces loss.

Risk-Index Value

Action

Very high

Unacceptable—requires (further) mitigation

High

Acceptable only with engineering and quality executive signoff

Moderate

Acceptable with project manager signoff

Low

Acceptable with no review



Table II. Example of a risk-index value and action assignment table.

SAFETY-SIGNIFICANT VARIABLES

Program execution typically involves setting and altering values. Many values have little effect on the device's meeting the system safety requirements. Some variables, such as the dosage rate for an infusion pump or how much energy a defibrillator should discharge, do relate directly to device safety. An operator can input such values directly through a device's front panel. Computed variables containing crucial control values also play a role in device safety. One example is the stepper motor rate for achieving a given pump dosage. Variables whose values affect device safety are termed safety-significant variables.

Traditional risk analysis includes determining the probability that the system will threaten humans. Analysis performed according to MIL-SPEC 1629A includes multiplying numeric probabilities for occurrence, severity, and detectability. This process can confuse engineers new to software risk analysis. Software risk analysis as currently practiced for medical device development does not reliably support quantification at this level.

SAFETY-SPECIFIC SOFTWARE

Software risk analysis hinges on the idea that not all software is directly involved in meeting the device's safety requirements. The support of the safety requirements is spread unevenly among the software's building blocks. Modules that fulfill the safety requirements are typically termed safety-critical or safety-significant. For example, a module that contains an algorithm for controlling the energy level applied to a patient is much more safety-critical than one that provides a background housekeeping task. The engineering literature also describes safety-related software as composing a safety net to ensure safety when safety-critical software fails.

The concept that not all software within a device is safety-critical might be difficult to understand because, for simple devices, the source code is compiled into a block of executable machine instructions. Abstract boundaries do not apply to the monolithic block of machine instructions. It is easy to see that any software failure can eventually result in the failure of software responsible for safety requirements. This threat usually boils down to three points:

  • Other software can corrupt variables affecting safety performance, which means that safety-critical information must be maintained so that corruption can be detected.

  • Other software can cause execution threads to fail, resulting in the execution of code out of normal sequence. Well-engineered, safety-critical software ensures the proper sequence of critical code segments.

  • Other poorly engineered software can consume or mismanage computing resources such as processor computation bandwidth and working memory. This can lead to rendering safety-critical software nonfunctional. As C++ has become more popular, engineers must address so-called memory leaks that result from software execution threads exiting the environment without freeing up memory resources. Safety-related software must reduce memory leaks. One solution is to separate localized resource control from system control.

MITIGATION

Risk management depends on the premise that software failure can result in increased risk. Developers must define approaches for reducing, or mitigating, the risk. This requires designers to couple potential software failures to mitigations. Such pairings are described below, from the least dependable to the most critical for reducing potential software failure.

Inform the User. A potential risk might be linked to an information prompt. For example, "Failure: Information written to wrong area of screen buffer. Mitigation: Provide user documentation relating to expected display." Obviously, this particular mitigation is weak because it relies on activities outside the control of the development and quality teams. Developers must review the instructions to ensure that the screen layout provides a key to information found in the different screen areas.

Displaying information and related hazards is particularly difficult. Development teams often simply indicate that a trained care provider should be able to detect presentations that do not make sense or that have distorted values. Mitigation quality depends on both the value of the screen information to the care provider and on how well the care provider is trained. When information is critical to therapy, some designs provide a means for reading the display buffer back to ensure information validity. More dangerous is dead facing, in which the device display is commanded blank but the device continues to administer a therapy.

Development Process. A typical pairing here would be: "Failure: Flawed breathing algorithm implementation. Mitigation: Independent review." This indicates that something exceptional will be done within the development process. To complete process mitigation, the mitigation must be described in the software development plan and audited to ensure that an independent review did occur and, equally important, that any findings were subsequently acted on.

Software Mechanisms. A pairing that expresses a software problem might be presented as "Failure: Overwritten pump-speed variable. Mitigation: Variable redundantly stored, accessed, and changed by a single function." Adding such special software is common and is considered good practice because it enforces structured access and enables corruption detection on every access. These mechanisms can be weakened, however, if sloppy use—such as moving a critical variable to a locally used variable—is allowed. This reinforces the fact that software mechanisms might require parts of the development process, such as code inspection, to detect and enforce usage rules.

Hardware Mechanisms. An example of a failure that calls for hardware mitigation would be "Failure: Runaway execution thread. Mitigation: Hardware watchdog timer." Installing a separate hardware safety mechanism is considered good practice because the hardware relies on an independent technology to provide the device safety net. However, if the software fails to interface properly with the watchdog circuitry, and the start-up test fails to detect the malfunction, this particular hardware mitigation could be ineffective.

Execution Diversity. This type of mitigation depends on the system's architecture having a safety supervisor. A safety supervisor employs a separate processor with its own software to ensure that the primary processor and software stream operate according to device safety requirements. It is common for this type of architecture, described in the literature for process control systems, to be found in off-the-shelf processor and software packages.5

In the European Union, the response speed of the mitigation mechanism for certain devices is considered important. For example, specifications for a syringe infusion pump require that the dosage rate be proportional to the motor speed, since overdose is a common hazard for this type of device and can be caused by software runaway inducing a high motor rate. A common form of mitigation is a watchdog timer. Depending on how the timer is implemented, the time elapsed from a software fault to an error to detection to a safe state with the pump motor stopped could be too long to prevent a dangerous dose of a drug from being uncontrollably administered. Although a mitigation exists, it may not reduce the risk exposure if it is slower than the speed of the therapy. A lock-and-key software that can command the motor speed would provide a faster mitigation and would relocate the risk exposure to a safer region of the software.

Lock-and-key software depends on safety-critical software functions only being executed if the operator presents the proper key. Properly implemented, lock-and-key software also detects a software system jump in the middle of a function. Lock-and-key, then, would detect rather than initiate a command for illegal execution entry at the start of the function. Should entry appear further down the function, lock-and-key would detect an illegal command within microseconds after motor commanding. Combining watchdog and lock-and-key solutions provides the most protection.

CAUSE-AND-MITIGATION PATTERNS

One of the most recent paradigms to appear for software developers is the idea of patterns. Patterns are based on the observation that a collection of software objects can be connected in certain ways to solve common problems. Although this assumption is abstract, this approach reinforces the idea that at some level the solutions for multiple problems take on a similar form.

The paradigm of patterns can be applied to streamline cause and mitigation pairing. Patterns for a limited set of software failures are listed in Table III. The list of mitigation mechanisms in the table is loosely ordered from more common to less common approaches. All patterns assume a solid development process that includes code inspection and verification testing as a baseline. The patterns represent only a starting point. The challenge is to specify additional patterns that might be unique to the product's architecture or implementation environment.

Failure

Mitigation Mechanisms

Data/variable corruption

Redundant copies; validity checking, controlled access

CRC or check sum of storage space

Reasonableness checks on fetch

Hardware-induced problems

Rigorous built-in-self-test (BIST) at start-up

Reasonableness checks

Interleaved diagnostic software(See illegal function entry; data corruption)

Software runaway; illegal function entry

Watchdog hardware

Lock-and-key on entry and exit

Bounds/reasonableness checking

Execution tread logging with independent checking

Memory leakage starves execution stream

Explicit code inspection checklist and coding rules

Memory usage analysis

Instrumented code under usage stress analysis

Local memory control for safety-critical functions

Flawed control value submitted to HW

Independent read back with reasonableness check

HW mechanism provides independent control/safe state

Safety supervisor computer must agree to value

Flawed display of information

BIST with user review direction in user manual

Read back with independent software check

Separate display processor checks reasonableness

Overlapped illegal use of memory

Explicit inspection checklist item

Coding rules on allocation and deallocation

Special pointer assignment rules



Table III. Failure patterns and mitigation mechanisms.

LINKING MITIGATION SOLUTIONS

The potential for not implementing all safety-related software and thus not properly supporting the mitigation functions is great in large projects that have a number of developers. Therefore, applying traces to software is becoming more important. Typically, a trace links a downstream activity to something that was determined earlier in the product development life cycle.

Traces are a critical part of software risk management. At a minimum, mitigation demonstration must be linked to the tested cause-and-mitigation pair. A more conservative approach—as might be applied to software found in blood-bank devices and systems—is to link each cause-and-mitigation control and safety requirement to specific requirements in the product's software requirements specification.6 Linking safety requirements to specific logic routines that accomplish the function also ties acknowledgement to mitigation and to demonstration.

Sign up for the QMED & MD+DI Daily newsletter.

You May Also Like