Developing Safety-Conscious Software for Medical Devices
Originally Published MDDI January 2003SOFTWARE SAFETY To protect both the user and patient, medical device developers must pay strict attention to the safety of a device's software. Risk-mitigated software design is crucial.Timothy Cuff and Steven Nelson
January 1, 2003
Originally Published MDDI January 2003
SOFTWARE SAFETY
To protect both the user and patient, medical device developers must pay strict attention to the safety of a device's software. Risk-mitigated software design is crucial.
Timothy Cuff and Steven Nelson
In the medical device industry, the software used to control a device takes on an additional role: it must help ensure the safety of the user and patient. This important requirement is not particularly easy to meet, however. Challenges to safe software implementation include microprocessor-oriented controls architecture, limited higher-order language support for the microprocessor (often C language only, or limited C implementation), the limited functionality of the medical device, and pressure to keep product costs down.
Consequently, the software design phase must include a deliberate and rigorous risk-mitigation process. In this context, the software's design should be a natural by-product of risk analysis and mitigation. Software design must incorporate risk-mitigation strategies from the onset of the project, while simultaneously addressing potential device failures introduced by the software itself.
Risk Mitigation: The Basics
Risk-mitigation techniques should be incorporated into the development of any engineered product. A design team has an ethical responsibility to make sure that any potential safety hazards are balanced by expected product benefits. The good news is that properly applying risk-mitigation techniques is more than just another burden borne by the project team. When employed intelligently, risk mitigation helps the team design a better product using fewer development resources. By identifying potential risk early in the design process, the team will spend less time and money solving problems than if it had waited.
Tailored Risk Mitigation. A preferred risk-mitigation approach is one that recognizes the different levels of evaluation necessary for each product. The overall process is the same for most projects, but the degree to which the analysis is conducted varies in accordance with the degree of potential danger the product presents. The determination of the required degree of analysis is arrived at through an agreement between the client and the project team, and is ultimately driven by the results of the risk-mitigation process itself. The tailored risk-mitigation approach is a process-driven, ongoing effort, updated at appropriate points in the project to ensure the design team is following the correct course.
Project Hazard Analysis. The first line of risk mitigation is product hazard analysis (PHA). Starting with such general potential hazards as user injury or product failure, PHA identifies failures that could cause a hazard and attaches a likelihood value to each. The goal is to identify and rank specific failures based on their likelihood values, so that the design team can appropriately focus on those requiring attention. PHA starts early in the design life cycle, roughly concurrent with the first attempt at product specification. Any significant risks identified by PHA are used to generate product requirements and specifications used by the design team to develop the product. The PHA document is updated as necessary throughout the early phases of the development process.
Failure Mode and Effects Analysis. Once the design has reached the point where key systems are defined and the product is ready for more-detailed design, a second risk-analysis technique comes into play: failure mode and effects analysis (FMEA). Because sufficient information about how the product works is available at this point, the team can examine each design element to define what might occur should it fail. This bottom-up approach complements the top-down PHA to provide the project team with a comprehensive examination of the product's potential safety-related issues. As with PHA, FMEA can generate product specifications to guide the development team.
Software Safety Assessment. The software safety assessment (SSA) is a subset of the product risk analysis. It can be included as part of the PHA and FMEA or isolated as a separate document. Generally, a product with a significant software component will have a separate SSA and associated software requirements specification. The SSA addresses those product aspects to which software can pose a potential hazard or mitigate the effects of such a hazard. Often, the SSA affects decisions about hardware and software architecture.
Testing. After the risk-mitigation analyses results have been used to define the product specifications, the resulting product must be tested to determine how well the design team has met those specifications. Qualification tests are linked to each of the design specifications to demonstrate the ability of the product to meet them. For complex software, the testing can extend beyond simple tests that define a series of steps and an expected outcome. More- subjective examinations, such as code reviews, are often necessary to determine adequately the ability of the product to meet its software specifications. In such cases, it is important to make sure that staff members who were not on the development team are brought into this part of the review. These integral new team members might come from internal staff working on other teams, the client company, or other outside sources.
Risk assessment and mitigation are a significant part of the development of any medical device and its software. Only by adopting a comprehensive approach and adhering to its requirements can a safe product be developed.
Software Design Guidance
As the system architecture develops, the proper instrumentation and sensor suite is needed so the hardware and software integration complement one another. This approach reduces risk. Without proper instrumentation, the software has no mechanism for detecting system behavior; the program proceeds with control based only on intent and inference.
Software produces commands for hardware. After hardware has been commanded, the software infers that the hardware is behaving as expected. If a hardware element can contribute to a hazard, then feedback is necessary to explicitly monitor the commanded behavior. In the absence of so-called real-world feedback, control software must infer the status of hardware—it cannot do otherwise. This inference of operation tends to increase the risk of operational hazards.
Sensors overcome this shortcoming. They allow a device's software to tap into the "real world" and receive information and feedback. Through the use of sensors, safe software detects whether its operation conforms to reality, instead of being deceived by spurious sensor inputs or simply "dreaming" (inferring) that correct operation is occurring. These circumstances can be likened to doctors performing surgery: one is fully informed of all conditions; the others are misled or simply dreaming. If the software detects deception or dreaming, it should always indicate a fault. One option is to cease operation upon fault detection, but this isn't feasible in applications that require degraded but continuing operation.
An insidious example of operation inference occurs when the application must measure time. Hardware (and people) always experience time, but software knows about the passage of time only when some "tick" event occurs, say, every millisecond. If the hardware or software hook responsible for producing that tick is absent, late, or of an incorrect interval, then the software does not correspond with reality—the logic does not realize that too much time, too little time, or even any time has passed. Consequently, any time-based calculations would be incorrect, and therefore, a potential cause of user or patient harm. Two possible approaches to avoiding this type of problem are deliberate comparison of clock values within the code, and using multiple independently operated clocks, since it is unlikely that both clocks will fail.
Installing a Watchdog Timer. A watchdog timer is a device that directs the microprocessor and hardware operation to a known safe state in the event of an outright software failure. Depending on the specific microprocessor operation, software failure manifests itself in different ways. For example, it might create runaway tasks, interrupt enabled-but-without-foreground tasks, or come to a total halt. Regardless of the software failure mode, real-time hardware control ceases in the event of a software failure.
A watchdog timer has two basic elements: a reset mechanism and a reset block. Typically, a time-out value is specified for the watchdog, and specific hardware circuitry counts down this value unless it is reset. Should the watchdog detect a timeout, the processor is reset or directed to a fail-safe state. Application code periodically "pets" the watchdog to essentially rouse it and block the timeout.
Planning for Deliberate Programming Practices. Although not driven directly by the hazard analysis, some software disciplines, such as object-orient techniques, can improve the implementation.
Whether fully supported by a programming language or not, the use of "classes" or "objects" will aid software implementation. A class is a software component that functions as a black box. The class performs a specific set of functions, and other software elements (clients) make use of these functions. The implementation details of the class are solely within the class itself. Clients of the class may only employ the functionality by using the interface structure of the class.
Data hiding and private functions are two techniques that implement these features. An object-oriented implementation will force a software developer to create well-defined and strongly typed interfaces between software components. Data hiding is a technique that limits access to data, and private functions hide the implementation details within the object itself. Objects can be altered only through their public functions and data. These interfaces limit what effect objects are allowed to have on each other.
The design intention is to limit code interdependencies. When software modules are overly interdependent, the design intent can be difficult to understand. When code is difficult to understand, it becomes difficult to develop and maintain—with a potential side effect of the code not operating as intended.
Establishing Low Cyclomatic Complexity and Modularity. Cyclomatic complexity measures the level of difficulty inherent in understanding and verifying a software component's design or implementation. The degree of complication is determined by such factors as the number and intricacy of interfaces, the number and intricacy of conditional branches, the degree of nesting, and the types of data structures present.
Overly complex software components cannot be tested. These modules are prone to unintended operation, since the full suite of functionality cannot be verified. Future maintenance of such software components carries a high degree of risk.
Ensuring Data Integrity. Data integrity refers to whether numbers stored in variables are altered only as intended by the programmer.
Variables can have an effect on patient safety, and for that reason must be guarded against unintended changes. State variables are an example of one type of safety-critical data. Variables can be unintentionally altered through the physical medium (either RAM or EEPROM, electrically erasable programmable read-only memory), stuck bits (which occur when a binary digit is stuck on or off), or by other means, such as overstepping array bounds or a wandering stack pointer. The software designer needs to distinguish between the improbable and the impossible. Many of these occurrences seem unlikely, perhaps even highly improbable, but they are not impossible.
The software architecture bears the burden of explicitly monitoring safety-critical data to ensure that the data has not been corrupted. To this end, a deliberate read-and-write strategy is required. The read component of the strategy must provide an explicit means for detecting data corruption. The write component must provide a repeatable and consistent method that complements the corruption-detection scheme. In the interest of modularity, these features can be encapsulated into a single class.
Several strategies can be called upon to monitor data corruption:
Storing a backup or complementary form of each safety-critical variable (and then checking data against it every time) in the application's main loop.
Making an inverted backup copy, which is similar to a standard backup copy except a 1's complement is stored as backup.
Using a cyclic redundancy code (CRC) or checksum for stable variable sets. (In EEPROM/Flash, safety-critical variables are also compared to a checksum stored in EEPROM.) The CRC is by far the highest-fidelity method; however, CRC usage is not without risk. To lessen risk, the CRC must be sized appropriately for the data it will monitor. The CRC also carries a significant computational overhead that may push the limits of processing in the microprocessor environment.
Ensuring Calculation and Algorithm Integrity. In this context, algorithms and calculations are used to convert a real-world representation to a machine-oriented representation. For example, if a stepper motor will cause a movement of some distance, the distance will be converted to a number of step pulses and some number of feedback pulses via an encoder.
Ultimately, an algorithm yields one or more calculations. The mechanization of the calculation must be analyzed, especially when multiple steps are involved. As a software compiler works through the source code, he or she may use intermediate "pseudovariables." Unintended rounding or truncation may result.
The software designer must perform rigorous developmental testing to ensure that under- or overflow and nominal values in the various input terms yield the expected results. When a variable overflows, its value decreases—like an odometer on a car driven many thousands of miles.
It is possible that the codified calculation may need additional explicit steps to avoid unintentional compiler influences. Never sacrifice proper functionality in the name of code elegance: the code must work properly, not look nice.
Software Implementation Guidance
Naming Variables. A naming convention provides a by-inspection check that numbers are treated in logical ways. The preferred convention specifies both variable type and physical units.
Specifying a variable type (for example, in C, using "int," "unsigned int," "long," "char", etc.) provides a safeguard against unintentionally altering a value through variable truncation (downcasting) or assigning a signed type to an unsigned type. (Negative numbers will not translate as expected.)
Similarly, the physical unit shows where units might be mixed in a calculation. The equation Density_g/cm3 = Mass_g/Volume_cm3, which specifies units of measurement, is safer than Density = Mass/Volume, which does not. In the latter equation, the programmer would need to look up the units of measurement for each variable and ensure consistent variable usage.
Initialization. In most programming languages, allocating space for a variable means that the variable is assigned to a piece of memory, which has a random value. To prevent random behavior as a result of acting on uninitialized variables, the variables should always be set to a specific value before their initial use. This programming practice becomes mandatory when the goal is to make software safe.
Range Checking. Most languages do not perform range checking on arrays. This might mean no warning sound is emitted if the code accesses the sixth element of an array that has only five elements, for example. If read, that sixth element will essentially return a random value; if written, that element will likely corrupt at least one other variable, resulting in unexpected behavior. Like initialization, range checking is a good programming practice that should be elevated to a mandatory practice.
Filling Unused ROM. Finally, it is important to note that unused memory should be filled with an instruction that causes a transition to a known safe state. Unused ROM without specific and deliberate values may contain random numbers. If these memory locations are used (however improbable that is), unpredictable behavior ensues.
Safety from the Start
The medical device industry can settle for no less than the highest levels of safety and quality—and the software design should reflect the same standards. Deliberate engineering practices must be applied from the onset of the software design process to reduce software complexity for a safer overall system.
Risk mitigation must be an active and continuous engineering activity performed throughout the development of the device. It must influence all aspects of design, including preliminary design (top-down), detailed design (bottom-up), and the formulation of testing strategy as applied to the software product. A proper hardware sensor suite should always be in place to accurately measure the behaviors of the device. The software design should mandate that data and algorithms be monitored at all times for computational integrity and corruption detection.
Through ensuring that each of these precautions is taken, safety-conscious software can be successfully achieved.
Timothy Cuff and Steven Nelson are senior research scientists in product development at Battelle Memorial Institute (Columbus, OH). The authors acknowledge the contributions of Clark Fortney and Jeffrey Keip, both of Battelle Memorial Institute.
Copyright ©2003 Medical Device & Diagnostic Industry
You May Also Like