Originally published September 1996
As anyone who has been through the development of a microprocessor-controlled medical device knows, the myriad ways in which the device might fail create significant challenges in meeting device safety requirements. And while software design and coding are but two of the practices in the development process, they provide key points for the introduction of defects that can cause device failures.
Two basic reasons medical device software fails are inadequacies in the functionality of the software and errors in its implementation. Omissions of functionality derive from oversights during all development phases, including requirements and hazard analysis, system design, test plan design, coding, and validation. Oversights can result from failure to anticipate circumstances relating to the vast variations in human biology (for example, optical measurements on blood plasma may be affected by the fat content of the donor's breakfast), or from unexpected use (when the door is opened while a sample is being processed after the user has entered the context-sensitive help screen). Additionally, the root of such oversights is often pressure put on the developers by management to deliver the device on time and within budget.
USING TEST PLANS
Validation testing is usually the last formal phase of the software process in which a defect can be detected. However, it may not be possible to exercise the software with sufficient rigor to observe the defect under the variety of conditions in which it will be used by customers. The risk of placing a device on the market before adequate testing has been completed can be minimized by developing an extensive test plan. Described by the Institute of Electrical and Electronics Engineers (IEEE) in the Standard for Software Verification and Validation Plans (IEEE Std 1012-1986), test plans help ensure that comprehensive testing will be performed.1 The test plan must be accepted by upper management and include unit testing, verification and validation procedures, and clinical test protocols. Although this test plan must be agreed upon before testing begins, it must be adaptable. It should be adjusted and augmented as development proceeds and as new issues are uncovered.
Using traceability matrices, a test plan can point back to the requirements specification and the hazard analysis to show proper coverage.2 The individual validation procedures referenced in the plan should enable a tester to carry out a prescribed sequence of operations and should specify precise, measurable system outputs, acceptable ranges for all critical values, and the devices to be used for their measurement. As development proceeds, new procedures should be written to reflect new functionality in the software. Existing procedures should be executed as regression tests after major updates to the software to ensure that functional behavior has not changed.
For developers of general-purpose computer software, many tools are available that facilitate the automation of regression testing, ensuring that changes to software have not had inadvertent adverse effects. Such tools are becoming available for embedded systems (such as microprocessor-based instrumentation) as well, but for various reasons have not yet reached critical mass in the marketplace. One reason is the unique and proprietary nature of most medical device software (by contrast, PC software has benefited greatly by standardizing under Microsoft Windows). Additionally, tools for embedded systems tend to carry a high cost in terms of both purchase price and learn- ing curve.
Cycle testing is a technique that has been successfully used to improve this situation. This process requires a special version of the device software, identical in all respects to the deliverable software except in its ability to operate continuously without operator intervention. By simulating user input and using appropriate data collection--including RS-232 output and analog signal capture with data logging-- software and hardware defects that present themselves infrequently can often be exposed after many hours of uninterrupted execution. Implementing test scripts, which requires additional development effort, can improve testability over standard cycling by facilitating the development of multiple, reusable test sequences.
The second reason for software failure--implementation errors in the software-- is far more dangerous and prevalent. The term bug, as defined by Watts Humphrey, is a program defect that is encountered in operation--either under test or in use.3 Bugs are introduced by engineers during the design and implementation phases and can result in the infrequent, bizarre, and potentially dangerous outcomes that are often reported in the press and that ultimately result in increased pressure on unrelated activities such as validation.
Unfortunately, design and code reviews risk detecting only the most obvious errors and omissions. The weakness of such reviews stems from the complexity inherent in software and the profound familiarity with the software a reviewer must attain in order to make a meaningful contribution. Also, if the material for review is not carefully selected, the really diabolical problems will not be detected, since the author may have intentionally brought together those software components that are meant to work together. Unexpected interactions and conflicts with other modules may remain unseen.
The most insidious software errors are often caused by the poorly handled sharing of data and other resources by competing processes. In these cases, one process is in the middle of modifying data when it is interrupted by another, higher-priority process that then uses the partially modified, and hence unreliable, data. Such an error is even more likely in the real-time software found in medical devices, since a block of data might have to remain protected and unchanged over time. These problems generally produce conditions that do not cause a system to crash (failing completely, which is often a fail-safe condition), but rather allow continued, unpredictable (nondeterministic) behavior.
The two general types of resource-sharing errors involve reentrancy and critical code. Reentrancy is the ability of a software module (generally a utility found in run-time libraries) to be safely interrupted and called by the interrupting process. A simple example is given in Figure 1 to illustrate a nonreentrant routine. Process A and process B both use subroutine C to calculate a sum. A problem arises if process A is interrupted during its execution of subroutine C after setting a value for y, and process B is allowed to run. When process B is finished, the values for (x,y,z) will be (10,11,12). When process A resumes where it left off, it will set z = 6, making the data block (10,11,6). This unexpected result will cause the calculation to fail. This example highlights the importance of understanding how unrelated parts of a software program can interact even when no intention of sharing data exists. For the processes to run independently and to avoid interference, each must use its own data space for the calculation.
A real-world case in which this might happen is during the use of a nonreentrant floating point library. If a routine is executing a floating point calculation and an interrupt or task switch occurs that also requires floating point calculations, the library's data structure will be corrupted on return from the interrupting calculation. When the first calculation resumes, it will give a random result. Floating point emulation libraries delivered with the industry-leading C/C++ compilers (from Borland and Microsoft) are not reentrant. The use of floating point calculations in interrupt service routines or in multitasking environments written with these libraries is to be avoided. Reentrant libraries are available from third-party vendors, including US Software Corp. (Portland, OR).
A similar situation exists in device control, in which a system requires multiple operations to change state. Consider a pneumatics system with a tank, pump, and valve controlled by the processes charted in Figure 2. During a normal tank drain operation, the pump is turned off, the valve is opened, and the status variable is set to indicate draining. A problem occurs if, after opening the valve, an interrupting process needs to disable pneumatics updating. After the disabling routine closes the valve and sets the status variable to DISABLED, the original routine is allowed to finish executing. The status variable is set to DRAIN, yet the valve is closed, preventing draining. If the higher-level software is waiting for the drain operation to adjust the tank pressure, the system will hang. The three steps in setting the pneumatic system state are inextricably linked and serve as an example of a critical section of code that cannot be interrupted.
Since the behavior of the system following these types of coding errors is difficult to predict, the errors can take weeks or months to reproduce, isolate, and resolve. And as our young profession continues to mature, these topics remain noticeably absent from major references. For example, Code Complete and Writing Solid Code, both well- received books published recently by Microsoft Press, fail to include sections on interrupt service handling, multitasking, reentrancy, or critical code.4,5 Less well-known references dealing specifically with these issues are Principles of Concurrent Programming and Software Design Methods for Concurrent and Real-Time Systems.6,7
SOFTWARE SAFETY MECHANISMS
Preventing the introduction of such problems into a product, however, can be quite straightforward. The developer needs to be aware of the data that can be shared and to use the mechanisms supported by the processor or the operating system to surround the critical sections of the code (in which data are read or modified) with interrupt disabling or task locks to prevent interruption. Simple implementation of a semaphore, which is a software variable acting as a gatekeeper, can be used if no system support is available. Here, a feature of object-oriented languages known as operator overloading can help by allowing engineers to create smart data types that restrict access and know how to protect themselves when accessed. Alternatively, the shared data can be treated as a resource (no different from a disk drive or A/D converter) with its own mediator software and API (application programmer's interface), allowing controlled access to the data. Another feature of object orientation, encapsulation, strictly enforces the hiding of private data by preventing uncontrolled access. Such software mechanisms provide the added benefit of allowing the software engineer to create consistent interfaces to shared resources, improving the structure and readability of the code.
Following coding, reviews should be held in which any global (generally accessible) data and systems with multivariable state descriptions are considered as possible sources of error. Modules are to be reviewed not only along functional lines but with regard to data and resource sharing as well. The review should include a cross-reference of data and procedures, generated automatically by the compiler/linker, to ensure that all users of a resource are known. An overly complex cross-reference can be an indication of poor overall program structure.
REDUNDANT SAFETY MECHANISMS
Even with the best of development methodologies and intentions, mistakes in the software will occur. Safety can be significantly enhanced by designing with redundant safety mechanisms. The process begins with a hazard analysis. Although no specific IEEE standard exists for such analysis, the Standard for Software Safety Plans (IEEE Std 1228-1994) addresses the issue.1 After identifying the major hazards and their severity and potential causes, electromechanical systems for detecting these conditions and overriding the instrument's processor can be devised. If these systems are developed to be orthogonal to the software (that is, the only point of common functionality is in the detection of the fault), then true redundancy will be achieved and the safe response to hazardous conditions improved.
For example, a blood infusion device included an electronic component linking the device's pumps and air detectors with logic independent of the central processor. If this safety board detected air being pumped in the direction of the donor (a condition that could only occur through failure of the instrument's software), it would remove power to the pumps and other critical devices, all outside of software control. The processor would be informed of the event, and the safety board would be queried by the software to determine the cause of the shutdown. The operator would then be required to intervene to continue the procedure (providing the additional benefit of increasing the likelihood that the device manufacturer will be informed of the device's failure by the operator). Finally, the safety board would not allow the instrument to continue until receiving a software-initiated reset command. With this board in place, complete system-level testing then includes single-fault introduction to ensure proper operation of the hardware and correct error detection, reporting, and recovery by the software.
If the development of additional hardware is not justified, either because the severity of the hazards is low or because the system has strict economic constraints, redundant safety checks can be implemented in software as background processes to ensure the absence of hazardous behavior. Use of a real-time, multitasking kernel facilitates the implementation of background, high-priority tasks that monitor system performance. These tasks then interrupt processing when improper and unexpected conditions occur.
The concepts of data and resource sharing and hardware redundancy are not new. The vast majority of their associated issues have been resolved by other branches of computer science. In database software, maintaining the integrity of data shared by multiple users is one of the fundamental problems. Here, the popular mechanisms for protecting against corrupted data involve record, page, and file locking. In operating-system software, multiple processes are allowed to share system hardware through the use of virtual device drivers. This software regulates the queuing and completion of tasks by the hardware device. Finally, many computer systems running mission-critical software contain multiple mirrored disk drives for redundant data storage, using background tasks that monitor and ensure data integrity.
However, little cross-pollination among the branches of software engineering exists. Those trained in real-time process control may have little or no formal database training. Additionally, since software engineering for embedded systems as a discipline has existed for such a relatively short time, specific training at the college level and beyond is limited. It is therefore the responsibility of the entire design team to ensure, through a well-defined hazard analysis and test plan, and through good coding and review practices, that fault-handling and safety mechanisms are a focal point of instrument development.
1. IEEE Software Engineering Standards Collection, New York, Institute of Electrical and Electronics Engineers, 1994.
2. Vicens CF, "Implementing an Automated Traceability Mechanism for Medical Device Software," Med Dev Diag Indust, 18(2): 98106, 1996.
3. Humphrey WS, Managing the Software Process, Reading, MA, Addison-Wesley, 1990.
4. McConnell S, Code Complete, Redmond, WA, Microsoft Press, 1993.
5. Maguire S, Writing Solid Code, Redmond, WA, Microsoft Press, 1993.
6. Ben-Ari M, Principles of Concurrent Programming, Englewood Cliffs, NJ, Prentice Hall, 1982.
7. Gomaa H, Software Design Methods for Concurrent and Real-Time Systems, Reading, MA, Addison-Wesley, 1993.
Bruce Levkoff is senior software engineer at Cytyc Corp. (Boxborough, MA), a medical device manufacturer that provides instruments for cytology laboratories.