In the 1980s, the Therac-25 radiation therapy system exposed a group of six patients to 10,000% more radiation than was prescribed due to a simple data entry mistake. The error resulted in at least three deaths. The incident demonstrated the extremely high stakes for medical device manufacturers – missteps are measured in people injured or killed and often result in lawsuits and endless litigation. The now infamous Therac-25 example highlights the care that must be taken when designing for the medical industry and the importance of approaching the process with a safety-centric perspective that extends beyond basic compliance. This article will outline several keys to designing with safety in mind that address some of the most common blunders that threaten the development process.
1. Go Beyond the Standards
The United States mandates compliance with standards for the software development lifecycle (SDLC) of safety-critical projects. The standard for the medical device SDLC is IEC 62304. It lays out detailed procedures for documenting and monitoring nearly every aspect of software specification, design, coding, and testing, as well as rigorous standards for oversight, compliance, and certification.
In order to ensure safer software, designers need to reach beyond requirement validation and instead rely on the careful application of systematic and incremental steps applied to every level of development. Safe complex systems are only built on top of safe, simpler ones. Relying on the base level compliance standards to ensure safety is asking for trouble as complex systems spiral out of control.
2. Reduce Design Complexity
Companies often reward managers who can produce systems under tight schedules and below budget. An emphasis on speed and reduced cost provides incentives to cut corners, fudge data, and ignore small red flags. Turnaround should matter far less than quality.
While the focus should never be solely on timing and cost, there are opportunities to reduce spend while improving safety. The greatest improvements are usually seen when design complexity is reduced. This involves selective use of higher-level tools such as functional and type-safe programming languages, and model-driven development (MDD). C and C++ are both popular, and terrible, for safety-critical systems. Their popularity stems from the deep well of tooling and the availability of experienced developers. They are terrible because languages with unsafe pointers and direct memory management present a nearly impossible test design challenge.
The size of a programmer's head is limited, so the level of abstraction she works at is key to the scale of the problem that can be solved in a given sitting. Greater tool functionality and more immediate feedback not only produce systems faster, but they also improve the kinds of systems that can be built.
3. Requirements Must Be Testable
The process of setting the requirements for the system is, in and of itself, a huge source of error. As systems become more complex, the boilerplate text gets pulled or "adapted" from older designs. When things go well, these errors are discovered early and reworked or removed; when the process fails, a nonsense code is implemented to trace to inapplicable requirements.
Stakeholders need to be identified early. And just as code should be traceable requirements, requirements must be traceable to the individuals who understand their needs in depth. Dynamic, flexible development, adoption, review and, especially, rejection processes for requirements are likely to have a greater impact on systems safety than anything else described here.
4. Design to Pass Every Test
Comprehensive, integrated, internal test design is as critical as valid requirements to high quality, reliable software. It is also the easiest element to ignore in the early stages of development and cannot be added to an existing codebase without effectively rewriting the code from scratch.
Effective test design has to be incorporated into code design. Before the first line is written, languages, frameworks, simulators, and other tool suites need to be evaluated with the test paradigm at the forefront.
5. Ignore Mean Time Between Failure
Finally, let’s address the fallacy of mean time between failure calculations. To put it directly, they should be banned from any discussions of software. The assumptions that underly the math of these statements are themselves so subjective and prone to error that we should assume that all those carefully constructed charts and graphs are meaningless.
Start by assuming that any process and any system can fail. Just as building complete test suites must not rely on one team, failure remediation must include independent teams who bring different ideas to the table. Independent watchdog processes, especially when coupled with redundant and independent hardware sensor arrays, provide excellent techniques for reducing failure modes. Avoid temptations to remove mechanical and electronic safeguards just because “the software can do it.”
It is almost never one massive engineering flaw that causes disasters like the injuries and deaths in the Therac-25 case, but several smaller missteps throughout the design process that lead to a cumulative safety failure. The overconfidence in prior models, lack of testing in a real-world environment and unwillingness to believe system failures resulted in a complete malfunction of the machine when used in a clinical environment. The inability of the development team to plan for and prevent these errors serves as a startling reminder of how important even the smallest step can be when designing safety-critical software. With a thoughtful eye to every milepost of the design process and a testing protocol that goes beyond the baseline standards, tragedies like this can be mitigated at the starting gate instead of discovered past the finish line.