Learning from Failure

Originally Published MDDI September 2002COVER STORY Failures in the prototype lab can provide engineers with valuable information for preventing new product disasters.

September 1, 2002

11 Min Read
Learning from Failure

Originally Published MDDI September 2002

COVER STORY

Failures in the prototype lab can provide engineers with valuable information for preventing new product disasters.

Failure investigations can be used to verify the FMEA process, but even the most thorough analysis will not prevent product defects if the results do not reach the right people.

David Warburton

Few catastrophic failures of complex systems occur without warning. So says James R. Chiles, author of Inviting Disaster: Lessons from the Edge of Technology.1 Chiles explains that while warning signs seem obvious in retrospect, product failure ultimately occurs because managers either ignore those signs early on or fail to take action upon noticing them.

New product launches have their fair share of disasters: manufacturing yield fails to meet projections or field failures anger customers or even provoke a product recall. Fortunately, few of these events make headlines.

Like larger failures in complex systems, the problems inherent in a product's design or manufacturing process seldom appear without one or more telltale failures in the preproduction units. The key to avoiding new product disaster is twofold: First, develop a rigorous procedure to trap, investigate, and document prototype failures in a manner that ensures the integrity and veracity of the data. Second, ensure that management receives the information and reviews it impartially.

TRAPPING AND TRACKING FAILURES

The prerequisite to rigorous failure analysis is a comprehensive strategy for capturing prototype failures. Such a strategy includes the following:

  • A method for identifying and tracking every prototype produced. Product developers should consider serializing prototypes and creating a database to track the configuration of each unit.

  • A definition of what constitutes a failure event.

  • Instructions for responding to failures of different kinds, including what actions to take, whom to call, and what information to record.

  • Assignment of a chief investigator—a senior engineer with deep knowledge of the product—who will manage the failure reporting and investigation processes.

Tracking and using the information from every unit produced, instead of tracking only those units designated "test units," can greatly increase both the sample size and the opportunities for interesting failures. Marketing samples, for example, can be rich lodes of new failure types because these units are often the first to leave the engineering lab.

In one such case, I lent four prototype handheld glucose monitors to a photographer for marketing materials. The instant he shot the first photo, the RAM on each of the four units was mysteriously erased. These four failures were dismissed as astonishing coincidences until another marketing person exposed several of the same devices to direct sunlight. Only after these latter units failed did the engineering team investigate and learn that bright light could penetrate the epoxy coating on the microprocessor chip and erase what was thought to be the photoinsensitive RAM.

Such unexpected failure modes are only found by accident, because formal product testing covers only expected failure modes. That is why it is so important both to broadly define what constitutes a reportable failure and to instruct everyone who comes in contact with a prototype what to do when a failure occurs. On larger devices, these instructions might even be printed on a label.

The particulars of the failure-reporting procedure will vary, but every procedure should contain the following points:

  • Whom to call when a failure occurs.

  • How to begin a failure report.

  • What information should be recorded.

  • What to do next.

  • Most importantly, what not to do (such as a system reset).

Reportable failures should be called in to the chief investigator. It is best to have only one person responsible for the failure investigations. With one person in charge, everyone knows whom to call when there is a problem with a prototype. It is also easier to have only one person responsible for ensuring that the failure investigation procedure is followed. When a single, experienced individual performs each and every investigation, he or she is better able to spot trends and to draw inferences.

INVESTIGATING THE FAILURE

Once a prototype failure occurs, the investigator should begin a failure report. The report should follow a standardized format to ensure key information is captured. When performing the investigation, the investigator must relinquish any preconceived ideas about the root cause and be careful not to inadvertently destroy crucial evidence. Some suggestions for performing an investigation are

  • Disassemble the unit deliberately, and collect as much noninvasive data as possible before each disassembly step.

  • Take photographs and measure the product's critical parameters.

  • Determine what additional tests can be run before each level of disassembly.

  • Download, save, or print all error logs.

Finding these log files later will be easier if they are stored on a server rather than in someone's e-mail. It also helps to have a naming convention for the log files and directories, and to note the appropriate name and directory in the failure report.

While investigating a primary failure, look for other failures or potential failures as well. Most prototype equipment contains many defects; however, the so-called first-failure defect usually receives most of the attention—even if it is ultimately not the most serious defect in the product.

Looking out for those future failures during the course of a failure investigation can dramatically increase the defect discovery rate. The investigation should include looking for particulates around moving parts, which indicate abnormal wear; cracks in stressed parts; discoloration on electronic components (a symptom of overheating); and stains, puddles, or crystallization, all three of which point to leaks.

A few years ago, I led an engineering team in designing large and enormously expensive machines. Consequently, only a few prototypes were available to test during development. Every time the prototype testing was halted for a repair or for upgrading, the engineering team would thoroughly examine the disassembled portion of the machine for indications of wear. These examinations revealed such problems as improper drive-belt tracking, undersized bearings, and interference between moving parts and their cover panels. These were, in some instances, the only clues that a particular mechanism would not reach its designated seven-year life expectancy.

During the failure investigation, another thing to look for is how the whole system responded to the initial root failure. Were there adequate self-checks in place to prevent a small failure from propagating? Or, did the failure go undetected until equipment damage occurred or until users were put at risk? For example, a heater unit not protected by an overtemperature sensor can do a considerable amount of damage within an instrument if the primary temperature sensor fails. Failure investigations provide excellent experimental verification of the failure mode and effects analysis (FMEA) and should be used to update and revise the FMEA document.

SAVING THE EVIDENCE

Developing a rigorous procedure to identify, investigate, and document prototype failures is the first step in avoiding product launch disasters.

Once the initial failure investigation is complete, the investigator is usually left with a collection of parts. He or she must avoid the temptation to either discard the parts or heap them in indiscriminate piles in an office. Parts should never be discarded; rather, they should be stored in labeled boxes or bags in a secure location that will not become an alternate spare-parts stockroom.

There are two reasons to retain the failed parts. First, failure investigations rarely progress linearly from initial investigation to final identification of the root cause. The investigator might have to reexamine the failed parts several times over a period of weeks as the investigation proceeds.

Second, the parts often become valuable in the investigation of other failures. If a weld cracks during testing of a late-stage prototype, for example, the investigator can go back and examine a large population of earlier parts to better understand the scope of the problem. Because the earlier parts have been carefully identified, their manufacturing and use history can be traced. Consequently, the information gleaned from those parts can be trusted.

Obviously, some prototypes cannot be archived after failure and must be returned to testing after the initial failure investigation is complete. In that case, the investigator should document all work that is done to recondition the unit before it is returned to service.

The importance of careful documentation was brought home to me one late night in the test lab, when the engineering team was lamenting a noisy, embarrassing, premature bearing failure in a beta unit. No one on the design team, save one sheepish technician, had any inkling that a design problem with the bearings existed. The technician later confessed that while making emergency repairs to the prototype before a customer demonstration several months prior, he had noticed the bearings were about to fail. He popped in a new set but forgot to mention it to anyone.

Later, thinking that he had already notified the engineering department of the problem, he began to replace the bearings regularly during his periodic inspections of the machine. He had a whole pile of failed bearings in an unlabeled box in the lab.

MANAGING THE DATA

Even the most thorough failure analysis cannot prevent defects in the final product if the results never reach the right people. For this reason, it is important to maintain a system for both tracking failure reports and generating summary reports. Some useful methods for creating summary reports include

  • Performing Pareto analysis of failure types.

  • Trending incidents over time to demonstrate improvements in product reliability.

  • Comparing open investigations with those that have been closed.

Many device companies use their field-failure reporting database to record internal failures during product development as well. Doing so provides the advantage of leveraging a mature, well-established system to perform an additional, related task. Because a large number of people at any given company are already trained in the use of the field-failure reporting database, asking them to extend the use of that database to include premarket failures is reasonable.

On the other hand, the disadvantage of using the field-failure reporting database for this purpose is that the database—and the procedures for reporting failures, investigating them, and following through with corrective action—might not be flexible enough to be used efficiently by the product developers.

Regardless of how summary reports are generated, management should formally review them as part of the overall design review process. Such a third-party audit of the failure data can provide a crucial outside perspective on the maturity of the product under development, which can help the engineering team avoid two common impediments to decision making: collective rationalization and confirmation bias.

Collective rationalization is the effort of a group to rationalize decisions in order to discount warnings or other information that might lead the group members to reconsider their assumptions.2 Confirmation bias is the name given to the tendency of an individual holding a hypothesis to look for and accept evidence confirming the hypothesis, and ignore or discount evidence that negates the hypothesis.3 Both of these tendencies have been demonstrated to be major reasons why managers fail to act on crucial failure information.

TAKING ACTION

A component-level traceability strategy is a critical part of investigating prototype failures.

Collective rationalization and confirmation bias often lead engineering teams to dismiss clear evidence of defects in product design. An engineering team is usually under considerable time pressure by the end of a project, and occasionally, because of this pressure, the team will ignore isolated and sometimes inconsistent data coming from prototype testing.

One strategy for counteracting this temptation is to perform failure investigations in a rigorous and consistent manner. This will give decision makers confidence that the failure data can be trusted.

Another approach is to be cautious about dismissing any failure as a fluke. Although it might be only a single failure, it is a single failure in a small sample set, so there is a very strong probability that unless corrective action is taken, the defect will appear in the final product. For example, if a single defect is found in a group of five prototypes, the true failure rate in the subsequent production population could be anywhere from 0.5 to 72%, using a binomial distribution and a 95% confidence interval.

Finally, investigators should be sure that failure investigation reports and summaries are subjected to third-party review. This provides a necessary second look at the data and can offer fresh insight into problems and possible corrective actions.

One of the most difficult decisions a manager has to make is whether to undertake a major product redesign based on inconclusive information coming from a single prototype failure. The decision will be easier, however, if the manager has confidence in the failure analysis data and takes steps to account for the biases in the decision-making process.

REFERENCES

1. James R Chiles, Inviting Disaster: Lessons from the Edge of Technology (New York: HarperBusiness, 2001).

2. Irving L Janis, Groupthink (Boston: Houghton Mifflin, 1982).

3. Kari Edwards and Edward E Smith, "A Disconfirmation Bias in the Evaluation of Arguments," Journal of Personality and Social Psychology 71, no. 1 (1996): 5–24.

David Warburton is an engineering manager and new-product development consultant based in Lexington, MA.

Copyright ©2002 Medical Device & Diagnostic Industry

Sign up for the QMED & MD+DI Daily newsletter.

You May Also Like