In early January 2020, the World Health Organization (WHO) released information about a special case of a flu-like disease in Wuhan, China. However, a Canadian company specializing in artificial intelligence (AI)-based monitoring of the spread of infectious diseases had already warned its customers of the risk of an epidemic in China as early as December 2019.1 The warning had been derived from AI-based analyses of news reports and articles in online networks for animal and plant diseases. Access to global flight ticket data enabled AI to correctly forecast the spread of the virus in the days after it emerged.
Lack of Regulatory Framework
The example reveals the capabilities of AI and machine learning (ML). Both are also used in an increasing number of medical devices, for example, in the form of integrated circuits. Despite the risks likewise associated with the use of AI, common standards and regulations do not yet include specific requirements addressing these innovative technologies. The European Union’s Medical Device Regulation (MDR), for example, only sets forth general software requirements. According to the regulation, software must be developed and manufactured in line with the state of the art and designed for its intended use.
This implies that AI, too, must ensure predictable and reproducible performance, which in turn requires a verified and validated AI model. The requirements for validation and verification have been described in the software standards IEC 62304 and IEC 82304-1. However, there are still fundamental differences between conventional software and artificial intelligence with machine learning. Machine learning is based on using data to train a model, without explicitly programming the processes. As training progresses, the model is continually improved and optimized through changes in “hyperparameters.”
Testing AI Training Data and Defining the Scope
Data quality is crucial for the forecasts delivered by AI. Frequent problems include bias, over-fitting, or under-fitting of the model and labelling errors in supervised machine-learning models. Thorough testing reveals some of these problems.
It shows that bias and labelling errors are often caused unintentionally by training data that are lacking in diversity. Take the example of an AI model that is trained to recognize apples. If the data used to train the model include predominantly green apples of different shapes and sizes, the model might identify a green pear as an apple but fail to recognize a red apple. Under certain circumstances, accidentally or unintentionally common features of aspects might be rated as significant by AI even though they are irrelevant. The statistical distribution of data must be justified and correspond to the real environment. The existence of two legs, for example, must not be applied as a critical factor for AI classification as a human being.
Labeling errors are also caused by subjectivity (“severity of disease”) or identifiers that are unsuitable for the purpose of the model. Labeling of large data volumes and selection of suitable identifiers is a time- and cost-intensive process. In some cases, only a very minor amount of the data will be processed manually. These data are used to train AI. Subsequently, AI is instructed to label the remaining data. This process is not always error-free, which in turn means that errors will be reproduced.
Key factors of success are data quality and the volume of data used. So far, empirical estimates of the amount of data required for an algorithm are few and far between. While it is basically true that even a weak algorithm functions well if the quality and volume of data is large enough, in most cases capabilities will be limited by the availability of (labeled) data and computing power. The minimum scope of data required depends on the complexity of both the problem and the AI algorithm, with non-linear algorithms generally requiring more data than linear algorithms.
Normally 70 to 80 percent of the available data are used to train the model while the rest is used for verification of the prediction. The data used for AI training should cover a maximum bandwidth of attributes.
Example: Identification of Osteoarthritis of the Knee
According to black-box AI, one of the two patients represented by the following images will develop osteoarthritis of the knee in the next three years. This is invisible to the human eye and the diagnosis cannot be verified. Would a patient still choose to undergo surgery? (The following images have been taken from the following publication "Making Medical AI Trustworthy," Spectrum IEEE.org, August 2018 [https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8423571], originally from The Osteoarthritis Initiative [https://nda.nih.gov/]. This article reflects the views of the author and may not reflect the opinions or views of the NIH or of the researchers who submitted the original data to The Osteoarthritis Initiative.).
Above: Figure 1. This patient will not suffer from osteoarthritis in the next 3 years.
Above: Figure 2. This patient will suffer from osteoarthritis in the next 3 years.
AI–Beware of the Black-Box Problem
The transparency of the AI algorithm used in a medical device is clinically relevant. As AI models have very convoluted and non-linear structures, they often operate as a “black box,” i.e., it can be very difficult if not impossible to understand how they make their decisions. In this case, for example, experts can no longer determine which part of the data input into the model (e.g., diagnostic images) triggers the decision made by AI (e.g., cancer tissue detected in an image).
AI methods used in the reconstruction of MRT and CT images have also proven to be unstable in some cases. Even minor changes in the input images may lead to completely different results. One reason is that algorithms are developed in some cases with accuracy but not with stability in mind.
Without transparent and explainable AI forecasts, the medical validity of a decision might be doubted. Some current errors of AI in pre-clinical applications further increase these doubts. However, to ensure safe use in patients, experts must be able to explain the decisions made by AI. This is the only way to inspire and maintain trust.
The following figures demonstrate the differences between black-box and white-box AI.
Above: Figure 3. Black-box AI.
Above: Figure 4. Opening black-box AI.
The figures below demonstrate the effects of training AI using low-quality data. Examples include:
- Biased data (bias in assigning entries to one category of results).
- Over-fitting of data (see Figure 6) Inclusion and excessive weighting of characteristics of little or no relevance.
- Under-fitting of data: The model does not represent the training example with sufficient accuracy.
Above: Figure 5. Effect of training using data of low quality.
Above: Figure 6. Over-fitting (red line) of data (points). Inclusion and excessive weighting of characteristics of little or no relevance.
Above: Figure 7. Under-fitting (red line) of data (points). The model does not represent the training example with sufficient accuracy.
Free Guidance for Developers and Manufacturers
A free checklist published by the Interest Group of the Notified Bodies in Germany (IG-NB) lists about 150 requirements for the development and post-market surveillance of medical devices (see info box below). Until standards governing the safety of AI-based medical devices are published, this guidance can be used to minimize risks in the lifecycle of medical AI. This facilitates the placement on the market of new technologies in an environment which is highly regulated by nature.
Checklist for Medical Devices with AI
Safety of AI-based medical devices requires a process-focused approach across all phases of the product lifecycle. The checklist published by IG-NB covers the following three areas:
General requirements include the certifiability of AI, the pertinent processes and the competencies required in development, as well as thorough documentation.
Requirements for product development
Tasks include identifying users, gathering software requirements, and developing and evaluating models.
Requirements for downstream phases
After development, the focus must be directed to production, distribution, and installation. This is as important as continual post-market surveillance.