Richard P. Chiacchierini
Analysis of the data from a medical device clinical trial or study is one of many critical steps along the path to FDA approval and, ultimately, to the marketplace. It is the culmination of all prior planning and execution of the study protocol. In the course of a proper analysis, underlying assumptions are verified, study populations and sites are checked for comparability, and all primary and secondary study variables are evaluated.
Clinical data can arise from a controlled clinical trial or from other clinical studies that reveal information about the performance of a medical device. The term clinical study encompasses a broad spectrum of situations in which data are gathered in a clinical setting. A clinical trial is a very specific type of clinical study.
Depending on the way a study is conducted, statistical analysis of its data can be variously affected by such design considerations as sample size, comparison groups, masking, or randomization. The particular type of analysis conducted on clinical study data is dictated by the way the study was actually conducted--which may or may not be the same as originally designed. Changes in the protocol during the course of a study will also require changes in the methods of analysis to be used. This article presents the basic framework for a proper statistical analysis of data arising from the conduct of a medical device clinical trial or study. The methods discussed here are extremely powerful, but their effectiveness depends critically on the quality of the data to which they are applied. No statistical method, regardless of its sophistication, can overcome major data weaknesses that arise from seriously flawed study design or conduct.
Starting Points. The device manufacturer should recognize at the outset that analyzing the data from a clinical study is a painstaking and expensive proposition. Despite the common misconception that data analysis is "simple and straightforward, requiring little time, effort, or expense," statisticians know that "careful analysis requires a major investment of all three" (Friedman, et al., p. 241; see bibliography, p. 56). In recent years, the common misconception has been amplified by the growing number of user-friendly computer software packages that seemingly promise to make data analysis effortless. But giving the analysis of clinical data less effort than it requires often leads to incorrect or inappropriate analyses that cause major delays in FDA's product review process. Agency reviewers are skeptical of statements made by a sponsor that are not supported by a proper and appropriate analysis.
A good analysis should start with an analytical strategy. The strategy should be crudely developed at the time the protocol is written and refined as the study or trial goes to completion. It should describe in general terms:
- The anticipated analysis procedures.
- The basis for the sample size.
- The primary and secondary variables.
- The subgroups, if any, that will be investigated by hypothesis tests.
- The influencing variables (covariates) that are important, and why they are important.
Although refinement of the analytical strategy should not be taken to include wholesale changes that drastically alter the intention of the original study, it may include the addition of greater detail that moves the initial strategy from generality to specificity. The original strategy document should provide a skeleton for the analytical scheme and the refinements should provide the meat.
At first glance, many analytical methods may appear suited to the data, but only a few are likely to have underlying assumptions that are truly consistent with the data. To determine the correct analytical technique to be used, the manufacturer needs to know the answers to a number of critical questions:
- Why were the data gathered?
- How were the data gathered?
- From whom were the data gathered?
- When and for how long were the data gathered?
- Where were the data gathered?
A database with rows representing patients, and columns representing variables can yield summary data tables that might appear capable of analysis by a number of different methods. In actuality, however, there are likely to be a very limited number of methods (possibly one or two) for which the analytical assumptions are satisfied. Use of other methods that do not satisfy the analytical assumptions is inappropriate and their results are considered unreliable.
Although the term statistical analysis embraces an ever-increasing number of methods that might be used by a medical device sponsor, all such analytical methods can be classified into two main groups: hypothesis testing and estimation. In hypothesis testing, the researcher usually compares the occurrence of one or more features of interest in two or more groups of patients. Most hypothesis testing in medical device clinical trials compares the mean, proportion, or other features of the device-treated group to the same features in the control group. Features could involve such measures as the mean time to healing or hemostasis, or the proportion of patients who showed a preselected degree of improvement.
In estimation, the researcher's interest is to determine the relative value of a characteristic of interest in a group under study. The estimated value is usually accompanied by a statement about its certainty, or confidence interval, which is expressed as a percentage. Estimation is a necessary part of hypothesis testing, but it is not the culmination of the method. Estimation is also important in the analysis of safety variables. For example, in a clinical study of a "me-too" device, where effectiveness is not an issue, FDA and the sponsor may be interested in estimating the proportion of patients that might experience a particular complication. To ensure that the estimate has a high probability of being accurate, the researchers would also need to determine the confidence interval for it.
No single presentation on the statistical analysis of medical device clinical data can be sufficiently comprehensive to cover all aspects of this complicated and diverse methodology. Although this article is not intended to provide new or provocative material, it will cover the basic tenets that form the foundation for a proper analysis of clinical study data. These tenets are divided into three main sec-tions: preliminary analysis, comprehensive analysis, and analytical interpretation.
Authors of textbooks about statistical data analysis rarely discuss the need to match the analytical method to the character of the data. Often, they simply assume that the reader is sophisticated enough to investigate whether the variance of the groups being compared is sufficiently similar, or whether the distribution of the data is suitable for the analytical method being proposed. This is clearly a leap of faith that is not supported by experience.
In the evaluation of any set of data, from whatever source, it is essential to begin with an investigation of the data's basic character.
- What is the nature of the distribution of the primary, secondary, and influencing variables?
- Is the distribution of variables consistent with normal (Gaussian) or another well-known distribution?
- If the data are not normally distributed, can they be changed by a function (a transformation) that preserves their order, but brings them into conformity with well-known assumptions about their distribution?
- Is the sample of adequate size such that normality of the means can be assumed even if the data are not normally distributed?
- Are the variances of the subgroups to be compared equal?
These questions are the realm of descriptive statistics. They can be answered by applying simple, well-known tests or by inspecting rudimentary data plots such as histograms or box plots. Such questions are essential for enabling the statistician to validate the assumptions that underlie the data, and to select the most appropriate analytical method consistent with the data.
Basic Character of the Data. Clinical data are similar to other forms of data in that there are two types of variables, quantitative and qualitative. Quantitative variables are numbers that can have any value within some acceptable range. For example, a person's weight in pounds could be 125.73. Qualitative variables, however, must conform to discrete classes, and are usually characterized numerically by whole numbers. For instance, a patient who is disease-free could be characterized by a zero, and a patient who has the disease could be classified as a one. The analytical procedures appropriate for these two types of variables are diverse. While there have recently been tremendous advances in the analysis of qualitative data, the techniques for analyzing quantitative variables remain more powerful because there is more numerical information in a number like 125.73 than there is in a zero or a one.
The distribution of variables in a sample is a critical factor in determining what method of analysis can be used. Normal, or Gaussian, distribution resembles the symmetrical bell-shaped curve by which most students are graded throughout their scholastic careers. It is fully characterized by two features, the mean, a measure of the location of the distribution, and the variance, a measure of the spread of the distribution. Many well-known statistical methods for analyzing means or averages--such as the t-test or the paired t-test--are based on normal distribution. Such methods rely on normality to ensure that the mean represents a measure of the center of the distribution.
Because statistical theory holds that the means of large samples are approximately normally distributed, an assumption of normality becomes less important as sample sizes increase. However, when sample sizes are small, as they are likely to be in most medical device clinical studies, it is crucial to determine whether the data to be analyzed are consistent with a normal distribution or with another well-characterized distribution.
Most common statistical tests of quantitative variables, including the t-tests and analysis of variance (ANOVA), are tests of the equality of the measures of location belonging to two or more subgroups that are assumed to have equal variance. A measure of location, such as a mean or median, is a single number that best describes the placement of the distribution (usually its center) on a number line. Because equal variance provides the basis of nearly all tests that involve measures of location, in such cases an assumption of equal variance is more critical than an assumption of normality--even when the tests do not rely on any specific distribution of the data (called nonparametric tests). If the variances are not equal among the subgroups being compared, it is frequently possible to find a formula or function (a transformation) that preserves order and results in variables that do have equal variance.
When considering the distribution of data, it is also important to look at a picture of them. Data can be plotted for each group under consideration to determine whether the distribution is shifted toward higher or lower values (skewed). The presence of one or more values that are much higher or lower than the main body of data indicates possible outliers. Data plots can also help to locate other data peculiarities. Common, statistically sound adjustment methods can be used to correct for many types of data problems.
Baseline Variable Evaluation. Once the character of the variables of interest has been established, the analysis can test for comparability between the treatment and control groups. Comparability is established by performing statistical tests to compare demographic factors, such as age at the time of the study, age at the time of disease onset, or gender, or prognostic factors measured at baseline, such as disease severity, concomitant medication, or prior therapies. Biased results can occur when the comparison groups show discrepancies or imbalances in variables that are known or suspected to affect primary or secondary outcome measures. For instance, when a group includes a large proportion of patients whose disease is less advanced than in those of the comparison group, the final analysis will usually favor the outcomes for the former group, even without an effect that is due to the device.
About 30 years ago, another example of this effect occurred in a study that was comparing the effectiveness of surgery and iodine-131 for treatment of hyperthyroidism. The investigators found the seemingly inconsistent result that patients who received the supposedly less-traumatic radiation therapy had a much higher frequency of illness and death than those who underwent surgery. An investigation of the baseline characteristics of the two groups revealed that the patients selected for the surgery group were younger and in better general health than those selected for the iodine treatment. The inclusion criteria for the surgery group were more stringent than those for the iodine group because the patients had to be able to survive the surgery. In this example, noncomparability resulted in an inconsistent finding that was resolved only through investigation.
It is desirable to perform comparability tests using as many demographic or prognostic variables simultaneously as the method of analysis will allow. The reason for using this approach is that the influence of a single demographic or prognostic characteristic on the outcome variable may be strongly amplified or diminished by the simultaneous consideration of a second characteristic. However, the size of most medical device clinical studies is rarely sufficient to allow the simultaneous consideration of more than two variables. More commonly, the sample size of the trial will allow the investigator to consider only one variable at a time.
As part of their comparability testing, one characteristic that manufacturers must always evaluate is the study site. Such an analysis should include not only the demographic and prognostic factors, but also the outcome variables. This evaluation is important because it provides the major basis for pooling the data from various clinical sites, which is very often essential to meeting the study sample size requirement.
Imbalances detected in comparability testing do not necessarily invalidate study results. By knowing that such differences exist, however, the analyst can account for their presence when comparing the outcomes data from the treatment and control groups. Many statistical procedures can be used to adjust for imbalances either before or during the comprehensive analysis, but such adjustments are usually restricted to instances where the extent of the difference is not great. Large differences in variables that affect data outcomes among comparison groups can rarely be adjusted adequately to make the comparison groups comparable.
The methods used for comprehensive analysis of clinical data vary according to the nature of the data, but also according to whether the analysis focuses on the effectiveness or the safety of the device. Selection of an appropriate method must also take into account the nature of the device under study. The following sections outline some of the statistical methods available for comprehensive analysis of effectiveness data for in vitro diagnostic products and therapeutic devices, and for assessing safety-related data.
Effectiveness Analyses for Diagnostic Devices. In vitro diagnostic devices require statistical techniques that are quite specialized. Usually the analysis is based on a specimen, such as a vial of blood, collected from a patient. The same specimen is analyzed by two or more laboratory methods to detect an analyte that is related to the presence of a condition or disease. Thus, each specimen results in a pair of measurements that are related to one another. In the case of a new method devised to detect the amount of serum cholesterol, for example, each blood sample would be used to produce two measures of serum cholesterol, one from the conventional method and one from the new method.
The statistical treatment of such related (or correlated) data is very different from that of unrelated (or uncorrelated) data because both measurements are attempting to measure exactly the same thing in the same individual. Generally, if both laboratory measurements result in a quantitative variable, the first analysis attempts to measure the degree of relationship between the measurements. The usual practice is to perform a simple linear regression analysis that assumes that the pairs of values resulting from the laboratory tests are related in a linear way.
In linear regression analysis, a best-fit line through the data is found statistically, and the slope is tested to determine whether it is statistically different from zero. A finding that the slope differs from zero indicates that the two variables are related, and careful attention should be paid to the correlation coefficient, a measure of the closeness of the points to the best-fit line. A correlation coefficient with a high value, either positive or negative, indicates a strong linear relationship between the two variables being compared. However, this correlation is an imperfect measure of the degree of relationship between the two measurements (i.e., although a good correlation with a coefficient near one may not indicate good agreement between the two measurements, a low correlation is almost surely indicative of poor agreement).
Although correlation can indicate whether there is a linear relationship between two laboratory measurements, it does not provide good information concerning their degree of equivalence. Perfect equivalence would be shown if the correlation were very near one, the slope very near one, and the intercept very near zero. It is possible to have a very good relationship between the two measures, but still have a slope that is statistically very different from one and an intercept that is very different from zero. Such a situation usually suggests that one of the two measurements is biased relative to the other.
If the conventional method used in the testing is a true "gold standard" or reference method, it may be possible to adjust the chemical or electronic measurement system of the device being evaluated to make the slope one and the intercept zero. If the conventional method is not a reference method or gold standard, then the sponsor is faced with the possibility that the new method under test may be better than the one to which it is being compared. In such a situation, tinkering with the device to force equivalence may be inadvisable.
When the conventional method is not a reference method or gold standard, the degree of agreement can be assessed by another method that goes beyond regression analysis. Recognizing that the absence of a gold standard means that the conventional method is imperfect, Bland and Altman devised a technique that compares the difference between the two measurements plotted against their mean (see bibliography, p. 56). The analyst establishes a confidence interval for the difference between the two measurements and assesses the number of differences falling within the interval. If the number is similar to that predicted by theory, and the width of the interval is small enough to be clinically acceptable, then the new measurement system is considered to be in good agreement with the conventional method. However, the determination that an interval's width is clinically acceptable cannot be established by statistical techniques and must involve the judgment of a health professional.
Establishing agreement between the quantitative measures is only the first step in the analysis of an in vitro diagnostic device. Since these devices and those that are designed to give qualitative results are diagnostic, the analyst must also assess the ability of the device to detect the condition. Such an assessment requires that a value (a cutoff value) that specifies the disease state or condition has been identified for each measurement system. It is critical that this value be established on a different set of data from the measurements currently under analysis; it is unacceptable to use a value that characterizes a disease state by reference to its own data set.
The next step is to classify the patients into two groups, those with the condition and those without it. This is performed for both the new method and the conventional method by reference to a qualitative outcome or by use of the cutoff value. The result is a two-by-two table in which the four cells represent the number of patients found negative for the disease or condition by both measurement methods, the number found positive by the conventional method but negative by the new method, the number found negative by the conventional method but positive by the new method, and the number found positive by both methods. From this table it is possible to estimate the sensitivity, specificity, predictive value positive, and predictive value negative, along with their respective confidence intervals. These values are usually compared with those for other classification systems for the disease or condition under test to determine whether they are close to those known values.
The next step in the analysis of diagnostic devices involves either a relative risk assessment or a receiver operating characteristic (ROC) analysis. There is software available to perform either of these analyses. The relative risk is a ratio of the risk of the disease among patients with a positive test value to the risk of disease among patients with a negative test value. The relative risk analysis is particularly effective and can be done by use of either a logistic regression or a Cox regression depending on whether the patients have constant or variable follow-up, respectively. ROC analysis provides a measure of the robustness of the cutoff value as a function of sensitivity and specificity.
These techniques, described more fully below, allow the analysis of the measurement method along with any potential influencing variable. If the final model, fit to the data, contains a statistically significant contribution that is attributable to the sponsor's measurement system--whether or not there are significant effects attributable to other covariates--the test method provides an independent means of assessing the disease or condition. The reason for this powerful interpretation is that the test resulting from these methods is based on a statistic that has been adjusted for the presence of other significant covariates.
Finally, if the device is diagnostic for a condition that takes a relatively long time to develop (such as cancer), the analyst should evaluate the lead time afforded by the device. Sometimes this evaluation is a simple mean with a corresponding confidence interval. For these types of devices to be effective, the interval should not include zero. In addition, the farther away the lower limit of the interval is from zero, the better.
Effectiveness Analysis for Therapeutic Devices. In-depth analysis of a therapeutic device usually involves hypothesis testing to determine whether the device maintains or improves the health of patients. In some cases, FDA may permit a sponsor to compare a particular device operating performance characteristic (OPC) to a test treatment. Even in such cases, however, the result will be a test of the hypothesis that the treatment is better than or equal to a constant, the OPC. Selection of an appropriate method for in-depth analysis of data from such trials or studies depends on many factors, such as:
- Is the primary variable quantitative or qualitative?
- Was the primary variable measured only once or on several occasions?
- What other variables could affect the measurement under evaluation?
- Are those other variables qualitative (ordered or not) or quantitative?
Quantitative Primary Variables. If the primary variable under evaluation is quantitative, selection of an appropriate method of analysis will depend on how many times that variable was measured and on the nature of any other variables that need to be considered. If there is only a single measurement for each variable, and there are no differences among the potential covariates belonging to the treated and control groups, the appropriate method of analysis may be a parametric or nonparametric ANOVA or t-test. For example, a study of a new cardiovascular stent that is expected to offer better protection against restenosis, with all other things being equal, could compare the six-month luminal diameter by this method.
The choice of an appropriate analytical method changes if the covariates belonging to the two comparison groups differ and are measured qualitatively. Such cases may require use of a more complex analysis of variance or an analysis of covariance (ANCOVA). The ANCOVA method is particularly suited to analyzing variables that are measured before and after treatment, assuming that the two measurements are related in a linear or approximately linear manner. Using ANCOVA, the statistician first adjusts the posttreatment measure for its relationship with the pretreatment measure, and then performs an analysis of variance. Using the example of the cardiovascular stent, ANCOVA would be a suitable method of analysis if the amount of improvement in the six-month luminal diameter of the artery treated by the stent depended on the original luminal diameter of the artery.
In medical device studies, outcome variables are often measured more than once for each study subject. Although there are very powerful methods of statistical analysis that can be applied to such situations, they require what statisticians call balance; for example, every time a variable is measured it must be measured for every patient. A balanced repeated measures ANOVA can be performed with or without covariates. With covariates, this method reveals the effect of each patient's covariate value on the outcome variable, the effect of time for each patient, and whether the effect of time for each patient is changed by different values of the covariate. Continuing with the stent example, a repeated measures ANOVA could be applied to evaluate measurements of luminal diameter before implantation and at 3, 6, 9, and 12 months after implantation, and of the location of coronary lesions. In this case, the primary outcome variable is luminal diameter, and the covariate is the location of the lesions.
A repeated measures ANOVA also can be used if a few patients missed one or possibly two measurements. However, doing so requires the statistician to use sophisticated statistical algorithms in order to estimate the missing outcome measures, and these can present problems. To find solutions, it is sometimes necessary to restrict the data or make other assumptions that may weaken the resulting statistical conclusions.
Some studies result in a quantitative outcome variable and one or more quantitative covariates. In this situation, multiple regression methods are useful in evaluating outcome variables (called dependent variables), especially if the study involves several levels or doses of treatment as well as other factors (independent variables). Regression is a powerful analytical technique that enables the statistician to simultaneously assess the primary variables as well as any covariates.
The regression model is an equation in which the primary outcome variable is represented as a function of the covariates and other independent variables. The importance of each independent variable is assessed by determining whether its corresponding coefficient is significantly different from zero. If the coefficient is statistically greater than zero, then that independent variable is considered to have an effect on the dependent variable and is kept in the model; otherwise, it is discarded. The final model includes only those variables found to be statistically related to the dependent variable. The model enables the statistician to determine the strength of each independent variable relative to the others as well as to the device treatment. In the stent example, a multiple regression analysis would be appropriate for data where the luminal diameter was measured twice (say, at baseline and at 6 months), and the length of patient lesions was measured as an independent variable.