## Methods for detecting problems

High Blood Pressure Exercise Program

Get Instant Access

In some senses the methods for detection of problems are simply an extension of the usual methods for checking data, combined with an awareness of the characteristics shown by altered and fabricated data. There are some purely statistical methods, and some graphical methods, which supplement the numerical ones. Familiarity with the research subject is also important, which will be true from medical investigator and referee, but may not be true for the statistician analysing the data. This knowledge will include a sensitivity to look for relationships that should exist and for the absence of relationships that should not.

Examining data: one variable at a time

The features of fraudulent data such as excessive digit preference or alteration of record forms will not usually be visible to the reader, referee, or editor, but may be detected if looked for by a colleague or statistician processing the raw invented data. Digit preference itself is neither a sensitive nor a specific test of fraud.

### Statistical methods

The methods for data checking described in textbooks aimed at medical research such as Altman2 and Altman et al.4 are entirely appropriate. However, they have an implicit assumption that the data are genuine; using the presentation methods that they suggest will be a first step in preventing fraud. Checking will usually be needed by centre or investigator, or time period, when the main analysis would not be done in this way.

The guidelines that were previously published in the BMJ and that have been updated in Chapters 9 and 10 of Altman et al.4 are particularly pertinent. The use of standard deviations and percentile ranges (such as 25th-75th or 10th-90th) may be helpful. The range itself is focused on the extremes (that are the observations most likely to be in error), and always increases with increasing sample size. It may be helpful in detecting outliers (or their absence), but should not be a substitute for measures such as the standard deviation that can be used for comparative purposes across different studies. If the range of observations is clearly determined, such as lying within 0-100 as a percentage of another measurement that forms a total, it is unlikely to be helpful itself. However, the percentiles can still be quite different in real as opposed to invented data. The range is more popular in the medical literature than it should be, and is often the only measure of variability quoted in published reports.

Original data should be summarised wherever possible rather than only derived variables. A confidence interval or a standard error may allow for the derivation of the standard deviation provided the sample size involved is clear to the reader. Provided this is possible, then constraints on space may allow standard deviations to be omitted, although they can often be included with very little space penalty.

For those with access to the raw data there are a variety of techniques that can be used to detect problems. The variation in the data is a vital component to examine. The kurtosis of a distribution is not often used in ordinary statistical analysis, but can be very helpful, both in scanning for outliers in many variables, and also in looking for data that have too few outliers. Data that are from a uniform distribution can be detected by examination of the kurtosis. It is especially helpful within the context of a large and complex trial in scanning many variables in a large number of centres, separately by treatment group.

Dates are helpful in checking veracity. In some instances, fraudulent data have involved supposed visits to a general practitioner or hospital outpatient clinic on bank holidays or weekends. While not impossible, these are less likely, and where any number of visits are recorded on such dates this constitutes a "signal" of an issue that merits investigation. As with other data, reduced variability in times between visits is a marker of possible problems. Buyse et al. 5 give a description of this type of problem in more detail, with an example from Scherrer.6

Authors (and editors) should be encouraged to present raw data if possible rather than just summary values, and, where practicable, diagrams that show all the data should also be presented. Bad data tend to lie too close to the centre of the data for their own group. All authors should be ready to send their data, if requested, to an editor so that independent checks may be made on it if necessary.

### Graphical methods

These are part of statistical science and require careful thought for scientific presentation. The advent of business presentation graphics on personal computers has led to a decline in the quality of published graphs. Inappropriate use of bar graphs for presenting means is a typical example. The use of good graphics is particularly useful when patterns are being sought in data under suspicion.

Some techniques may be used for exploration of data but may not be the best for final communication of the results. An example is the use of the "stem and leaf" plot. This is like a histogram on its side, with the "stem" being the most significant digits, and the "leaves" being the least significant digits. This can be constructed by hand very easily, and many statistical computer programs can produce it. Because it retains all the data, unlike the histogram, which groups the data, the last digit can be seen, and instances of digit preference can be seen clearly. Such a technique showed that doctors did not always use the Hawksley random zero sphygmomanometer in the correct manner.7 This example itself illustrates that, as with the title of that paper, it is always easier to blame a machine than human failing.

Figure 14.2 shows a stem and leaf plot of blood pressure recorded in another study where, although the measurements were theoretically made to the nearest 2 mmHg, some minor digit preference may be seen.

Stem and leaf plots could be used in publications more than they are, and for further details of their construction see Altman2 or Bland.8 If digit preference is suspected, then a histogram of the final digit, or a separate

 Depth Stem Leaves Plot of DBP 2 5 68 12 6 0222224444 16 • 6666 33 7 00000022222222444 (8) • 66668888 37 8 00000022244444 23 • 666888 17 9* 000024 11 • 666 5 10* 000 3 • 88 11 4 HIGH 118, 140

Figure 14.2 Stem and leaf plot of diastolic blood pressure. The first two values are 56, 58. Note that there are 21 values ending in zero, while 11 values end in 8. This is only slight digit preference. (The values are not expected to be measured to better than 2 mmHg.)

one of the penultimate digit can be helpful. A chi-square test for the goodness of fit to a uniform distribution offers a guide as to whether the digit preference is more than random variation. When measurements are made by human reading, such as height, then digit preference will be expected. If digit preference is found in data that should have been machine-generated, such as electronic blood pressure readings or multichannel analyser results for biochemical tests, then this becomes good evidence that some subversion of the data has taken place. It may also be helpful to examine the pattern of digit preference by investigator in a multicentre trial and by treatment group. In one instance, so far unpublished, there was a clear difference in digit preference between the baseline measurements for several variables by treatment group, when random allocation had supposedly been performed. This constitutes evidence for some form of interference with the data.

Examining data: two measurements of the same variable

As an example, consider a trial comparing two treatments (A and B) for hypertension. A way of analysing the data is to look at the individual changes in blood pressure and compare these changes between the two groups.

The basic data consist of the values of blood pressure for each individual at randomisation (r) and at final assessment (f). A t test or confidence interval is calculated using the changes (f—r), comparing groups A and B.

### Statistical methods

Ordinary data checking will look for outlying values (values a long way from the mean) of blood pressure at r or f, and in the changes (f—r). Fraudulent data will not usually have outlying values, rather the reverse. Outliers increase the variability of the data more than they affect the mean, so statistical significance using a t test will be reduced. When data have been manipulated by either removing or changing values that are inconvenient from the fraudster's viewpoint, or when data are completely invented, the range of data will not be extreme. The data will have the outliers removed or "shrunk" towards the mean, and some values may have small changes to increase the differences between the groups.

A sensitive test of fraud will be to find markedly reduced variability in the changes in the blood pressures. For blood pressure, and for many other variables that are measured in research, there is good knowledge of this variability. It tends not to be examined as carefully as the mean or median values that are reported. Extreme values of the mean or median will be noticed easily, but the usual reader, and even the referee and editor of a paper, will be less likely to examine the variability. For blood pressure, the between-person variability in most studies has a standard deviation of close to 10 mmHg. This variation increases with increasing mean value so that, in studies of hypertensive patients, the variability will be rather larger.

The within-person standard deviation varies with the length of time between the making of the measurements concerned, tending to increase as the time between measurements increases. An alternative way of looking at this is to state that the correlation between the two measurements tends to decrease the further apart they are in time. This will happen without treatment, but will also happen in the presence of treatment. Two measurements repeated within a few minutes tend to have a correlation that may be as high as 0.8, while measurements a week or so apart tend to have a correlation of about 0.6-0.7. Values several years apart tend to have lower correlations, falling to about 0.3 at 10 years. These are very approximate values, but those working in a field can obtain the relevant values from their own data that have been shown to be genuine. The within-person standard deviation then tends to be about 7 mmHg for values from one week to a few months apart, which is the typical range of time encountered in much research.The reports of studies, whether genuine or not, often tend to neglect the reporting of the variability of differences. A summary P value may be the only statistic that is given. If this P value is given exactly (rather than P < 0.05), then it is possible to work back to obtain an approximate original standard deviation of the differences. Hence it is possible to see if there is a hint that the data do not have the expected variability.

When changes are examined, which will always be the case when any paired statistical significance tests are done as well as when changes are compared between groups, then the variability of the changes should also be given. It is well known that comparisons in a table are made more easily by going down a column than across a row. This means that the same values in different groups should be given in columns so that comparisons may be made more easily.

The issue of variability of changes is not examined carefully enough. All too often, the baseline and final means and standard deviations are presented with just a P value for the comparison of the changes. Firstly, this makes detection of bad data more difficult; secondly, in order to plan further studies using those changes, especially when calculating sample size, the essential information is not available in the publication. This must be one of the most frequent problems encountered by the consulting statistician helping to plan sample sizes for new research.

### Graphical methods

Graphical methods tend not to be used for pairs of observations although, when the pairs of points are shown joined by lines, it is possible to see when variability is too little by noting that all the lines are parallel. When the same variable is repeatedly measured, this type of graph can be used, but it is rarely done. The usual graphs do not indicate anything of the within-person variability. With modern statistical graphics it is easy to identify different centres or investigators with different plotting symbols. These can be helpful in exploratory analysis of the data rather than in graphs intended for publication.

Examining data: two or more variables at a time

### Statistical methods

When data are invented to manipulate or show an effect that is not present or not present so clearly in the genuine data, then a skilled manipulator will perhaps be able to produce convincing data when viewed in one dimension. It is very much more difficult to retain the nature of real data when viewed in two dimensions. The relationship between variables tends to disappear. In a well-documented example,9 a laboratory study on animal models of myocardial infarction involved a number of variables. The simplest example of this problem was the data relating weight of the dogs versus the weight of the left ventricle. In this example of very elaborate forgery, the range and variability of left ventricle weight was high, in fact higher than in the genuine data, with a similar range for the weights of the dogs. The correlation between these two measurements was very much weaker. The situation with infarct size versus collateral blood flow was even worse, where the variability in collateral blood flow was very much less than expected and the relationship that should have existed was absent.

This type of problem is not easy to detect by simply reading a paper, but ought to be detected by a statistician with access to the raw data and familiar with the science of the study. In some cases, a correlation matrix may be presented, and careful examination of this may show unexpected findings that raise the index of suspicion.

In the example quoted,9 the study was being carried out in several laboratories simultaneously so that the differences between the laboratories could be studied very easily. In fact, the study itself was set up because of previous inconsistencies in the results from different laboratories.

In many situations, there are no data available from multicentre studies and considerable experience in the field may be necessary to detect the problem.

The situation with regard to several variables is an extension of that seen with two. The variables on their own tend to show reduced variability, but even when this is not so, the relationships among many variables become much weaker than they should be.

As has been noted above, the examination of the correlation matrix may also show where relationships are too weak (or, on occasions, too strong) for genuine data. This approach essentially examines the relationships between pairs of variables. True multivariate methods are able to look at the effect of many variables simultaneously. These can be of use in sophisticated data checking.

The first, well-known, multivariable method examines the "influence" of individual observations. It is of most help where data errors have been made and for ensuring that single observations do not distort the results of an analysis too much.

The basic idea is to have a single outcome variable that is the measurement of greatest importance.This is used as the response (dependent) variable in a multiple regression analysis, with a number of possible explanatory (independent) variables, including one for the treatment group if a comparative study is being analysed. The first step is to use standard methods of multiple regression. This entails obtaining as good a fit to the data as possible, which also makes biological sense. For these purposes, it may be reasonable to obtain the best fitting equation (also called a "model"), regardless of how sensible it is in biological terms.The inclusion of variables that are not thought to be medically relevant may indicate that there are problems with the data. The relationships with such variables may merit further investigation.

There are several measures of "influence" available, probably the best of them is called "Cook's distance". This is like a residual in multiple regression: the distance between an observed point and the value predicted for that point by the regression equation. It measures how far off a point is in both the X and Y directions. An ordinary residual may not be very informative, since outliers may have small residuals in that they "attract" the regression line towards them. An alternative is a "deleted" residual, which involves calculating the equation for the regression line excluding that point, and obtaining the residual from the predicted value with this regression equation. This will be very effective when a single outlying point is present in the data. An outlier can influence the regression equation in two ways. It can influence the "height" when it is in the centre of the data, but it influences the slope when it is also an outlier in at least one explanatory variable. This effect is known as "leverage", and is illustrated in Figure 14.3, where only two dimensions are shown. The usual measures of leverage effectively relate to how far points are away from the centre in the Xdirection. Cook's distance for the outlier in Figure 14.3b is very large because it has leverage as well as being an outlier. In Figure 14.3a it is an outlier but does not have leverage and has a smaller Cook's distance. The slope of the line in Figure 14.3b has been notably altered by a single observation. Single outliers are probably more likely to be data errors rather than invented data, but investigation of the reasons for the outlier may be important in the data analysis for genuine as well as fraudulent data. There do exist statistical tests for multiple outliers, but these are beyond the scope of this introduction but the Hadi statistic is implemented in the statistical packages Stata (College Station, Texas) and DataDesk (Ithaca, New York). Statistical mathematics will not usually be as helpful as graphics in detection of problems.

The problem with an invented data point is that it is unlikely to be an outlier in any dimension; in fact, the exact opposite is true. Invented data are likely to have values that lie close to the mean for each variable that has been measured. In one instance of invented data of which I had experience, the perpetrator used the results of an interim analysis of means of all the measured variables to generate four extra cases. For these cases, either the original data were lost, or the results did not fit the desired pattern, and the records were destroyed by the perpetrator. The means of the two "treatment"

Haematocrit vs haemoglobin

Haematocrit vs haemoglobin

Figure 14.3 Use of Cook's distance to measure the influence of an outlying point.

groups had been provided for all haemodynamic and biochemical measurements. The perpetrator used these means as a guide so that the invented data consisted of the nearest feasible numbers close to the relevant group mean with minor changes. This meant that the data for the two individuals in each "treatment" group were similar but not absolutely identical.

These data have the effect of increasing sample size and reducing the standard deviation of every measured variable. This can have a noticeable effect on the P values - it can change from P = 0.07 to P = 0.01.

Such invented data cannot be detected by any of the usual checks. However, one method for looking for outliers can also be used to detect "inliers". It is not unusual for a value of one variable for any case to be close to the mean. It is less likely that it will be close to the mean of an entirely unrelated variable. The probability of it being close to the mean of each of a large number of variables in any individual case is then very low. The distance of a value from the mean for one variable can be expressed in units of standard deviation, a "Z score".This distance can be generalised to two dimensions when the distance from a bivariate mean (for example, diastolic blood pressure and sodium concentration each expressed as Z scores) can be calculated using Pythagoras' theorem. A measure of the distance from a multivariate mean, equivalent to the square of a Z score is called the Mahalanobis distance. The distribution of this distance should follow a chi-square distribution, approximately.Very large values-outliers can be detected in this way. Although not mentioned in textbooks, this can also be used to detect "inliers", looking for much smaller values of Mahalanobis distance. Figure 14.4 shows the Mahalanobis distances, on a logarithmic scale, for a set of data to which two "inliers" have been added. It is possible to use formal statistical tests to indicate the need for further investigation, but on their own, they cannot prove the existence of fabrication.

A similar approach for categorical data was used by RA Fisher in his examination of the results of Mendel's genetic experiments on garden peas. Several experiments had observed frequencies that were too close to the expected ones. The usual statistical test for comparing observed and expected frequencies uses a chi-square test and looks for small P values very close indeed to 1, for example, 0.99996. The probability of observing a chi-square small or smaller then becomes 1 — 0.99996 = 0.00004, which is strong evidence that the usual chance processes are not at work. Some geneticists have doubted Fisher's suggestion that a gardening assistant was responsible for producing Mendel's, and there is some doubt that these data are reliable, but as Fisher concludes, "There is no easy way out of the difficulty."10

• There are statistically significant effects in small studies where most investigators need larger ones.

• Measures of variability or standard errors are absent.

However, it is also important to realise that with any diagnostic procedure there are false positives and false negatives. If we regard a false negative as failing to recognise fraud or manipulation when it is present, then there is also the possibility of the false positive - accusing someone of fraud when it is absent. Inexperienced or ignorant investigators are particularly prone to problems with the third and fifth elements in the list above (P values without data and absence of measures of variability). This is not necessarily fraud. The guidelines on presentation of statistical evidence in Altman et al.4 make it clear that these practices should be avoided. Obtaining extra evidence is the only way of reducing the rate of both these errors simultaneously. When the whiff becomes somewhat stronger, it becomes a

Patient -

10 100

Mahalanobis distance

Figure 14.4 Distribution of Mahalanobis distance for a set of data to which two "inliers" have been added.

FRAUD AND MISCONDUCT IN BIOMEDICAL RESEARCH 0.4 -, c o

"O

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51

### Sequence number

Figure 14.5 Residual risk region by sequence. Reprinted by permission of the publisher from "Detecting fabrication of data in a multicenter collaborative animal study" by KR Bailey (Controlled Clinical Trials 1991; 12: 741-52). Copyright 1991 Elsevier Science Publishing Co, Inc.

### Routine scanning of data

Most data are not fraudulent and it is simply not sensible, or practically possible, to scrutinise all data with a complete battery of tests for possible fraudulent data. What is required is some simple checks that go a little beyond what would be done in ordinary data checking that should be done prior to analysis of any set of data.

The first step is to examine variability. Comparisons may be made between centres, between treatment groups or other "interesting" groups, or with previous experience. The usual object is to look for increased variation, but it is important to also look for reduced variation. It may also

Box 14.1 Routine checking

• Variability and kurtosis

• Baseline imbalance in outcome variable

• Scatter-plot matrix by investigator/centre be reasonable to check for baseline imbalance in the outcome variable if this is measured at baseline. Such imbalances will occur by chance on occasions, but they are particularly important in the outcome or a variable strongly correlated with the outcome. It is an indicator of misconduct when such differences are found consistently with one investigator, but are absent in data from all other investigators.

Standard statistical computer programs calculate standard deviation and usually also calculate kurtosis. It is a reasonable minor extra effort to check the kurtosis as well as the standard deviation.

When the obvious errors in the data have been corrected, then it is reasonable to produce a scatter plot matrix of the variables that might be expected to relate to one another. The extra work required in checking such a scatter plot, even if there are separate plots for a number of centres, will take only a few minutes. Regular use of these techniques will enable the data analyst to become familiar with the pattern shown by genuine data.

It is not difficult with computer programs to be able to obtain histograms of final digits. In some circumstances, it may be reasonable to carry out regular checking for digit preference. However, this is probably unnecessary in routine use (Box 14.1).

### More extensive checking

Digit preference is clearly an area that is useful for finding definite problems. This is especially true where automatic measurements are usually made, but differences in digit preference within the same study may show that investigation is warranted.

In many trials, there are repeated observations on the same individuals. When these are summarised, the within-person variation is often not studied or, if it is, then it is examined for large values that can indicate possible data errors. Again, reduced variation is an indicator of possible misconduct. Occasionally zero variation is seen in invented data for the same person. In one instance, not published, an investigator suggested that measurement of secondary or safety variables in a trial was of no interest, so it did not matter very much if they were not measured properly. Graphical display of the individual data over time with different visits can be very helpful to show different patterns for different centres or investigators.

Cluster analysis can be employed where there is a possibility that test results have been obtained by splitting a blood sample from a single individual into two or more aliquots, and then sent for testing as if they

Box 14.2 More extensive checking

• Digit preference

• Within-individual variation - look for reduced variation

• Cluster analysis

• Mahalanobis distances came from two or more individuals. This technique and the use of Mahalanobis distances can show observations that are too similar to one another. It is helpful if genuine duplicate samples are also available, so that it can be seen that they differ by at least as much as those that purport to come from different individuals. The use of Mahalanobis distances is not justified on a routine basis, but they are very simple to obtain with modern statistical software, and plotting them by treatment group is not arduous (Box 14.2).

### Corroborative evidence

It is important to realise that unusual patterns appear in genuine data. It is easy for someone whose eyes are opened to the possibility of misconduct to see it everywhere when they have not studied a great deal of genuine data. When investigating misconduct, it is good practice to set out a protocol for the analyses, with specific null hypotheses to be tested. (A Bayesian analysis will of course also be possible and may be regarded as philosophically preferable.) This protocol will usually be able to state that a particular centre or investigator has data that differ from the others in specified ways. It is advisable for the statistician carrying out such an investigation to be supplied with data from all investigators, or as many as possible. They should also not know, if possible, the identity of the suspicious centre. They can then treat each centre in a symmetrical way and look for a centre for which there is strong evidence of divergence. Finding strong evidence, from a formal statistical analysis, is then independent evidence of misconduct. It would rarely be taken as sufficient evidence on its own, but with other evidence may prove that there really was misconduct. Fraud implies intention, and statistical analysis would not impinge on this directly.

All of the methods listed above may be used in a search for evidence, and the totality of the evidence must be used to evaluate the possible misconduct. Innocent explanations may be correct. For example, a trial found one centre with very divergent results in the treatment effect observed. It turned out on further simple investigation that this investigator was from a retirement area of the South Coast of England and the ages of all the patients were very much higher than in the other centres. There was an effect of age on the treatment and this was accepted as an explanation for divergence (Box 14.3).

Box 14.3 Corroborative evidence

• Predefine a protocol

• Use as simple methods as possible

• Consider innocent explanations