Whilst it may seem simple to apply the three criteria of randomisation generation/concealment, blinding and ITT to judge the quality of RCTs, it is still uncertain how far these factors can reliably discriminate between "good" and "bad" RCTs in dermatology. Other factors that are disease specific and rely on content knowledge/expertise are likely to be equally important in determining the quality of some dermatology trials. The influence of such disease-specific factors in dermatology is an area that requires further systematic research.
Therefore, as someone with an interest in atopic eczema, I would not trust a study that claimed a beneficial effect for a new treatment if the study included both children and adults with diverse eczematous dermatoses,14 as people with such conditions might respond differently.15 Similarly, the definitions of disease used may be an important quality criterion. For example, if I were reading the report of an RCT of an intervention for bullous pemphigoid, I would want to know that the diagnosis in study participants was confirmed by immunofluorescence in order to distinguish it from other bullous disorders of diverse aetiologies and with differing treatment responsiveness.
In evaluating a clinical trial, look for clinical outcome measures that are clear cut and clinically meaningful to you and your patients.16 For example, in a study of a systemic treatment for warts, complete disappearance of warts is a meaningful outcome, whereas a decrease in the volume of warts is not. The development of scales and indices for cutaneous diseases and testing their validity, reproducibility and responsiveness has been inadequate.16,17 A lack of clearly defined and useful outcome variables remains a major problem in interpreting clinical trials in dermatology.
Until better scales are developed, trials with the simplest and most objective outcome variables are the best. Categorical outcomes lead to the least amount of confusion and have the strongest conclusions. Thus, trials in which a comparison is made between death and survival, patients with recurrence of disease and those without recurrence, or patients who are cured and those who are not cured are studies whose outcome variables are easily understood and verified. For trials in which the outcomes are less clear cut and more subjective, a simple ordinal scale is probably the best choice. The best ordinal scales involve a minimum of human judgement, have a precision that is much smaller than the differences being sought, and are sufficiently standardised to enable others to use them and produce similar results.16,17
In addition to helping to balance known predictors of treatment response such as baseline disease severity (which could serve as confounders when evaluating treatment efficacy between groups), it has also been suggested that randomisation will balance against unknown confounders.3 This statement is superficially appealing, but is difficult to verify if these confounders are indeed unknown. Even so, randomisation, especially when implemented on small sample sizes, may result in imbalances in possible cofactors that can affect treatment response. In other words, randomisation is not a guarantee against imbalance, although more sophisticated methods of randomisation such as blocking and stratification can help to minimise this.7
It is quite common to see as the first table in the results section of an RCT report a long list of demographic characteristics of the participants in the different treatment groups and a statement to the effect that "the two groups did not differ statistically at baseline". This statement is problematic for two reasons.
• It is inappropriate to perform such multiple statistic tests without prior hypotheses -indeed many of the variables recorded may be totally irrelevant to predicting treatment response.
• There may still be no arbitrary 5% statistical significance even for gross imbalances in treatment groups simply because the groups are so small.
Before reading such tables, the most important thing to do is to ask oneself, "What are the most important factors which may predict treatment response?" and then to "eyeball" these in the table of baseline characteristics, if they have been recorded. If there are major imbalances such as baseline severity score, then these can and should be allowed for in a number of ways during analysis, for example a multivariate analysis adjusting for baseline severity as a covariate.7
Many dermatology trials report as many as 10 different outcome measures recorded at several different time points. Even by chance, at least 1 in 20 of such outcomes will be "significant" at the 5% level. Therefore, it is important in studies that use multiple outcomes to ensure that the trialists are not data dredging, that is performing repeated statistical tests for a range of outcome measures and then emphasising only the one that is "significant" at the "magic" 5% level. Such practice is akin to throwing a dart and drawing a dartboard around it. Instead, trialists should declare up front what they would regard as a single "success criterion" for a particular trial. This way it is more credible if that main success criterion is indeed fulfilled - as opposed to some secondary or tertiary outcome measure that turns out to be "significant". Sometimes, trialists will try to save face by emphasising a range of less clinically significant biological markers of success when in fact the main clinical comparisons look disappointing.
It is quite common for continuous data such as acne spot counts to have a skewed frequency distribution. It may then be inappropriate to use parametric tests such as the Student f-test without first transforming the data. Alternatively, non-parametric tests that do not rely on the assumption of a normal distribution can be used. A quick way to check whether a continuous variable is normally distributed is to determine whether the mean minus two standard deviations is less than zero. If it is, the data are likely to be skewed.
Performing a statistical test on something other than the main outcome of interest is a subtle but not uncommon error in dermatology trials.18,19 When comparing a continuous outcome measure such as decrease in acne spots between treatment A and treatment B, the correct summary statistic to challenge the null hypothesis of no difference between the treatment is to examine the difference between the two treatments in terms of change of spot count from baseline. Sometimes the investigators simply perform a statistical test on whether the acne lesion count falls from baseline in the two groups independently. If the fall in spot count reaches the 5% level in one group but not in the other, then the authors may conclude that "therefore treatment A is more effective than treatment B". Perhaps the P value for change in spot count from baseline is 0-04 in one group (i.e. significant) and 0-06 in the other (i.e. conventionally non-significant). This practice is clearly inappropriate since the difference between the two treatments has not been tested.
Misinterpreting trials with negative results is a common error in dermatology clinical trials.20 Failure to find a statistically significant difference between treatments should not be interpreted that "treatment is ineffective". Put another way, no evidence of effect is not the same as evidence of no effect.21 In many dermatology trials the sample sizes are too small to detect clinically important differences. Providing 95% confidence intervals around the main response estimates allows readers to see what kind of effects might have been missed. For example, in an RCT of famotidine versus diphenhydramine for acute urticaria, itch as measured by a 100 mm visual analogue scale decreased by 36 mm in the famotidine group and by 54 mm in the diphenhydramine group, a difference of 18 mm (54 - 36) in favour of diphenhydramine. Although the statistical test for this difference of 18 mm between the two treatment groups was not significant at the 5% level, there was a trend towards to greater reduction in itch in the diphenhydramine group. The 95% confidence interval around the 18 mm difference between the groups was from -3 to 38. In other words, the results were compatible with a difference of as little as 3 mm in favour of famotidine and as much as 38 mm in favour of diphenhydramine.22
Once randomised, it is important that the two intervention groups are followed up in similar ways. Previous studies have shown the nonspecific benefits of being included in a clinical trial, even in placebo groups.23 Part of the benefit might be the result of better ancillary care prompted by frequent follow ups and being "fussed over" by study assessors.7 It is important therefore to scrutinise whether the treatment groups have been treated equally in terms of frequency and duration of follow up and whether they have been afforded identical privileges except for the treatment under investigation.
It is natural to assume that a clinical trial of a drug that has taken years of investment by a drug company and that is sponsored by that same company will strive to demonstrate that the drug is successful. Indeed, millions of dollars of profit may rely on convincing opinion leaders in dermatology of a new drug's worth. Yet the influence of sponsorship on efficacy claims has not been tested in dermatology RCTs. Drug companies and trialists have many opportunities to influence journal readers when the results of their trial are published (Box 9.3).
It should not be assumed that biases in relation to sponsorship are confined to the pharmaceutical industry. Those conducting trials for government agencies might hope to show that a new drug is less cost-effective than standard therapy. Some independent clinicians with preformed
Was this article helpful?