## Consider fudge factors and be robust against outliers

Often, the t-score gives top ranks to genes with very small mean intensity differences if the expression values are almost constant in each condition. In terms of the applicability of the t-test this is no problem. As a biologist, however, you might not like it. To block these genes from top ranks, you can artificially enlarge low variances. First attempts to correct for low variances were made by Efron et al. (7) and Tusher et al. (8) by introducing a fudge factor that is added to each gene's variance estimate. Typically, the fudge factor is chosen from the set of variances of all genes. Like Efron et al. (7) you can choose this value manually. Choosing a very large fudge factor approximates differences of means, a zero factor results in the classical t-score, and probably an intermediate value will work best for you. An automatic choice is implemented in the software SAM (Significance Analysis of Microarrays), see Tusher et al. (8). The implementation in package twilight selects the median of all standard deviations as fudge factor. Type:

score3 <- twilight.pval(X,y,method="z",paired=FALSE)

Genes with small variances are one type of gene with high t-scores but little biological relevance; the other type are genes with outlying values. The Wilcoxon rank-sum score works on ranks instead of numerical values, and is less sensitive to outliers. The price for robustness is loss of information that was contained in the numerical values. In R, you can define a new function wilc that incorporates the base function wilcox.test. Note that we changed the Wilcoxon score such that it is distributed around 0. Assume that your condition labels are coded as 1 for the first condition and 0 for the second condition such that the label vector y is binary. Function wilc is then applied to each gene.

wilc <- function(expression,labels){

result <- wilcox.test(expression~labels,paired=FALSE)$statistic return( result - sum(labels)*sum(1-labels)/2 ) }

A score that protects against both outliers and constant genes was introduced by Smyth (9) as moderated t-score. The idea is 'to borrow information across genes' with the intention to raise small variances and to shrink larger variances to an overall variance. The approach is implemented in the R package limma. The computation of limma scores in R needs three steps. First, fit a linear model for two conditions to each gene. Second, define the contrast you are interested in. That is the difference between the two condition averages. Third, compute moderated t-scores with function ebayes.

library(limma)

a <- lmFit(X,design=cbind(y,1-y)) b <- contrasts.fit(a,contrasts=matrix(c(1,-1),ncol=1)) score5 <- ebayes(b)$t

## Post a comment