## Empirical Bayes

The existence of thousands of genes in the microarray time course context brings to mind the empirical Bayes (EB) approach to inference. This is a model-based way of introducing moderation into the analysis.

In Tai and Speed (34), a multivariate hierarchical normal model with conjugate priors is proposed to derive the posterior odds for differential expression in the one- and two-sample problems. This is designed for longitudinal data, and takes into account correlations across times. When all genes have the same number of replicates, the MB-statistic for the null hypothesis that the expected profile equals to 0, or the paired two-sample problem with the null hypothesis that the expected profiles are the same, is a monotonic increasing function of the statistic T2 = t't, where

is just the traditional multivariate t-statistic with the denominator replaced by a moderated covariance matrix S. This expression incorporates the gene-specific covariances but also shares the covariance information across genes:

Here S is the gene-specific sample variance-covariance matrix; X is the gene-specific sample average time course vector; n is the number of replicates; v and A are hyperparameters estimated from the whole gene set (see 34 for details). The MB-statistic or T2 statistic for the independent two-sample problem is also derived in Tai and Speed (34), where differential expression now means that the expected profiles are different between the two biological conditions. It is shown there that the MB-statistic achieves the lowest numbers of false positives and false negatives, and performs about as well as the moderated Hotelling T2 statistic. One of the values of the multivari-ate empirical Bayes is that it provides a natural way to estimate the gene-specific moderated sample covariance matrix, while the likelihood ratio based approach (moderated Hotelling T2 statistic) does not.

Let Ydi be the i-th replicate for the d-th condition, and Y and Yd be the overall sample average, and average for the d-th condition only, respectively. In the case that there are more than two biological conditions and all genes have the same number n of replicates within each condition, the posterior odds for difference between conditions for a conjugate normal model are proportional to

where TSSP = Z^JY* - Y) (Y* - Y)' and WSSP* = (Y* - Y*) (Y* -

Y^)' are the total and within-condition sums of squares and products, Md, M, and vA are matrices involving the (condition-specific) prior means and variance-covariance matrices, respectively. This is our EB analogue of Wilks' likelihood-based A from MANOVA (see 36 for details).

For cross-sectional data across D > two biological conditions, Tai and Speed (37) derive the posterior odds that the expected temporal profiles are different among biological conditions versus they are the same. Let Ydji be the log2 intensity value or log2 ratio of this gene for the d-th biological condition, j-th time point, and i-th replicate. Ydji are independent across times (j = 1,..., k), biological replicates within conditions (i = 1,..., ndj) and biological conditions (d = 1,., D). The sampling times need not to be the same within and across biological conditions. For the simplest case, we assume they are, and that all genes have the same number of replicates n for all conditions and times. Again, under a conjugate normal model with unstructured means, the posterior odds are proportional to

where Ydj = n- Xn=1 Ydji and Yj = D 1 XD=1 Ydj denote the average log2 (relative) expression level at the j-th time point for the d-th condition only and all the conditions, respectively; TSS = Xd=i Xi=i Xn=i (Ydji - Y) and WSSd = (Ydji - Ydj)2 are the total and within sums of squares, respectively; md, m, and v^2 are quantities involving (condition-specific) prior means and variances. This is a special case of our fully moderated P-statistic, the EB analogue of the traditional P-statistic. Gene selection using either the MB-statistic or the T2 statistic can be based on rankings.

The multivariate EB procedure in Tai and Speed (34) focuses on moderating the denominator of the multivariate t-statistic t, and ranks genes according to the moderated statistic T2, to reduce the number of false positives and false negatives resulting from very small or very large replicate variances or covariances. Alternatively, one could replace the numerator of the multivariate t-statistic with a robust estimate, to avoid the problem of very large T2 resulting from outliers. Such an outliers issue can be common in the microarray time course context, when the sample sizes are typically very small (two or three). Incorporating robust methods into the analysis of microarray time course is a research topic of interest here. Figures 20.3-20.5 gives the profiles of the top-ranked genes from Tomancak et al. (8) using the one-sample longitudinal MB-statistic (34), the one-sample cross-sectional MB-statistic with a fifth-degree polynomial model for the means (37), the moderated F-statistic (29), and the usual F-statistic with unstructured means.

g lo

CC B

longitudinal MB rank = 1 cross-sectional MB rank = 60 moderated F rank = 32 F rank = 39

CCCCC

-110

Hour

Figure 20.3.

The top gene by the one-sample longitudinal MB-statistic.

g lo

longitudinal MB rank = 52 cross-sectional MB rank = 1 moderated F rank = 17 F rank = 22

6 8 Hour

Figure 20.4.

The top gene by the one-sample cross-sectional MB-statistic.

Figure 20.5.

longitudinal MB rank = 5 cross-sectional MB rank = 297 moderated F rank = 1 F rank = 1

Hour

The top gene by both the moderated F-statistic and the F-statistic.

## Post a comment