## SE a

For each chip, indexed by J, we thus get both an expression value and a standard error estimate. These vectors can be summarized at the chip level to obtain an index of quality for each chip.

The standard errors of estimated expression within a chip form a heterogeneous set by virtue of the fact that the value of a varies from probe set to probe set. We can remove this source of heterogeneity by using unsealed standard errors to assess the precision of the estimated expressions. Removing the <r factor from the standard error does not affect the assessed relative precision of estimated expressions across chips, which is our main interest.

There still remains some heterogeneity in the unscaled standard errors across probe sets, because the effective number of probes used in estimating the expression for chip j may vary from probe set to probe set. To remove this source of heterogeneity, we can normalize the unscaled standard errors by dividing by the average or median standard error across a set of chips.

The bottom panel of Fig. 8 shows box plots of normalized unscaled standard errors (NUSE) of probe set summaries for each chip. In these, we see that the NUSE of probe set summaries are quite sensitive to deviations in assessed expression variability, with experiments A and P clearly standing out from the rest.

We can summarize the batches of NUSEs for each chip by the median, for example, and this value can be used as a chip data quality index. NUSE values fluctuate around 1.0. A median value of 1.05 for a chip may be interpreted as a 5% average loss in precision. The question that naturally arises is what is a good range for this quality index? This is a difficult question and may not have a single answer. The answer depends on the specific application and the various costs involved. For a specific application, one could judge at what level of quality including a chip in an analysis becomes disruptive, by a ''leave-one-out'' comparison, for example. For a carefully performed analysis, such as one that combines expression measures robustly, the answer may be that including a small number of lower quality chips in an analysis will be harmless in most cases. This does not mean that detecting departures from quality standards that have small effects on downstream analysis is not useful. For example, in a large-scale production environment, having a sensitive tool to monitor quality may help detect and correct problems before they have an impact on expression measures and critical results of downstream analysis.

Asymetrix recommends a number of quality checks to be performed after the analysis of the raw data by the Asymetrix MAS 5.0 software (Asymetrix, 2001). Some are qualitative and involve judging the overall quality of the chip image by visual inspection, whereas others are quantitative. Of the quantitative assessments, some involve examining the expression level of special-purpose probe sets—the hybridization controls, poly(A) controls, and housekeeping genes. Other quantitative assessments are based on a more comprehensive summary of expression and signal level on a chip. Figure 9 examines the relationship between the chip data quality index derived from the residuals (the median NUSE) and three of the quantitative quality assessment measures recommended by Afifymetrix: Scaling Factor (target = 500), RawQ, and Percent Present calls.

Other recommended measures that summarize the probe intensities are highly correlated with these and do not provide much additional information.

Figure 9 demonstrates that the median NUSE is highly sensitive to departures from quality standards. In sets of chips varying over a wide range of quality levels, we find that the index of quality based on assessed variability of expression, the median NUSE, is highly correlated with some of the recommended quality assessment measures. We believe that assessed variability provides a better basis for making decisions to rerun an experiment or exclude a chip from an analysis set, whereas other measures are potentially more useful at identifying the source of a problem.

Analyzing expression levels of specialized spike-ins or housekeeping genes for quality assessment purposes poses a special challenge. Because there are only a few spike-in probe sets, measures derived from them tend to be noisy, requiring substantial departures from quality standards for a problem to be detectible. These measures may nonetheless be useful for tracking the source of departures from quality standards that are more easily detectible by other means.

### 2. Spatial Analysis of Residuals

Residuals can be imaged in a manner similar to the way probe cell intensities are typically imaged. It is common practice to assess chip quality by visually inspecting probe-intensity images. Artifacts like bright or dim spots, scratches, or uneven brightness can be identified this way. Because cell intensities within a chip vary over a wide range and most of this variation comes from the fixed part of the model (Eq. 1), the imaged residuals are expected to provide increased resolution for visually detecting image artifacts.

Spatial patterns of residuals can be profitably examined when seeking an explanation for elevated standard errors of expression estimates on a chip. Spatial patterns may provide evidence of SAPE residue caused by poor wash, uneven hybridization, bubbles, or other local artifacts. A uniform distribution of elevated residuals is another possibility, indicating a different kind of problem with the assay. Note that spatial patterns of residuals may sometimes detect artifacts that are not detectible at the level of gene expression variability. Such artifacts would probably not play a role in accepting or rejecting a chip for analysis but may be valuable in monitoring a chip production process.

Spatial patterns of residuals themselves have proven difficult to visualize. The challenge is to capture spatial patterns of a dense scatter of numbers having both sign and amplitude. Each of these features, sign and amplitude, are readily visualized separately. The weights used in the IRLS fit can be imaged to capture the magnitude of the residuals, highlighting residuals that deviate substantially from an overall estimated scale. The sign of residuals can also be imaged and such images add to the pseudo-images of the weights by telling us whether a region of outlying residuals corresponds to a bright or a dim region on the chip. In addition, the image of the sign of residuals will capture small effects that are not detectible in the weights, which are insensitive to small deviations of the residuals from their expected value of zero.

In Fig. 10 the log intensities (top row), probe weights (middle row), and the residuals (bottom row) are imaged for three chips: two with elevated assessed variability of expression, A and P, and one with average assessed variability of expression, H. Low probe weights, corresponding to residuals with high absolute values, appear as the intense green spots on the chip pseudo-image of the weights (middle row). Clusters of probes with high absolute residuals are clearly visible for chips A and P. The patterns are also discernible in the pseudo-images of log intensities, but not nearly as clearly. Clusters of positive residuals corresponding to bright areas on the chip are clearly visible in the images of the sign of the residuals (bottom row). Determining the source of variability accounting for specific patterns of high absolute residuals—local versus global, and possible trends—is an open question. Software for producing these images is available as part of the qffyPLM library from www.bioconductor.org.

### 3. Quality Assessment Based on Relative Expression

The standard error estimates provide a measure of expression summary variability that is independent of expression level. We can also gauge variability of expression measures by summarizing the distribution of relative log expressions. To compute relative log expression values, we use a virtual median chip constructed by taking, for each probe set, the median log expression from a reference set of chips. We can summarize a vector of relative expression by a measure of bias: median(RE), a measure of variability: IQR(RE), or total error:

Fig. 10. (Top row) Image of log probe intensities. (Middle row) Image of probe weights. Intense green areas denote high concentration of large absolute residuals. (Bottom row) Image of residuals. Red represents areas of highly positive residuals, and blue represents areas of highly negative residuals. (See Color Insert.)

Fig. 10. (Top row) Image of log probe intensities. (Middle row) Image of probe weights. Intense green areas denote high concentration of large absolute residuals. (Bottom row) Image of residuals. Red represents areas of highly positive residuals, and blue represents areas of highly negative residuals. (See Color Insert.)

IQR + |Bias|. These summaries are sensitive to technical sources of variability that are large compared to biological variation. This assessment will be highly correlated with an assessment based on estimated standard errors of probe set expressions, but it has the advantage of being derived from the expression estimates alone (as opposed to probe-level residuals).

Figure 11 shows box plots of relative log expressions for the 2353 series. We can readily see the elevated variability in chips A and P, as was assessed by the residual analysis. In addition, we note a downward bias in the expressions for chip P. As the chips being compared here were hybridized with a common source of RNA, the relative log expression should be zero for all non spike-in probe sets, and the differences in variability between chips can therefore be attributed to technical or processing variability. When comparing chips with different sources of RNA, the variability in relative log expression will be inflated by real biological variability. This is not seen as a serious handicap in the use of relative log expression to assess data quality, because the technical variability that we are interested in is typically greater than the biological variability.

B. Quality Assessment for cDNA Microarray Experiments

The quality of the expression data derived from cDNA microarray experiments depends on experimental and production factors similar to those affecting oligonucleotide microarrays. The extraction of gene expression information from a scanned array requires a complicated image analysis process. This process is an additional source of potential variability. Yang et al. (2002a) discuss image analysis for spotted arrays in detail. As a by-product of the image analysis step, a number of spot characteristics are generated: spot size and shape, spot intensity, and background intensity. These can be used as quality indicators (Wang et al., 2001). When some clones are spotted at several locations on the array, the repeated measurements for clones can be combined to obtain some assessment of the reproducibility of the measurements, just as probes within probe sets are used to measure reproducibility with the oligonucleotide microarrays. Jenssen et al. (2002) and Tseng et al. (2001) discuss the use of multiply spotted clones in quality assessment. Ritchie et al. (2003) demonstrate that spot-quality measures are correlated to spot reproducibility for the multiply spotted clones and suggest that this relationship could be exploited to derive spot weights to be used in gene-wise regressions.

### VII. Detection of Absolute Gene Expression

The problem of classifying genes as present or absent in a given sample has been largely overlooked in the literature. The only widely used detection call for oligonucleotide microarrays is the one implemented in the MAS software developed by Asymetrix (2001). Although the detection of absolute expression is not generally regarded as important as that of differential expression, it has definite biological relevance in some circumstances. For example, a biologist studying gene expression in neural stem cells may want to know which genes go from being absent to being present at a particular time, and vice versa.

A. The Affymetrix Presence/Absence Algorithm

The Asymetrix MAS 5.0 software makes a detection call for each probe set by defining a discrimination score

where PMi is the perfect match intensity of the ¿'th probe in the probe set, and MMi is the corresponding mismatch intensity. This is done for the nonsaturated probe pairs. A one-sided Wilcoxon signed-rank test is then used to test

where H0 is the null hypothesis and H1 is the alternate. r is a small positive number, tunable by the user, and set to a default of 0.015. Asymetrix has determined this value as being one that minimizes the number of incorrect calls without sacrificing sensitivity.

The p value from the signed-rank test is used as a determinant of gene presence or absence. MAS 5.0 actually uses two user-configurable significance levels «1 and a2, such that 0 < «1 < a2 < 0.5. Probe sets are called present ifp < a1, absent if p > «2, and marginal (no call) if «1 < p < «2. The defaults in MAS 5.0 are «1 = 0.04, «2 = 0.06. These are found to be optimal (based on analyses of spikein data) for the default value of r. More details about the Affymetrix presence/ absence methodology can be found in Liu et al. (2001) and Liu et al. (2002).

### B. Alternative Methods

Zhou and Abagyan (2002) have developed an algorithm to calculate expression summaries that use only the PM intensities. As a side effect of their procedure, they perform a detection call. The 5% lowest intensity probe sets are designated as background, and the empirical cumulative distribution of the background intensities on the linear scale, B(I), is then calculated. For each probe set, they calculate the empirical cumulative distribution of the probe signals, S/J(I), and compare this to B(I). The authors' claim is that genes that are absent will tend to have integral distributions that are close to the background distribution. They, therefore, compare each Sk to B using a Kolmogorov-Smirnov test. There is no recommendation for an appropriate threshold on thep value from the K-S test for calling presence or absence. Instead, the authors state that ''those signal sets that can be easily explained by noise are assigned a logiqp value closer to zero.''

Rubinstein and Speed (2003) have approached the problem of transcript detection using several novel methods. They define three broad classes of detection algorithms: thresholding rank sums of probe-pair summaries, thresholding robust averages of probe-pair summaries, and thresholding expressionlevel estimates. Possible probe-pair summaries include log(PM;/MM;) and (PM; - MM;)/(PM; + MM;). The latter is the summary used by the MAS software. The PM and MM values may or may not be background corrected and normalized across chips. The authors have developed a framework for evaluating different detection algorithms, using the ROC (Receiver Operating Characteristic) Convex Hull method. Under this scheme, the cost of misclassification is defined as follows:

Fig. 12. Receiver Operating Characteristic (ROC) curves for some of the detection algorithms discussed in the text. NRALR is the normalized robust average of log ratios, the NRSLR is the normalized rank sum of log ratios, and the RMA is the expression-level estimate obtained using RMA. The gray polygon is the convex hull of the ROC curves and represents the best possible classifier.

Fig. 12. Receiver Operating Characteristic (ROC) curves for some of the detection algorithms discussed in the text. NRALR is the normalized robust average of log ratios, the NRSLR is the normalized rank sum of log ratios, and the RMA is the expression-level estimate obtained using RMA. The gray polygon is the convex hull of the ROC curves and represents the best possible classifier.

where P(p) is the prior probability of an example being positive and P(n) of being negative, TPR and FPR are the true and false positive rates, and C(P n) and C(N, p) are the costs of false negatives and false positives, respectively. Requiring that this cost be minimized allows one to evaluate the optimality of detection algorithms over a particular range of false-negative and false-positive costs, given a set of ROC curves for those algorithms. Rubinstein and Speed find that their normalized robust average of log ratios (NRALR) and normalized rank sum of log ratios (NRSLR) outperform the MAS 5.0 algorithm for a wide range of costs, while thresholding on either expression level derived from RMA (Irizarry et al., 2003a) or the MAS 5.0 signal estimate do not perform as well. ROC curves for these five representative algorithms are shown in Figure 12.

References

Asymetrix (2001). Affymetrix Microarray Suite Users Guide, Version 5. Calif, Santa Clara.

Astrand, M. (2003). Contrast normalization of oligonucleotide arrays. J. Comput. Biol. 10, 95—102.

Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185-193.

Box, G. E. P., Hunter, W. G., and Hunter, J. S. (1978). Statistics for experimenters: An introduction to design, data analysis, and model building. Wiley, New York.

Chu, T. M., Weir, B., and Wolfinger, R. (2002). A systematic statistical linear modeling approach to oligonucleotide array experiments. Mathematical Biosci. 176, 35-51.

Cleveland, W. S., and Devlin, S.J. (1988). Locally-weighted regression: An approach to regression analysis by local fitting. J. Am. Statistical Assoc. 83, 596-610.

Cobb, G. W. (1998). Introduction to design and analysis of experiments. Springer, New York.

Cox, D. R. (1992). Planning of experiments. Wiley, New York.

Dudoit, S., Yang, Y. H., Speed, T. P., and Callow, M. J. (2002a). Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica 12, 111-139.

Dudoit, S., Yang, Y. H., and Bolstad, B. M. (2002b). Using R for the analysis of DNA microarray data. R. News 2, 24-32.

Gautier, L., Cope, L. M., Bolstad, B. M., and Irizarry, R. A. (2003). Analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20(3), 307-315.

Ge, Y., Dudoit, S., and Speed, T. P. (2003). Resampling-based multiple testing for microarray data analysis [with Discussion]. Test 12, 1-77.

Gene Logic (2001). Dilution series data available at: www.genelogic.com/media/studies/dilution.cfm.

Holder, D., Pikounis, V., Raubertas, R., Svetnik, V., and Soper, K. (2002). Statistical analysis of high density oligonucleotide arrays: A SAFER approach: Proceedings of the American Statistical Association. Atlanta, Georgia.

Holland, P. W., and Welsch, R. E. (1977). Robust regression using iteratively reweighted least-squares. Comm. Stat. Theory Methods A6(9), 813-827.

Huber, P.J. (1972). Robust statistics: A review. Ann. Mathematical Stat. 43, 1041-1067.

Huber, P.J. (1981). Robust statistics. John Wiley & Sons, New York.

Irizarry, R. A., Bolstad, B. M., Collin, F., Cope, L. M., Hobbs, B, and Speed, T. P. (2003a). Summaries of Affymetrix GeneChip probe level data. Nucl. Acids Res. 31, e15.

Irizarry, R. A., Hobbs, B., Collin, F., Beazer-Barclay, Y. D., Antonellis, K. J., Scherf, U., and Speed, T. P. (2003b). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249—264.

Jenssen, T., Langaas, M., Kuo, W. P., Smith-Sorensen, B., Myklebost, O., and Hovig, E. (2002). Analysis of repeatability in spotted cDNA microarrays. Nucl. Acids Res. 30, 3235—3244.

Lange, K. L., Little, R.J. A., and Taylor, J. M. G. (1989). Robust statistical modeling using the t distribution. J. Am. Stat. Assoc. 84, 881-896.

Li, C., and Wong, W. H. (2001a). Model-based analysis of oligonucleotide arrays: Model validation, design issues and standard error application. Genome Biol. 2, 0032.1-0032.11.

Li, C., and Wong, W. H. (2001b). Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proc. Natl. Acad. Sci. USA 98, 31-36.

Liu, W. M., Mei, R., Bartell, D. M., Di, X., Webster, T. A., and Ryder, T. (2001). Rank-based algorithms for analysis of microarrays. Proc. SPIE 4266, 56-67.

Liu, W. M., Mei, R., Di, X., Ryder, T. B., Hubbell, E., Dee, S., Webster, T. A., Harrington, C. A., Ho, M. H., Baid, J., and Smeekens, S. P. (2002). Analysis of high density expression microarrays with signed-rank call algorithms. Bioinformatics 18, 1593-1599.

Lonnstedt, I., and Speed, T. P. (2002). Replicated microarray data. Statistica Sinica 12, 31-46.

Montgomery, D. C. (2000). Design and analysis of experiments, 5th ed. Wiley, New York.

Mootha, V. K., Lindgren, C. M., Eriksson, K. F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E., Houstis, N., Daly, M. J., Patterson, N., Mesirov, J. P., Golub, T. R., Tamayo, P., Spiegelman, B., Lander, E. S., Hirschhorn, J. N., Altshuler, D., and Groop, L. C. (2003). PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet. 34, 267-273.

Pearson, E. S., and Hartley, H. O. (1962). Biometrika Tables for Statisticians. Cambridge University Press, Cambridge.

Ritchie, M., Smyth, G. K., Diyagama, D., Val Laar, R., Holloway, A., and Speed, T. P. (2003). Quality measures for cDNA microarray experiments. Royal Statistical Society posten presentation.

Rubinstein, B. I. P., and Speed, T. P. (2005). Detecting gene expression with oligonucleotide microarrays. (Unpublished manuscript).

Schadt, E. E., Li, C., Ellis, B., and Wong, W. H. (2001). Feature extraction and normalization algorithms for high-density oligonucleotide gene expression array data. J. Cell. Biochem. Suppl. 37, 120-125.

Speed, T., Ed. (2003). Statistical analysis of gene expression microarray data. Boca Raton, FL. Chapman and Hall CRC Press.

Tseng, G. C., Oh, M., Rohlin, L., Liao, J. C., and Wong, W. H. (2001). Issues in cDNA microarray analysis: Quality filtering, channel normalization, models of variations and assessment of gene effects. Nucl. Acids Res. 29, 2549-2557.

Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley, Reading, Mass.

Wang, X., Soumitra, G., and Guo, S. (2001). Quantitative quality control in microarray image processing and data acquisition. Nucl. Acids Res. 29, 2549-2557.

Yang, Y. H., Buckley, M. J., Dudoit, S., and Speed, T. P. (2002a). Comparison of methods for image analysis on cDNA microarray data. J. Comput. Graph. Stat. 11, 108-136.

Yang, Y. H., Dudoit, S., Luu, P., Lin, D. M., Peng, V., Ngai, J., and Speed, T. P. (2002b). Normalization for cDNA microarray data: A robust composite method addressing single and multiple slide systematic variation. Nucl. Acids Res. 30, e1.

Yang, Y. H., and Speed, T. P. (2002). Design issues for cDNA microarray experiments. Nat. Rev. Genet. 3, 579-588.

Zhou, Y., and Abagyan, R. (2002). Match-only Integral Distribution (MOID) algorithm for high-density oligonucleotide array analysis. BMC Bioinform. 3, 3.

## Post a comment