## CA wAgA 1 wAjiA

where g(A) is a loess curve fitted to spots from the MSP series, g (A) is a loess for print-tip group i, and w(A) is usually defined as the proportion of spots less than intensity A. The adjustment is then done as before. The idea is to increasingly use the MSP curve at higher intensities, where there are fewer spots and the print-tipâ€”specific curves may be more unreliable.

We have discussed normalization within slides, but sometimes there are large differences in scale when comparing data between slides. The advised procedure is to first normalize within slides using the methods previously discussed, and then consider scaling of M between slides, as described previously. This adjustment is needed so that the relative expression levels from one slide do not dominate the expression levels from others when averaging across replicate slides. It should be noted that there is a tradeoff between the gains achieved by scale normalization and any variability that may be introduced. Often this normalization will not be required.

Software implementing these normalization methods for cDNA data may be found in the SMA package (Dudoit et al., 2002b) and downloaded from CRAN (http: // cran.r-project.org/).

### B. Normalization for Affymetrix Arrays

There are two main approaches to normalization of Asymetrix GeneChip data. A recent paper (Bolstad et al., 2003) categorizes these into methods that use a baseline array and methods that are complete data methods. A complete data method does not use a baseline array, instead using data from all the chips to form the normalization.

Examining box plots of raw probe intensities by array can often show the need for normalization. Such a plot is shown in Fig. 4A, for five arrays from part of a dilution series dataset (Gene Logic, 2001). The only difference between the arrays is the scanner that was used, yet the box plot shows quite different levels of expression for each array.

A number of normalization methods have been proposed. The simplest approach, scaling, is to scale each array so that all arrays in a dataset have the same mean intensity. Trimmed means are often used instead of means, and this is the method used by Affymetrix in the MAS 5.0 software (Affymetrix, 2001). If Xi is the mean (trimmed) intensity for array i and K is the target mean intensity, then array i is normalized by multiplying by K/Xi. The target intensity is often chosen to be the mean of one of the arrays. We would, thus, classify this method as a baseline method. Figure 4B shows the five arrays after scaling normalization. The scaling approach can be applied in a time-efficient manner, but it does not adequately deal with possible nonlinear trends between arrays, as shown in Fig. 5. The Affymetrix HG-U113A chip has 100 normalization control probe sets that may be used for normalization in this context. These probe sets have been chosen because of their stability of expression across a wide range of tissues.

Another approach is to choose a baseline array, then fit nonlinear relationships between the baseline array and each of the other arrays (in this context, we call these the treatment arrays). Such an approach fitting splines was suggested by

Schadt et al. (2001) and used with a running median line (Li and Wong, 2001a,b). A rank invariant set of probes is chosen between the baseline and the treatment array. These probes are then used to fit the nonlinear relation. The curve is then used to map from the treatment array to the baseline array and defines the normalization.

Several complete data adaptations of the MA-plot loess method for cDNA arrays have been proposed for normalizing Avymetrix arrays. The first is the cyclic loess method, in which arrays are normalized against each other in a pairwise fashion using a loess fit to an MA-plot. Unfortunately this requires O(N2) MA-plot normalizations and so it is quite time consuming.

A second adaptation of the MA-plot loess method is to transform the data using an orthonormal basis to give a set of contrasts (Astrand, 2003). The normalization is applied to the transformed data. The data are then transformed back to the original basis. This method requires only O(N) MA-plot normalizations and is, therefore, faster than the cyclic loess method. However, loess normalizations are slow for probe-intensity data. Typical implementations use only a subset of the probes to improve the processing time.

Another complete data method is the quantile normalization method, in which the goal is to normalize arrays so that each array has a common intensity distribution. This method uses a simple non-parametric algorithm to quickly normalize a batch of arrays. In particular, averaging the quantiles of all the arrays in the set forms the reference distribution. Each array is then assigned the reference intensity distribution. The quantile normalization method is a specific case of the transformation = F G(xi), where we estimate G by the empirical distribution of each array and F using the empirical distribution of the averaged sample quantiles. Extensions of the method could be implemented where F_ and G are more smoothly estimated. However, we have found the current method to perform satisfactorily in practice.

Figure 4C shows the arrays after quantile normalization. Figure 5 demonstrates that the quantile normalization deals adequately with nonlinear relationships in the data. In practice, this normalization can be carried out in a very time-efficient manner.

These methods were compared in a recent paper (Bolstad et al., 2003), where it was demonstrated that the scaling method was least effective at reducing variability. Figure 6 illustrates this result using the RMA expression measure. These graphs show the ratios of the variance of the expression measure across five arrays plotted against mean expression for two different normalization

methods. We see that scaling and quantile normalization both reduce the variability, with the greater reduction achieved by the quantile normalization method. The complete data methods were favored in Bolstad et al. (2003) because baseline methods can introduce peculiarities of the baseline array into the data for the treatment arrays. It was found that the quantile method was the fastest, with acceptable reductions in variance and little change in bias. Software implementing these normalization methods may be found in the qfjy package (Gautier et al., 2003), which is part of the Bioconductor project (see www.bioconductor.org).

## Post a comment