The basic requirement for a gene expression microarray experiment is that the measurement of intensities of each spot can in some way be interpreted to reflect the corresponding number of mRNA molecules in the sample under consideration. However, it is well known that raw intensity measurements are highly influenced by a number of different external factors, for example effects due to pins (1-3), PCR plates, sample preparation, array coating, spotting, labeling efficiencies, nonlinearity of dye-labeling, scanning, and so on (for an overview see 3, 4). Thus, it is obvious that raw intensity measurements of microarray spots typically do not reflect respective mRNA levels.

In order to achieve a biologically meaningful interpretation of the experiment, these influences have to be statistically described. It is imperative to bring samples which are compared in the course of analysis not only to a common scale (scaling methods), but particularly to remove effects which are not meant to be part of the biological interpretation as thoroughly as possible. The process of normalization should lead to the correction of those effects that are due to variations in the experimental procedure. Furthermore, the dynamic range of the data as well as the distribution of intensities might be different when comparing several arrays within one series of experiments. Recapitulating the goal of applying any normalization method is to adjust for many influences other than those due to the biological differences in the RNA samples.

Over the last few years, a number of so-called normalization methods have been published to overcome the problem of various effects and to end up with a dataset that allows for further statistical analysis (for reviews see 3-5).

The basic question is: which mathematical description can explain the underlying data best, or what kind of data description is properly specifying the biological nature of a microarray experiment? While it is impossible to rule out all influencing factors and to exactly describe the underlying biology, it is nevertheless crucial to find out whether setting up a data description of higher complexity is more appropriate in biological terms.

We will review frequently used normalization methods and demonstrate their application to biological datasets.

Overall, the normalization methods which have been published during the last few years can be divided into procedures that are based on the assumption that the majority of genes detected by the array change in expression or remain unchanged in the experiment. In this article, we are focusing on the second group, namely experiments in which the majority of genes remain unchanged. Normalization methods of the second group can be divided into (i) scaling or standardization methods and (ii) normalization methods using a normalizing transformation of the data (see Figure 17.1). Note, that scaling methods can in fact only correct for globally multiplicative effects by appropriate scaling of the data. Nevertheless, they are often called normalization methods as well.

Normalization Strategies

Underlying assumption: "Most genes remain unchanged"

Scaling Methods Transformation-based Methods

Mean Median Shorth ZScore


Quantile Normalization ANOVA

Variance Stabilization


Locally (lowess, loess) QSpline

Figure 17.1.

Overview of normalization strategies used for microarray data analysis.

In this article we outline mathematical procedures to describe and remove various kinds of effects in microarray data. Some of these variations are systematic, for example pin effects, and can be estimated using the measured data in many cases. Others are random effects, and appropriate error models for these will be discussed. In the following sections we first introduce the experimental data we are using. Then, examples of scaling methods are explained and we discuss the problem that these methods can only correct for globally multiplicative errors. Subsequently, we describe some of the most frequently applied normalization methods which are based on data transformation. We demonstrate the application of the presented normalization methods using two published biological micro-array datasets. Note that we describe the normalization methods for cDNA array technology. However, they can also be applied with only minor changes to Affymetrix datasets.

0 0

Post a comment