## Data and standard statistics

For simplicity, we use a BAC array data set to illustrate our methods (13). Our methods apply to more complicated designs with dyes and arrays as factors (11). We shall analyze one set of data presented by Snijders et al. (13), GM01524, which can be downloaded from the website http:// genetics.nature.com/supplementary_info/. The data result from an experiment aimed at measuring copy number changes for the cell strain GM01524 (test sample) against a normal male reference DNA (reference), which were co-hybridized on a CGH array containing 2460 BAC and P1 clones in triplicate (7380 spots) and with an average resolution of ~1.4 Mb (13). We shall only focus on chromosome 6 for ease of exposition.

Array CGH data often have systematic biases as do cDNA microarray data (11). Therefore, the first step in analysis is to remove these biases using a normalization procedure such as lowess (see also Chapter 17). Details can be found in Yang et al. (14) or Wang and Guo (11).

Our methods apply to each chromosome separately to detect copy number changes at clones on the chromosome. For simplicity, we assume, in the following discussion, that all clones to be considered are on the same chromosome. After normalization, let yjjk be the kth replication of the logarithm of the dye intensity of clone i of sample j, where i = 1,..., I represents observed clones in a chromosome, j = 1,2 represents two samples (test and reference), k = 1,., nj and nj represents the number of replications of sample j.

Assume that yijk ~ N(^i,, a}). Then the question of whether there are any significant copy number differences between the two samples at clone i can be formalized by the hypothesis H0: ^ = ¡ii2 vs H: ^ ^ ¡ii2. Let n n, yi. = X yuk/n, s2 = X On - y,.)2/^ -1), i = 1,...,/; j = 1,2. k=1 k=1

The standard z-statistic

and the standard t-statistic t = (ya. - yi2.)/Vs?1/n1 + spn2. (21.2)

Clone i is declared to have significantly different intensity ratio and thus copy number between the two samples when the absolute value of the z-statistic or of the t-statistic is large. It should be noted that the z-test ignores the heterogeneous nature of variances associated with intensity ratios. The standard t-test accounts for the variation in the z-statistic and is easy to use since ti approximately follows a Student t distribution with degrees of freedom (s?l/n1 + s22/n2)/((s2l/n1)2/(n1 - 1) + (s22/n2)2/(n2 - 1)). However, it has two fundamental problems: (i) the repetition numbers n1 and n2 are usually small (e.g. n1 = n2 = 3 in the example) because repetitive printing of the same clone on the slide limits the total number of clones to be printed on the slide. Even if multiple slides are used, the amount of DNA extracted from the test sample is often limited and thus only a few slides can be used for hybridization. Estimates of variances s2i1 and si22 are unreliable when sample sizes are small (11, 15); (ii) spatial correlations between neighboring clones are ignored, rendering the methods less efficient. Our methods aim to overcome these two problems by pooling information in neighboring clones to yield more stable estimates of the variances and to detect clones with copy number changes.

## Post a comment