## The HAS procedure

Let ^1(x) = 1, \$2(x) = x - 0.5 and ^(x) = I (xt - u)(x - u)du.

1. Initialization: set the maximum number of bases q (q > 2) and the inflated degrees of freedom (IDF). Start with k = 2 and two bases {^1(x),^2(x)j.

2. Forward stepwise selection: for k = 3,..., q, choose the kth basis ^ik (x) to maximize the reduction in the residual sum of squares (RSS).

3. Optimal number of bases: choose k > 2 as the minimizer of the generalized cross-validation (GCV) score

4. Backward elimination: perform backward elimination to the selected bases. Decide the final number of bases by the Akaike Information Criteria (AIC).

5. Fit: fit a standard or ridge regression model to the final selected bases.

The key to spatial adaptiveness is to select bases adaptively based on data. The IDF is used to account for the added flexibility in adaptively selected bases. Luo and Wahba (20) suggested the use of IDF=1.2. For array CGH data, we found that this choice of IDF sometimes under- or over-estimates the number of bases. Our experiences suggest that the combination of a smaller IDF (1 or 1.1) with the backward elimination step provides better fits. We also found that the ridge regression step in the original HAS procedure can lead to over-smoothing for array CGH data. Therefore we recommend the standard regression using a numerically stable procedure.

We use the following bootstrap procedure to calculate p-values. Denote the HAS estimates of f and a as f and f respectively. We first generated a bootstrap sample y* = f(xi) + £*, i = 1,., I, where e* are sampled with replacement from residuals. Denote the HAS estimates of f and abased on the bootstrapped sample as f* and f* respectively. Let D* = (f*(x) - f(x))/f*. Repeat this process B times and denote D*(b) as the D*i statistic based on the bth bootstrapped sample. We then calculate the p-values as pi = #{b : |D*(b)| > |f(xi)|/f}/B.

Then clones with pi < a are significant at level a. False discovery rate (FDR) can be used to circumvent the problem of multiple comparisons (11).

Simulations in Wang and Guo (11) indicated that the modified t-like statistic based on smoothed variance always improves the performance. The HAS procedure is more powerful in detecting clustered locations while a separate t-test is more powerful in detecting isolated locations.

0 0