Data analysis allele calling and genotype assignment

The raw signals from the image quantification process are used to derive the allele calls and finally to assign the genotypes. The quantification data is similar to the numerical data typically collected from gene expression arrays, containing noise from different sources, like the hybridization specificity, ASO printing anomalies and chemical residues, affecting the image. The results from the data analysis are depicted in Figure 8.3.

The data analysis starts with data normalization, where we have used standard log transformation of background-subtracted signal intensities. Next we calculate the mean of the summed intensities from signals obtained for both ASOs for a given marker in all samples and exclude outlier ASOs differing from the mean more than certain times the standard deviation, for example more than 2 S.D.. This procedure is able to filter out non-amplified samples as well as extremities of the signal intensities, usually due to non-specific fluorescence signals.

Scanned array, 3 x 5 subarrays shown

Sample 1

Sample 2

Sample 3

Figure 8.3.

mt mam

• •••

* >


■ •


/ OHeo«

/ r

1 \ ^ t \ V \ V


" rn i



- —




1000000 100000-I 10000-

¿3 100101

1000000 100000-I 10000-

¿3 100101



Excluded due to low intensity (water controls)

0,0 0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4 0,45 0,5 0,55 0,6 0,65 0,7 0,75 0,8 0,85 0,9 0,95 1,0


Figure 8.3.

Analysis of the image data, allele calling and genotype assignment. The data shown is from a multiplex genotyping assay of 20 SNPs. Clockwise from the left: The scanned image of a fraction of the genotyping array, indicating the arrays-of-array layout. Each subarray represents an independent sample and each SNP locus is represented by the two spots which contain the ASO1 or ASO2 oligonucleotide. In the three enlarged subarrays, duplicate spots per SNP allele are used, increasing the reliability of genotyping. The spots are quantitated and by using the intensity fractions, the spot pairs are clustered into three distinct genotypes (both homozygotes and heterozygotes). The clustering is confirmed by checking, for example that duplicate spots are within the same cluster. In the graph, the x-axis is the clustered fraction of the spot pairs' background-subtracted intensities and the y-axis is the logarithm of the summed intensities of the spot pairs. Water controls are routinely used and distinguished by low summed intensity values. Possible outlier spot pairs, with too high or low summed intensity values or intensity fraction values between the clusters are excluded and not genotyped.

Optimally the validated data should get organized to three distinct classes, representing the two homozygotes and the heterozygote samples. We typically use clustering methods, such as a modified version of the k-means clustering from the one-dimensional signal intensity fraction data. We set k =3 and pre-assign the cluster centroids to 0.2, 0.5 and 0.8 fraction values. We also optimize the clustering so that replicates of the same sample are to be assigned to the same cluster, if possible. Replicate samples having discrepancies in their cluster assignments will not be assigned a genotype, unless the researcher decides to manually exclude the conflicting samples. Usually the clusters converge easily even in the situation where the fraction values of the cluster centroids are heavily skewed to either end of the fraction scale. This makes the clustering approach superior to static assignments of genotypes based just on the intensity fraction values. The clustering can be further directed by using reference samples, for which the genotypes are already known, as well as no template control, for reduction of the error due to the unspecific fluorescence emission.

In order to decrease false genotyping assignments, we next calculate distances between the cluster centroids as well as standard deviation of the samples from the cluster centroids. We use this information to set uncertainty areas between the cluster centroids and all samples in these regions will be excluded from the allele calling, because of the reduced probability of correct cluster assignment and thus increased possibility for a genotyping error.

As the next quality control step we calculate the standard Hardy-Weinberg distributions and use a Chi-Square test in order to evaluate the likelihood for the observed genotyping assignments. Finally we enter all genotyping data to a database, where we check the Mendelian inheritance rules of the samples, if this information is available.

All data analysis steps described here are implemented in SNPSnapper (, a software specially designed for both allele—specific primer extension and minisequencing in our laboratory (Saharinen et al., manuscript in preparation). SNPSnapper also displays all the data in various dynamic graphs and allows manual intervention in each step and provides the original scanned array image, for example for rejection of conflicting sample replicates. Finally the data is stored in a relational database and can be exported, for example in linkage files to downstream analysis programs.

0 0

Post a comment