## Distance and similarity

To determine which objects cluster together, we must have a way of measuring how similar, or dissimilar any two of them are. Most clustering approaches will allow as input a matrix whose entries measure similarity, or dissimilarity, between each pair of objects. Choosing this measure is one of the most critical, yet often underappreciated, aspects of a cluster analysis. Different measures reflect different goals, and thus can have a strong influence on the resulting clusters. Here we discuss in detail three: the correlation coefficient, which will bring together objects whose patterns of change are similar; the Euclidean distance, which will bring together objects whose absolute expressions are similar, and the uncentered correlation, which achieves a compromise between the previous two.

The Pearson correlation coefficient measures the strength of a linear association between the expression levels of objects. In the case of genes j and k, it is defined by

where xsj is the gene expression for gene j in sample s and Xj is the average gene expression of gene j across all samples. A symmetric definition applies to the correlation between samples. The correlation takes values ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation). A correlation of 0 means that there is no linear relationship between the two genes. For analyses that require positive similarity matrices, it is common to use the absolute value of the correlation with the rationale that high negative and high positive correlations both may imply an underlying common mechanism. The correlation coefficient is unitless, but is sensitive to nonlinear transformation of the data, such as the logarithm. For nonlinear relationships, the correlation coefficient may not adequately describe similarity. Another drawback is that it may be sensitive to noise.

The Euclidean distance measures geometric distance between two objects. In the case of genes j and k, it is defined by djk (Xj- xsk )2. (19.2)

A symmetric definition applies to the correlation between samples. It takes values from 0 to ^ and it retains the units of the input gene expression measurements. It grows with the number of samples included in the dataset.

The uncentered correlation (14) is similar to the Pearson correlation but is evaluated without centering:

=1Xsk

As the Pearson correlation, this is unitless, but is sensitive to absolute magnitudes as the Euclidean distance. As a result it will be less likely to be influenced by genes whose variation is mostly noise. For a summary of other distance and similarity metrics, see (15).

0 0