Hierarchical clustering

Hierarchical clustering is used to partition objects into a series of nested clusters (5, 6), by contrast with approaches that find a single partition (16). To illustrate, a hierarchical clustering analysis of both genes and samples in the Hedenfalk data is shown in Figure 19.1, along with a gray scale image of gene expression levels. The similarity used is the uncentered correlation. The hierarchy of clusters of samples is displayed using a tree-like structure called a dendrogram. Dendrograms join objects, or clusters of objects, to form increasingly large clusters. The height at which two clusters are joined represents how similar they are, with low heights representing high similarity. Samples in Figure 19.1 are labeled by their type (BRCA1, BRCA2, or sporadic), though these types are not used in constructing the dendrogram.

There are two kinds of hierarchical clustering approaches: agglomerative and divisive. The agglomerative approach begins by assuming that each object belongs to its own separate cluster. At the first step, the two most

Figure 19.1.

Hierarchical cluster analysis of the Hedenfalk breast cancer data. The gray scale image represents gene expression levels, with levels lower than the reference represented by white to light gray and levels higher than the reference represented by medium gray to black. The left panel includes all samples and genes. The right panel includes all samples and the top 25% genes most strongly associated with the presence of BRCA1 and BRCA2 mutations. The dendrograms for genes have been omitted.

similar objects are combined to form a new cluster. Then the next most similar clusters or objects are combined and so forth. This is a bottom-up approach in the sense that the clustering starts at the bottom of the dendrogram of Figure 19.1 and works its way up until all objects belong to one cluster. As part of the agglomerative approach, it is necessary to specify a linkage method, that is a way of defining similarity of clusters based on similarities of cluster members. Some of the commonly used linkage methods are single, average, and complete in which clusters are linked based on the similarity of the closest members, the average similarity, and the similarity of the furthest members.

The divisive approach works from the top of the dendrogram, where all objects belong to one cluster. At the first step, it finds the best division of the objects so that there is the highest similarity among objects within clusters and the most dissimilarity between clusters. This process continues, where the best cluster partition is chosen at each step until all objects are in their own clusters. Details of hierarchical clustering can be found in (4).

An important consideration when applying or interpreting hierarchical clustering results is that there is not a unique dendrogram for a given hierarchical clustering result. For each split in a dendrogram, it is arbitrary which branch is drawn to the right or left, and users need to specify criteria for this choice. As such, many dendrograms can be drawn for a given hierarchical clustering result and closeness of objects should be judged based on the height at which they are joined, rather than their ordering in the dendrogram.

Preselection of genes can significantly affect clustering of samples and vice versa. Selecting genes that show at least a certain amount of variation across samples is useful to reduce the sensitivity of clustering results to noise variation. Selecting genes whose variation is associated with a phenotype of interest is also common, though when that is done the correspondence of clusters to phenotype cannot be invoked as validation of the clustering results, as the correspondence will be inflated by the preselection. To illustrate, compare the left panel of Figure 19.1, which includes all genes in the experiment, to the right panel, where only the top 25% of genes associated with the BRCA types are included. The dendrogram on the left has short branch links and cascading patterns, both of which weaken the case for the existence of clusters. None of the main partitions has any relation to the BRCA type. On the right, the branch links at the top are longer and there is some evidence of two major clusters, which separate well the BRCA1 from the BRCA2 cases. While in general a correspondence between clusters found by unsupervised analyses and sample phenotypes can be taken as independent supporting evidence of the existence of clusters of biological significance, in this case this argument would be circular, because the sample phenotypes were used in selecting the genes for clustering.

10 Ways To Fight Off Cancer

10 Ways To Fight Off Cancer

Learning About 10 Ways Fight Off Cancer Can Have Amazing Benefits For Your Life The Best Tips On How To Keep This Killer At Bay Discovering that you or a loved one has cancer can be utterly terrifying. All the same, once you comprehend the causes of cancer and learn how to reverse those causes, you or your loved one may have more than a fighting chance of beating out cancer.

Get My Free Ebook

Post a comment