Limitations of cluster analysis

Clustering techniques for high dimensional data are exploratory. Their strength is in providing rough maps and suggesting directions for further study. Substantial additional work is necessary to provide context and meaning to groups found by automated algorithms. This includes cross-referencing of existing knowledge about genes and samples as well as additional biological validation.

Clustering results are sensitive to a variety of user-specified inputs. The clustering of a large and complex set of objects can, like arranging books in a collection, be planned in different ways depending on the goals. From this perspective, good clustering tools are responsive to users' choices, not insensitive to them, and sensitivity to input is a necessity of cluster analysis rather than a weakness. This also means, however, that use of a clustering algorithm without knowledge of its workings, the meaning of inputs, and their relationship to the biological questions of interest is likely to yield misleading results.

Clustering results are generally sensitive to small variations in the samples and the genes chosen and to outlying observations. This means that a number of the data-analytic decisions made during normalization, filtering, data transformations, and so forth will have an effect on results. When conclusions drawn from clustering go beyond simple data visualization, it is important to provide accurate assessments of the uncertainty associated with the clusters found. Uncertainty from sampling and outliers can be addressed within model-based approaches (31) or alternatively using resampling techniques (32-34). The consequences of choosing among plausible alternative transformations, normalizations, and filtering should be addressed by sensitivity analysis, that is by repeating the analysis and reporting conclusions that are consistent across analyses.

0 0

Post a comment