The identification of differentially expressed genes narrows down the number of genes for further analysis. Clustering genes with similar temporal profiles is commonly the next phase of the initial analysis. This is done in the belief that genes with similar temporal profiles may well be involved in similar biological processes, for example in the same aspect of response to a treatment. Frequently there is a further hope that genes in the same cluster share common sequence motifs in their regulatory region. In clustering, the focus might be on grouping genes with particular temporal profiles, say early induced, monotonic increasing, up first and then down and so on, or it may simply be a way of partitioning all genes into automatically defined groups.

Below we briefly summarize the literature on clustering, referring to the review of Moller-Levet et al. (54) for a comprehensive treatment of the issues. One of the earliest examples was hierarchical clustering of the yeast cell cycle data by Spellman et al. (5) and Eisen et al. (55). Shortly afterwards, self-organizing maps (SOM) were applied to the same data, as well as a human dataset concerning hematopoietic differentiation by Tamayo et al. (56), while the k-means algorithm was used in Tavazoie et al. (57). This early literature used rather arbitrary criteria for reducing the number of genes prior to clustering. In the GENECLUSTER package, genes are filtered by a simple variation criteria; see (58) and GENECLUSTER 2 Reference Guide for details (http: //www .broad. mit. edu/cancer/software/genecluster 2/gc_r ef .html). In producing a 6 x 4 grid SOM, Saban et al. (59) started with 588 genes which had to be induced at least three-fold over the initial time point in one replicate at some time point, and induced at least two-fold over the initial time point in the second replicate at that time point. Such filtering rules are not uncommon in the literature (see 60). A more recent example was the hierarchical clustering of 906 genes into six main groups representing three major patterns in Himanen et al. (9). A significance test within a mixed-model analysis was used to select 906 genes.

Different clustering algorithms and distance measures can lead to very different results. A perennial challenge with cluster analysis is the determination of the number of clusters. In recent years methods have been developed to deal with this issue, e.g. the gap statistic in Hastie et al. (61), see also (62).

As well as these classical clustering approaches, a number of model-based clustering algorithms have been proposed (e.g. 63). Ramoni et al. (64) gave a Bayesian model-based clustering algorithm, which represents temporal profiles by autoregressive models and used an agglomerative procedure to determine the number of clusters. Yeung et al. (65) used Gaussian mixture models in which each component corresponds to a cluster, the number of clusters being determined by the Bayesian Information Criterion (BIC). HMM clustering can be found in Schliep et al. (66) and Schliep et al. (67). There, each cluster was represented as one HMM. The method started with a collection of HMMs with typical qualitative behavior, and an iterative algorithm was used to fit these models and assign genes to clusters in such a way as to maximize the joint likelihood. This method also dealt with missing data, and was illustrated on yeast cell cycle data of Spellman et al.

(5), and on the fibroblast serum response data of Iyer et al. (68). Similarly, Ji et al. (69) and Zeng and Garcia-Frias (70) also used HMM approaches to cluster microarray time course data. Bar-Joseph et al. (43) and Luan and Li (71) did likewise, but first represented the profile for each gene by a continuous curve fitted by B-splines with gene-specific and class-specific parameters. Both papers illustrated their methods on the yeast cell cycle and fibroblasts serum response datasets. Zhang et al. (72) proposed a biclus-tering algorithm to discover genes which are co-regulated in only part of the time course. They illustrated their algorithm on the yeast cell cycle data of Cho et al. (4). Other noteworthy approaches were outlined in Peddada et al. (73) and Wakefield et al. (74).

Based on the above examples and our experience, we note that most model-based clustering algorithms have been effective for periodic time courses, but their satisfactory performance on short time-course experiments is not so clear. For developmental time-course data, traditional algorithms such as hierarchical clustering, SOM, or their variants based on distance measures have been more popular, probably because there are usually too few time points to allow the fitting of models. However, these approaches have the drawbacks of ignoring possible dependency across times in longitudinal studies, and generally ignoring the ordered nature of the time index. We feel that clustering methods which combine features from both the traditional and model-based approaches are urgently needed, ones which recognize the time ordering, and will deal with few time points, as well as temporal dependence and replicates where appropriate.

We end this clustering section by briefly mentioning a couple of other exploratory approaches like clustering that have used to analyze microarray time course data. These are correspondance analysis (75, 76), and singular value decomposition (SVD) (77, 78). Such graphical methods can be quite powerful.

10 Ways To Fight Off Cancer

10 Ways To Fight Off Cancer

Learning About 10 Ways Fight Off Cancer Can Have Amazing Benefits For Your Life The Best Tips On How To Keep This Killer At Bay Discovering that you or a loved one has cancer can be utterly terrifying. All the same, once you comprehend the causes of cancer and learn how to reverse those causes, you or your loved one may have more than a fighting chance of beating out cancer.

Get My Free Ebook

Post a comment