Dimension reduction

Because of the large number of genes that can be used as potential predictors, it is useful to preselect a subset of genes, or composite variables, likely to be predictive and then investigate in depth the relationship between these and the phenotype of interest. For example, genes with nearly constant expression across all samples can be eliminated. Additional screening can be based on measures of marginal association, such as the ratio of within-group variation to between-group variation, or the measure used in Slonim et al. (42), though these can miss important genes that act in concert with others but have no strong marginal effects.

Parsimonious representations of the data may be identified when there is knowledge of important pathways that can be used to manually construct new and more highly explanatory variables. When such knowledge is not available we need to apply discovery techniques such as those described earlier; for example, the centroids of clusters or the variables identified by PCA can be used as predictors. Composite variables that are easily measurable and interpretable in terms of the original gene expression are generally preferable. Automatic approaches for preclustering variables before classification are also useful (43).

0 0

Post a comment