Molecular classification of kidney tumors

For analysis, we chose a broad range of classification approaches to identify and address the potential dependencies between classification approaches and their prediction abilities (see also Chapter 19). To this end, we selected the frequently used prediction analysis for microarrays (PAM) (21), support vector machines (22) and random forest (23) algorithms. The rationale of a classification is as follows: A set of microarray data (the training set) is divided into two or more classes (here, the three RCC types). The goal is to build a classifier (a method that is able to predict the classes in an independent data set). Ideally, the performance of the classifier is evaluated by testing its prediction ability on an independent test set for which the classes are also known. In practice however, due to the small number of available samples, cross-validation is performed, where the samples are randomly divided into equally sized subsets. In each step, one subset is left aside, the classifier is built on the remaining samples, and the classes of the left-out samples are predicted and compared with the actual classes. In our data set, the prediction ability was assessed by 10-fold cross-validation, that is in each step 90% of the samples are used as a training set and 10% as a test set. The whole procedure was repeated 20 times.

Microarrays can detect small fold changes in pairwise comparisons. However, small fold changes are often not useful for diagnostic routines. Hence, we applied a filter in order to select for genes whose expression levels changed considerably between two tumor types. To avoid overfitting, our gene filtering was applied in each cross-validation step on the training set. Therefore, we calculated the mean expression value of every gene/EST and every tumor type and selected all genes whose group means' fold change was >2 in any pairwise group comparison. Depending on the samples in the training set, approximately 440 of the 4207 genes on the microarrays fulfilled this criterion.

The microarray data of 35 kidney tumor samples were used for the molecular classification of the tumor subgroups. All classification methods (21-23) showed a high prediction ability and gave similar results: Each method correctly classified at least 32 out of 35 samples, 31 in at least 95% and one in at least 80% of the repetitions (not shown). One ccRCC sample (no. 14) was consistently misclassified by all methods, two others were misclassified by random forest and SVM in at least 60% of the repetitions (not shown). In PAM (Plate 1), only sample 14 was "incorrectly" classified while all other samples corresponded to the histopathological diagnosis. Thus, the misclassification rate was less than 3%. The tumor samples that did not match the histopathological classification were reanalyzed by pathology. The initial diagnosis was confirmed, suggesting that there was no error in the clinical diagnosis of these tumors.

0 0

Post a comment