Twin Curses Plague Biomedical Data Analysis
How to deal with too many dimensions and too few samples.
Noninvasive experimental techniques, such as magnetic resonance (MR), infrared, Raman and fluorescence spectroscopy, and more recently, mass spectroscopy (proteomics) and microarrays (genomics) have helped us better understand, diagnose and treat disease. These methods create huge numbers of features, on the order of 1,000 to 10,000, resulting in Bellman’s curse of dimensionality: too many features (i.e., dimensions). However, clinical reality frequently limits the number of available samples to the order of 10 to 100. This leads to the curse of dataset sparsity: too few samples. Thus, on the one hand, we have a wealth of information available for data analysis; on the other hand, statistically meaningful analysis is hampered by sample scarcity.
Robust, reliable data classification (e.g., distinguishing between diseased and healthy conditions) requires a sample-to-feature ratio on the order of 5 to 10, instead of the initial 1/10 to 1/1000. What can be done?
To lift the curse of dimensionality and reduce the number of features to a manageable size, we use feature extraction/selection (FES). FES reduces dimensionality by identifying and eliminating redundant or irrelevant information. For microarray data this is accomplished by first identifying groups of correlated genes and defining group averages as new features. For spectra, neighboring features are strongly correlated, and therefore the majority of features are redundant. In addition, many features are “noise,” or are irrelevant for the desired classification. Eliminating these yields a much lower-dimensional feature space that suffices for accurate spectral characterization. To identify the spectral features to be eliminated, we have developed an algorithm that helps select optimal sub-regions that are most relevant for an accurate classification. Averaging adjacent spectral intensities leads to further reduction, while retaining spectral identity, which is important for interpretability of the resulting features (e.g., MR peaks, essentially averages of adjacent spectral intensities, are manifestations of the presence of specific chemical compounds).
Dataset sparsity has more subtle consequences, and lifting this curse is more problematic. The ideal solution—acquiring more samples—is frequently too expensive or even impracticable. Yet, limited sample size may create classifiers that give overoptimistic accuracies, even after feature space reduction. Robust classifier creation requires enough samples to meaningfully partition the data into training, validation and independent test sets. The training set is used for both FES and optimal classifier development. The validation set helps prevent the classifier from adapting to the peculiarities of a finite training set (overfitting) by monitoring the progress of the FES/classifier. The independent test set is used for external cross-validation, but only after completion of the FES and identification of the final classifier. With small datasets, even partitioning into training and test sets is statistically suspect, and k-fold cross-validation is used: the dataset is split into k equal parts (approximately 5 to 10), trained on k to 1 parts and tested on the remaining portion. One then cycles through k times and averages the test results. For small sample sizes, the variance of the averaged test accuracies tends to be unacceptably large, while overtraining is still a threat.
For highly imbalanced classes (e.g., rare disease vs. healthy), overall classification accuracy can be misleading. For example, consider 90 samples in the healthy class, but only 10 in the disease class. Misclassifying all 10 still gives 90 percent overall accuracy. Hence, balanced sensitivity and specificity (i.e., comparable accuracies for both classes) is more appropriate, and can be achieved by undersampling, oversampling or by penalizing misclassifications differently for different classes. (Differing misclassification costs for the classes is an example.)
For each sample, we compute class probabilities. This is relevant clinically (e.g., additional tests would be suggested if a classifier assigned a patient to the disease class with 55 percent probability, immediate treatment would commence if this probability were 90 percent.)
In the biomedical field, the twin curses are generally active. They both must be dealt with in concert, otherwise overly optimistic and frequently wrong conclusions will result.
Ray Somorjai is Head of Biomedical Informatics at the Institute for Biodiagnostics, National Research Council Canada. The three major thrusts in his Group are supervised classification, with special emphasis on handling the peculiarities of biomedical data, unsupervised classification (e.g., EvIdent, a powerful fuzzy clustering software) and the mathematical modeling of the spread of infectious diseases (AIDS, SARS, etc.).