Often times, particulary in health care, datasets have a large number of features and only a small number of samples. Having too many features is referred to as "the curse of dimensionality". This is because the more features (dimensions) a model is working in, the larger the volume of the feature space of the model resulting in widely spread out (sparse) observations. Because supervised machine learning is largely about finding patterns in observed data, when these data points are spread out too thin over a large feature space its harder to pull out important patterns.
As a result, a number of methods exist to reduce the dimensionality of a dataset. This can either be down through initial univariate analysis, i.e. retaining only those features that by themselves are associated with the response. Another approach is to attempt to combine features in such a way that the number of final combinations is smaller than the number of initial features, but the combinations maintain most of the information that was present in the features. An example of this is principal components analysis.
# read in datasetdataset<-read.csv('data.csv')dataset<-dataset[,-ncol(dataset)]print(paste("This dataset has",ncol(dataset)-2,"features"))
 "This dataset has 30 features"
# find principal componentspca<-prcomp(dataset[,-c(1,2)],center=TRUE,scale.=TRUE)print(pca)
# plot amount of variance explained by each PCplot(pca,type="l")
PCA is also very useful for visualization. The human brain cannot visual a 31 dimensional feature space but by reducing the feature space in to two principal components we can then project the data down onto a 2-D space.
# create plot dataplot_data<-as.data.frame(pca$x)plot_data$group<-dataset$diagnosis# plot principal componentsp<-ggplot(plot_data,aes(x=PC1,y=PC2,color=group))p<-p+geom_point()print(p)