Sharedprincipal_component_analysis.ipynbOpen in CoCalc

Principal Component Analysis:

Often times, particulary in health care, datasets have a large number of features and only a small number of samples. Having too many features is referred to as "the curse of dimensionality". This is because the more features (dimensions) a model is working in, the larger the volume of the feature space of the model resulting in widely spread out (sparse) observations. Because supervised machine learning is largely about finding patterns in observed data, when these data points are spread out too thin over a large feature space its harder to pull out important patterns.

As a result, a number of methods exist to reduce the dimensionality of a dataset. This can either be down through initial univariate analysis, i.e. retaining only those features that by themselves are associated with the response. Another approach is to attempt to combine features in such a way that the number of final combinations is smaller than the number of initial features, but the combinations maintain most of the information that was present in the features. An example of this is principal components analysis.

library(ggplot2)

# read in dataset
dataset <- dataset[,-ncol(dataset)]
print(paste("This dataset has", ncol(dataset)-2, "features"))

[1] "This dataset has 30 features"
# find principal components
pca <- prcomp(dataset[,-c(1,2)], center = TRUE, scale. = TRUE)
print(pca)

# plot amount of variance explained by each PC

# create plot data
plot_data <- as.data.frame(pca$x) plot_data$group <- dataset\$diagnosis