Principal Component Analysis
Names: Rob Sanchez & Austin Martin
Due: Wednesday, March 20
We are going to look at a classic data set consisting of physical measurements of 150 irises. There are three species of irises in this set—setosa, versicolor, and virginica—and there are 50 samples of each species. Each sample has four measurements, all of which are in centimeters:
sepal length
sepal width
petal length
petal width
In the image below, the sepals are labeled as falls and the petals as standards.
The basic problem is to use the four physical measurements to predict which species a given sample belongs to. This is a standard data set that is used for testing machine learning techniques.
Since we have 150 samples, each of which has 4 measurements, we are looking at 150 data points in . That makes it difficult to visualize. Of course, we could look at scatter plots formed by considering just two measurements at a time, but we'd like to find the best two-dimensional picture of the data. Principal component analysis is the right tool for doing that.
Evaluate the cell below to read in the data set. In addition, you will have two familiar functions findmean(data)
, which returns the mean of the data, and demean(data)
, which returns the de-meaned data matrix. Remember that you have two other useful functions: B.matrix_from_columns( list )
and B.matrix_from_rows( list )
.
What is the average petal length in centimeters?
3.76 cm, the 3rd item in the list below.
Construct the covariance matrix and display it below.
Find the eigenvalues of the covariance matrix.
{4.197, 0.241, 0.078, 0.024}
For what percentage of the total variance do the first two eigenvalues account?
97.8% of the total variance is accounted for with the first two eigenvalues.
Find matrices and that orthogonally diagonalize .
Verify that the columns of are orthonormal.
Suppose that we would like to create a two-dimensional plot of the de-meaned data set by projecting the data onto the two-dimensional subspace formed by eigenvectors of corresponding to the two largest eigenvalues. That is, if and are the eigenvectors, we would like to represent a de-meaned data point as where the projection of onto this subspace is . Find the matrix such that .
The product will give a matrix whose columns consist of the de-meaned data points projected onto the plane. Construct this product and use the plot2d
function to display these projected points. The red points are samples from the setosa species, green are versicolor, and blue are virginica.
Explain why this plot is wider than it is tall:
The plot is wider than it is tall because the variance of the first component (the eigenvalue corresponding to the first eigenvector), accounts for most (97.8%) of the total variance of in the dataset. In this plot, the first component is represented on the X-axis, and the second component is represented on the Y-axis. The variance of the first component is much larger than the variance of the second component. Thus, the plot is wider than it is tall.
Suppose that you discover a new sample but that you only know two measurements: the sepal length is 5.65 cm and the sepal width is 2.75 cm. In this case, you don't know some of the data for this sample. However, let's make the reasonable assumption that the demeaned data point lies in the two-dimensional subspace . Find the coordinates of this sample. You will probably need to think about this task for a little bit to determine a linear system for the coordinates.
Estimate the other two measurements, the petal length and the petal width:
petal length: 3.99 cm
petal width: 1.29 cm
To which species does this sample most likely belong:
Versicolor (green cluster)
Suppose you find another sample whose petal length is 1.5 cm and whose petal width is 0.25 cm. Estimate the other two measurements.
sepal length: 6.55 cm
sepal width: 4.90 cm
To which species does this sample most likely belong:
Virginica (blue cluster)