Getting Started
This notebook shows you how to run Principal Component Analysis (PCA) using Python and Numpy, and graph your results using Matplotlib. Each box shows you a code example of how to do a step. To run a step, click on that code box and then click the Play button above (just left of the STOP button).
Step 1: Initial Setup
We first need to get ready to use Numpy, Matplotlib, and define a few convenience functions to make PCA easier. Click on the following code box, and click the Play button above; you should then see a simple graph:
Step 2: Human Height vs. Weight Dataset
The following code box shows how to enter a simple dataset of (height, weight) pairs, convert it from row-format to column-format (i.e. each datapoint is represented by one column in the data array), which is convenient for numpy calculations and graphing. Click on the code box and click the Play button above, to graph these data:
Step 3: What is a Rotation Matrix?
A rotation matrix represents a rotated coordinate system as a table whose columns are the coordinates of its axes. The following code example shows how to get a 2D rotation matrix for a 30 degree (counter-clockwise) rotation, print it, and graph its axes (column vectors):
Step 4: Rotating a Dataset
Multiplying a dataset by a rotation matrix has the effect of rotating it by that amount (around the origin (0,0)). Here we rotate the dataset (centered at (0,0)) by our rotation matrix and graph the result:
Step 5: What is a Covariance Matrix?
A covariance matrix is a square table whose entries (i,j) show the covariance of coordinate i vs. coordinate j, and whose diagonal entries (i,i) show the variance of coordinate i. A positive covariance implies a positive correlation between coordinates i and j. It can be computed on a dataset using the numpy.cov() function:
Step 6: Find Rotated Axes with Zero Covariance
For this simple two-dimensional dataset, it's easy to search all possible rotations (around the center of the data) for its "natural" coordinate system where the covariance goes to zero. For each possible rotation, we calculate the rotated data's covariance matrix C, whose element C[0,0] is the variance of X', C[1,1] is the variance of Y', and C[0,1] is the covariance of X', Y':
Step 7: General PCA Method for Finding the Principal Component Axes
We can find the principal component axes for any dataset, no matter how many dimensions, in a single step, by simply computing the so-called eigenvectors of the covariance matrix. Numpy makes this easy. The following example displays the PCA axes superimposed on the dataset; intuitively, you can see they provide the "natural coordinate system" that best fits the dataset:
Step 8: Reversing the PCA Rotation
Our pca_axes matrix transforms points in PCA coordinates to "real-world" coordinates (in this case, height and weight). For the opposite conversion from real-world coordinates to PCA coordinates, we need the reverse rotation, which is called the inverse of the rotation matrix . The following code example shows how to convert our height-weight dataset to PCA coordinates, graph them, and assess whether PCA actually eliminated all covariance (note that it is printed in scientific notation where "e-14" means "times 10 to the -14th power"):
Step 9: Sorting the PCA axes by Importance
As you can see in the example above, some PCA axes capture more of the inherent variance in the data than others. We therefore sort the PCA axes from biggest to least variance, as follows:
Step 10: How Well Can We Model Height vs. Weight Using Just One Variable?
The value of sorting the PCA axes this way is that it provides a natural ordering for "compressing" the data as accurately as possible. The following code example shows how accurately we can model the height vs. weight dataset using ONLY the first principal component (by just setting the second component to its mean value, zero), and then predicting both the height and weight from that value:
Questions:
Why do the predictions lie on a straight line?
What do you expect to be the average error of these predictions?
How much better are these predictions than the "naive prediction" ?
What would happen if you changed the third line of code above to be "hwr[0]=0" (i.e. setting the First principal component to zero)? Test your prediction by modifying the code and rerunning it.
What would happen if you deleted the third line of code (i.e. leave both principal components unchanged)? Test your prediction by modifying the code and rerunning it.
Answers:
We used only a single dimension (first principal component) to model the data, so all its predictions will lie on a (one dimensional) line.
The average error of our predictions will just be given by the variance of the second principal component (which we always just set to the mean value, zero), which was 8.03.
If we used the naive prediction, the total variance would be increased by 559 (the variance of the first principal component).
The predictions will all lie on the second principal component axis (rather than the first), and the resulting error will be a lot higher (variance=559, instead of variance=8).
The predictions will perfectly match the original data. All we have done is apply rotation matrix R followed by its inverse , which just gives us back our original dataset unchanged.
Multidimensional Data Example: Obesity Dataset
Step 1: Accuracy of Modeling These Data by PCA?
The power of PCA is that it works the same no matter how dimensions (variables) your dataset contains. Here's a three dimensional dataset giving a sample of people's waist measurements (inches), weight (pounds), and body fat (percent). First let's examine how well PCA can model these data, using the same approach as we used above:
Questions:
Interpret the meaning of this graph.
Interpret the meaning of the first principal component (first column of pca_axes).
Interpret the meaning of the second principal component (second column of pca_axes).
Answers:
The graph shows that the first principal component, by itself, can model all three data variables (waist, weight, bodyfat) with an accuracy of 94% of the total variance. This indicates that all three data variables are highly correlated. Including the second principal component as well raises the accuracy to 99.8% of the variance.
the first PCA axis (first column of pca_axes) has the same sign for all three variables (waist, weight, bodyfat, in that order), indicating that the main trend in the data is that all three are positively correlated with each other. I.e. people with bigger waist measurement also tend to have higher weight and %bodyfat.
the second PCA axis (second column of pca_axes) has the same sign for waist and bodyfat, but opposite sign for weight. I.e. when waist and bodyfat vary independently of weight, they still correlate positively with each other (and very strongly -- only 0.2% of the total variance is left unexplained by these two PCA axes).
Step 2: Predicting Waist and Weight from %bodyfat
We can use PCA as a prediction method, by simply computing the first principal component value for each %bodyfat value, and then using it to predict the waist and weight measurements for that person. All we have to do is get the submatrix that maps the first principal component to %bodyfat. Then to convert %bodyfat to first principal component, we just invert that matrix, and apply it to our %bodyfat data:
Step 3: Predicting %bodyfat from waist, weight
The same method can be used to predict all the variables from any subset of variables. For example, waist and weight are trivial to measure (with just a measuring tape and scale), but %bodyfat is much harder to measure. So it might be helpful to be able to predict %bodyfat from waist and weight measurements. We reuse exactly the same procedure as above: