This notebook shows you how to run Principal Component Analysis (PCA) using Python and Numpy, and graph your results using Matplotlib. Each box shows you a code example of how to do a step. To run a step, click on that code box and then click the Play button above (just left of the STOP button).

We first need to get ready to use Numpy, Matplotlib, and define a few convenience functions to make PCA easier. Click on the following code box, and click the Play button above; you should then see a simple graph:

In [1]:

# import the libraries we need import numpy from matplotlib import pyplot from math import cos, sin, pi # make this notebook display graphs automatically %matplotlib inline # define convenience functions def get_rotation_matrix(theta=0.5, isDegrees=True): if isDegrees: theta *= pi / 180. r = numpy.matrix(((cos(theta), -sin(theta)), (sin(theta), cos(theta)))) return r def sort_pca(pca_vars, pca_axes): s = list(zip(pca_vars, pca_axes.T)) s.sort(reverse=True) eig_vals2 = numpy.array([t[0] for t in s]) eig_vecs2 = numpy.matrix([t[1] for t in s]).T return eig_vals2, eig_vecs2 def recenter(vecs): m = numpy.matrix(vecs).mean(axis=1) return numpy.array(vecs - m), numpy.array(m) # display a graph of SIN(X) rads = numpy.arange(0, 2*pi, 0.1) pyplot.plot(rads, numpy.sin(rads))

[<matplotlib.lines.Line2D at 0x7fc9a843ea20>]

The following code box shows how to enter a simple dataset of (height, weight) pairs, convert it from row-format to column-format (i.e. each datapoint is represented by one column in the data array), which is convenient for numpy calculations and graphing. Click on the code box and click the Play button above, to graph these data:

In [2]:

rows = [[64.5, 118.0], [73.3, 143.0], [68.8, 172.0], [65.0, 147.0], [69.0, 146.0], [64.5, 138.0], [66.0, 175.0], [66.3, 134.0], [68.8, 172.0], [64.5, 118.0], [70.0, 151.0], [69.0, 155.0], [70.5, 155.0], [66.0, 146.0], [68.0, 135.0], [68.5, 127.0], [73.5, 178.0], [66.3, 136.0], [70.0, 180.0], [76.5, 186.0], [62.0, 122.0], [68.0, 132.0], [63.0, 114.0], [72.0, 171.0], [68.0, 140.0], [77.0, 187.0], [63.0, 106.0], [66.5, 159.0], [62.5, 127.0], [67.0, 191.0], [75.5, 192.0], [69.0, 181.0], [66.5, 143.0], [66.5, 153.0], [70.5, 144.0], [64.5, 139.0], [74.0, 148.0], [75.5, 179.0]] hw = numpy.array(rows).T # transpose to column format hwc, m = recenter(hw) # recentered on (0,0) pyplot.plot(hw[0], hw[1], marker='.', linestyle='') pyplot.xlabel('height (inches)') pyplot.ylabel('weight (pounds)') pyplot.xlim(xmin=10, xmax=130)

(10, 130)

A rotation matrix represents a rotated coordinate system as a table whose columns are the coordinates of its axes. The following code example shows how to get a 2D rotation matrix for a 30 degree (counter-clockwise) rotation, print it, and graph its axes (column vectors):

In [3]:

r = get_rotation_matrix(30) print(r) pyplot.plot((0,r[0,0]), (0,r[1,0])) pyplot.plot((0,r[0,1]), (0,r[1,1]))

[[ 0.8660254 -0.5 ]
[ 0.5 0.8660254]]

[<matplotlib.lines.Line2D at 0x7fc9a7b9f240>]

Multiplying a dataset by a rotation matrix has the effect of rotating it by that amount (around the origin (0,0)). Here we rotate the dataset (centered at (0,0)) by our rotation matrix and graph the result:

In [4]:

hwr = numpy.array(r * hwc) pyplot.plot(hwr[0], hwr[1], marker='.', linestyle='') pyplot.xlim(xmin=-50, xmax=70)

(-50, 70)

A covariance matrix is a square table whose entries (i,j) show the covariance of coordinate i vs. coordinate j, and whose diagonal entries (i,i) show the variance of coordinate i. A positive covariance implies a positive correlation between coordinates i and j. It can be computed on a dataset using the numpy.cov() function:

In [5]:

print(numpy.cov(hw))

[[ 15.95035562 65.60156472]
[ 65.60156472 551.24039829]]

For this simple two-dimensional dataset, it's easy to search all possible rotations (around the center of the data) for its "natural" coordinate system where the covariance goes to zero. For each possible rotation, we calculate the rotated data's covariance matrix C, whose element C[0,0] is the variance of X', C[1,1] is the variance of Y', and C[0,1] is the covariance of X', Y':

In [6]:

rots = numpy.arange(0, 180, 5) # from 0 to 180 degrees in 5 degree steps cs = [numpy.cov(get_rotation_matrix(r) * hwc) for r in rots] pyplot.plot(rots, [c[0,1] for c in cs], label='Cov(x,y)') pyplot.plot(rots, [c[0,0] for c in cs], label='Var(x)') pyplot.plot(rots, [c[1,1] for c in cs], label='Var(y)') pyplot.plot(rots, [c[0,0] + c[1,1] for c in cs], label='total variance') pyplot.axhline(0, color='k') # draw a line at y=0 pyplot.legend(loc='lower right')

<matplotlib.legend.Legend at 0x7fc9a7ba86a0>

We can find the principal component axes for any dataset, no matter how many dimensions, in a single step, by simply computing the so-called eigenvectors of the covariance matrix. Numpy makes this easy. The following example displays the PCA axes superimposed on the dataset; intuitively, you can see they provide the "natural coordinate system" that best fits the dataset:

In [7]:

c = numpy.cov(hwc) pca_variances, pca_axes = numpy.linalg.eig(c) # compute eigenvectors a = numpy.array(pca_axes) * 2 * numpy.sqrt(pca_variances) # scale to 2 std. dev. pyplot.plot(hw[0], hw[1], marker='.', linestyle='') pyplot.plot([m[0]-a[0,0],m[0]+a[0,0]],[m[1]-a[1,0],m[1]+a[1,0]]) pyplot.plot([m[0]-a[0,1],m[0]+a[0,1]],[m[1]-a[1,1],m[1]+a[1,1]]) pyplot.xlim(xmin=10, xmax=130)

(10, 130)

Our pca_axes matrix transforms points in PCA coordinates to "real-world" coordinates (in this case, height and weight). For the opposite conversion from real-world coordinates to PCA coordinates, we need the reverse rotation, which is called the inverse of the rotation matrix $R^{-1}$. The following code example shows how to convert our height-weight dataset to PCA coordinates, graph them, and assess whether PCA actually eliminated all covariance (note that it is printed in scientific notation where "e-14" means "times 10 to the -14th power"):

In [8]:

rinv = numpy.matrix(pca_axes).I # get the inverse rotation matrix hwr = numpy.array(rinv * hwc) # transform to PCA coordinates pyplot.plot(hwr[0], hwr[1], marker='.', linestyle='') print(numpy.cov(hwr))

[[ 8.02792108e+00 -2.13962208e-14]
[-2.13962208e-14 5.59162833e+02]]

As you can see in the example above, some PCA axes capture more of the inherent variance in the data than others. We therefore sort the PCA axes from biggest to least variance, as follows:

In [9]:

print(pca_variances) print(pca_axes) pca_variances, pca_axes = sort_pca(pca_variances, pca_axes) print('Reordered by variance:') print(pca_variances) print(pca_axes)

[ 8.02792108 559.16283284]
[[-0.9927866 -0.11989481]
[ 0.11989481 -0.9927866 ]]
Reordered by variance:
[559.16283284 8.02792108]
[[-0.11989481 -0.9927866 ]
[-0.9927866 0.11989481]]

The value of sorting the PCA axes this way is that it provides a natural ordering for "compressing" the data as accurately as possible. The following code example shows how accurately we can model the height vs. weight dataset using ONLY the first principal component (by just setting the second component to its mean value, zero), and then predicting both the height and weight from that value:

In [10]:

rinv = pca_axes.I # get the inverse rotation matrix hwr = numpy.array(rinv * hwc) # transform to PCA coordinates hwr[1] = 0 # set second PCA coordinate to zero print(hwr) hw2 = numpy.array(pca_axes * hwr + m) # predict height, weight pyplot.plot(hw[0], hw[1], marker='.', linestyle='', label='actual') pyplot.plot(hw2[0], hw2[1], marker='+', linestyle='', label='predicted') pyplot.xlabel('height (inches)') pyplot.ylabel('weight (pounds)') pyplot.xlim(xmin=10, xmax=130) pyplot.legend(loc='upper left')

[[ 33.28432361 7.40958429 -20.84170052 4.43356477 4.94677215
13.42859159 -23.48435487 17.18392734 -20.84170052 33.28432361
-0.13705566 -3.98830726 -4.16814947 5.30645657 15.98731957
23.86966498 -27.36192571 15.19835414 -28.9278671 -35.66390294
29.61291422 18.96567937 37.43531222 -20.2325773 11.02338656
-36.71663695 45.37760503 -7.65971665 24.58903381 -39.48883529
-41.50072774 -29.80075889 8.22486897 -1.70299704 6.75250314
12.43580498 2.36172492 -28.59450193]
[ 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. 0. 0.
0. 0. 0. ]]

<matplotlib.legend.Legend at 0x7fc9a77a5630>

Questions:

- Why do the predictions lie on a straight line?
- What do you expect to be the average error of these predictions?
- How much better are these predictions than the "naive prediction" $(\overline h, \overline w)$?
- What would happen if you changed the third line of code above to be "hwr[0]=0" (i.e. setting the First principal component to zero)? Test your prediction by modifying the code and rerunning it.
- What would happen if you deleted the third line of code (i.e. leave both principal components unchanged)? Test your prediction by modifying the code and rerunning it.

Answers:

- We used only a single dimension (first principal component) to model the data, so all its predictions will lie on a (one dimensional) line.
- The average error of our predictions will just be given by the variance of the second principal component (which we always just set to the mean value, zero), which was 8.03.
- If we used the naive prediction, the total variance would be increased by 559 (the variance of the first principal component).
- The predictions will all lie on the second principal component axis (rather than the first), and the resulting error will be a lot higher (variance=559, instead of variance=8).
- The predictions will perfectly match the original data. All we have done is apply rotation matrix R followed by its inverse $R^{-1}$, which just gives us back our original dataset unchanged.

The power of PCA is that it works the same no matter how dimensions (variables) your dataset contains. Here's a three dimensional dataset giving a sample of people's waist measurements (inches), weight (pounds), and body fat (percent). First let's examine how well PCA can model these data, using the same approach as we used above:

In [11]:

rows = [[32, 175, 6], [36, 181, 21], [38, 200, 15], [33, 159, 6], [39, 196, 22], [40, 192, 31], [41, 205, 32], [35, 173, 21], [38, 187, 25], [38, 188, 30], [33, 188, 10], [40, 240, 20], [36, 175, 22], [32, 168, 9], [44, 246, 38], [33, 160, 10], [41, 215, 27], [34, 159, 12], [34, 146, 10], [44, 219, 28]] wwb = numpy.array(rows).T # convert to column format wwbc, wwbm = recenter(wwb) # center on (0,0) c = numpy.cov(wwbc) # compute covariance matrix pca_variances, pca_axes = numpy.linalg.eig(c) # compute PCA pca_variances, pca_axes = sort_pca(pca_variances, pca_axes) print(pca_variances) print(pca_axes) f = pca_variances.cumsum() / sum(pca_variances) # cumulative sum pyplot.plot((1,2,3), f, marker='o') pyplot.xlabel('Principal Component') pyplot.ylabel('Fraction of Total Variance')

Questions:

- Interpret the meaning of this graph.
- Interpret the meaning of the first principal component (first column of pca_axes).
- Interpret the meaning of the second principal component (second column of pca_axes).

Answers:

- The graph shows that the first principal component, by itself, can model all three data variables (waist, weight, bodyfat) with an accuracy of 94% of the total variance. This indicates that all three data variables are highly correlated. Including the second principal component as well raises the accuracy to 99.8% of the variance.
- the first PCA axis (first column of pca_axes) has the same sign for all three variables (waist, weight, bodyfat, in that order), indicating that the main trend in the data is that all three are
*positively*correlated with each other. I.e. people with bigger waist measurement also tend to have higher weight and %bodyfat. - the second PCA axis (second column of pca_axes) has the same sign for waist and bodyfat, but opposite sign for weight. I.e. when waist and bodyfat vary independently of weight, they still correlate positively with each other (and very strongly -- only 0.2% of the total variance is left unexplained by these two PCA axes).

We can use PCA as a prediction method, by simply computing the first principal component value for each %bodyfat value, and then using it to predict the waist and weight measurements for that person. All we have to do is get the submatrix that maps the first principal component to %bodyfat. Then to convert %bodyfat to first principal component, we just invert that matrix, and apply it to our %bodyfat data:

In [12]:

axes = (2,) # just use %bodyfat (data row 2) pca_b = pca_axes[axes,:1] # get PCA column 0 vs. data row 2 print(pca_b) binv = pca_b.I # inverse rotation matrix for converting %bodyfat to PCA print(binv) pc1 = binv * wwbc[axes,] # compute first principal component print(pc1) z = numpy.zeros(wwb.shape) # empty dataset (all values zero) z[:1] = pc1 # insert first principal component values wwb2 = numpy.array(pca_axes * z + wwbm) # predict waist and weight pyplot.plot(wwb[0], wwb[1], marker='.', linestyle='', label='actual') pyplot.plot(wwb2[0], wwb2[1], marker='+', linestyle='', label='predicted') pyplot.xlabel('waist (inches)') pyplot.ylabel('weight (pounds)') pyplot.legend(loc='upper left')

The same method can be used to predict all the variables from any subset of variables. For example, waist and weight are trivial to measure (with just a measuring tape and scale), but %bodyfat is much harder to measure. So it might be helpful to be able to predict %bodyfat from waist and weight measurements. We reuse exactly the same procedure as above:

In [13]:

axes = (0,1) # use waist, weight (data rows 0 and 1) pca_ww = pca_axes[axes,:len(axes)] # get PCA columns 0-1 vs. rows 0-1 print(pca_ww) wwinv = pca_ww.I # inverse rotation matrix for converting ww to PCA print(wwinv) pc2 = wwinv * wwbc[axes,] # compute first two principal components print(pc2) z = numpy.zeros(wwb.shape) # empty dataset (all values zero) z[:len(axes)] = pc2 # insert first two principal component values wwb3 = numpy.array(pca_axes * z + wwbm) # predict %bodyfat pyplot.plot(wwb[2], wwb3[2], marker='o', linestyle='') pyplot.plot([5,40], [5,40]) # diagonal line from (5,5) to (40,40) pyplot.xlabel('%bodyfat') pyplot.ylabel('predicted %bodyfat')

In [ ]: