In this homework we will implement a basic version of gradient descent.
Below, look for unfinished functions that contain a line like
## You complete me
Where those occur, fill in the missing code.
When your code is correct the output should be similar to the output provided (see the HTML version of this document for backup output images).
As in all our exercises, please be sure your function works for input that has any dimensionality (changing $d$ and $N$ should not break the code).
Reminder: It's cheating to google the solution and use the code you find. Don't pollute your brain by googling. Just try to solve the problem.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
### Here is the function you must complete
### Below you can see some tests which you can use to determine if you've succeeded.
### In your implementation simply run gradient descent for max_iter iterations.
### In other words you don't need to implement fancy stopping conditions.
def grad_descent(w,X,y,gradient,eta=0.1,max_iter=1000):
""" parameters: w, gradient
optional: eta (default 0.1) max_iter (default 1000)
"""
history=[] ## Each time a new w is considered, add it to this list
history.append(w) ## Just like this
for i in range(max_iter):
### You complete me
history.append(w)
return w,np.array(history)
### First we'll try the function on a made up regression problem.
from sklearn import datasets
X_no_bias,y = datasets.make_regression(n_features=1,noise=10,random_state=100)
plt.scatter(X_no_bias,y,alpha=0.6)
plt.show()
## Add a bias column
X = np.c_[np.ones(X_no_bias.shape[0]),X_no_bias]
X[:5], X.shape,y.shape
## Get a train test split of the data
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y)
As we've said in class, linear regression with the normal equations is a one-liner.
Please write that line in the cell below.
You want
$$\bar{w} = X_{train}^\dagger y_{train}$$where $X_{train}^\dagger$ is the pseudo-inverse of $X_{train}$.
w = ### You complete me
y_train_hat = X_train.dot(w)
plt.scatter(X_train[:,1],y_train)
plt.plot(X_train[:,1],y_train_hat,c='r')
plt.title("The regression line learned by the normal eqns")
plt.show()
Below we use $R^2$ to assess the performance of the model.
Please help us out by implementing the R2
function.
You can remind yourself of the definition here:
https://en.wikipedia.org/wiki/Coefficient_of_determination#Definitions
def R2(y,yhat):
"""parameters: y, yhat"""
## You complete me
y_test_hat = X_test.dot(w)
print("w = {}".format(w))
print("R2 = {}".format(R2(y_test,y_test_hat)))
Now let's try to find the solution to this same problem using gradient descent.
You'll need the gradient of the mean squared error function which is on page 85 of the textbook.
For your convenience:
$$ \nabla E_{in}(\bar{w}) = \frac{2}{N}(X^TX\bar{w}-X^Ty).$$Remember that the computer is not psychic and will not understand "N" which is the number of rows in $X$.
def mse_gradient(w,X,y):
## You complete me
## This formula is given above
w_init = np.random.randn(2)/10
wgd,path = grad_descent(w_init,X_train,y_train,mse_gradient)
y_test_hat = X_test.dot(wgd)
print("w (normal eqn) = {}".format(w))
print("w (gradient descent) = {}".format(wgd))
print("R2 = {}".format(R2(y_test,y_test_hat)))
## First 7 w's visited in grad descent
path[:7]
## Last 5 w's visited
path[-5:]
## This will make a picture of the path your w followed during GD
xx = np.linspace(-50,80,100)
yy = np.linspace(-50,80,100)
XX,YY = np.meshgrid(xx,yy)
W = np.c_[XX.ravel(),YY.ravel()]
errs = ((X.dot(W.T)).T - y)**2
E = np.mean(errs,axis=1)
plt.contourf(XX,YY,E.reshape(100,100),levels=300)
plt.title(r"Contour plot of $z=E_{in}(\bar{w})$ (regression)")
plt.xlabel(r"$w_0$")
plt.ylabel(r"$w_1$")
plt.scatter(path[:,0],path[:,1],c='r',alpha=0.5)
plt.colorbar()
plt.show()
### This shows how the magnitude of the w your GD considered changed over time
### When could your program have stopped?
L = [mse_gradient(ww,X,y) for ww in path]
gradients = np.linalg.norm(L,axis=1)
plt.plot(np.arange(len(gradients)),gradients)
plt.title("Magnitude of gradient over time steps")
plt.ylabel("Gradient magnitude")
plt.xlabel("Time step")
plt.show()
## This matrix version of MSE is from the bottom of pg 84 in the text
## It is 20 times faster than an implementation with python loops!
def MSE(w,X,y):
return 1/X.shape[0]*(w.T.dot(X.T).dot(X).dot(w) - 2*w.T.dot(X.T).dot(y) + y.dot(y))
err_diffs = np.diff([MSE(ww,X,y) for ww in path])
plt.plot(np.arange(len(err_diffs)),err_diffs)
plt.title("Error differences over time")
plt.ylabel(r"$E_{in}(w_{t+1})-E_{in}(w_{t})$")
plt.xlabel("Time step")
plt.show()
Let's see how your gradient descent performs on a real dataset, such as the Boston Housing data.
import pandas as pd
df=pd.read_csv("housing.data",header=None,delim_whitespace=True)
df.head()
We ran into some problems here.
Linear regression based on the normal equations is doing just fine, but gradient descent is struggling.
We had to put $\eta$ all the way down to 0.000001 just to get convergence.
But then after 100000 iterations, we still haven't converged to a good solution.
What's the problem?
Fix it! (Hint: What should you always do to the data before doing gradient descent? You can take the code for this operation from HW 2, or use the sklearn library version.)
When you fix it the output should look roughly like this:
w (normal equations )= [ 26.53112618 -10.30154035 4.11512828 2.44959865 2.11057155
-9.36246851 17.88281394 1.23610308 -13.15196355 7.6501691
-7.15185994 -9.86921614 3.64943825 -19.76860333]
R2 (normal equations) = 0.7165453029604523
w (gradient descent) = [ 19.9368617 -6.02022064 2.87944774 1.82739763 2.25551214
-6.37598968 22.22541246 1.55715796 -8.46547661 4.94478739
-4.63963461 -8.74471267 4.76136717 -18.60811081]
R2 (gradient descent)= 0.7123144123485858
You will need to put $\eta$ and max_iter
back down to sane values to get this output.
X_nb = df.values[:,:-1]
y = df.values[:,-1]
X = np.concatenate((np.ones(X_nb.shape[0]).reshape(X_nb.shape[0],1),X_nb),axis=1)
X_train,X_test,y_train,y_test = train_test_split(X,y)
w = np.linalg.pinv(X_train).dot(y_train)
y_train_hat = X_train.dot(w)
y_test_hat = X_test.dot(w)
print("w (normal equations )= {}".format(w))
print("R2 (normal equations) = {}".format(R2(y_test,y_test_hat)))
w_init = np.random.randn(X.shape[1])
wgd,path = grad_descent(w_init,X_train,y_train,mse_gradient,eta=0.000001,max_iter=100000)
y_test_hat = X_test.dot(wgd)
print("w (gradient descent) = {}".format(wgd))
print("R2 (gradient descent)= {}".format(R2(y_test,y_test_hat)))
err_diffs = np.diff([MSE(ww,X,y) for ww in path])
plt.plot(np.arange(len(err_diffs)),err_diffs)
plt.title("Error differences over time")
plt.ylabel(r"$E_{in}(w_{t+1})-E_{in}(w_{t})$")
plt.xlabel("Time step")
plt.show()
L = [mse_gradient(ww,X,y) for ww in path]
gradients = np.linalg.norm(L,axis=1)
plt.plot(np.arange(len(gradients)),gradients)
plt.title("Magnitude of gradient over time steps")
plt.ylabel("Gradient magnitude")
plt.xlabel("Time step")
plt.show()
err_diffs[-10:]
gradients[-10:]
In this section you will implement stochastic gradient descent and apply it so some real data.
$\nabla {\bf e} = 2(\bar{w}^T\bar{x}_n-y_n)\bar{x}$
def pw_linreg_grad(x,w,yy):
"""parameters: x,w,yy (instance, weight vector, truth value = +-1)
return: new w
This is the gradient for MSE error.
The mathematical formula can be found above.
"""
### You complete me
def sgd(w,X,y,pw_gradient,eta=0.05,num_epochs=50):
"""parameters: w (initial weight vector)
X (data matrix)
y (target vector)
pw_gradient (pointwise gradient function taking params x,w,yy)"""
history = [] ## Every time you compute a new w, do history.append(w).
for j in range(num_epochs):
shuff = np.random.permutation(X.shape[0])
for i in range(X.shape[0]):
#You complete me
history.append(w)
return w,np.array(history)
### Be sure X_train and X_test are already scaled for this test.
### Actual R2 scores will vary but the normal eqns and sgd should give similar results.
w = np.linalg.pinv(X_train).dot(y_train)
yhat_test_ne = X_test.dot(w)
print("Solution from normal equations: \n",w)
print("R2_test(w) (normal equations) = {}".format(R2(yhat_test_ne,y_test)))
wsgd = np.random.randn(X.shape[1])
wsgd,path = sgd(wsgd,X_train,y_train,pw_linreg_grad,eta=0.01,num_epochs=250)
print("Solution from SGD")
yhat_test = X_test.dot(wsgd)
print("w = {},\n R2_test(w) (sgd) = {}".format(wsgd,R2(yhat_test,y_test)))
err_diffs = np.diff([MSE(ww,X,y) for ww in path])
plt.plot(np.arange(len(err_diffs)),err_diffs)
plt.title("Error differences over time")
plt.ylabel(r"$E_{in}(w_{t+1})-E_{in}(w_{t})$")
plt.xlabel("Time step")
plt.show()
L = [mse_gradient(ww,X,y) for ww in path]
gradients = np.linalg.norm(L,axis=1)
plt.plot(np.arange(len(gradients)),gradients)
plt.title("Magnitude of gradient over time steps")
plt.ylabel("Gradient magnitude")
plt.xlabel("Time step")
plt.show()