Project: Qihao Leng - Math 157 (Fall 2018)

Views: ⁹⁰

Kernel: SageMath (stable)

Model selection for Machine Learning

The problem of overfitting

While it is common to train a machine learning model on a huge dataset, sometimes it happens that the learning algorithm is strong enough or training is kept too long such that the model would ''memorize'' the training data given to it. Sometimes it could happen when the training data does not generalize overall very well. Intuitively, the model should be able to learn to fit the training dataset given enough time or simple enough data. However, such a model would likely be ''overfitting'' the data, and when given something it has never seen, it usually produces meaningless output.

Example: Training on a selected subset of data

We still use the digits dataset from sklearn, but this time we filter out all the 9s from training data.

In [2]:

from sklearn import datasets
digits = datasets.load_digits()


data = []
target = []
images = []
for index,t in enumerate(digits.target, 0):
    if t != 9:
        data.append(digits.data[index])
        target.append(digits.target[index])
        images.append(digits.images[index])

filtered = digits.copy()
filtered['target_names'] = [0,1,2,3,4,5,6,7,8]
filtered['data'] = data
filtered['target'] = target
filtered['images'] = images

We can see that the number of 9s in the new data set is 0.

In [11]:

d_list = 10*[0]
for d in filtered['target']:
    d_list[d]+=1
    
d_list

[178, 182, 177, 183, 181, 182, 181, 179, 174, 0]

Then we train a svm model using the subset of data

In [13]:

from sklearn import svm
clf = svm.SVC(gamma=.001, C=100.)
clf.fit(filtered['data'][:-1], filtered['target'][:-1])

SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [19]:

pre = clf.predict(digits.data) # Save predictions for the testing data
ans = digits.target # Save the correct answers for this data

In [29]:

total = 0
wrong = 0
for index,a in enumerate(ans,0):
    if a == 9:
        total += 1
        if ans[index] != pre[index]:
            wrong += 1
wrong/total

1.0

Cross Validation

Cross validation is a very common method used to avoid overfitting. The idea is to hide a portion of the training data from the model, just like what we did above, but randomly selected this time. Then by evaluating performance on this ''hidden'' part of data, we can see whether the model is ''memorizing'' given data instead of generalizing.

Since the aim is just to split the data, we have a lot of different ways to achieve this.

1. Hold out

The simplest way is to make a static hold out set before training so we can evaluate the model using this hold out set during training. We can use LeavePOut class from sklearn to do so.

In [15]:

import sklearn.model_selection
# Leave 10% of the data as hold out
lpo = sklearn.model_selection.LeavePOut(p=len(digits.target)//10)

for train,test in lpo.split(digits.target):
    print('test:', len(test))
    print('train:', len(train))
    break

test: 179
train: 1618

2. K-fold

Another common way is K=fold validation, where we split the data set into K folds and before every epoch of training(everytime we go through the entire training set), we can randomly pick one fold as validation set. That means we can rotate the validation set and training set during training process.

In [24]:

k_fold = sklearn.model_selection.KFold(n_splits=5)

for train,test in k_fold.split(digits.target):
    print('One pass through training set, train set has', len(train), 'validation set has', len(test))

One pass through training set, train set has 1437 validation set has 360
One pass through training set, train set has 1437 validation set has 360
One pass through training set, train set has 1438 validation set has 359
One pass through training set, train set has 1438 validation set has 359
One pass through training set, train set has 1438 validation set has 359

Note that these are only indices, so you would need to apply them to digits.target and digits.data seperately to get the actual data and targets.

In [21]:

len(digits.target[test]) == len(test)

True

We can check the shape of the split we get from k fold:

In [27]:

digits.data[test].shape

(359, 64)

3. Validating

Now that we have the split of data that we want to use as a ''hidden'' validation for our model, we need to figure a way to actually evaluate quantitive performance.

A good way of doing so is just imply apply the objective function to the validation set. For example, the simplest objective function is the Least Squared Error(LSE): $E = \sum (t-y)^2$

where t is our target, and y is the output of the model. In order to evaluate LSE on validation set, we need to let the model predict on validation set first. Then we simply calculate the LSE.

In [26]:

from sklearn import svm
clf = svm.SVC(gamma=.001, C=100.)
clf.fit(digits.data[train], digits.target[train])

SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [29]:

y_test = clf.predict(digits.data[test])
t_test = digits.target[test]

In [30]:

sklearn.metrics.mean_squared_error(t_test,y_test)

0.7381615598885793

We can also compare this with the result from training set.

In [31]:

y_train = clf.predict(digits.data[train])
t_train = digits.target[train]
sklearn.metrics.mean_squared_error(t_train,y_train)

0.0

As we can see, the SVM model we're using is very good at minimizing the LSE on training data. However, the error from the validation set is still quite high.

Exercise

Now, let's repeat everything several times to see how training progress over mulitple epoches.

In [33]:

clf = svm.SVC(gamma=.001, C=100.)
k_fold = sklearn.model_selection.KFold(n_splits=10)

for train,test in k_fold.split(digits.target):
    print('One pass')
    clf.fit(digits.data[train], digits.target[train])
    
    y_train = clf.predict(digits.data[train])
    t_train = digits.target[train]
    print('Train error', sklearn.metrics.mean_squared_error(t_train,y_train))
    
    y_test = clf.predict(digits.data[test])
    t_test = digits.target[test]
    print('Valid error', sklearn.metrics.mean_squared_error(t_test,y_test))

One pass
Train error 0.0
Valid error 0.45555555555555555
One pass
Train error 0.0
Valid error 0.0
One pass
Train error 0.0
Valid error 0.26666666666666666
One pass
Train error 0.0
Valid error 0.14444444444444443
One pass
Train error 0.0
Valid error 0.005555555555555556
One pass
Train error 0.0
Valid error 0.3611111111111111
One pass
Train error 0.0
Valid error 0.08888888888888889
One pass
Train error 0.0
Valid error 0.00558659217877095
One pass
Train error 0.0
Valid error 0.7318435754189944
One pass
Train error 0.0
Valid error 0.41899441340782123

In [54]:

from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss='squared_loss', penalty=None, max_iter=200, tol=1e-3, warm_start=True)
k_fold = sklearn.model_selection.KFold(n_splits=5)

for train,test in k_fold.split(digits.target):
    print('One pass')
    clf.fit(digits.data[train], digits.target[train])
    
    y_train = clf.predict(digits.data[train])
    t_train = digits.target[train]
    print('Train error', sklearn.metrics.mean_squared_error(t_train,y_train))
    
    y_test = clf.predict(digits.data[test])
    t_test = digits.target[test]
    print('Valid error', sklearn.metrics.mean_squared_error(t_test,y_test))

One pass

/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:603: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  ConvergenceWarning)

Train error 13.157081014223872
Valid error 10.277777777777779
One pass

/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:603: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  ConvergenceWarning)

Train error 9.923933209647496
Valid error 10.372222222222222
One pass

/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:603: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  ConvergenceWarning)

Train error 12.262213976499691
Valid error 11.772222222222222
One pass

/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:603: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  ConvergenceWarning)

Train error 14.175015460729746
Valid error 13.905555555555555
One pass

/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:603: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  ConvergenceWarning)

Train error 18.131725417439704
Valid error 17.344444444444445
One pass

/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:603: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  ConvergenceWarning)

Train error 14.577612863327149
Valid error 13.527777777777779
One pass

/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:603: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  ConvergenceWarning)

Train error 19.14409400123686
Valid error 19.67222222222222
One pass

/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:603: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  ConvergenceWarning)

Train error 22.00494437577256
Valid error 22.748603351955307
One pass

/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:603: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  ConvergenceWarning)

Train error 12.15389369592089
Valid error 14.329608938547485
One pass
Train error 14.30840543881335
Valid error 14.033519553072626

/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:603: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
  ConvergenceWarning)

clf.get_params()

In [0]: