Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download
Views: 90
Kernel: SageMath (stable)

Model selection for Machine Learning

The problem of overfitting

While it is common to train a machine learning model on a huge dataset, sometimes it happens that the learning algorithm is strong enough or training is kept too long such that the model would ''memorize'' the training data given to it. Sometimes it could happen when the training data does not generalize overall very well. Intuitively, the model should be able to learn to fit the training dataset given enough time or simple enough data. However, such a model would likely be ''overfitting'' the data, and when given something it has never seen, it usually produces meaningless output.

Example: Training on a selected subset of data

We still use the digits dataset from sklearn, but this time we filter out all the 9s from training data.

from sklearn import datasets digits = datasets.load_digits() data = [] target = [] images = [] for index,t in enumerate(digits.target, 0): if t != 9: data.append(digits.data[index]) target.append(digits.target[index]) images.append(digits.images[index]) filtered = digits.copy() filtered['target_names'] = [0,1,2,3,4,5,6,7,8] filtered['data'] = data filtered['target'] = target filtered['images'] = images

We can see that the number of 9s in the new data set is 0.

d_list = 10*[0] for d in filtered['target']: d_list[d]+=1 d_list
[178, 182, 177, 183, 181, 182, 181, 179, 174, 0]

Then we train a svm model using the subset of data

from sklearn import svm clf = svm.SVC(gamma=.001, C=100.) clf.fit(filtered['data'][:-1], filtered['target'][:-1])
SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)
pre = clf.predict(digits.data) # Save predictions for the testing data ans = digits.target # Save the correct answers for this data
total = 0 wrong = 0 for index,a in enumerate(ans,0): if a == 9: total += 1 if ans[index] != pre[index]: wrong += 1 wrong/total
1.0

Cross Validation

Cross validation is a very common method used to avoid overfitting. The idea is to hide a portion of the training data from the model, just like what we did above, but randomly selected this time. Then by evaluating performance on this ''hidden'' part of data, we can see whether the model is ''memorizing'' given data instead of generalizing.

Since the aim is just to split the data, we have a lot of different ways to achieve this.

1. Hold out

The simplest way is to make a static hold out set before training so we can evaluate the model using this hold out set during training. We can use LeavePOut class from sklearn to do so.

import sklearn.model_selection # Leave 10% of the data as hold out lpo = sklearn.model_selection.LeavePOut(p=len(digits.target)//10) for train,test in lpo.split(digits.target): print('test:', len(test)) print('train:', len(train)) break
test: 179 train: 1618

2. K-fold

Another common way is K=fold validation, where we split the data set into K folds and before every epoch of training(everytime we go through the entire training set), we can randomly pick one fold as validation set. That means we can rotate the validation set and training set during training process.

k_fold = sklearn.model_selection.KFold(n_splits=5) for train,test in k_fold.split(digits.target): print('One pass through training set, train set has', len(train), 'validation set has', len(test))
One pass through training set, train set has 1437 validation set has 360 One pass through training set, train set has 1437 validation set has 360 One pass through training set, train set has 1438 validation set has 359 One pass through training set, train set has 1438 validation set has 359 One pass through training set, train set has 1438 validation set has 359

Note that these are only indices, so you would need to apply them to digits.target and digits.data seperately to get the actual data and targets.

len(digits.target[test]) == len(test)
True

We can check the shape of the split we get from k fold:

digits.data[test].shape
(359, 64)

3. Validating

Now that we have the split of data that we want to use as a ''hidden'' validation for our model, we need to figure a way to actually evaluate quantitive performance.

A good way of doing so is just imply apply the objective function to the validation set. For example, the simplest objective function is the Least Squared Error(LSE): E=(ty)2E = \sum (t-y)^2

where t is our target, and y is the output of the model. In order to evaluate LSE on validation set, we need to let the model predict on validation set first. Then we simply calculate the LSE.

from sklearn import svm clf = svm.SVC(gamma=.001, C=100.) clf.fit(digits.data[train], digits.target[train])
SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)
y_test = clf.predict(digits.data[test]) t_test = digits.target[test]
sklearn.metrics.mean_squared_error(t_test,y_test)
0.7381615598885793

We can also compare this with the result from training set.

y_train = clf.predict(digits.data[train]) t_train = digits.target[train] sklearn.metrics.mean_squared_error(t_train,y_train)
0.0

As we can see, the SVM model we're using is very good at minimizing the LSE on training data. However, the error from the validation set is still quite high.

Exercise

Now, let's repeat everything several times to see how training progress over mulitple epoches.

clf = svm.SVC(gamma=.001, C=100.) k_fold = sklearn.model_selection.KFold(n_splits=10) for train,test in k_fold.split(digits.target): print('One pass') clf.fit(digits.data[train], digits.target[train]) y_train = clf.predict(digits.data[train]) t_train = digits.target[train] print('Train error', sklearn.metrics.mean_squared_error(t_train,y_train)) y_test = clf.predict(digits.data[test]) t_test = digits.target[test] print('Valid error', sklearn.metrics.mean_squared_error(t_test,y_test))
One pass Train error 0.0 Valid error 0.45555555555555555 One pass Train error 0.0 Valid error 0.0 One pass Train error 0.0 Valid error 0.26666666666666666 One pass Train error 0.0 Valid error 0.14444444444444443 One pass Train error 0.0 Valid error 0.005555555555555556 One pass Train error 0.0 Valid error 0.3611111111111111 One pass Train error 0.0 Valid error 0.08888888888888889 One pass Train error 0.0 Valid error 0.00558659217877095 One pass Train error 0.0 Valid error 0.7318435754189944 One pass Train error 0.0 Valid error 0.41899441340782123
from sklearn.linear_model import SGDClassifier clf = SGDClassifier(loss='squared_loss', penalty=None, max_iter=200, tol=1e-3, warm_start=True) k_fold = sklearn.model_selection.KFold(n_splits=5) for train,test in k_fold.split(digits.target): print('One pass') clf.fit(digits.data[train], digits.target[train]) y_train = clf.predict(digits.data[train]) t_train = digits.target[train] print('Train error', sklearn.metrics.mean_squared_error(t_train,y_train)) y_test = clf.predict(digits.data[test]) t_test = digits.target[test] print('Valid error', sklearn.metrics.mean_squared_error(t_test,y_test))
One pass
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:603: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. ConvergenceWarning)
Train error 13.157081014223872 Valid error 10.277777777777779 One pass
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:603: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. ConvergenceWarning)
Train error 9.923933209647496 Valid error 10.372222222222222 One pass
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:603: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. ConvergenceWarning)
Train error 12.262213976499691 Valid error 11.772222222222222 One pass
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:603: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. ConvergenceWarning)
Train error 14.175015460729746 Valid error 13.905555555555555 One pass
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:603: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. ConvergenceWarning)
Train error 18.131725417439704 Valid error 17.344444444444445 One pass
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:603: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. ConvergenceWarning)
Train error 14.577612863327149 Valid error 13.527777777777779 One pass
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:603: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. ConvergenceWarning)
Train error 19.14409400123686 Valid error 19.67222222222222 One pass
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:603: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. ConvergenceWarning)
Train error 22.00494437577256 Valid error 22.748603351955307 One pass
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:603: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. ConvergenceWarning)
Train error 12.15389369592089 Valid error 14.329608938547485 One pass Train error 14.30840543881335 Valid error 14.033519553072626
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/stochastic_gradient.py:603: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit. ConvergenceWarning)

clf.get_params()