Model selection for Machine Learning
The problem of overfitting
While it is common to train a machine learning model on a huge dataset, sometimes it happens that the learning algorithm is strong enough or training is kept too long such that the model would ''memorize'' the training data given to it. Sometimes it could happen when the training data does not generalize overall very well. Intuitively, the model should be able to learn to fit the training dataset given enough time or simple enough data. However, such a model would likely be ''overfitting'' the data, and when given something it has never seen, it usually produces meaningless output.
Example: Training on a selected subset of data
We still use the digits dataset from sklearn, but this time we filter out all the 9s from training data.
We can see that the number of 9s in the new data set is 0.
Then we train a svm model using the subset of data
Cross Validation
Cross validation is a very common method used to avoid overfitting. The idea is to hide a portion of the training data from the model, just like what we did above, but randomly selected this time. Then by evaluating performance on this ''hidden'' part of data, we can see whether the model is ''memorizing'' given data instead of generalizing.
Since the aim is just to split the data, we have a lot of different ways to achieve this.
1. Hold out
The simplest way is to make a static hold out set before training so we can evaluate the model using this hold out set during training. We can use LeavePOut
class from sklearn to do so.
2. K-fold
Another common way is K=fold validation, where we split the data set into K folds and before every epoch of training(everytime we go through the entire training set), we can randomly pick one fold as validation set. That means we can rotate the validation set and training set during training process.
Note that these are only indices, so you would need to apply them to digits.target
and digits.data
seperately to get the actual data and targets.
We can check the shape of the split we get from k fold:
3. Validating
Now that we have the split of data that we want to use as a ''hidden'' validation for our model, we need to figure a way to actually evaluate quantitive performance.
A good way of doing so is just imply apply the objective function to the validation set. For example, the simplest objective function is the Least Squared Error(LSE):
where t
is our target, and y
is the output of the model. In order to evaluate LSE on validation set, we need to let the model predict on validation set first. Then we simply calculate the LSE.
We can also compare this with the result from training set.
As we can see, the SVM model we're using is very good at minimizing the LSE on training data. However, the error from the validation set is still quite high.
Exercise
Now, let's repeat everything several times to see how training progress over mulitple epoches.