Methods Of Cross Validation In ML

Cross-validation involves using the error on multiple small data sets to calculate the overall error.

Given a model learned from a data set, how can we estimate the error on new data of the model?

Leave-One-Out Cross-Validation (LOOCV)

LOOCV suggests that we can use a holdout set (a subset of the data set not used for training, but used specifically for testing error in the model).

What LOOCV does is take n examples, and train on n-1 examples, leaving 1 for the holdout set.

After we've done this n times (leaving out a different one each time) the mean error of each loop (using the one left out as the test set) is the estimate of the "true error" of the model.

k-Fold Cross Validation

k-fold fixes the computational-heaviness of LOOCV by partitioning the data into k (equally sized, disjoint) subsets (rather than n).

Each of the subsets is used as test-set whilst the remaining subsets are the training set.

k =10 is usually more than reasonable, and k = 3 can be used if the learning is a slow process.

Stratification is important, it's the process of ensuring that the class distribution in each subset is the same as that of the complete data set. This increases the accuracy of the whole process.