机器学习中的模型验证

model validation

Model validation is important step in machine learning. Cross validation and bootstrapping methods can be used for model validation. Both of them are resampling methods.Cross validation resamples without replacement,bootstrap resamples with replacement.

reference - https://arxiv.org/pdf/1811.12808.pdf

Cross validation

Cross-validation is a series of methods for estimating the true error of a model to ensure that the model being trained is also valid on real data. The methods include:

Hold-out cross validation
k-fold cross validation
leave-one-out cross validation

hold-out validation

The raw data set is splited into two parts, one is the training set to fit the model and the other one is the validation set to estimate the model error.

It Usually takes 20% or 30% as the validation set; The sampling is randomly stratified with the target variable Y for reducing the bias between the training/test set and the full set (bias).

Advantages and disadvantages:

The method is simple, requiring only random partitioning and low computational complexity.
The effect on the validation set can fluctuate considerably because each partitioning is different.

k-fold cross validation

In k-fold cross-validation, the raw data is randomly splited into k equal sized subdata set. one of subdata set is retained as the testing data, and the remaining k-1 subdata set are used as training data set. The cross-validation process is then repeated k times.

Advantages and disadvantages：

Ultimately all of data are used for fitting the model.
The estimate of the test error may be high
If K is too high (e.g., extreme K=n), the error estimate will have high variance; if K is too low (e.g., 2, 3), high bias will occur, Usually K=5 or K=10.

leave-one-out cross validation

A special case of cross-validation in the case of K=n, take n-1 samples at a time for modeling, 1 sample for evaluation, and repeat for n times.

bootstrapping

Bootstrap is a resampling method with replacement, and the idea is also used random forests.

One sample of the original size is repeated with playback to obtain K samples of the same size.
Calculate the specified statistics (e.g., mean, standard deviation) for each sample, or fit the model to the parameters to obtain a bootstrap distribution of some statistic/parameter similar to the bootstrap distribution obtained by bootstrap sampling from the total.
Taking the average of the bootstrap distribution is an estimate of the overall parameter.