Early stopping
Encyclopedia
In machine learning
, early stopping is a form of regularization
used when a machine learning
model (such as a neural network
) is trained by on-line gradient descent
. In early stopping, the training set
is split into a new training set and a validation set. Gradient descent is applied to the new training set. After each sweep through the new training set, the network is evaluated on the validation set. When the performance with the validation test stops improving, the algorithm halts. The network with the best performance on the validation set is then used for actual testing, with a separate set of data (the validation set is used in learning to decide when to stop).
This technique is a simple but efficient hack to deal with the problem of overfitting
. Overfitting is a phenomenon in which a learning system, such as a neural network gets very good at dealing with one data set at the expense of becoming very bad at dealing with other data sets. Early stopping is effectively limiting the used weights in the network and thus imposes a regularization, effectively lowering the VC dimension
.
Early stopping is a very common practice in neural network training and often produces networks that generalize well. However, while often improving the generalization it does not do so in a mathematically well-defined way.
It is crucial to realize that the validation error is not a good estimate of the generalization error. One method for getting an unbiased estimate of the generalization error is to run the net on a third set of data, the test set, that is not used at all during the training process. The error on the test set gives estimate on generalization; to have the outputs of the net approximate target values given inputs that are not in the training set.
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
, early stopping is a form of regularization
Regularization (mathematics)
In mathematics and statistics, particularly in the fields of machine learning and inverse problems, regularization involves introducing additional information in order to solve an ill-posed problem or to prevent overfitting...
used when a machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
model (such as a neural network
Neural network
The term neural network was traditionally used to refer to a network or circuit of biological neurons. The modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes...
) is trained by on-line gradient descent
Gradient descent
Gradient descent is a first-order optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point...
. In early stopping, the training set
Training set
A training set is a set of data used in various areas of information science to discover potentially predictive relationships. Training sets are used in artificial intelligence, machine learning, genetic programming, intelligent systems, and statistics...
is split into a new training set and a validation set. Gradient descent is applied to the new training set. After each sweep through the new training set, the network is evaluated on the validation set. When the performance with the validation test stops improving, the algorithm halts. The network with the best performance on the validation set is then used for actual testing, with a separate set of data (the validation set is used in learning to decide when to stop).
This technique is a simple but efficient hack to deal with the problem of overfitting
Overfitting
In statistics, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations...
. Overfitting is a phenomenon in which a learning system, such as a neural network gets very good at dealing with one data set at the expense of becoming very bad at dealing with other data sets. Early stopping is effectively limiting the used weights in the network and thus imposes a regularization, effectively lowering the VC dimension
VC dimension
In statistical learning theory, or sometimes computational learning theory, the VC dimension is a measure of the capacity of a statistical classification algorithm, defined as the cardinality of the largest set of points that the algorithm can shatter...
.
Early stopping is a very common practice in neural network training and often produces networks that generalize well. However, while often improving the generalization it does not do so in a mathematically well-defined way.
Method
- Divide the available data into training and validation sets.
- Use a large number of hidden units.
- Use very small random initial values.
- Use a slow learning rate.
- Compute the validation error rate periodically during training.
- Stop training when the validation error rate "starts to go up".
It is crucial to realize that the validation error is not a good estimate of the generalization error. One method for getting an unbiased estimate of the generalization error is to run the net on a third set of data, the test set, that is not used at all during the training process. The error on the test set gives estimate on generalization; to have the outputs of the net approximate target values given inputs that are not in the training set.
Advantages
Early stopping has several advantages:- It is fast.
- It can be applied successfully to networks in which the number of weights far exceeds the sample size.
- It requires only one major decision by the user: what proportion of validation cases to use.
Issues
- It's not clear on how many cases to assign to the training and validation sets
- The result might highly depends on the algorithm which is used to split the data into training and validation set
- Notion of "increasing validation error" is ambiguous; it may go up and down numerous times during training. The safest approach is to train to convergence, then determine which iteration had the lowest validation error. This impairs fast training, one of the advantages of early stopping.
See also
- OverfittingOverfittingIn statistics, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations...
, early stopping is one of methods used to prevent overfitting - Cross-validation, in particular using a "Validation Set"
- Generalization errorGeneralization errorThe generalization error of a machine learning model is a function that measures how far the student machine is from the teacher machine in average over the entire set of possible data that can be generated by the teacher after each iteration of the learning process...
External links
- What is early stopping? Internet FAQ Archives