Regularization (mathematics)
Encyclopedia
In mathematics
and statistics
, particularly in the fields of machine learning
and inverse problem
s, regularization involves introducing additional information in order to solve an ill-posed problem or to prevent overfitting
. This information is usually of the form of a penalty for complexity, such as restrictions for smoothness
or bounds on the vector space norm
.
A theoretical justification for regularization is that it attempts to impose Occam's razor
on the solution. From a Bayesian
point of view, many regularization techniques correspond to imposing certain prior
distributions on model parameters.
The same idea arose in many fields of science
. For example, the least-squares method can be viewed as a very simple form of regularization. A simple form of regularization applied to integral equation
s, generally termed Tikhonov regularization
after Andrey Nikolayevich Tikhonov, is essentially a trade-off between fitting the data and reducing a norm of the solution. More recently, non-linear regularization methods, including total variation regularization have become popular.
. Typical examples of regularization in statistical machine learning include ridge regression, lasso, and L2-norm in support vector machines.
Regularization methods are also used for model selection, where they work by implicitly or explicitly penalizing models based on the number of their parameters. For example, Bayesian learning methods make use of a prior probability
that (usually) gives lower probability to more complex models. Well-known model selection techniques include the Akaike information criterion
(AIC), minimum description length
(MDL), and the Bayesian information criterion (BIC). Alternative methods of controlling overfitting not involving regularization include cross-validation.
Examples of applications of different methods of regularization to the linear model
are:
Mathematics
Mathematics is the study of quantity, space, structure, and change. Mathematicians seek out patterns and formulate new conjectures. Mathematicians resolve the truth or falsity of conjectures by mathematical proofs, which are arguments sufficient to convince other mathematicians of their validity...
and statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
, particularly in the fields of machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
and inverse problem
Inverse problem
An inverse problem is a general framework that is used to convert observed measurements into information about a physical object or system that we are interested in...
s, regularization involves introducing additional information in order to solve an ill-posed problem or to prevent overfitting
Overfitting
In statistics, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations...
. This information is usually of the form of a penalty for complexity, such as restrictions for smoothness
Smooth function
In mathematical analysis, a differentiability class is a classification of functions according to the properties of their derivatives. Higher order differentiability classes correspond to the existence of more derivatives. Functions that have derivatives of all orders are called smooth.Most of...
or bounds on the vector space norm
Normed vector space
In mathematics, with 2- or 3-dimensional vectors with real-valued entries, the idea of the "length" of a vector is intuitive and can easily be extended to any real vector space Rn. The following properties of "vector length" are crucial....
.
A theoretical justification for regularization is that it attempts to impose Occam's razor
Occam's razor
Occam's razor, also known as Ockham's razor, and sometimes expressed in Latin as lex parsimoniae , is a principle that generally recommends from among competing hypotheses selecting the one that makes the fewest new assumptions.-Overview:The principle is often summarized as "simpler explanations...
on the solution. From a Bayesian
Bayesian inference
In statistics, Bayesian inference is a method of statistical inference. It is often used in science and engineering to determine model parameters, make predictions about unknown variables, and to perform model selection...
point of view, many regularization techniques correspond to imposing certain prior
Prior probability
In Bayesian statistical inference, a prior probability distribution, often called simply the prior, of an uncertain quantity p is the probability distribution that would express one's uncertainty about p before the "data"...
distributions on model parameters.
The same idea arose in many fields of science
Science
Science is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe...
. For example, the least-squares method can be viewed as a very simple form of regularization. A simple form of regularization applied to integral equation
Integral equation
In mathematics, an integral equation is an equation in which an unknown function appears under an integral sign. There is a close connection between differential and integral equations, and some problems may be formulated either way...
s, generally termed Tikhonov regularization
Tikhonov regularization
Tikhonov regularization, named for Andrey Tikhonov, is the most commonly used method of regularization of ill-posed problems. In statistics, the method is known as ridge regression, and, with multiple independent discoveries, it is also variously known as the Tikhonov-Miller method, the...
after Andrey Nikolayevich Tikhonov, is essentially a trade-off between fitting the data and reducing a norm of the solution. More recently, non-linear regularization methods, including total variation regularization have become popular.
Regularization in statistics
In statistics and machine learning, regularization is used to prevent overfittingOverfitting
In statistics, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations...
. Typical examples of regularization in statistical machine learning include ridge regression, lasso, and L2-norm in support vector machines.
Regularization methods are also used for model selection, where they work by implicitly or explicitly penalizing models based on the number of their parameters. For example, Bayesian learning methods make use of a prior probability
Prior probability
In Bayesian statistical inference, a prior probability distribution, often called simply the prior, of an uncertain quantity p is the probability distribution that would express one's uncertainty about p before the "data"...
that (usually) gives lower probability to more complex models. Well-known model selection techniques include the Akaike information criterion
Akaike information criterion
The Akaike information criterion is a measure of the relative goodness of fit of a statistical model. It was developed by Hirotsugu Akaike, under the name of "an information criterion" , and was first published by Akaike in 1974...
(AIC), minimum description length
Minimum description length
The minimum description length principle is a formalization of Occam's Razor in which the best hypothesis for a given set of data is the one that leads to the best compression of the data. MDL was introduced by Jorma Rissanen in 1978...
(MDL), and the Bayesian information criterion (BIC). Alternative methods of controlling overfitting not involving regularization include cross-validation.
Examples of applications of different methods of regularization to the linear model
Linear model
In statistics, the term linear model is used in different ways according to the context. The most common occurrence is in connection with regression models and the term is often taken as synonymous with linear regression model. However the term is also used in time series analysis with a different...
are:
Model | Fit measure | Entropy measure |
---|---|---|
AIC Akaike information criterion The Akaike information criterion is a measure of the relative goodness of fit of a statistical model. It was developed by Hirotsugu Akaike, under the name of "an information criterion" , and was first published by Akaike in 1974... /BIC |
||
Ridge regression | ||
Lasso | ||
Basis pursuit denoising | ||
RLAD | ||
Dantzig Selector |