Tikhonov regularization
Encyclopedia
Tikhonov regularization, named for Andrey Tikhonov, is the most commonly used method of regularization
of ill-posed problems. In statistics
, the method is known as ridge regression, and, with multiple independent discoveries, it is also variously known as the Tikhonov-Miller method, the Phillips-Twomey method, the constrained linear inversion method, and the method of linear regularization. It is related to the Levenberg-Marquardt algorithm
for non-linear least-squares
problems. When the following problem is not well posed
(either because of non-existence or non-uniqueness of )NEWLINE
and seeks to minimize the residualNEWLINE
or underdetermined
( may be ill-conditioned or singular). In the latter case this is no better than the original problem. In order to give preference to a particular solution with desirable properties, the regularization term is included in this minimization:NEWLINE
, giving preference to solutions with smaller norms
. In other cases, highpass operators (e.g., a difference operator or a weighted Fourier operator
) may be used to enforce smoothness if the underlying vector is believed to be mostly continuous. This regularization improves the conditioning of the problem, thus enabling a numerical solution. An explicit solution, denoted by , is given by:NEWLINE
. Note that for an ill-posed problem one must necessarily introduce some additional assumptions in order to get a stable solution. Statistically we might assume that a priori
we know that is a random variable with a multivariate normal distribution. For simplicity we take the mean to be zero and assume that each component is independent with standard deviation
. Our data are also subject to errors, and we take the errors in to be also independent
with zero mean and standard deviation . Under these assumptions the Tikhonov-regularized solution is the most probable
solution given the data and the a priori distribution of , according to Bayes' theorem
. The Tikhonov matrix is then for Tikhonov factor . If the assumption of normality is replaced by assumptions of homoskedasticity and uncorrelatedness of errors
, and if one still assumes zero mean, then the Gauss-Markov theorem entails that the solution is the minimal unbiased estimate
.
). In the Bayesian interpretation is the inverse covariance matrix
of , is the expected value
of , and is the inverse covariance matrix of . The Tikhonov matrix is then given as a factorization of the matrix (e.g. the Cholesky factorization), and is considered a whitening filter. This generalized problem can be solved explicitly using the formula NEWLINE
s, and one can formulate a Tikhonov regularization in the original infinite dimensional context. In the above we can interpret as a compact operator
on Hilbert space
s, and and as elements in the domain and range of . The operator is then a self-adjoint
bounded invertible operator.
. Given the singular value decomposition of A with singular values , the Tikhonov regularized solution can be expressed as where has diagonal values and is zero elsewhere. This demonstrates the effect of the Tikhonov parameter on the condition number
of the regularized problem. For the generalized case a similar representation can be derived using a generalized singular value decomposition
. Finally, it is related to the Wiener filter
: where the Wiener weights are and is the rank
of .
and unbiased predictive risk estimator. Grace Wahba
proved that the optimal parameter, in the sense of leave-one-out cross-validation minimizes: where is the residual sum of squares
and is the effective number degree of freedom
. Using the previous SVD decomposition, we can simplify the above expression: and
Regularization (mathematics)
In mathematics and statistics, particularly in the fields of machine learning and inverse problems, regularization involves introducing additional information in order to solve an ill-posed problem or to prevent overfitting...
of ill-posed problems. In statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
, the method is known as ridge regression, and, with multiple independent discoveries, it is also variously known as the Tikhonov-Miller method, the Phillips-Twomey method, the constrained linear inversion method, and the method of linear regularization. It is related to the Levenberg-Marquardt algorithm
Levenberg-Marquardt algorithm
In mathematics and computing, the Levenberg–Marquardt algorithm provides a numerical solution to the problem of minimizing a function, generally nonlinear, over a space of parameters of the function...
for non-linear least-squares
Non-linear least squares
Non-linear least squares is the form of least squares analysis which is used to fit a set of m observations with a model that is non-linear in n unknown parameters . It is used in some forms of non-linear regression. The basis of the method is to approximate the model by a linear one and to...
problems. When the following problem is not well posed
Well-posed problem
The mathematical term well-posed problem stems from a definition given by Jacques Hadamard. He believed that mathematical models of physical phenomena should have the properties that# A solution exists# The solution is unique...
(either because of non-existence or non-uniqueness of )NEWLINE
- NEWLINE
Linear least squares
In statistics and mathematics, linear least squares is an approach to fitting a mathematical or statistical model to data in cases where the idealized value provided by the model for any data point is expressed linearly in terms of the unknown parameters of the model...
and seeks to minimize the residualNEWLINE
- NEWLINE
Overdetermined system
In mathematics, a system of linear equations is considered overdetermined if there are more equations than unknowns. The terminology can be described in terms of the concept of counting constraints. Each unknown can be seen as an available degree of freedom...
or underdetermined
Underdetermined system
In mathematics, a system of linear equations is considered underdetermined if there are fewer equations than unknowns. The terminology can be described in terms of the concept of counting constraints. Each unknown can be seen as an available degree of freedom...
( may be ill-conditioned or singular). In the latter case this is no better than the original problem. In order to give preference to a particular solution with desirable properties, the regularization term is included in this minimization:NEWLINE
- NEWLINE
Identity matrix
In linear algebra, the identity matrix or unit matrix of size n is the n×n square matrix with ones on the main diagonal and zeros elsewhere. It is denoted by In, or simply by I if the size is immaterial or can be trivially determined by the context...
, giving preference to solutions with smaller norms
Norm (mathematics)
In linear algebra, functional analysis and related areas of mathematics, a norm is a function that assigns a strictly positive length or size to all vectors in a vector space, other than the zero vector...
. In other cases, highpass operators (e.g., a difference operator or a weighted Fourier operator
Discrete Fourier transform
In mathematics, the discrete Fourier transform is a specific kind of discrete transform, used in Fourier analysis. It transforms one function into another, which is called the frequency domain representation, or simply the DFT, of the original function...
) may be used to enforce smoothness if the underlying vector is believed to be mostly continuous. This regularization improves the conditioning of the problem, thus enabling a numerical solution. An explicit solution, denoted by , is given by:NEWLINE
- NEWLINE
Bayesian interpretation
Although at first the choice of the solution to this regularized problem may look artificial, and indeed the matrix seems rather arbitrary, the process can be justified from a Bayesian point of viewBayesian probability
Bayesian probability is one of the different interpretations of the concept of probability and belongs to the category of evidential probabilities. The Bayesian interpretation of probability can be seen as an extension of logic that enables reasoning with propositions, whose truth or falsity is...
. Note that for an ill-posed problem one must necessarily introduce some additional assumptions in order to get a stable solution. Statistically we might assume that a priori
A priori (statistics)
In statistics, a priori knowledge is prior knowledge about a population, rather than that estimated by recent observation. It is common in Bayesian inference to make inferences conditional upon this knowledge, and the integration of a priori knowledge is the central difference between the Bayesian...
we know that is a random variable with a multivariate normal distribution. For simplicity we take the mean to be zero and assume that each component is independent with standard deviation
Standard deviation
Standard deviation is a widely used measure of variability or diversity used in statistics and probability theory. It shows how much variation or "dispersion" there is from the average...
. Our data are also subject to errors, and we take the errors in to be also independent
Statistical independence
In probability theory, to say that two events are independent intuitively means that the occurrence of one event makes it neither more nor less probable that the other occurs...
with zero mean and standard deviation . Under these assumptions the Tikhonov-regularized solution is the most probable
Maximum a posteriori
In Bayesian statistics, a maximum a posteriori probability estimate is a mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data...
solution given the data and the a priori distribution of , according to Bayes' theorem
Bayes' theorem
In probability theory and applications, Bayes' theorem relates the conditional probabilities P and P. It is commonly used in science and engineering. The theorem is named for Thomas Bayes ....
. The Tikhonov matrix is then for Tikhonov factor . If the assumption of normality is replaced by assumptions of homoskedasticity and uncorrelatedness of errors
Errors and residuals in statistics
In statistics and optimization, statistical errors and residuals are two closely related and easily confused measures of the deviation of a sample from its "theoretical value"...
, and if one still assumes zero mean, then the Gauss-Markov theorem entails that the solution is the minimal unbiased estimate
Bias of an estimator
In statistics, bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. Otherwise the estimator is said to be biased.In ordinary English, the term bias is...
.
Generalized Tikhonov regularization
For general multivariate normal distributions for and the data error, one can apply a transformation of the variables to reduce to the case above. Equivalently, one can seek an to minimize where we have used to stand for the weighted norm (cf. the Mahalanobis distanceMahalanobis distance
In statistics, Mahalanobis distance is a distance measure introduced by P. C. Mahalanobis in 1936. It is based on correlations between variables by which different patterns can be identified and analyzed. It gauges similarity of an unknown sample set to a known one. It differs from Euclidean...
). In the Bayesian interpretation is the inverse covariance matrix
Covariance matrix
In probability theory and statistics, a covariance matrix is a matrix whose element in the i, j position is the covariance between the i th and j th elements of a random vector...
of , is the expected value
Expected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...
of , and is the inverse covariance matrix of . The Tikhonov matrix is then given as a factorization of the matrix (e.g. the Cholesky factorization), and is considered a whitening filter. This generalized problem can be solved explicitly using the formula NEWLINE
- NEWLINE
Regularization in Hilbert space
Typically discrete linear ill-conditioned problems result from discretization of integral equationIntegral equation
In mathematics, an integral equation is an equation in which an unknown function appears under an integral sign. There is a close connection between differential and integral equations, and some problems may be formulated either way...
s, and one can formulate a Tikhonov regularization in the original infinite dimensional context. In the above we can interpret as a compact operator
Compact operator
In functional analysis, a branch of mathematics, a compact operator is a linear operator L from a Banach space X to another Banach space Y, such that the image under L of any bounded subset of X is a relatively compact subset of Y...
on Hilbert space
Hilbert space
The mathematical concept of a Hilbert space, named after David Hilbert, generalizes the notion of Euclidean space. It extends the methods of vector algebra and calculus from the two-dimensional Euclidean plane and three-dimensional space to spaces with any finite or infinite number of dimensions...
s, and and as elements in the domain and range of . The operator is then a self-adjoint
Hermitian adjoint
In mathematics, specifically in functional analysis, each linear operator on a Hilbert space has a corresponding adjoint operator.Adjoints of operators generalize conjugate transposes of square matrices to infinite-dimensional situations...
bounded invertible operator.
Relation to singular value decomposition and Wiener filter
With , this least squares solution can be analyzed in a special way via the singular value decompositionSingular value decomposition
In linear algebra, the singular value decomposition is a factorization of a real or complex matrix, with many useful applications in signal processing and statistics....
. Given the singular value decomposition of A with singular values , the Tikhonov regularized solution can be expressed as where has diagonal values and is zero elsewhere. This demonstrates the effect of the Tikhonov parameter on the condition number
Condition number
In the field of numerical analysis, the condition number of a function with respect to an argument measures the asymptotically worst case of how much the function can change in proportion to small changes in the argument...
of the regularized problem. For the generalized case a similar representation can be derived using a generalized singular value decomposition
Generalized singular value decomposition
In linear algebra the generalized singular value decomposition is a matrix decomposition more general than the singular value decomposition...
. Finally, it is related to the Wiener filter
Wiener filter
In signal processing, the Wiener filter is a filter proposed by Norbert Wiener during the 1940s and published in 1949. Its purpose is to reduce the amount of noise present in a signal by comparison with an estimation of the desired noiseless signal. The discrete-time equivalent of Wiener's work was...
: where the Wiener weights are and is the rank
Rank (linear algebra)
The column rank of a matrix A is the maximum number of linearly independent column vectors of A. The row rank of a matrix A is the maximum number of linearly independent row vectors of A...
of .
Determination of the Tikhonov factor
The optimal regularization parameter is usually unknown and often in practical problems is determined by an ad hoc method. A possible approach relies on the Bayesian interpretation described above. Other approaches include the discrepancy principle, cross-validation, L-curve method, restricted maximum likelihoodRestricted maximum likelihood
In statistics, the restricted maximum likelihood approach is a particular form of maximum likelihood estimation which does not base estimates on a maximum likelihood fit of all the information, but instead uses a likelihood function calculated from a transformed set of data, so that nuisance...
and unbiased predictive risk estimator. Grace Wahba
Grace Wahba
Grace Wahba is the I. J. Schoenberg Professor of Statistics at the University of Wisconsin–Madison. She is a pioneer in methods for smoothing noisy data...
proved that the optimal parameter, in the sense of leave-one-out cross-validation minimizes: where is the residual sum of squares
Residual sum of squares
In statistics, the residual sum of squares is the sum of squares of residuals. It is also known as the sum of squared residuals or the sum of squared errors of prediction . It is a measure of the discrepancy between the data and an estimation model...
and is the effective number degree of freedom
Degrees of freedom (statistics)
In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.Estimates of statistical parameters can be based upon different amounts of information or data. The number of independent pieces of information that go into the...
. Using the previous SVD decomposition, we can simplify the above expression: and