Tikhonov regularization - AbsoluteAstronomy.com

Tikhonov regularization, named for Andrey Tikhonov, is the most commonly used method of regularization

Regularization (mathematics)

In mathematics and statistics, particularly in the fields of machine learning and inverse problems, regularization involves introducing additional information in order to solve an ill-posed problem or to prevent overfitting...

of ill-posed problems. In statistics

Statistics

Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, the method is known as ridge regression, and, with multiple independent discoveries, it is also variously known as the Tikhonov-Miller method, the Phillips-Twomey method, the constrained linear inversion method, and the method of linear regularization. It is related to the Levenberg-Marquardt algorithm

Levenberg-Marquardt algorithm

In mathematics and computing, the Levenberg–Marquardt algorithm provides a numerical solution to the problem of minimizing a function, generally nonlinear, over a space of parameters of the function...

for non-linear least-squares

Non-linear least squares

Non-linear least squares is the form of least squares analysis which is used to fit a set of m observations with a model that is non-linear in n unknown parameters . It is used in some forms of non-linear regression. The basis of the method is to approximate the model by a linear one and to...

problems. When the following problem is not well posed

Well-posed problem

The mathematical term well-posed problem stems from a definition given by Jacques Hadamard. He believed that mathematical models of physical phenomena should have the properties that# A solution exists# The solution is unique...

(either because of non-existence or non-uniqueness of

x

)NEWLINE

NEWLINE

A\mathbf{x}=\mathbf{b},

NEWLINE then the standard approach is known as linear least squares

Linear least squares

In statistics and mathematics, linear least squares is an approach to fitting a mathematical or statistical model to data in cases where the idealized value provided by the model for any data point is expressed linearly in terms of the unknown parameters of the model...

and seeks to minimize the residualNEWLINE

NEWLINE

\|A\mathbf{x}-\mathbf{b}\|^2

NEWLINE where

\left \| \cdot \right \|

is the Euclidean norm. This may be due to the system being overdetermined

Overdetermined system

In mathematics, a system of linear equations is considered overdetermined if there are more equations than unknowns. The terminology can be described in terms of the concept of counting constraints. Each unknown can be seen as an available degree of freedom...

or underdetermined

Underdetermined system

In mathematics, a system of linear equations is considered underdetermined if there are fewer equations than unknowns. The terminology can be described in terms of the concept of counting constraints. Each unknown can be seen as an available degree of freedom...

(

A

may be ill-conditioned or singular). In the latter case this is no better than the original problem. In order to give preference to a particular solution with desirable properties, the regularization term is included in this minimization:NEWLINE

NEWLINE

\|A\mathbf{x}-\mathbf{b}\|^2+ \|\Gamma \mathbf{x}\|^2

NEWLINE for some suitably chosen Tikhonov matrix,

\Gamma

. In many cases, this matrix is chosen as the identity matrix

Identity matrix

In linear algebra, the identity matrix or unit matrix of size n is the n×n square matrix with ones on the main diagonal and zeros elsewhere. It is denoted by In, or simply by I if the size is immaterial or can be trivially determined by the context...

\Gamma= I

, giving preference to solutions with smaller norms

Norm (mathematics)

In linear algebra, functional analysis and related areas of mathematics, a norm is a function that assigns a strictly positive length or size to all vectors in a vector space, other than the zero vector...

. In other cases, highpass operators (e.g., a difference operator or a weighted Fourier operator

Discrete Fourier transform

In mathematics, the discrete Fourier transform is a specific kind of discrete transform, used in Fourier analysis. It transforms one function into another, which is called the frequency domain representation, or simply the DFT, of the original function...

) may be used to enforce smoothness if the underlying vector is believed to be mostly continuous. This regularization improves the conditioning of the problem, thus enabling a numerical solution. An explicit solution, denoted by

\hat{x}

, is given by:NEWLINE

NEWLINE

\hat{x} = (A^{T}A+ \Gamma^{T} \Gamma )^{-1}A^{T}\mathbf{b}

NEWLINE The effect of regularization may be varied via the scale of matrix

\Gamma

. For

\Gamma = \alpha I

, when

\alpha

= 0 this reduces to the unregularized least squares solution provided that (A^TA)⁻¹ exists.

Bayesian interpretation

Although at first the choice of the solution to this regularized problem may look artificial, and indeed the matrix

\Gamma

seems rather arbitrary, the process can be justified from a Bayesian point of view

Bayesian probability

Bayesian probability is one of the different interpretations of the concept of probability and belongs to the category of evidential probabilities. The Bayesian interpretation of probability can be seen as an extension of logic that enables reasoning with propositions, whose truth or falsity is...

. Note that for an ill-posed problem one must necessarily introduce some additional assumptions in order to get a stable solution. Statistically we might assume that a priori

A priori (statistics)

In statistics, a priori knowledge is prior knowledge about a population, rather than that estimated by recent observation. It is common in Bayesian inference to make inferences conditional upon this knowledge, and the integration of a priori knowledge is the central difference between the Bayesian...

we know that

x

is a random variable with a multivariate normal distribution. For simplicity we take the mean to be zero and assume that each component is independent with standard deviation

Standard deviation

Standard deviation is a widely used measure of variability or diversity used in statistics and probability theory. It shows how much variation or "dispersion" there is from the average...

\sigma _x

. Our data are also subject to errors, and we take the errors in

b

to be also independent

Statistical independence

In probability theory, to say that two events are independent intuitively means that the occurrence of one event makes it neither more nor less probable that the other occurs...

with zero mean and standard deviation

\sigma _b

. Under these assumptions the Tikhonov-regularized solution is the most probable

Maximum a posteriori

In Bayesian statistics, a maximum a posteriori probability estimate is a mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data...

solution given the data and the a priori distribution of

x

, according to Bayes' theorem

Bayes' theorem

In probability theory and applications, Bayes' theorem relates the conditional probabilities P and P. It is commonly used in science and engineering. The theorem is named for Thomas Bayes ....

. The Tikhonov matrix is then

\Gamma = \alpha I

for Tikhonov factor

\alpha = \sigma _b / \sigma _x

. If the assumption of normality is replaced by assumptions of homoskedasticity and uncorrelatedness of errors

Errors and residuals in statistics

In statistics and optimization, statistical errors and residuals are two closely related and easily confused measures of the deviation of a sample from its "theoretical value"...

, and if one still assumes zero mean, then the Gauss-Markov theorem entails that the solution is the minimal unbiased estimate

Bias of an estimator

In statistics, bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. Otherwise the estimator is said to be biased.In ordinary English, the term bias is...

Generalized Tikhonov regularization

For general multivariate normal distributions for

x

and the data error, one can apply a transformation of the variables to reduce to the case above. Equivalently, one can seek an

x

to minimize

\|Ax-b\|_P^2 + \|x-x_0\|_Q^2\,

where we have used

\left \| x  \right \|_Q^2

to stand for the weighted norm

x^T Q x

(cf. the Mahalanobis distance

Mahalanobis distance

In statistics, Mahalanobis distance is a distance measure introduced by P. C. Mahalanobis in 1936. It is based on correlations between variables by which different patterns can be identified and analyzed. It gauges similarity of an unknown sample set to a known one. It differs from Euclidean...

). In the Bayesian interpretation

P

is the inverse covariance matrix

Covariance matrix

In probability theory and statistics, a covariance matrix is a matrix whose element in the i, j position is the covariance between the i th and j th elements of a random vector...

b

x_0

is the expected value

Expected value

In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...

x

, and

Q

is the inverse covariance matrix of

x

. The Tikhonov matrix is then given as a factorization of the matrix

Q = \Gamma^T \Gamma

(e.g. the Cholesky factorization), and is considered a whitening filter. This generalized problem can be solved explicitly using the formula NEWLINE

NEWLINE

x_0 + (A^T PA + Q)^{-1} A^T P(b-Ax_0).\,

NEWLINE

Regularization in Hilbert space

Typically discrete linear ill-conditioned problems result from discretization of integral equation

Integral equation

In mathematics, an integral equation is an equation in which an unknown function appears under an integral sign. There is a close connection between differential and integral equations, and some problems may be formulated either way...

s, and one can formulate a Tikhonov regularization in the original infinite dimensional context. In the above we can interpret

A

as a compact operator

Compact operator

In functional analysis, a branch of mathematics, a compact operator is a linear operator L from a Banach space X to another Banach space Y, such that the image under L of any bounded subset of X is a relatively compact subset of Y...

on Hilbert space

Hilbert space

The mathematical concept of a Hilbert space, named after David Hilbert, generalizes the notion of Euclidean space. It extends the methods of vector algebra and calculus from the two-dimensional Euclidean plane and three-dimensional space to spaces with any finite or infinite number of dimensions...

s, and

x

and

b

as elements in the domain and range of

A

. The operator

A^* A + \Gamma^T \Gamma

is then a self-adjoint

Hermitian adjoint

In mathematics, specifically in functional analysis, each linear operator on a Hilbert space has a corresponding adjoint operator.Adjoints of operators generalize conjugate transposes of square matrices to infinite-dimensional situations...

bounded invertible operator.

Relation to singular value decomposition and Wiener filter

With

\Gamma = \alpha I

, this least squares solution can be analyzed in a special way via the singular value decomposition

Singular value decomposition

In linear algebra, the singular value decomposition is a factorization of a real or complex matrix, with many useful applications in signal processing and statistics....

. Given the singular value decomposition of A

A = U \Sigma V^T\,

with singular values

\sigma _i

, the Tikhonov regularized solution can be expressed as

\hat{x} = V D U^T b

where

D

has diagonal values

D_{ii} = \frac{\sigma _i}{\sigma _i ^2 + \alpha ^2}

and is zero elsewhere. This demonstrates the effect of the Tikhonov parameter on the condition number

Condition number

In the field of numerical analysis, the condition number of a function with respect to an argument measures the asymptotically worst case of how much the function can change in proportion to small changes in the argument...

of the regularized problem. For the generalized case a similar representation can be derived using a generalized singular value decomposition

Generalized singular value decomposition

In linear algebra the generalized singular value decomposition is a matrix decomposition more general than the singular value decomposition...

. Finally, it is related to the Wiener filter

Wiener filter

In signal processing, the Wiener filter is a filter proposed by Norbert Wiener during the 1940s and published in 1949. Its purpose is to reduce the amount of noise present in a signal by comparison with an estimation of the desired noiseless signal. The discrete-time equivalent of Wiener's work was...

\hat{x} = \sum _{i=1} ^q f_i \frac{u_i ^T b}{\sigma _i} v_i

where the Wiener weights are

f_i = \frac{\sigma _i ^2}{\sigma _i ^2 + \alpha ^2}

and

q

is the rank

Rank (linear algebra)

The column rank of a matrix A is the maximum number of linearly independent column vectors of A. The row rank of a matrix A is the maximum number of linearly independent row vectors of A...

A

Determination of the Tikhonov factor

The optimal regularization parameter

\alpha

is usually unknown and often in practical problems is determined by an ad hoc method. A possible approach relies on the Bayesian interpretation described above. Other approaches include the discrepancy principle, cross-validation, L-curve method, restricted maximum likelihood

Restricted maximum likelihood

In statistics, the restricted maximum likelihood approach is a particular form of maximum likelihood estimation which does not base estimates on a maximum likelihood fit of all the information, but instead uses a likelihood function calculated from a transformed set of data, so that nuisance...

and unbiased predictive risk estimator. Grace Wahba

Grace Wahba

Grace Wahba is the I. J. Schoenberg Professor of Statistics at the University of Wisconsin–Madison. She is a pioneer in methods for smoothing noisy data...

proved that the optimal parameter, in the sense of leave-one-out cross-validation minimizes:

G = \frac{\operatorname{RSS}}{\tau ^2} = \frac{\left \| X \hat{\beta} - y \right \| ^2}{\left[ \operatorname{Tr} \left(I - X (X^T X + \alpha ^2 I) ^{-1} X ^T \right) \right]^2}

where

\operatorname{RSS}

is the residual sum of squares

Residual sum of squares

In statistics, the residual sum of squares is the sum of squares of residuals. It is also known as the sum of squared residuals or the sum of squared errors of prediction . It is a measure of the discrepancy between the data and an estimation model...

and

\tau

is the effective number degree of freedom

Degrees of freedom (statistics)

In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.Estimates of statistical parameters can be based upon different amounts of information or data. The number of independent pieces of information that go into the...

. Using the previous SVD decomposition, we can simplify the above expression:

\operatorname{RSS} = \left \| y - \sum _{i=1} ^q (u_i ' b) u_i \right \| ^2 + \left \| \sum _{i=1} ^q \frac{\alpha ^ 2}{\sigma _i ^ 2 + \alpha ^ 2} (u_i ' b) u_i \right \| ^2

\operatorname{RSS} = \operatorname{RSS} _0 + \left \| \sum _{i=1} ^q \frac{\alpha ^ 2}{\sigma _i ^ 2 + \alpha ^ 2} (u_i ' b) u_i \right \| ^2

and

\tau = m - \sum _{i=1} ^q \frac{\sigma _i ^2}{\sigma _i ^2 + \alpha ^2}

Relation to probabilistic formulation

The probabilistic formulation of an inverse problem

Inverse problem

An inverse problem is a general framework that is used to convert observed measurements into information about a physical object or system that we are interested in...

introduces (when all uncertainties are Gaussian) a covariance matrix

C_M

representing the a priori uncertainties on the model parameters, and a covariance matrix

C_D

representing the uncertainties on the observed parameters (see, for instance, Tarantola, 2004 http://www.ipgp.jussieu.fr/~tarantola/Files/Professional/SIAM/index.html). In the special case when these two matrices are diagonal and isotropic,

C_M = \sigma_M^2 I

and

C_D = \sigma_D^2 I

, and, in this case, the equations of inverse theory reduce to the equations above, with

\alpha = {\sigma_D}/{\sigma_M}

History

Tikhonov regularization has been invented independently in many different contexts. It became widely known from its application to integral equations from the work of Tychonoff

Andrey Nikolayevich Tychonoff

Andrey Nikolayevich Tikhonov was a Soviet and Russian mathematician known for important contributions to topology, functional analysis, mathematical physics, and ill-posed problems. He was also inventor of magnetotellurics method in geology. Tikhonov originally published in German, whence the...

and David L. Phillips. Some authors use the term Tikhonov-Phillips regularization. The finite dimensional case was expounded by Arthur E. Hoerl, who took a statistical approach, and by Manus Foster, who interpreted this method as a Wiener

Norbert Wiener

Norbert Wiener was an American mathematician.A famous child prodigy, Wiener later became an early researcher in stochastic and noise processes, contributing work relevant to electronic engineering, electronic communication, and control systems.Wiener is regarded as the originator of cybernetics, a...

-Kolmogorov filter. Following Hoerl, it is known in the statistical literature as ridge regression.

Bayesian interpretation

Generalized Tikhonov regularization

Regularization in Hilbert space

Relation to singular value decomposition and Wiener filter

Determination of the Tikhonov factor

Relation to probabilistic formulation

History

See Also