Least squares support vector machine

Least squares support vector machines (LS-SVM) are least squares

Least squares

The method of least squares is a standard approach to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns. "Least squares" means that the overall solution minimizes the sum of the squares of the errors made in solving every...

versions of support vector machine

Support vector machine

A support vector machine is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis...

s (SVM), which are a set of related supervised learning

Supervised learning

Supervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value...

methods that analyze data and recognize patterns, and which are used for classification and regression analysis

Regression analysis

In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...

. In this version one finds the solution by solving a set of linear equation

Linear equation

A linear equation is an algebraic equation in which each term is either a constant or the product of a constant and a single variable....

s instead of a convex quadratic programming

Quadratic programming

Quadratic programming is a special type of mathematical optimization problem. It is the problem of optimizing a quadratic function of several variables subject to linear constraints on these variables....

(QP) problem for classical SVMs. Least squares SVM classifiers, were proposed by Suykens and Vandewalle. LS-SVMs are a class of kernel based learning methods.

From support vector machine to least squares support vector machine

Given a training set

with input data

and corresponding binary class labels

, the SVM

SVM

SVM can refer to:* SVM * Saskatchewan Volunteer Medal* Scanning voltage microscopy* Schuylkill Valley Metro* Secure Virtual Machine, or AMD Virtualization , a virtualization technology by AMD* Solaris Volume Manager...

classifier, according to Vapnik’s original formulation, satisfies the following conditions:

Which is equivalent to

where

is the nonlinear map from original space to the high (and possibly infinite) dimensional space.

Inseparable data

In case of such separating hyperplane does not exist, we introduce a so called slack variable

such that

According to the structural risk minimization

Structural risk minimization

Structural risk minimization is an inductive principle of use in machine learning. Commonly in machine learning, a generalized model must be selected from a finite data set, with the consequent problem of overfitting – the model becoming too strongly tailored to the particularities of the...

principle, the risk bound is minimized by the following minimization problem:

where

are the Lagrangian multipliers. The optimal point will in the saddle point

Saddle point

In mathematics, a saddle point is a point in the domain of a function that is a stationary point but not a local extremum. The name derives from the fact that in two dimensions the surface resembles a saddle that curves up in one direction, and curves down in a different direction...

of the Lagrangian function, and then we obtain

By substituting w by its expression, we will get the following quadratic programming problem:

where

is called the kernel function. Solving this QP problem subject to constrains in (8), we will get the hyperplane

Hyperplane

A hyperplane is a concept in geometry. It is a generalization of the plane into a different number of dimensions.A hyperplane of an n-dimensional space is a flat subset with dimension n − 1...

in the high dimensional space and hence the classifier

Classifier

Classifier may refer to:*Classifier *Classifier *Classifier *Hierarchical classifier*Linear classifier...

in the original space.

Least squares SVM formulation

The least squares version of the SVM classifier is obtained by reformulating the minimization problem as:

subject to the equality constraints:

The least squares SVM (LS-SVM) classifier formulation above implicitly corresponds to a regression

Regression analysis

interpretation with binary targets

.

Using

, we have

with

Hence the LS-SVM classifier formulation is equivalent to

with

and

Both

and

should be considered as hyperparamters to tune the amount of regularization versus the sum squared error. The solution does only depend on the ratio

, therefore the original formulation uses only

as tuning parameter. We use both

and

as parameters in order to provide a Bayesian interpretation to LS-SVM.

The solution of LS-SVM regressor will be obtained after we construct the Lagrangian function:

where

are the Lagrange multipliers. The conditions for optimality are

Elimination of

and

will yield a linear system

Linear system

A linear system is a mathematical model of a system based on the use of a linear operator.Linear systems typically exhibit features and properties that are much simpler than the general, nonlinear case....

instead of a quadratic programming

Quadratic programming

problem:

with

and

. Here,

is an

identity matrix

Identity matrix

In linear algebra, the identity matrix or unit matrix of size n is the n×n square matrix with ones on the main diagonal and zeros elsewhere. It is denoted by In, or simply by I if the size is immaterial or can be trivially determined by the context...

, and

is the kernel matrix defined by

Kernel function K

For the kernel function K(•, •) one typically has the following choices:

Linear
Linear
In mathematics, a linear map or function f is a function which satisfies the following two properties:* Additivity : f = f + f...

kernel :
Polynomial
Polynomial
In mathematics, a polynomial is an expression of finite length constructed from variables and constants, using only the operations of addition, subtraction, multiplication, and non-negative integer exponents...

kernel of degree :
Radial basis function
Radial basis function
A radial basis function is a real-valued function whose value depends only on the distance from the origin, so that \phi = \phi; or alternatively on the distance from some other point c, called a center, so that \phi = \phi...

RBF kernel :
MLP kernel :

where

and

are constants. Notice that the Mercer condition holds for all

and

values in the polynomial

Polynomial

In mathematics, a polynomial is an expression of finite length constructed from variables and constants, using only the operations of addition, subtraction, multiplication, and non-negative integer exponents...

and RBF case, but not for all possible choices of

and

in the MLP case. The scale parameters

and

determine the scaling of the inputs in the polynomial, RBF and MLP kernel function. This scaling is related to the bandwidth of the kernel in statistics

Statistics

Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, where it is shown that the bandwidth is an important parameter of the generalization behavior of a kernel method.

Bayesian interpretation for LS-SVM

A Bayesian

Bayesian

Bayesian refers to methods in probability and statistics named after the Reverend Thomas Bayes , in particular methods related to statistical inference:...

interpretation of the SVM has been proposed by Smola et al. They showed that the use of different kernels in SVM can be regarded as defining different prior probability

Prior probability

In Bayesian statistical inference, a prior probability distribution, often called simply the prior, of an uncertain quantity p is the probability distribution that would express one's uncertainty about p before the "data"...

distributions on the functional space, as

. Here

is a constant and

is the regularization operator corresponding to the selected kernel.

A general Bayesian evidence framework was developed by MacKay, and MacKay has used it to the problem of regression, forward neural network

Neural network

The term neural network was traditionally used to refer to a network or circuit of biological neurons. The modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes...

and classification network. Provided data set

, a model

with parameter vector

and a so called hyperparameter or regularization parameter

, Bayesian inference

Bayesian inference

In statistics, Bayesian inference is a method of statistical inference. It is often used in science and engineering to determine model parameters, make predictions about unknown variables, and to perform model selection...

is constructed with 3 levels of inference:

In level 1, for a given value of , the first level of inference infers the posterior distribution of by Bayesian rule

The second level of inference determines the value of , by maximizing

The third level of inference in the evidence framework ranks different models by examining their posterior probabilities

We can see that Bayesian evidence framework is a unified theory for learning

Learning

Learning is acquiring new or modifying existing knowledge, behaviors, skills, values, or preferences and may involve synthesizing different types of information. The ability to learn is possessed by humans, animals and some machines. Progress over time tends to follow learning curves.Human learning...

the model and model selection.
Kwok used the Bayesian evidence framework to interpret the formulation of SVM and model selection. And he also applied Bayesian evidence framework to support vector regression.

Now, given the data points

and the hyperparameters

and

of the model

, the model parameters

and

are estimated by maximizing the posterior

. Applying Bayes’ rule, we obtain:

Where

is a normalizing constant such the integral over all possible

and

is equal to 1.
We assume

and

are independent of the hyperparameter

, and are conditional independent, i.e., we assume

When

, the distribution of

will approximate a uniform distribution. Furthermore, we assume

and

are Gaussian distribution, so we obtain the a priori distribution of

and

with

to be:

Here

is the dimensionality of the feature space, same as the dimensionality of

.

The probability of

is assumed to depend only on

and

. We assume that the data points are independently identically distributed (i.i.d.), so that:

In order to obtain the least square cost function, it is assumed that the probability of a data point is proportional to:

A Gaussian distribution is taken for the errors

as:

It is assumed that the

and

are determined in such a way that the class centers

and

are mapped onto the target -1 and +1, respectively. The projections

of the class elements

follow a multivariate Gaussian distribution, which have variance

.

Combining the preceding expressions, and neglecting all constants, Bayes’ rule becomes

The maximum posterior density estimates

and

are then be obtained by minimizing the negative logarithm of (26), so we arrive (10).

External links

www.esat.kuleuven.be/sista/lssvmlab/ "Least squares support vector machine Lab (LS-SVMlab) toolbox contains Matlab/C implementations for a number of LS-SVM algorithms."
www.kernel-machines.org "Support Vector Machines and Kernel based methods (Smola & Schölkopf)."
www.gaussianprocess.org "Gaussian Processes: Data modeling using Gaussian Process priors over functions for regression and classification (MacKay, Williams)"
www.support-vector.net "Support Vector Machines and kernel based methods (Cristianini)"
dlib: Contains a least-squares SVM implementation for large-scale datasets.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.