Empirical risk minimization
Encyclopedia
Empirical risk minimization (ERM) is a principle in statistical learning theory
which defines a family of learning algorithms and is used to give theoretical bounds on the performance of learning algorithms
.
problems. We have two spaces of objects and and would like to learn a function (often called hypothesis) which outputs an object , given . To do so, we have at our disposal a training set of a few examples where is an input and is the corresponding response that we wish to get from .
To put it more formally, we assume that there is a joint probability distribution over and , and that the training set consists of instances drawn i.i.d. from . Note that the assumption of a joint probability distribution allows us to model uncertainty in predictions (e.g. from noise in data) because is not a deterministic function of , but rather a random variable
with conditional distribution
for a fixed .
We also assume that we are given a non-negative real-valued loss function
which measures how different the prediction of a hypothesis is from the true outcome . The risk associated with hypothesis is then defined as the expectation
of the loss function:
A loss function commonly used in theory is the 0-1 loss function
: , where is the indicator notation.
The ultimate goal of a learning algorithm is to find a hypothesis among a fixed class of functions for which the risk is minimal:
Empirical risk minimization principle states that the learning algorithm should choose a hypothesis which minimizes the empirical risk:
Thus the learning algorithm defined by ERM principle consists in solving the above optimization problem.
is known to be an NP-hard
problem even for such relatively simple class of functions as linear classifier
s. Though, it can be solved efficiently when minimal empirical risk is zero, i.e. data is linearly separable
.
In practice, machine learning algorithms cope with that either by employing a convex approximation to 0-1 loss function (like hinge loss
for SVM
), which is easier to optimize, or by posing assumptions on the distribution (and thus stop being agnostic learning algorithms to which the above result applies,)
Statistical learning theory
Statistical learning theory is an ambiguous term.#It may refer to computational learning theory, which is a sub-field of theoretical computer science that studies how algorithms can learn from data....
which defines a family of learning algorithms and is used to give theoretical bounds on the performance of learning algorithms
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
.
Background
Consider the following situation, which is a general setting of many supervised learningSupervised learning
Supervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value...
problems. We have two spaces of objects and and would like to learn a function (often called hypothesis) which outputs an object , given . To do so, we have at our disposal a training set of a few examples where is an input and is the corresponding response that we wish to get from .
To put it more formally, we assume that there is a joint probability distribution over and , and that the training set consists of instances drawn i.i.d. from . Note that the assumption of a joint probability distribution allows us to model uncertainty in predictions (e.g. from noise in data) because is not a deterministic function of , but rather a random variable
Random variable
In probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...
with conditional distribution
Conditional distribution
Given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value...
for a fixed .
We also assume that we are given a non-negative real-valued loss function
Loss function
In statistics and decision theory a loss function is a function that maps an event onto a real number intuitively representing some "cost" associated with the event. Typically it is used for parameter estimation, and the event in question is some function of the difference between estimated and...
which measures how different the prediction of a hypothesis is from the true outcome . The risk associated with hypothesis is then defined as the expectation
Expected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...
of the loss function:
A loss function commonly used in theory is the 0-1 loss function
0-1 loss function
In statistics and decision theory, a frequently used loss function is the 0-1 loss functionwhere I is the indicator notation....
: , where is the indicator notation.
The ultimate goal of a learning algorithm is to find a hypothesis among a fixed class of functions for which the risk is minimal:
Empirical risk minimization
In general, the risk cannot be computed because the distribution is unknown to the learning algorithm (this situation is referred to as agnostic learning). However, we can compute an approximation, called empirical risk, by averaging the loss function on the training set:Empirical risk minimization principle states that the learning algorithm should choose a hypothesis which minimizes the empirical risk:
Thus the learning algorithm defined by ERM principle consists in solving the above optimization problem.
Computational complexity
Empirical risk minimization for a classification problem with 0-1 loss function0-1 loss function
In statistics and decision theory, a frequently used loss function is the 0-1 loss functionwhere I is the indicator notation....
is known to be an NP-hard
NP-hard
NP-hard , in computational complexity theory, is a class of problems that are, informally, "at least as hard as the hardest problems in NP". A problem H is NP-hard if and only if there is an NP-complete problem L that is polynomial time Turing-reducible to H...
problem even for such relatively simple class of functions as linear classifier
Linear classifier
In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics...
s. Though, it can be solved efficiently when minimal empirical risk is zero, i.e. data is linearly separable
Linearly separable
In geometry, two sets of points in a two-dimensional space are linearly separable if they can be completely separated by a single line. In general, two point sets are linearly separable in n-dimensional space if they can be separated by a hyperplane....
.
In practice, machine learning algorithms cope with that either by employing a convex approximation to 0-1 loss function (like hinge loss
Hinge loss
Hinge loss is a loss function in machine learning. The hinge loss function is the following:l=maxHinge loss works well for its purposes in SVM as a classifier, since the more you violate the margin, the higher the penalty is. However, hinge loss is not well-suited for regression-based problems as a...
for SVM
Support vector machine
A support vector machine is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis...
), which is easier to optimize, or by posing assumptions on the distribution (and thus stop being agnostic learning algorithms to which the above result applies,)