Training set
Encyclopedia
A training set is a set of data
used in various areas of information science
to discover potentially predictive relationships. Training sets are used in artificial intelligence
, machine learning
, genetic programming
, intelligent systems
, and statistics
. In all these fields, a training set has much the same role and is often used in conjunction with a test set
.
or machine learning
, a training set consists of an input vector and an answer vector, and is used together with a supervised learning
method to train a knowledge database (e.g. a neural net or a naive bayes classifier
) used by an AI machine.
In statistical modeling
, a training set is used to fit a model that can be used to predict a "response value" from one or more "predictors." The fitting can include both variable selection
and parameter estimation
. Statistical models used for prediction are often called regression
models, of which linear regression
and logistic regression
are two examples.
In these fields, a major emphasis is placed on avoiding overfitting
, so as to achieve the best possible performance on an independent test set
that follows the same probability distribution
as the training set.
consists of a function
taking one or more arguments
and results in an output vector, and the learning method's task is to run the system once with the input vector as the arguments, calculating the output vector, comparing it with the answer vector and then changing somewhat in order to get an output vector more like the answer vector next time the system is simulated.
Data
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...
used in various areas of information science
Information science
-Introduction:Information science is an interdisciplinary science primarily concerned with the analysis, collection, classification, manipulation, storage, retrieval and dissemination of information...
to discover potentially predictive relationships. Training sets are used in artificial intelligence
Artificial intelligence
Artificial intelligence is the intelligence of machines and the branch of computer science that aims to create it. AI textbooks define the field as "the study and design of intelligent agents" where an intelligent agent is a system that perceives its environment and takes actions that maximize its...
, machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
, genetic programming
Genetic programming
In artificial intelligence, genetic programming is an evolutionary algorithm-based methodology inspired by biological evolution to find computer programs that perform a user-defined task. It is a specialization of genetic algorithms where each individual is a computer program...
, intelligent systems
Artificial intelligence
Artificial intelligence is the intelligence of machines and the branch of computer science that aims to create it. AI textbooks define the field as "the study and design of intelligent agents" where an intelligent agent is a system that perceives its environment and takes actions that maximize its...
, and statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
. In all these fields, a training set has much the same role and is often used in conjunction with a test set
Test set
A test set is a set of data used in various areas of information science to assess the strength and utility of a predictive relationship. Test sets are used in artificial intelligence, machine learning, genetic programming, intelligent systems, and statistics...
.
Use in artificial intelligence, machine learning, and statistics
In artificial intelligenceArtificial intelligence
Artificial intelligence is the intelligence of machines and the branch of computer science that aims to create it. AI textbooks define the field as "the study and design of intelligent agents" where an intelligent agent is a system that perceives its environment and takes actions that maximize its...
or machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
, a training set consists of an input vector and an answer vector, and is used together with a supervised learning
Supervised learning
Supervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value...
method to train a knowledge database (e.g. a neural net or a naive bayes classifier
Naive Bayes classifier
A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong independence assumptions...
) used by an AI machine.
In statistical modeling
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
, a training set is used to fit a model that can be used to predict a "response value" from one or more "predictors." The fitting can include both variable selection
Feature selection
In machine learning and statistics, feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, is the technique of selecting a subset of relevant features for building robust learning models...
and parameter estimation
Estimation theory
Estimation theory is a branch of statistics and signal processing that deals with estimating the values of parameters based on measured/empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the...
. Statistical models used for prediction are often called regression
Regression analysis
In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...
models, of which linear regression
Linear regression
In statistics, linear regression is an approach to modeling the relationship between a scalar variable y and one or more explanatory variables denoted X. The case of one explanatory variable is called simple regression...
and logistic regression
Logistic regression
In statistics, logistic regression is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression...
are two examples.
In these fields, a major emphasis is placed on avoiding overfitting
Overfitting
In statistics, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations...
, so as to achieve the best possible performance on an independent test set
Test set
A test set is a set of data used in various areas of information science to assess the strength and utility of a predictive relationship. Test sets are used in artificial intelligence, machine learning, genetic programming, intelligent systems, and statistics...
that follows the same probability distribution
Probability distribution
In probability theory, a probability mass, probability density, or probability distribution is a function that describes the probability of a random variable taking certain values....
as the training set.
Use in intelligent systems
In general, an intelligent systemArtificial intelligence
Artificial intelligence is the intelligence of machines and the branch of computer science that aims to create it. AI textbooks define the field as "the study and design of intelligent agents" where an intelligent agent is a system that perceives its environment and takes actions that maximize its...
consists of a function
Function (mathematics)
In mathematics, a function associates one quantity, the argument of the function, also known as the input, with another quantity, the value of the function, also known as the output. A function assigns exactly one output to each input. The argument and the value may be real numbers, but they can...
taking one or more arguments
Parameter
Parameter from Ancient Greek παρά also “para” meaning “beside, subsidiary” and μέτρον also “metron” meaning “measure”, can be interpreted in mathematics, logic, linguistics, environmental science and other disciplines....
and results in an output vector, and the learning method's task is to run the system once with the input vector as the arguments, calculating the output vector, comparing it with the answer vector and then changing somewhat in order to get an output vector more like the answer vector next time the system is simulated.
See also
- Datasets for information extraction rules inferring and testing.
- Cross-validation (statistics)