Bag of words model
Encyclopedia
The bag-of-words model is a simplifying assumption used in natural language processing
and information retrieval
. In this model, a text (such as a sentence or a document) is represented as an unordered collection of words, disregarding grammar and even word order.
The bag-of-words model is used in some methods of document classification
. When a Naive Bayes classifier
is applied to text, for example, the conditional independence
assumption leads to the bag-of-words model.
Other methods of document classification that use this model are latent Dirichlet allocation
and latent semantic analysis
.
An early reference to "bag of words" in a linguistic context can be found in Zellig Harris
's 1954 article on Distributional Structure.
, an e-mail message is modeled as an unordered collection of words selected from one of two probability distributions: one representing spam and one representing legitimate e-mail ("ham").
Imagine that there are two literal bags full of words. One bag is filled with words found in spam messages, and the other bag is filled with words found in legitimate e-mail. While any given word is likely to be found somewhere in both bags, the "spam" bag will contain spam-related words such as "stock", "Viagra", and "buy" much more frequently, while the "ham" bag will contain more words related to the user's friends or workplace.
To classify an e-mail message, the Bayesian spam filter assumes that the message is a pile of words that has been poured out randomly from one of the two bags, and uses Bayesian probability
to determine which bag it is more likely to be.
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....
and information retrieval
Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...
. In this model, a text (such as a sentence or a document) is represented as an unordered collection of words, disregarding grammar and even word order.
The bag-of-words model is used in some methods of document classification
Document classification
Document classification or document categorization is a problem in both library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically...
. When a Naive Bayes classifier
Naive Bayes classifier
A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong independence assumptions...
is applied to text, for example, the conditional independence
Conditional independence
In probability theory, two events R and B are conditionally independent given a third event Y precisely if the occurrence or non-occurrence of R and the occurrence or non-occurrence of B are independent events in their conditional probability distribution given Y...
assumption leads to the bag-of-words model.
Other methods of document classification that use this model are latent Dirichlet allocation
Latent Dirichlet allocation
In statistics, latent Dirichlet allocation is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar...
and latent semantic analysis
Latent semantic analysis
Latent semantic analysis is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close...
.
An early reference to "bag of words" in a linguistic context can be found in Zellig Harris
Zellig Harris
Zellig Sabbettai Harris was a renowned American linguist, mathematical syntactician, and methodologist of science. Originally a Semiticist, he is best known for his work in structural linguistics and discourse analysis and for the discovery of transformational structure in language...
's 1954 article on Distributional Structure.
Example: Spam filtering
In Bayesian spam filteringBayesian spam filtering
Bayesian spam filtering is a statistical technique of e-mail filtering. It makes use of a naive Bayes classifier to identify spam e-mail.Bayesian classifiers work by correlating the use of tokens , with spam and non spam e-mails and then using Bayesian inference to calculate a probability that an...
, an e-mail message is modeled as an unordered collection of words selected from one of two probability distributions: one representing spam and one representing legitimate e-mail ("ham").
Imagine that there are two literal bags full of words. One bag is filled with words found in spam messages, and the other bag is filled with words found in legitimate e-mail. While any given word is likely to be found somewhere in both bags, the "spam" bag will contain spam-related words such as "stock", "Viagra", and "buy" much more frequently, while the "ham" bag will contain more words related to the user's friends or workplace.
To classify an e-mail message, the Bayesian spam filter assumes that the message is a pile of words that has been poured out randomly from one of the two bags, and uses Bayesian probability
Bayesian probability
Bayesian probability is one of the different interpretations of the concept of probability and belongs to the category of evidential probabilities. The Bayesian interpretation of probability can be seen as an extension of logic that enables reasoning with propositions, whose truth or falsity is...
to determine which bag it is more likely to be.
See also
- Natural language processingNatural language processingNatural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....
- Additive smoothingAdditive smoothingIn statistics, additive smoothing, also called Laplace smoothing , or Lidstone smoothing, is a technique used to smooth categorical data...
- Document classificationDocument classificationDocument classification or document categorization is a problem in both library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically...
- Machine learningMachine learningMachine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
- Document-term matrixDocument-term matrixA document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining...
- Bag of words model in computer visionBag of words model in computer visionThis is an article introducing the "Bag of words model" in computer vision, especially for object categorization. From now, the "BoW" model refers to the BoW model in computer vision unless explicitly declared. This technique is also known as "Bag of Features model".Before introducing the BoW...