Rocchio Classification
Encyclopedia
Rocchio Classification is based on a method of relevance feedback
found in information retrieval
systems which stemmed from the SMART Information Retrieval System
around the year 1970. Like many other retrieval systems, the Rocchio feedback approach was developed using the Vector Space Model
. The algorithm
is based on the assumption that most users have a general conception of which documents should be denoted as relevant
or non-relevant . Therefore, the user's search query is revised to include an arbitrary percentage of relevant and non-relevant documents as a means of increasing the search engine
's recall, and possibly the precision as well. The number of relevant and non-relevant documents allowed to enter a query
is dictated by the weights of the a, b, c variables listed below in the Algorithm section .
As demonstrated in the Rocchio formula, the associated weights (a, b, c) are responsible for shaping the modified vector
in a direction closer, or farther away, from the original query, related documents, and non-related documents. In particular, the values for b and c should be incremented or decremented proportionally to the set of documents classified by the user. If the user decides that the modified query should not contain documents from either the original query, related documents, or non-related documents, then the corresponding weight (a, b, c) value for the category should be set to 0.
In the later part of the algorithm, the variables Dr, and Dnr are presented to be sets of vectors
containing the coordinates of related documents and non-related documents. Though Dr and Dnr are not vectors themselves, and are the vectors used to iterate through the two sets and form vector summations
. These summations will be multiplied against the Multiplicative inverse
of their respective document set (Dr, Dnr) to complete the addition or subtraction of related or non-related documents.
In order to visualize the changes taking place on the modified vector, please refer to the image below . As the weights are increased or decreased for a particular category of documents, the coordinates for the modified vector begin to move either closer, or farther away, from the centroid
of the document collection. Thus if the weight is increased for related documents, then the modified vectors coordinate's will reflect being closer to the centroid of related documents.
for training and testing the Rocchio Classifcation algorithm are listed below and followed by the definition of each variable
. Note that when in testing phase, the time complexity can be reduced to that of calculating the euclidean distance
between a class centroid
and the respective document. As shown by: .
Training =
Testing =
document ranking will result in more precise documents being made available to the user. Therefore, traditional values for the algorithm's weights (a, b, c) in Rocchio Classification are typically around a = 1, b = 0.8, and c = 0.1. Modern information retrieval
systems have moved towards eliminating the non-related documents by setting c = 0 and thus only accounting for related documents. Although not all retrieval systems
have eliminated the need for non-related documents, most have limited the effects on modified query by only accounting for strongest non-related documents in the Dnr set.
in 1989. Therefore the two queries of "Burma" and "Myanmar" will appear much farther apart in the vector space model
, though they both contain similar origins.
Relevance feedback
Relevance feedback is a feature of some information retrieval systems. The idea behind relevance feedback is to take the results that are initially returned from a given query and to use information about whether or not those results are relevant to perform a new query...
found in information retrieval
Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...
systems which stemmed from the SMART Information Retrieval System
SMART Information Retrieval System
The SMART Information Retrieval System is an information retrieval system developed at Cornell University in the 1960s...
around the year 1970. Like many other retrieval systems, the Rocchio feedback approach was developed using the Vector Space Model
Vector space model
Vector space model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings...
. The algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
is based on the assumption that most users have a general conception of which documents should be denoted as relevant
Relevance (information retrieval)
In information science and information retrieval, relevance denotes how well a retrieved document or set of documents meets the information need of the user.-Types:...
or non-relevant . Therefore, the user's search query is revised to include an arbitrary percentage of relevant and non-relevant documents as a means of increasing the search engine
Search engine
A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...
's recall, and possibly the precision as well. The number of relevant and non-relevant documents allowed to enter a query
Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...
is dictated by the weights of the a, b, c variables listed below in the Algorithm section .
Algorithm
The formula and variable definitions for Rocchio Classification is as follows :Variable | Value |
---|---|
Modified Query Vector | |
Original Query Vector | |
Related Document Vector | |
Non-Related Document Vector | |
a | Original Query Weight |
b | Related Documents Weight |
c | Non-Related Documents Weight |
Dr | Set of Related Documents |
Dnr | Set of Non-Related Documents |
As demonstrated in the Rocchio formula, the associated weights (a, b, c) are responsible for shaping the modified vector
Vector space
A vector space is a mathematical structure formed by a collection of vectors: objects that may be added together and multiplied by numbers, called scalars in this context. Scalars are often taken to be real numbers, but one may also consider vector spaces with scalar multiplication by complex...
in a direction closer, or farther away, from the original query, related documents, and non-related documents. In particular, the values for b and c should be incremented or decremented proportionally to the set of documents classified by the user. If the user decides that the modified query should not contain documents from either the original query, related documents, or non-related documents, then the corresponding weight (a, b, c) value for the category should be set to 0.
In the later part of the algorithm, the variables Dr, and Dnr are presented to be sets of vectors
Tuple
In mathematics and computer science, a tuple is an ordered list of elements. In set theory, an n-tuple is a sequence of n elements, where n is a positive integer. There is also one 0-tuple, an empty sequence. An n-tuple is defined inductively using the construction of an ordered pair...
containing the coordinates of related documents and non-related documents. Though Dr and Dnr are not vectors themselves, and are the vectors used to iterate through the two sets and form vector summations
Summation
Summation is the operation of adding a sequence of numbers; the result is their sum or total. If numbers are added sequentially from left to right, any intermediate result is a partial sum, prefix sum, or running total of the summation. The numbers to be summed may be integers, rational numbers,...
. These summations will be multiplied against the Multiplicative inverse
Multiplicative inverse
In mathematics, a multiplicative inverse or reciprocal for a number x, denoted by 1/x or x−1, is a number which when multiplied by x yields the multiplicative identity, 1. The multiplicative inverse of a fraction a/b is b/a. For the multiplicative inverse of a real number, divide 1 by the...
of their respective document set (Dr, Dnr) to complete the addition or subtraction of related or non-related documents.
In order to visualize the changes taking place on the modified vector, please refer to the image below . As the weights are increased or decreased for a particular category of documents, the coordinates for the modified vector begin to move either closer, or farther away, from the centroid
Centroid
In geometry, the centroid, geometric center, or barycenter of a plane figure or two-dimensional shape X is the intersection of all straight lines that divide X into two parts of equal moment about the line. Informally, it is the "average" of all points of X...
of the document collection. Thus if the weight is increased for related documents, then the modified vectors coordinate's will reflect being closer to the centroid of related documents.
Time complexity
The time complexityTime complexity
In computer science, the time complexity of an algorithm quantifies the amount of time taken by an algorithm to run as a function of the size of the input to the problem. The time complexity of an algorithm is commonly expressed using big O notation, which suppresses multiplicative constants and...
for training and testing the Rocchio Classifcation algorithm are listed below and followed by the definition of each variable
Variable (mathematics)
In mathematics, a variable is a value that may change within the scope of a given problem or set of operations. In contrast, a constant is a value that remains unchanged, though often unknown or undetermined. The concepts of constants and variables are fundamental to many areas of mathematics and...
. Note that when in testing phase, the time complexity can be reduced to that of calculating the euclidean distance
Euclidean distance
In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. By using this formula as distance, Euclidean space becomes a metric space...
between a class centroid
Centroid
In geometry, the centroid, geometric center, or barycenter of a plane figure or two-dimensional shape X is the intersection of all straight lines that divide X into two parts of equal moment about the line. Informally, it is the "average" of all points of X...
and the respective document. As shown by: .
Training =
Testing =
Variable | Value |
---|---|
Labeled Document Set | |
Average Tokens Per Document | |
Class Set | |
Vocabulary/Term Set | |
Number of Tokens in Document | |
Number of Types in Document |
Usage
Though there are benefits to ranking documents as not-relevant, a relevantRelevant
Relevant may refer to:* Relevant operator, a concept in physics, see renormalization group* Relevant, Ain, a commune of the Ain département in France* Relevant Magazine, an American religious publication...
document ranking will result in more precise documents being made available to the user. Therefore, traditional values for the algorithm's weights (a, b, c) in Rocchio Classification are typically around a = 1, b = 0.8, and c = 0.1. Modern information retrieval
Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...
systems have moved towards eliminating the non-related documents by setting c = 0 and thus only accounting for related documents. Although not all retrieval systems
Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...
have eliminated the need for non-related documents, most have limited the effects on modified query by only accounting for strongest non-related documents in the Dnr set.
Limitations
The Rocchio algorithm often fails to classify multimodal classes and relationships. For instance, the country of Burma was renamed to MyanmarMyanmar
Burma , officially the Republic of the Union of Myanmar , is a country in Southeast Asia. Burma is bordered by China on the northeast, Laos on the east, Thailand on the southeast, Bangladesh on the west, India on the northwest, the Bay of Bengal to the southwest, and the Andaman Sea on the south....
in 1989. Therefore the two queries of "Burma" and "Myanmar" will appear much farther apart in the vector space model
Vector space model
Vector space model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings...
, though they both contain similar origins.