Rocchio Classification - AbsoluteAstronomy.com

Rocchio Classification is based on a method of relevance feedback

Relevance feedback

Relevance feedback is a feature of some information retrieval systems. The idea behind relevance feedback is to take the results that are initially returned from a given query and to use information about whether or not those results are relevant to perform a new query...

found in information retrieval

Information retrieval

Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...

systems which stemmed from the SMART Information Retrieval System

SMART Information Retrieval System

The SMART Information Retrieval System is an information retrieval system developed at Cornell University in the 1960s...

around the year 1970. Like many other retrieval systems, the Rocchio feedback approach was developed using the Vector Space Model

Vector space model

Vector space model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings...

. The algorithm

Algorithm

In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

is based on the assumption that most users have a general conception of which documents should be denoted as relevant

Relevance (information retrieval)

In information science and information retrieval, relevance denotes how well a retrieved document or set of documents meets the information need of the user.-Types:...

or non-relevant . Therefore, the user's search query is revised to include an arbitrary percentage of relevant and non-relevant documents as a means of increasing the search engine

Search engine

A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...

's recall, and possibly the precision as well. The number of relevant and non-relevant documents allowed to enter a query

Information retrieval

is dictated by the weights of the a, b, c variables listed below in the Algorithm section .

Algorithm

The formula and variable definitions for Rocchio Classification is as follows :

Variable	Value
	Modified Query Vector
	Original Query Vector
	Related Document Vector
	Non-Related Document Vector
a	Original Query Weight
b	Related Documents Weight
c	Non-Related Documents Weight
Dr	Set of Related Documents
Dnr	Set of Non-Related Documents

As demonstrated in the Rocchio formula, the associated weights (a, b, c) are responsible for shaping the modified vector

Vector space

A vector space is a mathematical structure formed by a collection of vectors: objects that may be added together and multiplied by numbers, called scalars in this context. Scalars are often taken to be real numbers, but one may also consider vector spaces with scalar multiplication by complex...

in a direction closer, or farther away, from the original query, related documents, and non-related documents. In particular, the values for b and c should be incremented or decremented proportionally to the set of documents classified by the user. If the user decides that the modified query should not contain documents from either the original query, related documents, or non-related documents, then the corresponding weight (a, b, c) value for the category should be set to 0.

In the later part of the algorithm, the variables Dr, and Dnr are presented to be sets of vectors

Tuple

In mathematics and computer science, a tuple is an ordered list of elements. In set theory, an n-tuple is a sequence of n elements, where n is a positive integer. There is also one 0-tuple, an empty sequence. An n-tuple is defined inductively using the construction of an ordered pair...

containing the coordinates of related documents and non-related documents. Though Dr and Dnr are not vectors themselves,

and

are the vectors used to iterate through the two sets and form vector summations

Summation

Summation is the operation of adding a sequence of numbers; the result is their sum or total. If numbers are added sequentially from left to right, any intermediate result is a partial sum, prefix sum, or running total of the summation. The numbers to be summed may be integers, rational numbers,...

. These summations will be multiplied against the Multiplicative inverse

Multiplicative inverse

In mathematics, a multiplicative inverse or reciprocal for a number x, denoted by 1/x or x−1, is a number which when multiplied by x yields the multiplicative identity, 1. The multiplicative inverse of a fraction a/b is b/a. For the multiplicative inverse of a real number, divide 1 by the...

of their respective document set (Dr, Dnr) to complete the addition or subtraction of related or non-related documents.

In order to visualize the changes taking place on the modified vector, please refer to the image below . As the weights are increased or decreased for a particular category of documents, the coordinates for the modified vector begin to move either closer, or farther away, from the centroid

Centroid

In geometry, the centroid, geometric center, or barycenter of a plane figure or two-dimensional shape X is the intersection of all straight lines that divide X into two parts of equal moment about the line. Informally, it is the "average" of all points of X...

of the document collection. Thus if the weight is increased for related documents, then the modified vectors coordinate's will reflect being closer to the centroid of related documents.

Time complexity

The time complexity

Time complexity

In computer science, the time complexity of an algorithm quantifies the amount of time taken by an algorithm to run as a function of the size of the input to the problem. The time complexity of an algorithm is commonly expressed using big O notation, which suppresses multiplicative constants and...

for training and testing the Rocchio Classifcation algorithm are listed below and followed by the definition of each variable

Variable (mathematics)

In mathematics, a variable is a value that may change within the scope of a given problem or set of operations. In contrast, a constant is a value that remains unchanged, though often unknown or undetermined. The concepts of constants and variables are fundamental to many areas of mathematics and...

. Note that when in testing phase, the time complexity can be reduced to that of calculating the euclidean distance

Euclidean distance

In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. By using this formula as distance, Euclidean space becomes a metric space...

between a class centroid

Centroid

and the respective document. As shown by:

.

Training =

Testing =

Variable	Value
	Labeled Document Set
	Average Tokens Per Document
	Class Set
	Vocabulary/Term Set
	Number of Tokens in Document
	Number of Types in Document

Usage

Though there are benefits to ranking documents as not-relevant, a relevant

Relevant

Relevant may refer to:* Relevant operator, a concept in physics, see renormalization group* Relevant, Ain, a commune of the Ain département in France* Relevant Magazine, an American religious publication...

document ranking will result in more precise documents being made available to the user. Therefore, traditional values for the algorithm's weights (a, b, c) in Rocchio Classification are typically around a = 1, b = 0.8, and c = 0.1. Modern information retrieval

Information retrieval

systems have moved towards eliminating the non-related documents by setting c = 0 and thus only accounting for related documents. Although not all retrieval systems

Information retrieval

have eliminated the need for non-related documents, most have limited the effects on modified query by only accounting for strongest non-related documents in the Dnr set.

Limitations

The Rocchio algorithm often fails to classify multimodal classes and relationships. For instance, the country of Burma was renamed to Myanmar

Myanmar

Burma , officially the Republic of the Union of Myanmar , is a country in Southeast Asia. Burma is bordered by China on the northeast, Laos on the east, Thailand on the southeast, Bangladesh on the west, India on the northwest, the Bay of Bengal to the southwest, and the Andaman Sea on the south....

in 1989. Therefore the two queries of "Burma" and "Myanmar" will appear much farther apart in the vector space model

Vector space model

, though they both contain similar origins.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.