Learning to rank
Encyclopedia
Learning to rank or machine-learned ranking (MLR) is a type of supervised
or semi-supervised
machine learning
problem in which the goal is to automatically construct a ranking model
from training data. Training data consists of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment (e.g. "relevant" or "not relevant") for each item. Ranking model's purpose is to rank, i.e. produce a permutation
of items in new, unseen lists in a way, which is "similar" to rankings in the training data in some sense.
Learning to rank is a relatively new research area which has emerged in the past decade.
problems, such as document retrieval
, collaborative filtering
, sentiment analysis
, computational advertising
(online ad placement).
A possible architecture of a machine-learned search engine is shown in the figure to the right.
Training data consists of queries and documents matching them together with relevance degree of each match. It may be prepared manually by human assessors (or raters, as Google
calls them),
who check results for some queries and determine relevance
of each result. It is not feasible to check relevance of all documents, and so typically a technique called pooling is used — only top few documents, retrieved by some existing ranking models are checked. Alternatively, training data may be derived automatically by analyzing clickthrough logs (i.e. search results which got clicks from users), query chains, or such search engines' features as Google's SearchWiki
.
Training data is used by a learning algorithm to produce a ranking model which computes relevance of documents for actual queries.
Typically, users expect a search query to complete in a short time (such as a few hundred milliseconds for web search), which makes it impossible to evaluate a complex ranking model on each document in the corpus, and so a two-phase scheme is used. First, a small number of potentially relevant documents are identified using simpler retrieval models which permit fast query evaluation, such as vector space model
, boolean model
, weighted AND, BM25
. This phase is called top- document retrieval and many good heuristics were proposed in the literature to accelerate it, such as using document's static quality score and tiered indexes. In the second phase, a more accurate but computationally expensive machine-learned model is used to re-rank these documents.
used in information retrieval for representation of documents.
Components of such vectors are called features, factors or ranking signals. They may be divided into three groups (features from document retrieval
are shown as examples):
Some examples of features, which were used in the well-known LETOR dataset:
Selecting and designing good features is an important area in machine learning, which is called feature engineering.
Examples of ranking quality measures:
DCG and its normalized variant NDCG are usually preferred in academic research when multiple levels of relevance are used. Other metrics such as MAP, MRR and precision, are defined only for binary judgements.
Recently, there have been proposed several new evaluation metrics which claim to model user's satisfaction with search results better than the DCG metric:
Both of these metrics are based on the assumption that the user is more likely to stop looking at search results after examining a more relevant document, than after a less relevant document.
in his paper "Learning to Rank for Information Retrieval" and talks at several leading conferences has analyzed existing algorithms for learning to rank problems and categorized them into three groups by their input representation and loss function
:
A number of existing supervised
machine learning algorithms can be readily used for this purpose. Ordinal regression and classification algorithms can also be used in pointwise approach when they are used to predict score of a single query-document pair, and it takes a small, finite number of values.
Note: as most supervised learning
algorithms can be applied to pointwise case, only those methods which are specifically designed with ranking in mind are shown above.
and early 1990s
. They suggest that these early works achieved limited results in their time due to little available training data and poor machine learning techniques.
In mid-1990s Berkeley researchers used logistic regression
to train a successful ranking function at TREC
conference.
Several conferences, such as NIPS
, SIGIR
and ICML had workshops devoted to the learning-to-rank problem since mid-2000s, and this has stimulated much of academic research.
s began using machine learned ranking systems since 2000s. One of the first search engines to start using it was AltaVista
(later its technology was acquired by Overture, and then Yahoo), which launched a gradient boosting
-trained ranking function in April 2003.
Bing's search is said to be powered by RankNet algorithm, which was invented at Microsoft Research
in 2005.
In November 2009 a Russian search engine Yandex
announced that it had significantly increased its search quality due to deployment of a new proprietary MatrixNet algorithm, a variant of gradient boosting
method which uses oblivious decision trees. Recently they have also sponsored a machine-learned ranking competition "Internet Mathematics 2009" based on their own search engine's production data. Yahoo has announced a similar competition in 2010.
As of 2008, Google
's Peter Norvig
denied that their search engine exclusively relies on machine-learned ranking. Cuil
's CEO, Tom Costello
, suggests that they prefer hand-built models because they can outperform machine-learned models when measured against metrics like click-through rate or time on landing page, which is because machine-learned models "learn what people say they like, not what people actually like".
Open Source code
Supervised learning
Supervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value...
or semi-supervised
Semi-supervised learning
In computer science, semi-supervised learning is a class of machine learning techniques that make use of both labeled and unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data...
machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
problem in which the goal is to automatically construct a ranking model
Ranking function
In information retrieval, a ranking function is a function used by search engines to rank matching documents according to their relevance to a given search query....
from training data. Training data consists of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment (e.g. "relevant" or "not relevant") for each item. Ranking model's purpose is to rank, i.e. produce a permutation
Permutation
In mathematics, the notion of permutation is used with several slightly different meanings, all related to the act of permuting objects or values. Informally, a permutation of a set of objects is an arrangement of those objects into a particular order...
of items in new, unseen lists in a way, which is "similar" to rankings in the training data in some sense.
Learning to rank is a relatively new research area which has emerged in the past decade.
In information retrieval
Ranking is a central part of many information retrievalInformation retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...
problems, such as document retrieval
Document retrieval
Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual...
, collaborative filtering
Collaborative filtering
Collaborative filtering is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc. Applications of collaborative filtering typically involve very large data sets...
, sentiment analysis
Sentiment analysis
Sentiment analysis or opinion mining refers to the application of natural language processing, computational linguistics, and text analytics to identify and extract subjective information in source materials....
, computational advertising
Computational Advertising
Computational Advertising is a relatively new discipline within Computer Science that deals with algorithms of presenting the best advertisement displayed to a person, typically through an Internet browser.- Research and Academic Courses :...
(online ad placement).
A possible architecture of a machine-learned search engine is shown in the figure to the right.
Training data consists of queries and documents matching them together with relevance degree of each match. It may be prepared manually by human assessors (or raters, as Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...
calls them),
who check results for some queries and determine relevance
Relevance (information retrieval)
In information science and information retrieval, relevance denotes how well a retrieved document or set of documents meets the information need of the user.-Types:...
of each result. It is not feasible to check relevance of all documents, and so typically a technique called pooling is used — only top few documents, retrieved by some existing ranking models are checked. Alternatively, training data may be derived automatically by analyzing clickthrough logs (i.e. search results which got clicks from users), query chains, or such search engines' features as Google's SearchWiki
Google SearchWiki
SearchWiki was a Google search feature which allowed logged-in users to annotate and re-order search results. The annotations and modified order only applied to the user's searches, but it was possible to view other users' annotations for a given search query. SearchWiki was released on November...
.
Training data is used by a learning algorithm to produce a ranking model which computes relevance of documents for actual queries.
Typically, users expect a search query to complete in a short time (such as a few hundred milliseconds for web search), which makes it impossible to evaluate a complex ranking model on each document in the corpus, and so a two-phase scheme is used. First, a small number of potentially relevant documents are identified using simpler retrieval models which permit fast query evaluation, such as vector space model
Vector space model
Vector space model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings...
, boolean model
Standard Boolean model
The Boolean model of information retrieval is a classical information retrieval model and, at the same time, the first and most adopted one. It is used by virtually all commercial IR systems today.-Definitions:...
, weighted AND, BM25
Okapi BM25
In information retrieval, Okapi BM25 is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson, Karen Spärck Jones, and...
. This phase is called top- document retrieval and many good heuristics were proposed in the literature to accelerate it, such as using document's static quality score and tiered indexes. In the second phase, a more accurate but computationally expensive machine-learned model is used to re-rank these documents.
In other areas
Learning to rank algorithms have been applied in areas other than information retrieval:- In machine translationMachine translationMachine translation, sometimes referred to by the abbreviation MT is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another.On a basic...
for ranking a set of hypothesized translations; - In computational biologyComputational biologyComputational biology involves the development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems...
for ranking candidate 3-D structures in protein structure prediction problem. - In proteomicsProteomicsProteomics is the large-scale study of proteins, particularly their structures and functions. Proteins are vital parts of living organisms, as they are the main components of the physiological metabolic pathways of cells. The term "proteomics" was first coined in 1997 to make an analogy with...
for the identification of frequent top scoring peptides. - In Recommender system for identifying a ranked list of related news articles to recommend to a user after he or she has read a current news article.
Feature vectors
For convenience of MLR algorithms, query-document pairs are usually represented by numerical vectors, which are called feature vectors. Such approach is sometimes called bag of features and is analogous to bag of words and vector space modelVector space model
Vector space model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings...
used in information retrieval for representation of documents.
Components of such vectors are called features, factors or ranking signals. They may be divided into three groups (features from document retrieval
Document retrieval
Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual...
are shown as examples):
- Query-independent or static features — those features, which depend only on the document, but not on the query. For example, PageRankPageRankPageRank is a link analysis algorithm, named after Larry Page and used by the Google Internet search engine, that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set...
or document's length. Such features can be precomputed in off-line mode during indexing. They may be used to compute document's static quality score (or static rank), which is often used to speed up search query evaluation. - Query-dependent or dynamic features — those features, which depend both on the contents of the document and the query, such as TF-IDF score or other non-machine-learned ranking functions.
- Query features, which depend only on the query. For example, the number of words in a query.
Some examples of features, which were used in the well-known LETOR dataset:
- TF, TF-IDF, BM25Okapi BM25In information retrieval, Okapi BM25 is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson, Karen Spärck Jones, and...
, and language modeling scores of document's zones (title, body, anchors text, URL) for a given query; - Lengths and IDF sums of document's zones;
- Document's PageRankPageRankPageRank is a link analysis algorithm, named after Larry Page and used by the Google Internet search engine, that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set...
, HITSHITS algorithmHyperlink-Induced Topic Search is a link analysis algorithm that rates Web pages, developed by Jon Kleinberg. It was a precursor to PageRank...
ranks and their variants.
Selecting and designing good features is an important area in machine learning, which is called feature engineering.
Evaluation measures
There are several measures (metrics) which are commonly used to judge how well an algorithm is doing on training data and to compare performance of different MLR algorithms. Often a learning-to-rank problem is reformulated as an optimization problem with respect to one of these metrics.Examples of ranking quality measures:
- Mean average precision (MAP);
- DCGDiscounted cumulative gainDiscounted cumulative gain is a measure of effectiveness of a Web search engine algorithm or related applications, often used in information retrieval. Using a graded relevance scale of documents in a search engine result set, DCG measures the usefulness, or gain, of a document based on its...
and NDCG; - Precision@n, NDCG@n, where "@n" denotes that the metrics are evaluated only on top n documents;
- Mean reciprocal rankMean reciprocal rankMean reciprocal rank is a statistic for evaluating any process that produces a list of possible responses to a query, ordered by probability of correctness. The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer...
; - Kendall's tau
DCG and its normalized variant NDCG are usually preferred in academic research when multiple levels of relevance are used. Other metrics such as MAP, MRR and precision, are defined only for binary judgements.
Recently, there have been proposed several new evaluation metrics which claim to model user's satisfaction with search results better than the DCG metric:
- Expected reciprocal rank (ERR);
- YandexYandexYandex is a Russian IT company which operates the largest search engine in Russia and develops a number of Internet-based services and products. Yandex is ranked as 5-th world largest search engine...
's pfound.
Both of these metrics are based on the assumption that the user is more likely to stop looking at search results after examining a more relevant document, than after a less relevant document.
Approaches
Tie-Yan Liu of Microsoft Research AsiaMicrosoft Research Asia
Microsoft Research Asia, Microsoft’s fundamental research arm in the Asia Pacific region, was founded on November 5, 1998. In 2004, Technology Review named Microsoft Research Asia “the hottest computer lab in the world”....
in his paper "Learning to Rank for Information Retrieval" and talks at several leading conferences has analyzed existing algorithms for learning to rank problems and categorized them into three groups by their input representation and loss function
Loss function
In statistics and decision theory a loss function is a function that maps an event onto a real number intuitively representing some "cost" associated with the event. Typically it is used for parameter estimation, and the event in question is some function of the difference between estimated and...
:
Pointwise approach
In this case it is assumed that each query-document pair in the training data has a numerical or ordinal score. Then learning-to-rank problem can be approximated by a regression problem — given a single query-document pair, predict its score.A number of existing supervised
Supervised learning
Supervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value...
machine learning algorithms can be readily used for this purpose. Ordinal regression and classification algorithms can also be used in pointwise approach when they are used to predict score of a single query-document pair, and it takes a small, finite number of values.
Pairwise approach
In this case learning-to-rank problem is approximated by a classification problem — learning a binary classifier which can tell which document is better in a given pair of documents. The goal is to minimize average number of inversions in ranking.Listwise approach
These algorithms try to directly optimize the value of one of the above evaluation measures, averaged over all queries in the training data. This is difficult because most evaluation measures are not continuous functions with respect to ranking model's parameters, and so continuous approximations or bounds on evaluation measures have to be used.List of methods
A partial list of published learning-to-rank algorithms is shown below with years of first publication of each method:Year | Name | Type | Notes |
---|---|---|---|
2000 | Ranking SVM (RankSVM) | pairwise | A more recent exposition is in, which describes an application to ranking using clickthrough logs. |
2002 | Pranking | pointwise | Ordinal regression. |
2003 | RankBoost | pairwise | |
2005 | RankNet | pairwise | |
2006 | IR-SVM | pairwise | Ranking SVM with query-level normalization in the loss function. |
2006 | LambdaRank | listwise | RankNet in which pairwise loss function is multiplied by the change in the IR metric caused by a swap. |
2007 | AdaRank | listwise | |
2007 | FRank | pairwise | Based on RankNet, uses a different loss function - fidelity loss. |
2007 | GBRank | pairwise | |
2007 | ListNet | listwise | |
2007 | McRank | pointwise | |
2007 | QBRank | pairwise | |
2007 | RankCosine | listwise | |
2007 | RankGP | listwise | |
2007 | RankRLS | pairwise | Regularized least-squares based ranking. The work is extended in to learning to rank from general preference graphs. |
2007 | SVMmap | listwise | |
2008 | [ftp://ftp.research.microsoft.com/pub/tr/TR-2008-109.pdf LambdaMART] | listwise | Winning entry in the recent Yahoo Learning to Rank competition used an ensemble of LambdaMART models. |
2008 | ListMLE | listwise | Based on ListNet. |
2008 | PermuRank | listwise | |
2008 | SoftRank | listwise | |
2008 | Ranking Refinement | pairwise | A semi-supervised approach to learning to rank that uses Boosting. |
2008 | SSRankBoost | pairwise | An extension of RankBoost to learn with partially labeled data (semi-supervised learning to rank) |
2008 | SortNet | pairwise | SortNet, an adaptive ranking algorithm which orders objects using a neural network as a comparator. |
2009 | MPBoost | pairwise | Magnitude-preserving variant of RankBoost. The idea is that the more unequal are labels of a pair of documents, the harder should the algorithm try to rank them. |
2009 | BoltzRank | listwise | Unlike earlier methods, BoltzRank produces a ranking model that looks during query time not just at a single document, but also at pairs of documents. |
2009 | BayesRank | listwise | Based on ListNet. |
2010 | NDCG Boost | listwise | A boosting approach to optimize NDCG. |
2010 | GBlend | pairwise | Extends GBRank to the learning-to-blend problem of jointly solving multiple learning-to-rank problems with some shared features. |
2010 | IntervalRank | pairwise & listwise | |
2010 | CRR | pointwise & pairwise | Combined Regression and Ranking. Uses stochastic gradient descent Stochastic gradient descent Stochastic gradient descent is an optimization method for minimizing an objective function that is written as a sum of differentiable functions.- Background :... to optimize a linear combination of a pointwise quadratic loss and a pairwise hinge loss from Ranking SVM. |
Note: as most supervised learning
Supervised learning
Supervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value...
algorithms can be applied to pointwise case, only those methods which are specifically designed with ranking in mind are shown above.
History
C. Manning et al. trace earliest works on learning to rank problem to papers in late 1980s1980s
File:1980s decade montage.png|thumb|400px|From left, clockwise: The first Space Shuttle, Columbia, lifted off in 1981; American President Ronald Reagan and Soviet leader Mikhail Gorbachev eased tensions between the two superpowers, leading to the end of the Cold War; The Fall of the Berlin Wall in...
and early 1990s
1990s
File:1990s decade montage.png|From left, clockwise: The Hubble Space Telescope floats in space after it was taken up in 1990; American F-16s and F-15s fly over burning oil fields and the USA Lexie in Operation Desert Storm, also known as the 1991 Gulf War; The signing of the Oslo Accords on...
. They suggest that these early works achieved limited results in their time due to little available training data and poor machine learning techniques.
In mid-1990s Berkeley researchers used logistic regression
Logistic regression
In statistics, logistic regression is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression...
to train a successful ranking function at TREC
Text Retrieval Conference
The Text REtrieval Conference is an on-going series of workshops focusing on a list of different information retrieval research areas, or tracks. It is co-sponsored by the National Institute of Standards and Technology and the Intelligence Advanced Research Projects Activity , and began in 1992...
conference.
Several conferences, such as NIPS
Neural Information Processing Systems
The Conference on Neural Information Processing Systems is a machine learning and computational neuroscience conference held every December. The conference is a single track meeting that includes invited talks as well as oral and poster presentations of refereed papers...
, SIGIR
Special Interest Group on Information Retrieval
SIGIR is the Association for Computing Machinery's Special Interest Group on Information Retrieval. The scope of the group's specialty is the theory and application of computers to the acquisition, organization, storage, retrieval and distribution of information; emphasis is placed on working with...
and ICML had workshops devoted to the learning-to-rank problem since mid-2000s, and this has stimulated much of academic research.
Practical usage by search engines
Commercial web search engineWeb search engine
A web search engine is designed to search for information on the World Wide Web and FTP servers. The search results are generally presented in a list of results often referred to as SERPS, or "search engine results pages". The information may consist of web pages, images, information and other...
s began using machine learned ranking systems since 2000s. One of the first search engines to start using it was AltaVista
AltaVista
AltaVista is a web search engine owned by Yahoo!. AltaVista was once one of the most popular search engines but its popularity declined with the rise of Google...
(later its technology was acquired by Overture, and then Yahoo), which launched a gradient boosting
Gradient boosting
Gradient boosting is a machine learning technique for regression problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by...
-trained ranking function in April 2003.
Bing's search is said to be powered by RankNet algorithm, which was invented at Microsoft Research
Microsoft Research
Microsoft Research is the research division of Microsoft created in 1991 for developing various computer science ideas and integrating them into Microsoft products. It currently employs Turing Award winners C.A.R. Hoare, Butler Lampson, and Charles P...
in 2005.
In November 2009 a Russian search engine Yandex
Yandex
Yandex is a Russian IT company which operates the largest search engine in Russia and develops a number of Internet-based services and products. Yandex is ranked as 5-th world largest search engine...
announced that it had significantly increased its search quality due to deployment of a new proprietary MatrixNet algorithm, a variant of gradient boosting
Gradient boosting
Gradient boosting is a machine learning technique for regression problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by...
method which uses oblivious decision trees. Recently they have also sponsored a machine-learned ranking competition "Internet Mathematics 2009" based on their own search engine's production data. Yahoo has announced a similar competition in 2010.
As of 2008, Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...
's Peter Norvig
Peter Norvig
Peter Norvig is an American computer scientist. He is currently the Director of Research at Google Inc.-Educational Background:...
denied that their search engine exclusively relies on machine-learned ranking. Cuil
Cuil
Cuil was a search engine that organized web pages by content and displayed relatively long entries along with thumbnail pictures for many results. Cuil said it had a larger index than any other search engine, with about 120 billion web pages. It went live on July 28, 2008...
's CEO, Tom Costello
Tom Costello
Tom Costello was an American jockey in the sport of Thoroughbred horse racing who won three American Classic Races.As a young boy, Costello lived at the New York House of Refuge, a place for juveniles convicted of crimes or adjudicated as vagrants...
, suggests that they prefer hand-built models because they can outperform machine-learned models when measured against metrics like click-through rate or time on landing page, which is because machine-learned models "learn what people say they like, not what people actually like".
External links
Competitions and public datasets- LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval
- Yandex's Internet Mathematics 2009
- Yahoo! Learning to Rank Challenge
- Microsoft Learning to Rank Datasets
Open Source code