Full text search
Encyclopedia
In text retrieval, full text search refers to techniques for searching a single computer
-stored document
or a collection in a full text database
. Full text search is distinguished from searches based on metadata
or on parts of the original texts represented in databases (such as titles, abstracts, selected sections or bibliographical references).
In a full text search, the search engine
examines all of the words in every stored document as it tries to match search criteria (e.g., words supplied by a user). Full text searching techniques became common in online bibliographic databases in the 1990s. Many web sites and application programs (such as word processing
software) provide full-text search capabilities. Some web search engines such as AltaVista
employ full text search techniques while others index only a portion of the web pages examined by its indexing system.
, a strategy called serial scanning. This is what some rudimentary tools, such as grep
, do when searching.
However, when the number of documents to search is potentially large or the quantity of search queries to perform is substantial, the problem of full text search is often divided into two tasks: indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms, often called an index, but more correctly named a concordance
. In the search stage, when performing a specific query, only the index is referenced rather than the text of the original documents.
The indexer will make an entry in the index for each term or word found in a document and possibly its relative position within the document. Usually the indexer will ignore stop words
, such as the English "the", which are both too common and carry too little meaning to be useful for searching. Some indexers also employ language-specific stemming
on the words being indexed, so for example any of the words "drives", "drove", or "driven" will be recorded in the index under a single concept word "drive".
The diagram at right represents a low-precision, low-recall search. In the diagram the red and green dots represent the total population of potential search results for a given search. Red dots represent irrelevant results, and green dots represent relevant results. Relevancy is indicated by the proximity of search results to the center of the inner circle. Of all possible results shown, those that were actually returned by the search are shown on a light-blue background. In the example only one relevant result of three possible relevant results was returned, so the recall is a very low ratio of 1/3 or 33%. The precision for the example is a very low 1/4 or 25%, since only one of the four results returned was relevant.
Due to the ambiguities of natural language
, full text search systems typically includes options like stop words
to increase precision and stemming
to increase recall. Controlled-vocabulary
searching also helps alleviate low-precision issues by tagging
documents in such a way that ambiguities are eliminated. The trade-off between precision and recall is simple: an increase in precision can lower overall recall while an increase in recall lowers precision.
to the intended search question. Such documents are called false positives (see Type I error). The retrieval of irrelevant documents is often caused by the inherent ambiguity of natural language
. In the sample diagram at right, false positives are represented by the irrelevant results (red dots) that were returned by the search (on a light-blue background).
Clustering techniques based on Bayesian
algorithms can help reduce false positives. For a search term of "football", clustering can be used to categorize the document/data universe into "American football", "corporate football", etc. Depending on the occurrences of words relevant to the categories, search terms a search result can be placed in one or more of the categories. This technique is being extensively deployed in the e-discovery domain.
algorithm developed by Google
gives more prominence to documents to which other Web page
s have linked. See Search engine
for additional examples.
Computer
A computer is a programmable machine designed to sequentially and automatically carry out a sequence of arithmetic or logical operations. The particular sequence of operations can be changed readily, allowing the computer to solve more than one kind of problem...
-stored document
Document
The term document has multiple meanings in ordinary language and in scholarship. WordNet 3.1. lists four meanings :* document, written document, papers...
or a collection in a full text database
Full text database
A full text database or a complete text database is a database that contains the complete text of books, dissertations, journals, magazines, newspapers or other kinds of textual documents...
. Full text search is distinguished from searches based on metadata
Metadata
The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...
or on parts of the original texts represented in databases (such as titles, abstracts, selected sections or bibliographical references).
In a full text search, the search engine
Search engine
A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...
examines all of the words in every stored document as it tries to match search criteria (e.g., words supplied by a user). Full text searching techniques became common in online bibliographic databases in the 1990s. Many web sites and application programs (such as word processing
Word processing
Word processing is the creation of documents using a word processor. It can also refer to advanced shorthand techniques, sometimes used in specialized contexts with a specially modified typewriter.-External links:...
software) provide full-text search capabilities. Some web search engines such as AltaVista
AltaVista
AltaVista is a web search engine owned by Yahoo!. AltaVista was once one of the most popular search engines but its popularity declined with the rise of Google...
employ full text search techniques while others index only a portion of the web pages examined by its indexing system.
Indexing
When dealing with a small number of documents it is possible for the full-text search engine to directly scan the contents of the documents with each queryInformation retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...
, a strategy called serial scanning. This is what some rudimentary tools, such as grep
Grep
grep is a command-line text-search utility originally written for Unix. The name comes from the ed command g/re/p...
, do when searching.
However, when the number of documents to search is potentially large or the quantity of search queries to perform is substantial, the problem of full text search is often divided into two tasks: indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms, often called an index, but more correctly named a concordance
Concordance (publishing)
A concordance is an alphabetical list of the principal words used in a book or body of work, with their immediate contexts. Because of the time and difficulty and expense involved in creating a concordance in the pre-computer era, only works of special importance, such as the Vedas, Bible, Qur'an...
. In the search stage, when performing a specific query, only the index is referenced rather than the text of the original documents.
The indexer will make an entry in the index for each term or word found in a document and possibly its relative position within the document. Usually the indexer will ignore stop words
Stop words
In computing, stop words are words which are filtered out prior to, or after, processing of natural language data . It is controlled by human input and not automated. There is not one definite list of stop words which all tools use, if even used...
, such as the English "the", which are both too common and carry too little meaning to be useful for searching. Some indexers also employ language-specific stemming
Stemming
In linguistic morphology and information retrieval, stemming is the process for reducing inflected words to their stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same...
on the words being indexed, so for example any of the words "drives", "drove", or "driven" will be recorded in the index under a single concept word "drive".
The precision vs. recall tradeoff
Recall measures the quantity of results returned by a search and precision is the measure of the quality of the results returned. Recall is the ratio of relevant results returned divided by all relevant results. Precision is the number of relevant results returned divided by the total number of results returned.The diagram at right represents a low-precision, low-recall search. In the diagram the red and green dots represent the total population of potential search results for a given search. Red dots represent irrelevant results, and green dots represent relevant results. Relevancy is indicated by the proximity of search results to the center of the inner circle. Of all possible results shown, those that were actually returned by the search are shown on a light-blue background. In the example only one relevant result of three possible relevant results was returned, so the recall is a very low ratio of 1/3 or 33%. The precision for the example is a very low 1/4 or 25%, since only one of the four results returned was relevant.
Due to the ambiguities of natural language
Natural language
In the philosophy of language, a natural language is any language which arises in an unpremeditated fashion as the result of the innate facility for language possessed by the human intellect. A natural language is typically used for communication, and may be spoken, signed, or written...
, full text search systems typically includes options like stop words
Stop words
In computing, stop words are words which are filtered out prior to, or after, processing of natural language data . It is controlled by human input and not automated. There is not one definite list of stop words which all tools use, if even used...
to increase precision and stemming
Stemming
In linguistic morphology and information retrieval, stemming is the process for reducing inflected words to their stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same...
to increase recall. Controlled-vocabulary
Controlled vocabulary
Controlled vocabularies provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings, thesauri, taxonomies and other form of knowledge organization systems...
searching also helps alleviate low-precision issues by tagging
Tag (metadata)
In online computer systems terminology, a tag is a non-hierarchical keyword or term assigned to a piece of information . This kind of metadata helps describe an item and allows it to be found again by browsing or searching...
documents in such a way that ambiguities are eliminated. The trade-off between precision and recall is simple: an increase in precision can lower overall recall while an increase in recall lowers precision.
False-positive problem
Free text searching is likely to retrieve many documents that are not relevantRelevance
-Introduction:The concept of relevance is studied in many different fields, including cognitive sciences, logic and library and information science. Most fundamentally, however, it is studied in epistemology...
to the intended search question. Such documents are called false positives (see Type I error). The retrieval of irrelevant documents is often caused by the inherent ambiguity of natural language
Natural language
In the philosophy of language, a natural language is any language which arises in an unpremeditated fashion as the result of the innate facility for language possessed by the human intellect. A natural language is typically used for communication, and may be spoken, signed, or written...
. In the sample diagram at right, false positives are represented by the irrelevant results (red dots) that were returned by the search (on a light-blue background).
Clustering techniques based on Bayesian
Bayesian inference
In statistics, Bayesian inference is a method of statistical inference. It is often used in science and engineering to determine model parameters, make predictions about unknown variables, and to perform model selection...
algorithms can help reduce false positives. For a search term of "football", clustering can be used to categorize the document/data universe into "American football", "corporate football", etc. Depending on the occurrences of words relevant to the categories, search terms a search result can be placed in one or more of the categories. This technique is being extensively deployed in the e-discovery domain.
Performance improvements
The deficiencies of free text searching have been addressed in two ways: By providing users with tools that enable them to express their search questions more precisely, and by developing new search algorithms that improve retrieval precision.Improved querying tools
- Keywords. Document creators (or trained indexers) are asked to supply a list of words that describe the subject of the text, including synonyms of words that describe this subject. Keywords improve recall, particularly if the keyword list includes a search word that is not in the document text.
- Field-restricted search. Some search engines enable users to limit free text searches to a particular fieldField (computer science)In computer science, data that has several parts can be divided into fields. Relational databases arrange data as sets of database records, also called rows. Each record consists of several fields; the fields of all records form the columns....
within a stored data record, such as "Title" or "Author." - Boolean queries. Searches that use BooleanBoolean logicBoolean algebra is a logical calculus of truth values, developed by George Boole in the 1840s. It resembles the algebra of real numbers, but with the numeric operations of multiplication xy, addition x + y, and negation −x replaced by the respective logical operations of...
operators (for example, "encyclopedia" AND "online" NOT "Encarta") can dramatically increase the precision of a free text search. The AND operator says, in effect, "Do not retrieve any document unless it contains both of these terms." The NOT operator says, in effect, "Do not retrieve any document that contains this word." If the retrieval list retrieves too few documents, the OR operator can be used to increase recall; consider, for example, "encyclopedia" AND "online" OR "Internet" NOT "Encarta". This search will retrieve documents about online encyclopedias that use the term "Internet" instead of "online." This increase in precision is very commonly counter-productive since it usually comes with a dramatic loss of recall. - Phrase searchPhrase searchPhrase Search is a type of search that allows users to search for documents containing an exact sentence or phrase opposed to being limited to keywords...
. A phrase search matches only those documents that contain a specified phrase, such as "Wikipedia, the free encyclopedia." - Concept searchConcept SearchA concept search is an automated information retrieval method that is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query...
. A search that is based on multi-word concepts, for example Compound term processingCompound term processingCompound term processing is the name that is used for a category of techniques in Information retrieval applications that performs matching on the basis of compound terms...
. This type of search is becoming popular in many e-Discovery solutions. - Concordance search. A concordance search produces an alphabetical list of all principal words that occur in a textPlain textIn computing, plain text is the contents of an ordinary sequential file readable as textual material without much processing, usually opposed to formatted text....
with their immediate context. - Proximity search. A phrase search matches only those documents that contain two or more words that are separated by a specified number of words; a search for "Wikipedia" WITHIN2 "free" would retrieve only those documents in which the words "Wikipedia" and "free" occur within two words of each other.
- Regular expressionRegular expressionIn computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...
. A regular expression employs a complex but powerful querying syntaxSyntaxIn linguistics, syntax is the study of the principles and rules for constructing phrases and sentences in natural languages....
that can be used to specify retrieval conditions with precision. - Fuzzy search will search for document that match the given terms and some variation around them (using for instance edit distanceEdit distanceIn information theory and computer science, the edit distance between two strings of characters generally refers to the Levenshtein distance. However, according to Nico Jacobs, “The term ‘edit distance’ is sometimes used to refer to the distance in which insertions and deletions have equal cost and...
to threshold the multiple variation) - Wildcard searchWildcard character-Telecommunication:In telecommunications, a wildcard character is a character that may be substituted for any of a defined subset of all possible characters....
. A search that substitutes one or more characters in a search query for a wildcard character such as an asteriskAsteriskAn asterisk is a typographical symbol or glyph. It is so called because it resembles a conventional image of a star. Computer scientists and mathematicians often pronounce it as star...
. For example using the asterisk in a search query "s*n" will find "sin", "son", "sun", etc. in a text.
Improved search algorithms
The PageRankPageRank
PageRank is a link analysis algorithm, named after Larry Page and used by the Google Internet search engine, that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set...
algorithm developed by Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...
gives more prominence to documents to which other Web page
Web page
A web page or webpage is a document or information resource that is suitable for the World Wide Web and can be accessed through a web browser and displayed on a monitor or mobile device. This information is usually in HTML or XHTML format, and may provide navigation to other web pages via hypertext...
s have linked. See Search engine
Search engine
A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...
for additional examples.
Software
The following is a partial list of available software products whose predominant purpose is to perform full text indexing and searching. Some of these are accompanied with detailed descriptions of their theory of operation or internal algorithms, which can provide additional insight into how full text search may be accomplished.Free and open source software
- BaseXBaseXBaseX is a native and light-weight XML database management system, developed as a community project on GitHub. It is specialized in storing, querying, and visualizing large XML documents and collections...
- DataparkSearchDataparkSearchDataparkSearch is a search engine designed to organize search within a website, group of websites, intranet or local system.DataparkSearch is written in C. Distributed under the terms of the GNU General Public License, DataparkSearch is free software....
- FerretFerret search libraryFerret is a search library for Ruby inspired by LuceneThere is a Ruby on Rails plugin called acts_as_ferret.Ferret utilizes Poshlib.- External links :** Ruby on Rails plugin for Ferret...
- Ht-//DigHt-//dight://Dig is a free software indexing and searching system created in 1995 by Andrew Scherpbier while he was employed at San Diego State University. It can provide a search engine for a single website....
- Hyper Estraier
- Lemur/Indri
- LuceneLuceneApache Lucene is a free/open source information retrieval software library, originally created in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License....
- mnoGoSearchMnoGoSearchmnoGoSearch is an open source search engine for Unix-like computer systems written in C. It is distributed under the GNU General Public License and designed to organize search within a website, group of websites, intranet or local system....
- SphinxSphinx (search engine)Sphinx is a free software search engine designed with indexing database content in mind. It currently supports MySQL, PostgreSQL, and ODBC-compliant databases as data sources natively. Other data sources can be indexed via pipe in a custom XML format...
- Swish-eSWISH-ESWISH-E stands for Simple Web Indexing System for Humans - Enhanced. It is used to index collections of documents ranging up to one million documents in size and includes import filters for many document types.- See also :...
- XapianXapianXapian is an open source probabilistic information retrieval library, released under the GNU General Public License . It is a full text search engine library for programmers....
- KinoSearchKinoSearchKinosearch is a search engine written in Perl and C and a loose port of the Java search engine library Apache Lucene.The Socialtext wiki software uses this search engine, and so does the MojoMojo wiki...
- Apache Solr
Proprietary software
- MarkLogic
- AttivioAttivioAttivio, Inc., is a privately held Newton, Massachusetts-based enterprise software company that produces and sells a unified information access platform that lets users find all types of information with a single query and analyze data extracted from text along with traditional data.Attivio's core...
- Autonomy CorporationAutonomy CorporationAutonomy is a multinational enterprise software company with joint headquarters in Cambridge, United Kingdom, and San Francisco, USA and a subsidiary of Hewlett-Packard. The company uses a combination of technologies born out of research at the University of Cambridge...
- BA Insight
- BrainwareBrainwareBrainware is an American software company that markets data capture and extraction products. The company spun out of Dulles-based SER Solutions Inc. in February 2006 when SER was acquired by The Gores Group LLC...
- BRS/SearchBRS/SearchBRS/Search is a full-text database and information retrieval system. BRS/Search uses a fully inverted indexing system to store, locate, and retrieve unstructured data...
- Concept Searching LimitedConcept Searching LimitedConcept Searching Limited is a software company which specializes in information retrieval software. Its has products for Enterprise search, Taxonomy Management and Statistical classification.-History:...
- Dieselpoint
- dtSearch
- Endeca
- ExaleadExaleadExalead is a software company that provides search platforms and search-based applications for consumer and business users. The company is headquartered in Paris, France, and is a subsidiary of Dassault Systèmes .- CloudView Platform :...
- Fast Search & TransferFast Search & TransferFast Search & Transfer ASA is a Norwegian company based in Oslo. FAST focuses on data search technologies. It also has offices located in Germany, Italy, Sri Lanka, France, Japan, the United Kingdom, the United States, Brazil, Mexico and other countries around the world. The company was founded...
- InktomiInktomiInktomi Corporation was a California company that provided software for Internet service providers. It was founded in 1996 by UC Berkeley professor Eric Brewer and graduate student Paul Gauthier. The company was initially founded based on the real-world success of the search engine they developed...
- VivísimoVivísimoVivisimo is a privately held enterprise search software company in Pittsburgh that develops and sells software products to improve search on the web and in enterprises...
- Lucid ImaginationLucid ImaginationLucid Imagination is a Redwood City, California-based company offering commercial support, consulting, training and value-add software for open source Apache Lucene and Apache Solr search technologies. Lucid Imagination is a private company founded in 2007 and publicly launched on January 26, 2009...
See also
- Compound term processingCompound term processingCompound term processing is the name that is used for a category of techniques in Information retrieval applications that performs matching on the basis of compound terms...
- Controlled vocabularyControlled vocabularyControlled vocabularies provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings, thesauri, taxonomies and other form of knowledge organization systems...
- Enterprise searchEnterprise searchEnterprise search is the practice of making content from multiple enterprise-type sources, such as databases and intranets, searchable to a defined audience.-Enterprise search summary:...
- Information ExtractionInformation extractionInformation extraction is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language...
- Information retrievalInformation retrievalInformation retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...
- Faceted search
- Full text databaseFull text databaseA full text database or a complete text database is a database that contains the complete text of books, dissertations, journals, magazines, newspapers or other kinds of textual documents...
- List of enterprise search vendors
- Search engineSearch engineA search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...
- Search engine indexing - how search engines generate indices to support full text searching
- SQL Server Full Text Search (implementation of)