Subject indexing
Encyclopedia
Subject indexing is the act of describing or classifying
a document
by index terms or other symbols in order to indicate what the document is about
, to summarize its content
or to increase its findability
. In other words, it is about identifying and describing the subject
of documents. Indexes are constructed, separately, on three distinct levels: terms in a document such as a book; objects in a collection such as a library; and documents (such as books and articles) within a field of knowledge.
Subject indexing is used in information retrieval
especially to create bibliographic database
s to retrieve documents on a particular subject. Examples of academic indexing services are Zentralblatt MATH
, Chemical Abstracts and PubMed
. The index terms were mostly assigned by experts but author keywords are also common.
The process of indexing begins with any analysis of the subject of the document. The indexer must then identify terms which appropriately identify the subject either by extracting words directly from the document or assigning words from a controlled vocabulary
. The terms in the index are then presented in a systematic order.
Indexers must decide how many terms to include and how specific the terms should be. Together this gives a depth of indexing.
. With the ability to conduct a full text search
widely available, many people have come to rely on their own expertise in conducting information searches and full text search
has become very popular. Subject indexing and its experts, professional indexers, catalogers, and librarians, remains crucial to information organization and retrieval. These experts understand controlled vocabularies and are able to find information that cannot be located by full text search
. The cost of expert analysis to create subject indexing is not easily compared to the cost of hardware, software and labor to manufacture a comparable set of full-text, fully searchable materials. With new web applications that allow every user to annotate documents, social tagging has gained popularity especially in the Web.
One application of indexing, the book index
, remains relatively unchanged despite the information revolution.
and lends itself well to automated techniques where word frequencies are calculated and those with a frequency over a pre-determined threshold are used as index terms. A stop-list containing common words such as the, and would be referred to and such stop words
would be excluded as index terms. Automated extraction indexing may lead to loss of meaning of terms by indexing single words as opposed to phrases. Although it is possible to extract commonly occurring phrases, it becomes more difficult if key concepts are inconsistently worded in phrases.
Automated extraction indexing also has the problem that even with use of a stop-list to remove common words such as “the,” some frequent words may not be useful for allowing discrimination between documents. For example, the term glucose is likely to occur frequently in any document related to diabetes. Therefore use of this term would likely return most or all the documents in the database. Post-co-ordinated indexing where terms are combined at the time of searching would reduce this effect but the onus would be on the searcher to link appropriate terms as opposed to the information professional. In addition terms that occur infrequently may be highly significant for example a new drug may be mentioned infrequently but the novelty of the subject makes any reference significant. One method for allowing rarer terms to be included and common words to be excluded by automated techniques would be a relative frequency approach where frequency of a word in a document is compared to frequency in the database as a whole. Therefore a term that occurs more often in a document than might be expected based on the rest of the database could then be used as an index term, and terms that occur equally frequently throughout will be excluded. Another problem with automated extraction is that it does not recognise when a concept is discussed but is not identified in the text by an indexable keyword .
s as the preferred term is indexed and synonyms or related terms direct the user to the preferred term. This means the user can find articles regardless of the specific term used by the author and saves the user from having to know and check all possible synonyms . It also removes any confusion caused by homograph
s by inclusion of a qualifying term. A third advantage is that it allows the linking of related terms whether they are linked by hierarchy or association, e.g. an index entry for an oral medication may list other oral medications as related terms on the same level of the hierarchy but would also link to broader terms such as treatment. Assignment indexing is used in manual indexing to improve inter-indexer consistency as different indexers will have a controlled set of terms to choose from. Controlled vocabularies do not completely remove inconsistencies as two indexers may still interpret the subject differently.
Rationalist theories of indexing (such as Ranganathan's theory) suggest that subjects are constructed logically from a fundamental set of categories. The basic method of subject analysis is then "analytic-synthetic", to isolate a set of basic categories (=analysis) and then to construct the subject of any given document by combining those categories according to some rules (=synthesis). Empiricist theories of indexing are based on selecting similar documents based on their properties, in particular by applying numerical statistical techniques. Historicist and hermeneutical theories of indexing suggest that the subject of a given document is relative to a given discourse or domain, why the indexing should reflect the need of a particular discourse or domain. According to hermeneutics is a document always written and interpreted from particular horizon. The same is the case with systems of knowledge organization and with all users searching such systems. Any question put to such a system is put from a particular horizon. All those horizons may be more or less in consensus or in conflict. To index a document is to try to contribute to the retrieval of “relevant” documents by knowing about those different horizons. Pragmatic and critical theories of indexing (such as Hjørland, 1997) is in agreement with the historicist point of view that subjects are relative to specific discourses but emphasizes that subject analysis should support given goals and values and should consider the consequences of indexing one way or another. These theories believe that indexing cannot be neutral and that it is a wrong goal to try to index in a neutral way. Indexing is an act (and computer based indexing is acting according to the programmers intentions). Acts serve human goals. Libraries and information services also serve human goals, why their indexing should be done in a way that supports these goals as much as possible. At a first glance this looks strange because the goals of libraries and information services is to identify any document or piece of information. Nonetheless is any specific way of indexing always supporting some kind of uses at the expense of other. The documents to be indexed intend to serve some specific purposes in a community. Basically the indexing should intend serving the same purposes. Primary and secondary documents and information services are parts of the same overall social system. In such a system different theories, epistemologies, worldviews etc may be at play and users need to be able to orient themselves and to navigate among those different views. This calls for a mapping of the different epistemologies in the field and classification of the single document into such a map. Excellent examples of such different paradigms and their consequences for indexing and classification systems are provided in the domain of art by Ørom (2003) and in music by Abrahamsen (2003).
The core of indexing is, as stated by Rowley & Farrow to evaluate a papers contribution to knowledge and index it accordingly. Or, with the words of Hjørland (1992, 1997) to index its informative potentials.
"In order to achieve good consistent indexing, the indexer must have a thorough appreciation of the structure of the subject and the nature of the contribution that the document is making to the advancement of knowledge." (Rowley & Farrow, 2000, p. 99).
Document classification
Document classification or document categorization is a problem in both library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically...
a document
Document
The term document has multiple meanings in ordinary language and in scholarship. WordNet 3.1. lists four meanings :* document, written document, papers...
by index terms or other symbols in order to indicate what the document is about
Aboutness
Aboutness is a term used in library and information science , linguistics, philosophy of language, and philosophy of mind. In LIS, it is often considered synonymous with subject . In philosophy it has been often considered synonymous with intentionality, perhaps since John Searle .R. A...
, to summarize its content
Content (media and publishing)
In media production and publishing, content is information and experiences that may provide value for an end-user/audience in specific contexts. Content may be delivered via any medium such as the internet, television, and audio CDs, as well as live events such as conferences and stage performances...
or to increase its findability
Findability
Findability is a term for the ease with which information contained on a website can be found, both from outside the website and by users already on the website. Although findability has relevance outside the World Wide Web, it is usually used in the context of the web...
. In other words, it is about identifying and describing the subject
Subject (documents)
In library and information science documents are classified and searched by subject - as well as by other attributes such as author, genre and document type. This makes "subject" a fundamental term in this field. Library and information specialists assign subject labels to documents to make them...
of documents. Indexes are constructed, separately, on three distinct levels: terms in a document such as a book; objects in a collection such as a library; and documents (such as books and articles) within a field of knowledge.
Subject indexing is used in information retrieval
Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...
especially to create bibliographic database
Bibliographic database
A bibliographic database is a database of bibliographic records, an organized digital collection of references to published literature, including journal and newspaper articles, conference proceedings, reports, government and legal publications, patents, books, etc...
s to retrieve documents on a particular subject. Examples of academic indexing services are Zentralblatt MATH
Zentralblatt MATH
Zentralblatt MATH is a service providing reviews and abstracts for articles in pure and applied mathematics, published by Springer Science+Business Media. It is a major international reviewing service which covers the entire field of mathematics...
, Chemical Abstracts and PubMed
PubMed
PubMed is a free database accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine at the National Institutes of Health maintains the database as part of the Entrez information retrieval system...
. The index terms were mostly assigned by experts but author keywords are also common.
The process of indexing begins with any analysis of the subject of the document. The indexer must then identify terms which appropriately identify the subject either by extracting words directly from the document or assigning words from a controlled vocabulary
Controlled vocabulary
Controlled vocabularies provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings, thesauri, taxonomies and other form of knowledge organization systems...
. The terms in the index are then presented in a systematic order.
Indexers must decide how many terms to include and how specific the terms should be. Together this gives a depth of indexing.
Subject analysis
The first step in indexing is to decide on the subject matter of the document. In manual indexing, the indexer would consider the subject matter in terms of answer to a set of questions such as "Does the document deal with a specific product, condition or phenomenon?" . As the analysis is influenced by the knowledge and experience of the indexer, it follows that two indexers may analyse the content differently and so come up with different index terms. This will impact on the success of retrieval.Automatic vs. manual subject analysis
Automatic indexing follows set processes of analysing frequencies of word patterns and comparing results to other documents in order to assign to subject categories. This requires no understanding of the material being indexed therefore leads to more uniform indexing but this is at the expense of the true meaning being interpreted. A computer program will not understand the meaning of statements and may therefore fail to assign some relevant terms or assign incorrectly. Human indexers focus their attention on certain parts of the document such as the title, abstract, summary and conclusions, as analysing the full text in depth is costly and time consuming An automated system takes away the time limit and allows the entire document to be analysed, but also has the option to be directed to particular parts of the document.Term selection
The second stage of indexing involves the translation of the subject analysis into a set of index terms. This can involve extracting from the document or assigning from a controlled vocabularyControlled vocabulary
Controlled vocabularies provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings, thesauri, taxonomies and other form of knowledge organization systems...
. With the ability to conduct a full text search
Full text search
In text retrieval, full text search refers to techniques for searching a single computer-stored document or a collection in a full text database...
widely available, many people have come to rely on their own expertise in conducting information searches and full text search
Full text search
In text retrieval, full text search refers to techniques for searching a single computer-stored document or a collection in a full text database...
has become very popular. Subject indexing and its experts, professional indexers, catalogers, and librarians, remains crucial to information organization and retrieval. These experts understand controlled vocabularies and are able to find information that cannot be located by full text search
Full text search
In text retrieval, full text search refers to techniques for searching a single computer-stored document or a collection in a full text database...
. The cost of expert analysis to create subject indexing is not easily compared to the cost of hardware, software and labor to manufacture a comparable set of full-text, fully searchable materials. With new web applications that allow every user to annotate documents, social tagging has gained popularity especially in the Web.
One application of indexing, the book index
Index (publishing)
An index is a list of words or phrases and associated pointers to where useful material relating to that heading can be found in a document...
, remains relatively unchanged despite the information revolution.
Extraction indexing
Extraction indexing involves taking words directly from the document. It uses natural languageNatural language
In the philosophy of language, a natural language is any language which arises in an unpremeditated fashion as the result of the innate facility for language possessed by the human intellect. A natural language is typically used for communication, and may be spoken, signed, or written...
and lends itself well to automated techniques where word frequencies are calculated and those with a frequency over a pre-determined threshold are used as index terms. A stop-list containing common words such as the, and would be referred to and such stop words
Stop words
In computing, stop words are words which are filtered out prior to, or after, processing of natural language data . It is controlled by human input and not automated. There is not one definite list of stop words which all tools use, if even used...
would be excluded as index terms. Automated extraction indexing may lead to loss of meaning of terms by indexing single words as opposed to phrases. Although it is possible to extract commonly occurring phrases, it becomes more difficult if key concepts are inconsistently worded in phrases.
Automated extraction indexing also has the problem that even with use of a stop-list to remove common words such as “the,” some frequent words may not be useful for allowing discrimination between documents. For example, the term glucose is likely to occur frequently in any document related to diabetes. Therefore use of this term would likely return most or all the documents in the database. Post-co-ordinated indexing where terms are combined at the time of searching would reduce this effect but the onus would be on the searcher to link appropriate terms as opposed to the information professional. In addition terms that occur infrequently may be highly significant for example a new drug may be mentioned infrequently but the novelty of the subject makes any reference significant. One method for allowing rarer terms to be included and common words to be excluded by automated techniques would be a relative frequency approach where frequency of a word in a document is compared to frequency in the database as a whole. Therefore a term that occurs more often in a document than might be expected based on the rest of the database could then be used as an index term, and terms that occur equally frequently throughout will be excluded. Another problem with automated extraction is that it does not recognise when a concept is discussed but is not identified in the text by an indexable keyword .
Assignment indexing
An alternative is assignment indexing where index terms are taken from a controlled vocabulary. This has the advantage of controlling for synonymSynonym
Synonyms are different words with almost identical or similar meanings. Words that are synonyms are said to be synonymous, and the state of being a synonym is called synonymy. The word comes from Ancient Greek syn and onoma . The words car and automobile are synonyms...
s as the preferred term is indexed and synonyms or related terms direct the user to the preferred term. This means the user can find articles regardless of the specific term used by the author and saves the user from having to know and check all possible synonyms . It also removes any confusion caused by homograph
Homograph
A homograph is a word or a group of words that share the same written form but have different meanings. When spoken, the meanings may be distinguished by different pronunciations, in which case the words are also heteronyms. Words with the same writing and pronunciation A homograph (from the ,...
s by inclusion of a qualifying term. A third advantage is that it allows the linking of related terms whether they are linked by hierarchy or association, e.g. an index entry for an oral medication may list other oral medications as related terms on the same level of the hierarchy but would also link to broader terms such as treatment. Assignment indexing is used in manual indexing to improve inter-indexer consistency as different indexers will have a controlled set of terms to choose from. Controlled vocabularies do not completely remove inconsistencies as two indexers may still interpret the subject differently.
Index presentation
The final phase of indexing is to present the entries in a systematic order. This may involve linking entries. In a pre-coordinated index the indexer determines the order in which terms are linked in an entry by considering how a user may formulate their search. In a post-coordinated index, the entries are presented singly and the user can link the entries through searches, most commonly carried out by computer software. Post-coordination results in a loss of precision in comparison to pre-coordinationDepth of Indexing
Indexers must make decisions about what entries should be included and how many entries an index should incorporate. The depth of indexing describes the thoroughness of the indexing process with reference to exhaustivity and specificityExhaustivity
An exhaustive index is one which lists all possible index terms. Greater exhaustivity gives a higher recall, or more likelihood of all the relevant articles being retrieved, however, this occurs at the expense of precision. This means that the user may retrieve a larger number of irrelevant documents or documents which only deal with the subject in little depth. In a manual system a greater level of exhaustivity brings with it a greater cost as more man hours are required. The additional time taken in an automated system would be much less significant. At the other end of the scale, in a selective index only the most important aspects are covered . Recall is reduced in a selective index as if an indexer does not include enough terms, a highly relevant article may be overlooked. Therefore indexers should strive for a balance and consider what the document may be used. They may also have to consider the implications of time and expense.Specificity
The specificity describes how closely the index terms match the topics they represent An index is said to be specific if the indexer uses parallel descriptors to the concept of the document and reflects the concepts precisely . Specificity tends to increase with exhaustivity as the more terms you include, the narrower those terms will be.Indexing theory
Hjørland (2011) found that theories of indexing is at the deepest level connected to different theories of knowledge:Rationalist theories of indexing (such as Ranganathan's theory) suggest that subjects are constructed logically from a fundamental set of categories. The basic method of subject analysis is then "analytic-synthetic", to isolate a set of basic categories (=analysis) and then to construct the subject of any given document by combining those categories according to some rules (=synthesis). Empiricist theories of indexing are based on selecting similar documents based on their properties, in particular by applying numerical statistical techniques. Historicist and hermeneutical theories of indexing suggest that the subject of a given document is relative to a given discourse or domain, why the indexing should reflect the need of a particular discourse or domain. According to hermeneutics is a document always written and interpreted from particular horizon. The same is the case with systems of knowledge organization and with all users searching such systems. Any question put to such a system is put from a particular horizon. All those horizons may be more or less in consensus or in conflict. To index a document is to try to contribute to the retrieval of “relevant” documents by knowing about those different horizons. Pragmatic and critical theories of indexing (such as Hjørland, 1997) is in agreement with the historicist point of view that subjects are relative to specific discourses but emphasizes that subject analysis should support given goals and values and should consider the consequences of indexing one way or another. These theories believe that indexing cannot be neutral and that it is a wrong goal to try to index in a neutral way. Indexing is an act (and computer based indexing is acting according to the programmers intentions). Acts serve human goals. Libraries and information services also serve human goals, why their indexing should be done in a way that supports these goals as much as possible. At a first glance this looks strange because the goals of libraries and information services is to identify any document or piece of information. Nonetheless is any specific way of indexing always supporting some kind of uses at the expense of other. The documents to be indexed intend to serve some specific purposes in a community. Basically the indexing should intend serving the same purposes. Primary and secondary documents and information services are parts of the same overall social system. In such a system different theories, epistemologies, worldviews etc may be at play and users need to be able to orient themselves and to navigate among those different views. This calls for a mapping of the different epistemologies in the field and classification of the single document into such a map. Excellent examples of such different paradigms and their consequences for indexing and classification systems are provided in the domain of art by Ørom (2003) and in music by Abrahamsen (2003).
The core of indexing is, as stated by Rowley & Farrow to evaluate a papers contribution to knowledge and index it accordingly. Or, with the words of Hjørland (1992, 1997) to index its informative potentials.
"In order to achieve good consistent indexing, the indexer must have a thorough appreciation of the structure of the subject and the nature of the contribution that the document is making to the advancement of knowledge." (Rowley & Farrow, 2000, p. 99).
See also
- Document classificationDocument classificationDocument classification or document categorization is a problem in both library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically...
- MetadataMetadataThe term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...
- OvercategorizationOvercategorizationOvercategorization, overcategorisation or category clutter is the process of assigning too many categories, classes or index terms to a given document. Wikipedia has developed a set of principles concerning overcategorization...
- Thomas of IrelandThomas of IrelandThomas of Ireland , known as Thomas Hibernicus, not to be confused with the Franciscan friar Thomas de Hibernia , was an Irish writer...
, a medieval pioneer in subject indexing