Named entity recognition
Encyclopedia
Named-entity recognition (NER) (also known as entity identification and entity extraction) is a subtask of information extraction
that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
Most research on NER systems has been structured as taking an unannotated block of text, such as this one:
And producing an annotated block of text, such as this one:
In this example, the annotations have been done using so-called ENAMEX tags that were developed for the Message Understanding Conference
in the 1990s.
State-of-the-art NER systems for English produce near-human performance. For example, the best system entering MUC-7 scored 93.39% of F-measure
while human annotators scored 97.60% and 96.95%. These algorithms had roughly twice the error rate (6.61%) of human annotators (2.40% and 3.05%).
-based techniques as well as statistical model
s. Hand-crafted grammar-based systems typically obtain better precision, but at the cost of lower recall and months of work by experienced computational linguists
. Statistical NER systems typically require a large amount of manually annotated
training data.
Early work in NER systems in the 1990s was aimed primarily at extraction from journalistic articles. Attention then turned to processing of military dispatches and reports. Later stages of the automatic content extraction (ACE) evaluation also included several types of informal text styles, such as weblogs and text transcripts from conversational telephone speech conversations. Since about 1998, there has been a great deal of interest in entity identification in the molecular biology
, bioinformatics
, and medical natural language processing
communities. The most common entity of interest in that domain has been names of genes and gene products.
s, as defined by Kripke
, stands for the referent. For instance, the automotive company created by Henry Ford in 1903 is referred to as Ford or Ford Motor Company. Rigid designators include proper names as well as certain natural kind terms like biological species and substances.
There is a general agreement to include temporal expressions
and some numerical expressions (i.e., money, percentages, etc.) as instances of named entities in the context of the NER task. While some instances of these types are good examples of rigid designators (e.g., the year 2001) there are also many invalid ones (e.g., I take my vacations in “June”). In the first case, the year 2001 refers to the 2001st year of the Gregorian calendar. In the second case, the month June may refer to the month of an undefined year (past June, next June, June 2020, etc.). It is arguable that the named entity definition is loosened in such cases for practical reasons. The definition of the term named entity is therefore not strict and often has to be explained in the context it is used.
At least two hierarchies
of named entity types have been proposed in the literature. BBN
categories, proposed in 2002, is used for Question Answering
and consists of 29 types and 64 subtypes. Sekine's extended hierarchy, proposed in 2002, is made of 200 subtypes.
, robust performance across domains and scaling up to fine-grained entity types. .
A recently emerging task of identifying "important expressions" in text and cross-linking them to Wikipedia
can be seen as an instance of extremely fine-grained named entity recognition, where the types are the actual Wikipedia pages describing the (potentially ambiguous) concepts. Below is an example output of a Wikification system:
Illinois NER system ,
Stanford NER system,
and
Lingpipe NER system.
The Illinois NER reports 90.6 F1 on the CoNLL03 NER shared task data and the Stanford NER reports 86.86 F1 .
There are also several publicly available Wikification systems for identifying important expressions in the text and cross-linking them to Wikipedia. Most notably, Illinois Wikification system
WM Wikifier
and
TAGME .
Most evaluation of these systems has been performed at conferences or contests put on by government organizations, sometimes acting in concert with contractors or academics.
Information extraction
Information extraction is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language...
that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
Most research on NER systems has been structured as taking an unannotated block of text, such as this one:
- Jim bought 300 shares of Acme Corp. in 2006.
And producing an annotated block of text, such as this one:
Jim bought300 shares ofAcme Corp. in2006 .
In this example, the annotations have been done using so-called ENAMEX tags that were developed for the Message Understanding Conference
Message Understanding Conference
The Message Understanding Conferences were initiated and financed by DARPA to encouragethe development of new and better methods of information extraction.The character of this competition—many concurrent research teams competing against one another—required the development of standardsfor...
in the 1990s.
State-of-the-art NER systems for English produce near-human performance. For example, the best system entering MUC-7 scored 93.39% of F-measure
F1 Score
In statistics, the F1 score is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct results divided by the number of all returned results and r is the number of correct results divided by the number of...
while human annotators scored 97.60% and 96.95%. These algorithms had roughly twice the error rate (6.61%) of human annotators (2.40% and 3.05%).
Approaches
NER systems have been created that use linguistic grammarFormal grammar
A formal grammar is a set of formation rules for strings in a formal language. The rules describe how to form strings from the language's alphabet that are valid according to the language's syntax...
-based techniques as well as statistical model
Statistical model
A statistical model is a formalization of relationships between variables in the form of mathematical equations. A statistical model describes how one or more random variables are related to one or more random variables. The model is statistical as the variables are not deterministically but...
s. Hand-crafted grammar-based systems typically obtain better precision, but at the cost of lower recall and months of work by experienced computational linguists
Computational linguistics
Computational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective....
. Statistical NER systems typically require a large amount of manually annotated
Annotation
An annotation is a note that is made while reading any form of text. This may be as simple as underlining or highlighting passages.Annotated bibliographies give descriptions about how each source is useful to an author in constructing a paper or argument...
training data.
Problem domains
Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains. Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.Early work in NER systems in the 1990s was aimed primarily at extraction from journalistic articles. Attention then turned to processing of military dispatches and reports. Later stages of the automatic content extraction (ACE) evaluation also included several types of informal text styles, such as weblogs and text transcripts from conversational telephone speech conversations. Since about 1998, there has been a great deal of interest in entity identification in the molecular biology
Molecular biology
Molecular biology is the branch of biology that deals with the molecular basis of biological activity. This field overlaps with other areas of biology and chemistry, particularly genetics and biochemistry...
, bioinformatics
Bioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...
, and medical natural language processing
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....
communities. The most common entity of interest in that domain has been names of genes and gene products.
Named entity types
In the expression named entity, the word named restricts the task to those entities for which one or many rigid designatorRigid designator
In modal logic and the philosophy of language, a term is said to be a rigid designator when it designates the same thing in all possible worlds in which that thing exists and does not designate anything else in those possible worlds in which that thing does not exist...
s, as defined by Kripke
Saul Kripke
Saul Aaron Kripke is an American philosopher and logician. He is a professor emeritus at Princeton and teaches as a Distinguished Professor of Philosophy at the CUNY Graduate Center...
, stands for the referent. For instance, the automotive company created by Henry Ford in 1903 is referred to as Ford or Ford Motor Company. Rigid designators include proper names as well as certain natural kind terms like biological species and substances.
There is a general agreement to include temporal expressions
Temporal expressions
A temporal expression in a text is a sequence of tokens that denote time, that is express a point in time, a duration or a frequency.Examples:-External links:...
and some numerical expressions (i.e., money, percentages, etc.) as instances of named entities in the context of the NER task. While some instances of these types are good examples of rigid designators (e.g., the year 2001) there are also many invalid ones (e.g., I take my vacations in “June”). In the first case, the year 2001 refers to the 2001st year of the Gregorian calendar. In the second case, the month June may refer to the month of an undefined year (past June, next June, June 2020, etc.). It is arguable that the named entity definition is loosened in such cases for practical reasons. The definition of the term named entity is therefore not strict and often has to be explained in the context it is used.
At least two hierarchies
Hierarchy
A hierarchy is an arrangement of items in which the items are represented as being "above," "below," or "at the same level as" one another...
of named entity types have been proposed in the literature. BBN
BBN Technologies
BBN Technologies is a high-technology company which provides research and development services. BBN is based next to Fresh Pond in Cambridge, Massachusetts, USA...
categories, proposed in 2002, is used for Question Answering
Question answering
In information retrieval and natural language processing , question answering is the task of automatically answering a question posed in natural language...
and consists of 29 types and 64 subtypes. Sekine's extended hierarchy, proposed in 2002, is made of 200 subtypes.
Current Challenges and Research Trends
Despite the high F1 numbers reported on the MUC-7 dataset, the problem of Named Entity Recognition is far from being solved. The main efforts are directed to reducing the annotation labor, robust performance across domains and scaling up to fine-grained entity types. .
A recently emerging task of identifying "important expressions" in text and cross-linking them to Wikipedia
can be seen as an instance of extremely fine-grained named entity recognition, where the types are the actual Wikipedia pages describing the (potentially ambiguous) concepts. Below is an example output of a Wikification system:
-
http://en.wikipedia.org/wiki/Michael_I._Jordan Michael Jordan is a professor athttp://en.wikipedia.org/wiki/University_of_California,_Berkeley Berkeley
Available Systems
Several systems are available online. For traditional NER, the most popular publicly available systems are:Illinois NER system ,
Stanford NER system,
and
Lingpipe NER system.
The Illinois NER reports 90.6 F1 on the CoNLL03 NER shared task data and the Stanford NER reports 86.86 F1 .
There are also several publicly available Wikification systems for identifying important expressions in the text and cross-linking them to Wikipedia. Most notably, Illinois Wikification system
WM Wikifier
and
TAGME .
NER evaluation forums
Evaluation of NER systems is critical to scientific progress of this field.Most evaluation of these systems has been performed at conferences or contests put on by government organizations, sometimes acting in concert with contractors or academics.
Conference | Acronym | Language(s) | Year(s) | Sponsor | Archive Site |
---|---|---|---|---|---|
Message Understanding Conference Message Understanding Conference The Message Understanding Conferences were initiated and financed by DARPA to encouragethe development of new and better methods of information extraction.The character of this competition—many concurrent research teams competing against one another—required the development of standardsfor... |
MUC | English | 1987–1999 | DARPA | http://www.itl.nist.gov/iaui/894.02/related_projects/muc/index.html |
Multilingual Entity Task Conference | MET | Chinese and Japanese | 1998 | US | http://www-nlpir.nist.gov/related_projects/tipster/met.htm |
Automatic Content Extraction Program | ACE | English | 2000–2008 | NIST | http://www.nist.gov/speech/tests/ace/ |
Conference on Computational Natural Language Learning | CoNLL | Spanish and Dutch / German and English | 2002–2003 | http://www.cnts.ua.ac.be/conll/ | |
Evaluation contest for named entity recognizers in Portuguese | HAREM | Portuguese | 2004–2008 | Linguateca | http://www.linguateca.pt/HAREM/ |
Information Retrieval and Extraction Exercise | IREX | Japanese | 1998–1999 | http://portal.acm.org/citation.cfm?id=992814&dl=acm&coll=&CFID=15151515&CFTOKEN=6184618 | |
ACL Special Interest Group in Chinese | SIGHan | Chinese | 2006 | http://sighan.cs.uchicago.edu/bakeoff2006/ | |
TAC Knowledge Base Population Evaluation | TAC/KBP | English | 2009– | NIST | http://www.nist.gov/tac/ |
External links
- Named entity recognition for Arabic – Issues and challenges in morphologically rich languages such as Arabic