LRE Map
Encyclopedia
The LRE Map is a freely accessible large database on resources dedicated to NLP
. The original feature of LRE Map is that the records are collected during the submission of different major NLP
conferences. The records are then cleaned and gathered into a global database called "LRE Map".
The LRE Map is intended to be an instrument for collecting information about language resources
and to become, at the same time, a community for users, a place to share and discover resources,
discuss opinions, provide feedback, discover new trends, etc. It is an instrument for discovering, searching and documenting language resources, here intended in a broad sense, as both data and tools.
The large amount of information contained in the Map can be analyzed in many different ways. A
few, general analyses are available on the Resource Map website at http://www.resourcebook.eu
(click on the “Stats” link). For instance, the LRE Map can provide information about the most frequent type of resource, the
most represented language, the applications for which resources are used or are being developed,
the proportion of new resources vs. already existing ones, or the way in which resources are
distributed to the community.
(ELRA
, LDC
, NICT Universal
Catalogue, ACL
Data and Code Repository, OLAC
, LT World, etc).
However, it has been estimated
that only 10% of existing resources are known, either through distribution catalogues or via direct
publicity by providers (web sites and the like). The rest remains hidden, the only occasions where it
briefly emerges being when a resource is presented in the context of a research paper or report at
some conference. Even in this case, nevertheless, it might be that a resource remains in the
background simply because the focus of the research is not on the resource per se.
2010 conference . More precisely, the idea was discussed within the FlaReNet project, and in collaboration with ELRA
, the Map was put in place at LREC-2010. The LREC organizers asked the authors to provide some basic information about all the resources (in a broad sense, i.e. including tools, standards and evaluation packages), either used or created, described in their papers. All these descriptors were then gathered in a global matrix called the LREC Map.
The same methodology and requirements from the authors has been then applied and extended to other conferences, namely COLING-2010, EMNLP-2010, RANLP-2011 and LREC-2012.
Each resource is described according to the following attributes:
The map has a great potential for many uses, in addition to being an information gathering tool:
It should be noted that, not surprisingly, English is the most studied language. Secondly, come French and German languages and then Italian and Spanish.
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....
. The original feature of LRE Map is that the records are collected during the submission of different major NLP
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....
conferences. The records are then cleaned and gathered into a global database called "LRE Map".
The LRE Map is intended to be an instrument for collecting information about language resources
and to become, at the same time, a community for users, a place to share and discover resources,
discuss opinions, provide feedback, discover new trends, etc. It is an instrument for discovering, searching and documenting language resources, here intended in a broad sense, as both data and tools.
The large amount of information contained in the Map can be analyzed in many different ways. A
few, general analyses are available on the Resource Map website at http://www.resourcebook.eu
(click on the “Stats” link). For instance, the LRE Map can provide information about the most frequent type of resource, the
most represented language, the applications for which resources are used or are being developed,
the proportion of new resources vs. already existing ones, or the way in which resources are
distributed to the community.
Context
Several institutions worldwide maintain catalogues of language resources(ELRA
ELRA
A not-for-profit organisation, the European Language Resources Association association is established under the law of the Grand Duchy of Luxembourg...
, LDC
Linguistic Data Consortium
The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for linguistics research and development purposes. The University of Pennsylvania is...
, NICT Universal
Catalogue, ACL
Association for Computational Linguistics
The Association for Computational Linguistics is the international scientific and professional society for people working on problems involving natural language and computation. An annual meeting is held each summer in locations where significant computational linguistics research is carried out...
Data and Code Repository, OLAC
OLAC
OLAC, the Open Language Archives Community, is an initiative to create a unified means of searching online databases of language resources for linguistic research. The information about resources is stored in XML format for easy searching...
, LT World, etc).
However, it has been estimated
that only 10% of existing resources are known, either through distribution catalogues or via direct
publicity by providers (web sites and the like). The rest remains hidden, the only occasions where it
briefly emerges being when a resource is presented in the context of a research paper or report at
some conference. Even in this case, nevertheless, it might be that a resource remains in the
background simply because the focus of the research is not on the resource per se.
History
The LRE Map originated under the name "LREC Map" during the preparation of LRECLREC
The International Conference on Language Resources and Evaluation is a biennial conference organised by the European Language Resources Association with the support of institutions and organisations involved in Natural language processing....
2010 conference . More precisely, the idea was discussed within the FlaReNet project, and in collaboration with ELRA
ELRA
A not-for-profit organisation, the European Language Resources Association association is established under the law of the Grand Duchy of Luxembourg...
, the Map was put in place at LREC-2010. The LREC organizers asked the authors to provide some basic information about all the resources (in a broad sense, i.e. including tools, standards and evaluation packages), either used or created, described in their papers. All these descriptors were then gathered in a global matrix called the LREC Map.
The same methodology and requirements from the authors has been then applied and extended to other conferences, namely COLING-2010, EMNLP-2010, RANLP-2011 and LREC-2012.
Size and Content
The size of the database increases other time. The data collected at LREC-2010 was made of 1889 entries.Each resource is described according to the following attributes:
- Resource type, e.g. lexiconLexiconIn linguistics, the lexicon of a language is its vocabulary, including its words and expressions. A lexicon is also a synonym of the word thesaurus. More formally, it is a language's inventory of lexemes. Coined in English 1603, the word "lexicon" derives from the Greek "λεξικόν" , neut...
, annotation tool, tagger/parser. - Resource production status, e.g. newly created finished, existing-updated.
- Resource availability, e.g. freely available, from data center.
- Resource modality, e.g. speechSpeechSpeech is the human faculty of speaking.It may also refer to:* Public speaking, the process of speaking to a group of people* Manner of articulation, how the body parts involved in making speech are manipulated...
, written, sign languageSign languageA sign language is a language which, instead of acoustically conveyed sound patterns, uses visually transmitted sign patterns to convey meaning—simultaneously combining hand shapes, orientation and movement of the hands, arms or body, and facial expressions to fluidly express a speaker's...
. - Resource use, e.g. named entity recognitionNamed entity recognitionNamed-entity recognition is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.Most research on NER...
, language identificationLanguage identificationLanguage identification is the process of determining which natural language given content is in. Traditionally, identification of written language - as practiced, for instance, in library science - has relied on manually identifying frequent words and letters known to be characteristic of...
, machine translationMachine translationMachine translation, sometimes referred to by the abbreviation MT is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another.On a basic...
. - Resource language, e.g. English, 23 European Union languages, official languages of India.
Uses
The LRE map is a very important tool to chart the NLP field. Compared to other studied based on subjective scorings, the LRE map is made of real facts.The map has a great potential for many uses, in addition to being an information gathering tool:
- It is a great instrument for monitoring the evolution of the field (useful for funders), if applied in different contexts and times.
- It can be seen as a huge joint effort, the beginning of an even larger cooperative action not just among few leaders but among all the researchers.
- It is also an "educational" means towards the broad acknowledgment of the need of meta-research activities with the active involvement of many.
- It is also instrumental in introducing the new notion of "citation of resources" that could provide an award and a means of scholarly recognition for researchers engaged in resource creation.
- It is used to help the organization of the conferences of the field like LRECLRECThe International Conference on Language Resources and Evaluation is a biennial conference organised by the European Language Resources Association with the support of institutions and organisations involved in Natural language processing....
.
Derived matrices
The data were then cleaned and sorted by Joseph Mariani (CNRS-LIMSI IMMI) and Gil Francopoulo (CNRS-LIMSI IMMI + Tagmatica) in order to compute the various matrices of the final FLaReNetreports. One of them, the matrix for written data at LREC-2010 is as follows:Corpus | Lexicon | Ontology | Grammar/Language Model |
Terminology | |
---|---|---|---|---|---|
Bulgarian | 7 | 6 | 1 | 1 | 1 |
Czech | 12 | 7 | 2 | 1 | 1 |
Danish | 6 | 2 | 0 | 2 | 0 |
Dutch | 17 | 8 | 2 | 1 | 2 |
English | 206 | 77 | 18 | 11 | 10 |
Estonian | 3 | 1 | 0 | 0 | 1 |
Finnish | 3 | 2 | 0 | 1 | 0 |
French | 44 | 24 | 3 | 4 | 5 |
German | 43 | 15 | 4 | 2 | 3 |
Greek | 10 | 3 | 2 | 0 | 0 |
Hungarian | 8 | 4 | 0 | 1 | 1 |
Irish | 1 | 0 | 0 | 0 | 0 |
Italian | 32 | 16 | 4 | 2 | 0 |
Latvian | 9 | 0 | 0 | 0 | 1 |
Lithuanian | 4 | 0 | 2 | 0 | 1 |
Maltese | 1 | 0 | 0 | 1 | 0 |
Polish | 7 | 2 | 1 | 2 | 1 |
Portuguese | 19 | 6 | 1 | 1 | 0 |
Romanian | 12 | 7 | 1 | 1 | 0 |
Slovak | 2 | 0 | 0 | 1 | 0 |
Slovene | 5 | 1 | 0 | 0 | 0 |
Spanish | 29 | 19 | 4 | 5 | 2 |
Swedish | 19 | 4 | 0 | 1 | 0 |
Other Europe | 19 | 11 | 3 | 3 | 2 |
Regional Europe | 18 | 8 | 0 | 1 | 3 |
Multilingual | 5 | 3 | 1 | 0 | 1 |
Language independent | 9 | 3 | 16 | 2 | 1 |
Non applicable | 2 | 0 | 2 | 1 | 0 |
Total | 552 | 229 | 67 | 45 | 36 |
It should be noted that, not surprisingly, English is the most studied language. Secondly, come French and German languages and then Italian and Spanish.