Text corpus
Encyclopedia
In linguistics
, a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing
, checking occurrences or validating linguistic rules on a specific universe.
A corpus may contain texts in a single language hi (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora.
In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation
. An example of annotating a corpus is part-of-speech tagging
, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear gloss
ing is used to make the annotation bilingual.
Some corpora have further structured levels of analysis applied. In particular, a number of smaller corpora may be fully parsed
. Such corpora are usually called Treebank
s or Parsed Corpora
. The difficulty of ensuring that the entire corpus is completely and consistently annotated means that these corpora are usually smaller, containing around 1 to 3 million words. Other levels of linguistic structured analysis are possible, including annotations for morphology
, semantics
and pragmatics
.
Corpora are the main knowledge base in corpus linguistics
. The analysis and processing of various types of corpora are also the subject of much work in computational linguistics
, speech recognition
and machine translation
, where they are often used to create hidden Markov model
s for part of speech tagging and other purposes. Corpora and frequency list
s derived from them are useful for language teaching.
s, for example in attempts to decipher
ancient scripts, or in Biblical scholarship. Some archaeological corpora can be of such short duration that they provide a snapshot in time. One of the shortest corpora in time, may be the 15-30 year Amarna letters
texts-(1350 BC). The corpus of an ancient city, (for example the "Kültepe
Texts" of Turkey), may go through a series of corpora, determined by their find site dates.
Other languages:
Linguistics
Linguistics is the scientific study of human language. Linguistics can be broadly broken into three categories or subfields of study: language form, language meaning, and language in context....
, a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing
Statistical hypothesis testing
A statistical hypothesis test is a method of making decisions using data, whether from a controlled experiment or an observational study . In statistics, a result is called statistically significant if it is unlikely to have occurred by chance alone, according to a pre-determined threshold...
, checking occurrences or validating linguistic rules on a specific universe.
A corpus may contain texts in a single language hi (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora.
In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation
Annotation
An annotation is a note that is made while reading any form of text. This may be as simple as underlining or highlighting passages.Annotated bibliographies give descriptions about how each source is useful to an author in constructing a paper or argument...
. An example of annotating a corpus is part-of-speech tagging
Part-of-speech tagging
In corpus linguistics, part-of-speech tagging , also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e...
, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear gloss
Gloss
A gloss is a brief notation of the meaning of a word or wording in a text. It may be in the language of the text, or in the reader's language if that is different....
ing is used to make the annotation bilingual.
Some corpora have further structured levels of analysis applied. In particular, a number of smaller corpora may be fully parsed
Parsing
In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens , to determine its grammatical structure with respect to a given formal grammar...
. Such corpora are usually called Treebank
Treebank
A treebank or parsed corpus is a text corpus in which each sentence has been parsed, i.e. annotated with syntactic structure. Syntactic structure is commonly represented as a tree structure, hence the name Treebank...
s or Parsed Corpora
Treebank
A treebank or parsed corpus is a text corpus in which each sentence has been parsed, i.e. annotated with syntactic structure. Syntactic structure is commonly represented as a tree structure, hence the name Treebank...
. The difficulty of ensuring that the entire corpus is completely and consistently annotated means that these corpora are usually smaller, containing around 1 to 3 million words. Other levels of linguistic structured analysis are possible, including annotations for morphology
Morphology (linguistics)
In linguistics, morphology is the identification, analysis and description, in a language, of the structure of morphemes and other linguistic units, such as words, affixes, parts of speech, intonation/stress, or implied context...
, semantics
Semantics
Semantics is the study of meaning. It focuses on the relation between signifiers, such as words, phrases, signs and symbols, and what they stand for, their denotata....
and pragmatics
Pragmatics
Pragmatics is a subfield of linguistics which studies the ways in which context contributes to meaning. Pragmatics encompasses speech act theory, conversational implicature, talk in interaction and other approaches to language behavior in philosophy, sociology, and linguistics. It studies how the...
.
Corpora are the main knowledge base in corpus linguistics
Corpus linguistics
Corpus linguistics is the study of language as expressed in samples or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are now largely...
. The analysis and processing of various types of corpora are also the subject of much work in computational linguistics
Computational linguistics
Computational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective....
, speech recognition
Speech recognition
Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...
and machine translation
Machine translation
Machine translation, sometimes referred to by the abbreviation MT is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another.On a basic...
, where they are often used to create hidden Markov model
Hidden Markov model
A hidden Markov model is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved states. An HMM can be considered as the simplest dynamic Bayesian network. The mathematics behind the HMM was developed by L. E...
s for part of speech tagging and other purposes. Corpora and frequency list
Frequency list
In computational linguistics, a frequency list is a sorted list of words together with their frequency, where frequency here usually means the number of occurrences in a given corpus...
s derived from them are useful for language teaching.
Archaeological corpora
Text corpora are also used in the study of historical documentHistorical document
Historical documents are documents that contain important information about a person, place, or event.Most famous historical documents are either laws, accounts of battles , or the exploits of the powerful...
s, for example in attempts to decipher
Decipherment
Decipherment is the analysis of documents written in ancient languages, where the language is unknown, or knowledge of the language has been lost....
ancient scripts, or in Biblical scholarship. Some archaeological corpora can be of such short duration that they provide a snapshot in time. One of the shortest corpora in time, may be the 15-30 year Amarna letters
Amarna letters
The Amarna letters are an archive of correspondence on clay tablets, mostly diplomatic, between the Egyptian administration and its representatives in Canaan and Amurru during the New Kingdom...
texts-(1350 BC). The corpus of an ancient city, (for example the "Kültepe
Kültepe
Kültepe is a modern village near the ancient city of Kaneš or Kanesh , located in the Kayseri Province of Turkey's Central Anatolia Region...
Texts" of Turkey), may go through a series of corpora, determined by their find site dates.
Some notable text corpora
English language:- Google N-Grams Corpus - Largest English corpus at 155 billion words. Also has corpora for other languages. (http://ngrams.googlelabs.com/datasets)
- American National CorpusAmerican National CorpusThe American National Corpus is a text corpus of American English currently containing 22 million words written and spoken data produced since 1990. The ANC may at some point of time include a range of genres comparable to the British National Corpus...
- Bank of EnglishBank of EnglishThe Bank of English is the name of the COBUILD corpus, a collection of English texts. These are mainly British, but American and Australian data are also included....
- British National CorpusBritish National CorpusThe British National Corpus is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. It was compiled as a general corpus in the field of corpus linguistics...
- Corpus Juris SecundumCorpus Juris SecundumCorpus Juris Secundum is an encyclopedia of U.S. law . Its full title is Corpus Juris Secundum: Complete Restatement Of The Entire American Law As Developed By All Reported Cases It contains an alphabetical arrangement of legal topics as developed by U.S...
- Corpus of Contemporary American EnglishCorpus of Contemporary American EnglishThe freely-searchable 425 million word Corpus of Contemporary American English is the largest corpus of American English currently available, and the only publicly-available corpus of American English to contain a wide array of texts from a number of genres.It was created by Mark Davies, Professor...
(COCA) 425 million words, 1990-2011. Freely searchable online. - Brown CorpusBrown CorpusThe Brown University Standard Corpus of Present-Day American English was compiled in the 1960s by Henry Kucera and W. Nelson Francis at Brown University, Providence, Rhode Island as a general corpus in the field of corpus linguistics...
, forming part of the "Brown Family" of corpora, together with LOB, Frown and F-LOB. - International Corpus of EnglishInternational Corpus of EnglishThe International Corpus of English is a set of corpora representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.-History:...
- Oxford English CorpusOxford English CorpusThe Oxford English Corpus is a text corpus of English language used by the makers of the Oxford English Dictionary and by Oxford University Press's language research programme. It is the largest corpus of its kind, containing over two billion words...
- Scottish Corpus of Texts & Speech
Other languages:
- Hamshahri CorpusHamshahri CorpusThe Hamshahri Corpus is a sizable Persian corpus based on the Iranian newspaper Hamshahri, one of the first online Persian newspapers in Iran. It was in initially collected and compiled by Ehsan Darrudi at DBRG Group of the University of Tehran....
(Persian a.k.a Farsi) - Amarna lettersAmarna lettersThe Amarna letters are an archive of correspondence on clay tablets, mostly diplomatic, between the Egyptian administration and its representatives in Canaan and Amurru during the New Kingdom...
, (for AkkadianAkkadian languageAkkadian is an extinct Semitic language that was spoken in ancient Mesopotamia. The earliest attested Semitic language, it used the cuneiform writing system derived ultimately from ancient Sumerian, an unrelated language isolate...
, Egyptian, SumerogramSumerogramA Sumerogram is the use of a Sumerian cuneiform character or group of characters as an ideogram or logogram rather than a syllabogram in the graphic representation of a language other than Sumerian, such as Akkadian or Hittite....
's, etc.) - TEP: Tehran English-Persian Parallel Corpus (http://ece.ut.ac.ir/nlp/)
- TMC: Tehran Monolingual Corpus, Standard corpus for Persian Language Modeling (http://ece.ut.ac.ir/nlp/)
- Bijankhan CorpusBijankhan CorpusThe Bijankhan corpus is a tagged corpus that is suitable for natural language processing research on the Persian language. This collection is gathered from daily news and common texts. In this collection all documents are categorized into different subjects such as political, cultural, etc; in...
A Contemporary Persian Corpus for NLP researches - CETENFolha
- Croatian Language CorpusCroatian Language CorpusThe Croatian Language Corpus is a corpus of Croatian compiled at the Institute of Croatian Language and Linguistics .- Background :The CLC was initially funded as a sub-project of the research program Riznica by the Ministry of Science, Education, and Sports of the Republic of Croatia from May...
- Croatian National CorpusCroatian National CorpusCroatian National Corpus is the biggest and the most important corpus of the Croatian language. Its compilation started in 1998 at the Institute of Linguistics of the Faculty of Humanities and Social Sciences, University of Zagreb following the ideas of Marko Tadić...
- Czech National Corpus
- Neo-Assyrian Text Corpus ProjectNeo-Assyrian Text Corpus Project-State archives of Assyria cuneiform texts:The following works are published in the series: State Archives of Assyria Cuneiform Texts:*1997–SAACT-Volume I..---The Standard Babylonian Epic of Gilgamesh, by Simo Parpola, 1997....
- Russian National CorpusRussian National CorpusThe Russian National Corpus is a corpus of the Russian language that has been available online since April 29, 2004...
- Slovenian National CorpusSlovenian National CorpusSlovenian National Corpus FidaPLUS is the biggest and the most important corpus of the Slovenian language. It is an upgrade of FIDA corpus, which was developed between 1997 and 2000, with added texts that were published up to 2006...
- Thesaurus Linguae GraecaeThesaurus Linguae GraecaeThe Thesaurus Linguae Graecae is a research center at the University of California, Irvine. The TLG was founded in 1972 by Marianne McDonald with the goal to create a comprehensive digital collection of all surviving texts written in Greek from antiquity to...
(Ancient Greek) - Quranic Arabic CorpusQuranic Arabic CorpusThe Quranic Arabic Corpus is an annotated linguistic resource consisting of 77,430 words of . The research project is led by at the University of Leeds, and is part of the Arabic language computing research group within the School of Computing, supervised by...
(Classical Arabic) - Eastern Armenian National Corpus (EANC) 110 million words. Freely searchable online.
- National Corpus of PolishNational Corpus of PolishThe National Corpus of Polish is the biggest and the most important corpus of the Polish language. A linguistic corpus is a collection of texts where one can find the typical use of a single word or a phrase, as well as their meaning and grammatical function.-Description:The National Corpus of...
- German Reference CorpusGerman Reference CorpusThe German Reference Corpus is an electronic archive of text corpora of contemporary written German. It was first created in 1964 and is hosted at the Institute for the German Language in Mannheim, Germany. The corpus archive is continuously updated and expanded...
(DeReKo) More than 4 billion words of contemporary written German. - TatoebaTatoebaTatoeba.org is a free online database of example sentences geared towards foreign language learners. Its name comes from the Japanese term "tatoeba" , meaning "for example". Unlike other online dictionaries, which focus on words, Tatoeba focuses on complete sentences, their grammatical...
A parallel corpus which contains about 913000 sentences in 90 languages. - Spanish text corpus by Molino de Ideas, which contains 660 millions words.
- Kotonoha Japanese language corpus
See also
- ConcordanceConcordance (publishing)A concordance is an alphabetical list of the principal words used in a book or body of work, with their immediate contexts. Because of the time and difficulty and expense involved in creating a concordance in the pre-computer era, only works of special importance, such as the Vedas, Bible, Qur'an...
- Corpus linguisticsCorpus linguisticsCorpus linguistics is the study of language as expressed in samples or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are now largely...
- Linguistic Data ConsortiumLinguistic Data ConsortiumThe Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for linguistics research and development purposes. The University of Pennsylvania is...
- Natural language processingNatural language processingNatural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....
- Natural Language ToolkitNatural Language ToolkitNatural Language Toolkit or, more commonly, NLTK is a suite of libraries and programs for symbolic and statistical natural language processing for the Python programming language. NLTK includes graphical demonstrations and sample data...
- Parallel text alignment
- Search engines: they access the "web corpus".
- Speech corpusSpeech corpusA speech corpus is a database of speech audio files and text transcriptions.In Speech technology, speech corpora are used, among other things, to create acoustic models ....
- Translation memoryTranslation memoryA translation memory, or TM, is a database that stores so-called "segments", which can be sentences or sentence-like units that have previously been translated. A translation memory system stores the words, phrases and paragraphs that have already been translated, in order to aid human translators...
- TreebankTreebankA treebank or parsed corpus is a text corpus in which each sentence has been parsed, i.e. annotated with syntactic structure. Syntactic structure is commonly represented as a tree structure, hence the name Treebank...
- Zipf's Law
External links
- Free, web-based corpora (45-425 million words each): American (COCA, COHA, TIME), British (BNC), Spanish, Portuguese
- ACL SIGLEX Resource Links: Text Corpora
- The Leipzig Glossing Rules: Conventions for interlinear morphemeMorphemeIn linguistics, a morpheme is the smallest semantically meaningful unit in a language. The field of study dedicated to morphemes is called morphology. A morpheme is not identical to a word, and the principal difference between the two is that a morpheme may or may not stand alone, whereas a word,...
-by-morpheme glossGlossA gloss is a brief notation of the meaning of a word or wording in a text. It may be in the language of the text, or in the reader's language if that is different....
es - Developing Linguistic Corpora: a Guide to Good Practice
- An interface for querying automatically-constructed virtual corpora.
- TEP: Tehran English-Persian Parallel Corpus.
- An interface for querying text corpora constructed through guided crawling of online news sites, the corpora (both local and virtual) constructed using the SPARTAN technique, and publicly available collections (e.g. Reuters-21578, texts from the Gutenberg project, GENIA).
- http://www.korpus.cz/intercorp/ Building synchronous parallel corpora of the languages taught at the Faculty of Arts of Charles University.