German Reference Corpus
Encyclopedia
The German Reference Corpus (original: Deutsches Referenzkorpus; short: DeReKo) is an electronic archive of text corpora
of contemporary written German
. It was first created in 1964 and is hosted at the Institute for the German Language (IDS) in Mannheim
, Germany
. The corpus archive is continuously updated and expanded. It currently comprises more than 4.0 billion word tokens (as of August 2010) and constitutes the largest linguistically motivated collection of contemporary German texts. Today, it is one of the major resources worldwide for the study of written German.
In contrast to other well-known corpora and corpus archives (such as the British National Corpus
), however, the German Reference Corpus is explicitly not designed as a balanced corpus: The distribution of DeReKo texts across time or text types does not match some predefined percentages.
This conception complies with the fact that whether or not a given corpus constitutes a balanced or even representative language sample
may only be assessed with respect to a specific language domain (i.e., the statistical population
). Because different linguistic investigations generally aim at different language domains, the declared purpose of the German Reference Corpus is to serve as a versatile superordinate sample, or primordial sample (German: Ur-Stichprobe) of contemporary written German, from which corpus users may draw a specialised subsample (a so-called virtual corpus) to represent the language domain they wish to investigate.
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...
of contemporary written German
German language
German is a West Germanic language, related to and classified alongside English and Dutch. With an estimated 90 – 98 million native speakers, German is one of the world's major languages and is the most widely-spoken first language in the European Union....
. It was first created in 1964 and is hosted at the Institute for the German Language (IDS) in Mannheim
Mannheim
Mannheim is a city in southwestern Germany. With about 315,000 inhabitants, Mannheim is the second-largest city in the Bundesland of Baden-Württemberg, following the capital city of Stuttgart....
, Germany
Germany
Germany , officially the Federal Republic of Germany , is a federal parliamentary republic in Europe. The country consists of 16 states while the capital and largest city is Berlin. Germany covers an area of 357,021 km2 and has a largely temperate seasonal climate...
. The corpus archive is continuously updated and expanded. It currently comprises more than 4.0 billion word tokens (as of August 2010) and constitutes the largest linguistically motivated collection of contemporary German texts. Today, it is one of the major resources worldwide for the study of written German.
Alternative names
The German Reference Corpus is often referred to by other names, such as Mannheim corpora, IDS corpora, COSMAS corpora and the corresponding German translations. The name Deutsches Referenzkorpus (DeReKo) was originally used for a specific portion of the current archive which was collected between 1999 and 2002 by a number of institutions in a joint project under the same name. Since 2004, Deutsches Referenzkorpus (DeReKo) is the official name of the full corpus archive.Conception and composition
The German Reference Corpus comprises fictional and academic texts, a large number of newspaper texts and several other text types. The texts cover the time range from around 1950 to the present.In contrast to other well-known corpora and corpus archives (such as the British National Corpus
British National Corpus
The British National Corpus is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. It was compiled as a general corpus in the field of corpus linguistics...
), however, the German Reference Corpus is explicitly not designed as a balanced corpus: The distribution of DeReKo texts across time or text types does not match some predefined percentages.
This conception complies with the fact that whether or not a given corpus constitutes a balanced or even representative language sample
Sample (statistics)
In statistics, a sample is a subset of a population. Typically, the population is very large, making a census or a complete enumeration of all the values in the population impractical or impossible. The sample represents a subset of manageable size...
may only be assessed with respect to a specific language domain (i.e., the statistical population
Statistical population
A statistical population is a set of entities concerning which statistical inferences are to be drawn, often based on a random sample taken from the population. For example, if we were interested in generalizations about crows, then we would describe the set of crows that is of interest...
). Because different linguistic investigations generally aim at different language domains, the declared purpose of the German Reference Corpus is to serve as a versatile superordinate sample, or primordial sample (German: Ur-Stichprobe) of contemporary written German, from which corpus users may draw a specialised subsample (a so-called virtual corpus) to represent the language domain they wish to investigate.
Access
Due to copyright and licence restrictions, the DeReKo archive may not be copied nor offered for download. It can be queried and analyzed free of charge via the system COSMAS II - end-users are required to register by name and to agree to use the corpus data exclusively for non-commercial, academic purposes. COSMAS II enables users to compile from DeReKo a virtual corpus suitable for their specific research questions.See also
- Text corpusText corpusIn linguistics, a corpus or text corpus is a large and structured set of texts...
- Corpus linguisticsCorpus linguisticsCorpus linguistics is the study of language as expressed in samples or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are now largely...
- American National CorpusAmerican National CorpusThe American National Corpus is a text corpus of American English currently containing 22 million words written and spoken data produced since 1990. The ANC may at some point of time include a range of genres comparable to the British National Corpus...
- Bank of EnglishBank of EnglishThe Bank of English is the name of the COBUILD corpus, a collection of English texts. These are mainly British, but American and Australian data are also included....
- British National CorpusBritish National CorpusThe British National Corpus is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. It was compiled as a general corpus in the field of corpus linguistics...
- Corpus of Contemporary American EnglishCorpus of Contemporary American EnglishThe freely-searchable 425 million word Corpus of Contemporary American English is the largest corpus of American English currently available, and the only publicly-available corpus of American English to contain a wide array of texts from a number of genres.It was created by Mark Davies, Professor...
(COCA) - Oxford English CorpusOxford English CorpusThe Oxford English Corpus is a text corpus of English language used by the makers of the Oxford English Dictionary and by Oxford University Press's language research programme. It is the largest corpus of its kind, containing over two billion words...
External links
- DeReKo website (German)
- COSMAS II - free DeReKo interface (German website)