Oxford English Corpus
Encyclopedia
The Oxford English Corpus is a text corpus
of English language
used by the makers of the Oxford English Dictionary
and by Oxford University Press
's language research programme. It is the largest corpus of its kind, containing over two billion
words. The sources for these words are writings of all sorts, from "literary novels and specialist journals to everyday newspapers and magazines and from Hansard
to the language of chatrooms, emails, and weblogs". This may be contrasted with similar database
s that sample only a specific kind of writing.
The digital version of the Oxford English Corpus is formatted in XML
and usually analysed with Sketch Engine software.
Each document in the OE Corpus is accompanied by metadata
naming:
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...
of English language
English language
English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...
used by the makers of the Oxford English Dictionary
Oxford English Dictionary
The Oxford English Dictionary , published by the Oxford University Press, is the self-styled premier dictionary of the English language. Two fully bound print editions of the OED have been published under its current name, in 1928 and 1989. The first edition was published in twelve volumes , and...
and by Oxford University Press
Oxford University Press
Oxford University Press is the largest university press in the world. It is a department of the University of Oxford and is governed by a group of 15 academics appointed by the Vice-Chancellor known as the Delegates of the Press. They are headed by the Secretary to the Delegates, who serves as...
's language research programme. It is the largest corpus of its kind, containing over two billion
1000000000 (number)
1,000,000,000 is the natural number following 999,999,999 and preceding 1,000,000,001.In scientific notation, it is written as 109....
words. The sources for these words are writings of all sorts, from "literary novels and specialist journals to everyday newspapers and magazines and from Hansard
Hansard
Hansard is the name of the printed transcripts of parliamentary debates in the Westminster system of government. It is named after Thomas Curson Hansard, an early printer and publisher of these transcripts.-Origins:...
to the language of chatrooms, emails, and weblogs". This may be contrasted with similar database
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...
s that sample only a specific kind of writing.
The digital version of the Oxford English Corpus is formatted in XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
and usually analysed with Sketch Engine software.
Each document in the OE Corpus is accompanied by metadata
Metadata
The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...
naming:
- title
- author (if known; many websites make this difficult to determine reliably)
- author gender (if known)
- language type (e.g. British English, American English)
- source website
- year (+ date, if known)
- date of collection
- domain + subdomain
- document statistics (number of tokens, sentences, etc.)
See also
- Oxford English DictionaryOxford English DictionaryThe Oxford English Dictionary , published by the Oxford University Press, is the self-styled premier dictionary of the English language. Two fully bound print editions of the OED have been published under its current name, in 1928 and 1989. The first edition was published in twelve volumes , and...
- Corpus of Contemporary American EnglishCorpus of Contemporary American EnglishThe freely-searchable 425 million word Corpus of Contemporary American English is the largest corpus of American English currently available, and the only publicly-available corpus of American English to contain a wide array of texts from a number of genres.It was created by Mark Davies, Professor...
- American National CorpusAmerican National CorpusThe American National Corpus is a text corpus of American English currently containing 22 million words written and spoken data produced since 1990. The ANC may at some point of time include a range of genres comparable to the British National Corpus...
- Frequency analysisFrequency analysisIn cryptanalysis, frequency analysis is the study of the frequency of letters or groups of letters in a ciphertext. The method is used as an aid to breaking classical ciphers....
External links
- [ftp://ftp.itri.bton.ac.uk/bnc/ British National Corpus]
- Sketch Engine website