Lancaster-Oslo-Bergen Corpus
Encyclopedia
The Lancaster-Oslo-Bergen Corpus (often abbreviated as LOB Corpus) was compiled in 1980s
in collaboration between the University of Lancaster, the University of Oslo
, and the Norwegian Computing Centre for the Humanities, Bergen
, to provide a British counterpart to the Brown Corpus
compiled by Kucera and Francis for American English in 1960s
.
Its composition was designed to match the original Brown corpus in terms of its size and genres as closely as possible using documents published in the UK by British authors. Both corpora consist of 500 samples each comprising about 2000 words in the following genres:
The corpus has been also tagged
, i.e. part-of-speech categories have been assigned to every word.
1980s
File:1980s decade montage.png|thumb|400px|From left, clockwise: The first Space Shuttle, Columbia, lifted off in 1981; American President Ronald Reagan and Soviet leader Mikhail Gorbachev eased tensions between the two superpowers, leading to the end of the Cold War; The Fall of the Berlin Wall in...
in collaboration between the University of Lancaster, the University of Oslo
University of Oslo
The University of Oslo , formerly The Royal Frederick University , is the oldest and largest university in Norway, situated in the Norwegian capital of Oslo. The university was founded in 1811 and was modelled after the recently established University of Berlin...
, and the Norwegian Computing Centre for the Humanities, Bergen
Bergen
Bergen is the second largest city in Norway with a population of as of , . Bergen is the administrative centre of Hordaland county. Greater Bergen or Bergen Metropolitan Area as defined by Statistics Norway, has a population of as of , ....
, to provide a British counterpart to the Brown Corpus
Brown Corpus
The Brown University Standard Corpus of Present-Day American English was compiled in the 1960s by Henry Kucera and W. Nelson Francis at Brown University, Providence, Rhode Island as a general corpus in the field of corpus linguistics...
compiled by Kucera and Francis for American English in 1960s
1960s
The 1960s was the decade that started on January 1, 1960, and ended on December 31, 1969. It was the seventh decade of the 20th century.The 1960s term also refers to an era more often called The Sixties, denoting the complex of inter-related cultural and political trends across the globe...
.
Its composition was designed to match the original Brown corpus in terms of its size and genres as closely as possible using documents published in the UK by British authors. Both corpora consist of 500 samples each comprising about 2000 words in the following genres:
Label | Text category | Brown Corpus | LOB Corpus |
---|---|---|---|
A | Press: reportage | 44 | 44 |
B | Press: editorial | 27 | 27 |
C | Press: reviews | 17 | 17 |
D | Religion | 17 | 17 |
E | Skills, trades and hobbies | 36 | 38 |
F | Popular lore | 48 | 44 |
G | Belles lettres, biography, essays | 75 | 77 |
H | Miscellaneous (documents, reports, etc) | 30 | 30 |
J | Learned and scientific writings | 80 | 80 |
K | General fiction | 29 | 29 |
L | Mystery and detective fiction | 24 | 24 |
M | Science fiction | 6 | 6 |
N | Adventure and western fiction | 29 | 29 |
P | Romance and love story | 29 | 29 |
R | Humour | 9 | 9 |
Total | 500 | 500 |
The corpus has been also tagged
Part-of-speech tagging
In corpus linguistics, part-of-speech tagging , also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e...
, i.e. part-of-speech categories have been assigned to every word.