Brown Corpus
Encyclopedia
The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) was compiled in the 1960s by Henry Kucera
Henry Kucera
Henry Kučera, born Jindřich Kučera was a Czech linguist who was a pioneer in corpus linguistics and linguistic software....

 and W. Nelson Francis at Brown University
Brown University
Brown University is a private, Ivy League university located in Providence, Rhode Island, United States. Founded in 1764 prior to American independence from the British Empire as the College in the English Colony of Rhode Island and Providence Plantations early in the reign of King George III ,...

, Providence
Providence, Rhode Island
Providence is the capital and most populous city of Rhode Island and was one of the first cities established in the United States. Located in Providence County, it is the third largest city in the New England region...

, Rhode Island
Rhode Island
The state of Rhode Island and Providence Plantations, more commonly referred to as Rhode Island , is a state in the New England region of the United States. It is the smallest U.S. state by area...

 as a general corpus
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...

 (text collection) in the field of corpus linguistics
Corpus linguistics
Corpus linguistics is the study of language as expressed in samples or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are now largely...

. It contains 500 samples of English-language text, totalling roughly one million words, compiled from works published in the United States in 1961.

History

In 1961/1963, Kucera and Francis published their classic work Computational Analysis of Present-Day American English (1967), which provided basic statistics on what is known today simply as the Brown Corpus. The Brown Corpus was a carefully compiled selection of current American English, totaling about a million words drawn from a wide variety of sources. Kucera and Francis subjected it to a variety of computational analyses, from which they compiled a rich and variegated opus, combining elements of linguistics, psychology, statistics, and sociology. It has been very widely used in computational linguistics
Computational linguistics
Computational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective....

, and was for many years among the most-cited resources in the field.

Shortly after publication of the first lexicostatistical
Lexicostatistics
Lexicostatistics is an approach to comparative linguistics that involves quantitative comparison of lexical cognates. Lexicostatistics is related to the comparative method but does not reconstruct a proto-language...

 analysis, Boston
Boston
Boston is the capital of and largest city in Massachusetts, and is one of the oldest cities in the United States. The largest city in New England, Boston is regarded as the unofficial "Capital of New England" for its economic and cultural impact on the entire New England region. The city proper had...

 publisher Houghton-Mifflin approached Kucera to supply a million word, three-line citation base for its new American Heritage Dictionary. This ground-breaking new dictionary, which first appeared in 1969, was the first dictionary to be compiled using corpus linguistics for word frequency and other information.

The initial Brown Corpus had only the words themselves, plus a location identifier for each. Over the following several years part-of-speech tags were applied. The Greene and Rubin tagging program (see under part of speech tagging) helped considerably in this, but the high error rate meant that extensive manual proofreading was required.

The tagged Brown Corpus used a selection of about 80 parts of speech, as well as special indicators for compound forms, contractions, foreign words and a few other phenomena, and formed the basis for many later corpora such as the Lancaster-Oslo-Bergen Corpus
Lancaster-Oslo-Bergen Corpus
The Lancaster-Oslo-Bergen Corpus was compiled in 1980s in collaboration between the University of Lancaster, the University of Oslo, and the Norwegian Computing Centre for the Humanities, Bergen, to provide a British counterpart to the Brown Corpus compiled by Kucera and Francis for American...

. The tagged corpus enabled far more sophisticated statistical analysis, much of it carried out by graduate student Andrew Mackie. Some of the analysis appears in Frequency Analysis of English Usage: Lexicon and Grammar, by Winthrop Nelson Francis and Henry Kucera, Houghton Mifflin (January, 1983) ISBN 0-395-32250-2.

One interesting result is that even for quite large samples, graphing words in order of decreasing frequency of occurrence shows a hyperbola
Hyperbola
In mathematics a hyperbola is a curve, specifically a smooth curve that lies in a plane, which can be defined either by its geometric properties or by the kinds of equations for which it is the solution set. A hyperbola has two pieces, called connected components or branches, which are mirror...

: the frequency of the n-th most frequent word is roughly proportional to 1/n. Thus "the" constitutes nearly 7% of the Brown Corpus, "to" and "of" more than another 3% each; while about half the total vocabulary of about 50,000 words are hapax legomena: words that occur only once in the corpus. This simple rank-vs.-frequency relationship was noted for an extraordinary variety of phenomena by George Kingsley Zipf (for example, see his The Psychobiology of Language), and is known as Zipf's law.

Although the Brown Corpus pioneered the field of corpus linguistics, by now typical corpora (such as the Corpus of Contemporary American English
Corpus of Contemporary American English
The freely-searchable 425 million word Corpus of Contemporary American English is the largest corpus of American English currently available, and the only publicly-available corpus of American English to contain a wide array of texts from a number of genres.It was created by Mark Davies, Professor...

, the British National Corpus
British National Corpus
The British National Corpus is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. It was compiled as a general corpus in the field of corpus linguistics...

 or the International Corpus of English
International Corpus of English
The International Corpus of English is a set of corpora representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.-History:...

) tend to be much larger, on the order of 100 million words.

Sample distribution

The Corpus consists of 500 samples, distributed across 15 genres in rough proportion to the amount published in 1961 in each of those genres. All works sampled were published in 1961; as far as could be determined they were first published then, and were written by native speakers of American English.

Each sample began at a random sentence-boundary in the article or other unit chosen, and continued up to the first sentence boundary after 2,000 words. In a very few cases miscounts led to samples being just under 2,000 words.

The original data entry was done on upper-case only keypunch machines; capitals were indicated by a preceding asterisk, and various special items such as formulae also had special codes.

The corpus originally (1961) contained 1,014,312 words sampled from 15 text categories:
  • A. PRESS: Reportage (44 texts)
    • Political
    • Sports
    • Society
    • Spot News
    • Financial
    • Cultural
  • B. PRESS: Editorial (27 texts)
    • Institutional Daily
    • Personal
    • Letters to the Editor
  • C. PRESS: Reviews (17 texts)
    • theatre
    • books
    • music
    • dance
  • D. RELIGION (17 texts)
    • Books
    • Periodicals
    • Tracts
  • E. SKILL AND HOBBIES (36 texts)
    • Books
    • Periodicals
  • F. POPULAR LORE (48 texts)
    • Books
    • Periodicals
  • G. BELLES-LETTRES - Biography, Memoirs, etc. (75 texts)
    • Books
    • Periodicals
  • H. MISCELLANEOUS: US Government & House Organs (30 texts)
    • Government Documents
    • Foundation Reports
    • Industry Reports
    • College Catalog
    • Industry House organ
  • J. LEARNED (80 texts)
    • Natural Sciences
    • Medicine
    • Mathematics
    • Social and Behavioral Sciences
    • Political Science, Law, Education
    • Humanities
    • Technology and Engineering
  • K. FICTION: General (29 texts)
    • Novels
    • Short Stories
  • L. FICTION: Mystery and Detective Fiction (24 texts)
    • Novels
    • Short Stories
  • M. FICTION: Science (6 texts)
    • Novels
    • Short Stories
  • N. FICTION: Adventure and Western (29 texts)
    • Novels
    • Short Stories
  • P. FICTION: Romance and Love Story (29 texts)
    • Novels
    • Short Stories
  • R. HUMOR (9 texts)
    • Novels
    • Essays, etc.

Part-of-speech tags used

Tag Definition
. sentence closer (. ; ? *)
( left paren
) right paren
* not, n't
-- dash
, comma
: colon
ABL pre-qualifier (quite, rather)
ABN pre-quantifier (half, all)
ABX pre-quantifier (both)
AP post-determiner (many, several, next)
AT article (a, the, no)
BE be
BED were
BEDZ was
BEG being
BEM am
BEN been
BER are, art
BEZ is
CC coordinating conjunction (and, or)
CD cardinal numeral (one, two, 2, etc.)
CS subordinating conjunction (if, although)
DO do
DOD did
DOZ does
DT singular determiner/quantifier (this, that)
DTI singular or plural determiner/quantifier (some, any)
DTS plural determiner (these, those)
DTX determiner/double conjunction (either)
EX existential there
FW foreign word (hyphenated before regular tag)
HV have
HVD had (past tense)
HVG having
HVN had (past participle)
IN preposition
JJ adjective
JJR comparative adjective
JJS semantically superlative adjective (chief, top)
JJT morphologically superlative adjective (biggest)
MD modal auxiliary (can, should, will)
NC cited word (hyphenated after regular tag)
NN singular or mass noun
NN$ possessive singular noun
NNS plural noun
NNS$ possessive plural noun
NP proper noun or part of name phrase
NP$ possessive proper noun
NPS plural proper noun
NPS$ possessive plural proper noun
NR adverbial noun (home, today, west)
OD ordinal numeral (first, 2nd)
PN nominal pronoun (everybody, nothing)
PN$ possessive nominal pronoun
PP$ possessive personal pronoun (my, our)
PP$$ second (nominal) possessive pronoun (mine, ours)
PPL singular reflexive/intensive personal pronoun (myself)
PPLS plural reflexive/intensive personal pronoun (ourselves)
PPO objective personal pronoun (me, him, it, them)
PPS 3rd. singular nominative pronoun (he, she, it, one)
PPSS other nominative personal pronoun (I, we, they, you)
QL qualifier (very, fairly)
QLP post-qualifier (enough, indeed)
RB adverb
RBR comparative adverb
RBT superlative adverb
RN nominal adverb (here, then, indoors)
RP adverb/particle (about, off, up)
TO infinitive marker to
UH interjection, exclamation
VB verb, base form
VBD verb, past tense
VBG verb, present participle/gerund
VBN verb, past participle
VBZ verb, 3rd. singular present
WDT wh- determiner (what, which)
WP$ possessive wh- pronoun (whose)
WPO objective wh- pronoun (whom, which, that)
WPS nominative wh- pronoun (who, which, that)
WQL wh- qualifier (how)
WRB wh- adverb (how, where, when)

Note that some versions of the tagged Brown corpus contain combined tags. For instance the word "wanna" is tagged VB+TO, since it is a contracted form of the two words, want/VB and to/TO. Also some tags might be negated, for instance "aren't" would be tagged "BER*", where * signifies the negation. Additionally, tags may have hyphenations: The tag -HL is hyphenated to the regular tags of words in headlines. The tag -TL is hyphenated to the regular tags of words in titles. The hyphenation -NC signifies an emphasized word. Sometimes the tag has a FW- prefix which means foreign word.

See also

  • LOB Corpus, a corpus of British English based on the same parameters as the Brown Corpus

External links


The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK