Moby Project
Encyclopedia
The Moby Project is a collection of public-domain lexical resources. It was created by Grady Ward
. The resources were dedicated to the public domain, and are now mirrored at Project Gutenberg
. , it contains the largest free phonetic database, with 177,267 words and corresponding pronunciation.
, German
, Italian
, Japanese
, and Spanish
:
However, some of the lists are contaminated, for example the Japanese list contains English words such as abnormal and non-words such as abcdefgh and m,./.
, listed in priority order. The format of the file is word\parts-of-speech, with the following parts of speech being identified:
Following this is the pronunciation. Several special symbols are present:
The rest of the symbols are used to represent IPA characters, according to the following table:
s and related terms - an average of 83.3 per root word. Each line consists of a list of comma-separated values
, with the first term being the root word, and all following words being related terms.
Grady Ward
placed this thesaurus in the public domain
in 1996. It is also available as a Debian
package.
Grady Ward
William Grady Ward is an American software engineer, lexicographer, and Internet activist who has featured prominently in the Scientology versus the Internet controversy....
. The resources were dedicated to the public domain, and are now mirrored at Project Gutenberg
Project Gutenberg
Project Gutenberg is a volunteer effort to digitize and archive cultural works, to "encourage the creation and distribution of eBooks". Founded in 1971 by Michael S. Hart, it is the oldest digital library. Most of the items in its collection are the full texts of public domain books...
. , it contains the largest free phonetic database, with 177,267 words and corresponding pronunciation.
Hyphenator
The Moby Hyphenator II contains 187,175 hyphenated words, with 9,752 indicating that they should not be hyphenated. Hyphenation is indicated by a character value 165 (hex A5).Language
Moby Language II contains wordlists of five languages - FrenchFrench language
French is a Romance language spoken as a first language in France, the Romandy region in Switzerland, Wallonia and Brussels in Belgium, Monaco, the regions of Quebec and Acadia in Canada, and by various communities elsewhere. Second-language speakers of French are distributed throughout many parts...
, German
German language
German is a West Germanic language, related to and classified alongside English and Dutch. With an estimated 90 – 98 million native speakers, German is one of the world's major languages and is the most widely-spoken first language in the European Union....
, Italian
Italian language
Italian is a Romance language spoken mainly in Europe: Italy, Switzerland, San Marino, Vatican City, by minorities in Malta, Monaco, Croatia, Slovenia, France, Libya, Eritrea, and Somalia, and by immigrant communities in the Americas and Australia...
, Japanese
Japanese language
is a language spoken by over 130 million people in Japan and in Japanese emigrant communities. It is a member of the Japonic language family, which has a number of proposed relationships with other languages, none of which has gained wide acceptance among historical linguists .Japanese is an...
, and Spanish
Spanish language
Spanish , also known as Castilian , is a Romance language in the Ibero-Romance group that evolved from several languages and dialects in central-northern Iberia around the 9th century and gradually spread with the expansion of the Kingdom of Castile into central and southern Iberia during the...
:
Language | Words | Size (in byte Byte The byte is a unit of digital information in computing and telecommunications that most commonly consists of eight bits. Historically, a byte was the number of bits used to encode a single character of text in a computer and for this reason it is the basic addressable element in many computer... s) |
---|---|---|
French | 138,257 | 1,524,757 |
German | 159,809 | 2,055,986 |
Italian | 60,453 | 561,981 |
Japanese | 115,523 | 934,783 |
Spanish | 86,059 | 850,523 |
Total | 560,101 | 5,928,030 |
However, some of the lists are contaminated, for example the Japanese list contains English words such as abnormal and non-words such as abcdefgh and m,./.
Part-of-Speech
Moby Part-of-Speech contains 233,356 words fully described by part(s) of speechLexical category
In grammar, a part of speech is a linguistic category of words , which is generally defined by the syntactic or morphological behaviour of the lexical item in question. Common linguistic categories include noun and verb, among others...
, listed in priority order. The format of the file is word\parts-of-speech, with the following parts of speech being identified:
Part-of-speech | Code |
---|---|
Noun Noun In linguistics, a noun is a member of a large, open lexical category whose members can occur as the main word in the subject of a clause, the object of a verb, or the object of a preposition .Lexical categories are defined in terms of how their members combine with other kinds of... |
N |
Plural Plural In linguistics, plurality or [a] plural is a concept of quantity representing a value of more-than-one. Typically applied to nouns, a plural word or marker is used to distinguish a value other than the default quantity of a noun, which is typically one... |
p |
Noun phrase Noun phrase In grammar, a noun phrase, nominal phrase, or nominal group is a phrase based on a noun, pronoun, or other noun-like word optionally accompanied by modifiers such as adjectives.... |
h |
Verb Verb A verb, from the Latin verbum meaning word, is a word that in syntax conveys an action , or a state of being . In the usual description of English, the basic form, with or without the particle to, is the infinitive... (usually participle Participle In linguistics, a participle is a word that shares some characteristics of both verbs and adjectives. It can be used in compound verb tenses or voices , or as a modifier... ) |
V |
Transitive verb Transitive verb In syntax, a transitive verb is a verb that requires both a direct subject and one or more objects. The term is used to contrast intransitive verbs, which do not have objects.-Examples:Some examples of sentences with transitive verbs:... |
t |
Intransitive verb Intransitive verb In grammar, an intransitive verb is a verb that has no object. This differs from a transitive verb, which takes one or more objects. Both classes of verb are related to the concept of the transitivity of a verb.... |
i |
Adjective Adjective In grammar, an adjective is a 'describing' word; the main syntactic role of which is to qualify a noun or noun phrase, giving more information about the object signified.... |
A |
Adverb Adverb An adverb is a part of speech that modifies verbs or any part of speech other than a noun . Adverbs can modify verbs, adjectives , clauses, sentences, and other adverbs.... |
v |
Conjunction Grammatical conjunction In grammar, a conjunction is a part of speech that connects two words, sentences, phrases or clauses together. A discourse connective is a conjunction joining sentences. This definition may overlap with that of other parts of speech, so what constitutes a "conjunction" must be defined for each... |
C |
Preposition | P |
Interjection Interjection In grammar, an interjection or exclamation is a word used to express an emotion or sentiment on the part of the speaker . Filled pauses such as uh, er, um are also considered interjections... |
! |
Pronoun Pronoun In linguistics and grammar, a pronoun is a pro-form that substitutes for a noun , such as, in English, the words it and he... |
r |
Definite article Article (grammar) An article is a word that combines with a noun to indicate the type of reference being made by the noun. Articles specify the grammatical definiteness of the noun, in some languages extending to volume or numerical scope. The articles in the English language are the and a/an, and some... |
D |
Indefinite article Article (grammar) An article is a word that combines with a noun to indicate the type of reference being made by the noun. Articles specify the grammatical definiteness of the noun, in some languages extending to volume or numerical scope. The articles in the English language are the and a/an, and some... |
I |
Nominative | o |
Pronunciator
The Moby Pronunciator II contains 177,267 words with corresponding pronunciation. The Project Gutenberg distribution also contains a copy of the cmudict v0.3. The file follows the format word[/part-of-speech] pronunciation. The part-of-speech field is used to disambiguate 770 of the words which have differing pronunciations depending on their part-of-speech. For example for the words spelled close, the verb has the pronunciation ˈkloʊz, whereas the adjective is /ˈkloʊs/. The parts-of-speech have been assigned the following codes:Part-of-speech | Code |
---|---|
Noun Noun In linguistics, a noun is a member of a large, open lexical category whose members can occur as the main word in the subject of a clause, the object of a verb, or the object of a preposition .Lexical categories are defined in terms of how their members combine with other kinds of... |
n |
Verb Verb A verb, from the Latin verbum meaning word, is a word that in syntax conveys an action , or a state of being . In the usual description of English, the basic form, with or without the particle to, is the infinitive... |
v |
Adjective Adjective In grammar, an adjective is a 'describing' word; the main syntactic role of which is to qualify a noun or noun phrase, giving more information about the object signified.... |
aj |
Adverb Adverb An adverb is a part of speech that modifies verbs or any part of speech other than a noun . Adverbs can modify verbs, adjectives , clauses, sentences, and other adverbs.... |
av |
Interjection Interjection In grammar, an interjection or exclamation is a word used to express an emotion or sentiment on the part of the speaker . Filled pauses such as uh, er, um are also considered interjections... |
interj |
Following this is the pronunciation. Several special symbols are present:
Symbol | Meaning |
---|---|
/ | Used to separate phoneme Phoneme In a language or dialect, a phoneme is the smallest segmental unit of sound employed to form meaningful contrasts between utterances.... s |
_ | Used to separate words |
' | Primary stress on the following syllable |
, | Secondary stress Secondary stress Secondary stress is the weaker of two degrees of stress in the pronunciation of a word; the stronger degree of stress is called 'primary'. The International Phonetic Alphabet symbol for secondary stress is a short vertical line preceding and at the foot of the stressed syllable: the nun in ... on the following syllable |
The rest of the symbols are used to represent IPA characters, according to the following table:
Symbol | IPA |
---|---|
& | æ |
- | ə |
@ | ʌ, ə |
@r | ɜr, ər |
A | ɑː |
aI | aɪ |
Ar | ɑr |
AU | aʊ |
b | b |
d | d |
D | ð |
dZ | dʒ |
E | ɛ |
eI | eɪ |
f | f |
g | ɡ |
h | h |
hw | hw |
i | iː |
I | ɪ |
j | j |
k | k |
l | l |
m | m |
n | n |
N | ŋ |
O | ɔː |
Oi | ɔɪ |
oU | oʊ |
p | p |
r | r |
s | s |
S | ʃ |
t | t |
T | θ |
tS | tʃ |
u | uː |
U | ʊ |
v | v |
w | w |
z | z |
Z | ʒ |
Shakespeare
Moby Shakespeare contains the complete unabridged works of Shakespeare. This specific resource is not available from Project Gutenberg.Thesaurus
The Moby Thesaurus II contains 30,260 root words, with 2,520,264 synonymSynonym
Synonyms are different words with almost identical or similar meanings. Words that are synonyms are said to be synonymous, and the state of being a synonym is called synonymy. The word comes from Ancient Greek syn and onoma . The words car and automobile are synonyms...
s and related terms - an average of 83.3 per root word. Each line consists of a list of comma-separated values
Comma-separated values
A comma-separated values file stores tabular data in plain-text form. As a result, such a file is easily human-readable ....
, with the first term being the root word, and all following words being related terms.
Grady Ward
Grady Ward
William Grady Ward is an American software engineer, lexicographer, and Internet activist who has featured prominently in the Scientology versus the Internet controversy....
placed this thesaurus in the public domain
Public domain
Works are in the public domain if the intellectual property rights have expired, if the intellectual property rights are forfeited, or if they are not covered by intellectual property rights at all...
in 1996. It is also available as a Debian
Debian
Debian is a computer operating system composed of software packages released as free and open source software primarily under the GNU General Public License along with other free software licenses. Debian GNU/Linux, which includes the GNU OS tools and Linux kernel, is a popular and influential...
package.
Words
Moby Words II is the largest wordlist in the world. The distribution consists of the following 16 files:Filename | Words | Description |
---|---|---|
ACRONYMS.TXT | 6,213 | Common acronyms and abbreviation Abbreviation An abbreviation is a shortened form of a word or phrase. Usually, but not always, it consists of a letter or group of letters taken from the word or phrase... s |
COMMON.TXT | 74,550 | Common words present in two or more published dictionaries |
COMPOUND.TXT | 256,772 | Phrases, proper noun Proper noun A proper noun or proper name is a noun representing a unique entity , as distinguished from a common noun, which represents a class of entities —for example, city, planet, person or corporation)... s, and acronyms not included in the common words file |
CROSSWD.TXT | 113,809 | Words included in the first edition of the Official Scrabble Players Dictionary Official Scrabble Players Dictionary The Official Scrabble Players Dictionary or OSPD is a dictionary developed for use in the game Scrabble, by speakers of American and Canadian English.-Creation:... |
CRSWD-D.TXT | 4,160 | Additions to the Official Scrabble Players Dictionary in the second edition |
FICTION.TXT | 467 | A list of the most commonly occurring substring Substring A subsequence, substring, prefix or suffix of a string is a subset of the symbols in a string, where the order of the elements is preserved... s in the book The Joy Luck Club The Joy Luck Club The Joy Luck Club is a best-selling novel written by Amy Tan. It focuses on four Chinese American immigrant families in San Francisco, California who start a club known as "the Joy Luck Club," playing the Chinese game of mahjong for money while feasting on a variety of foods... |
FREQ.TXT | 1,000 | Most frequently occurring words in the English language English language English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria... , listed in descending order |
FREQ-INT.TXT | 1,000 | Most frequently occurring words on Usenet Usenet Usenet is a worldwide distributed Internet discussion system. It developed from the general purpose UUCP architecture of the same name.Duke University graduate students Tom Truscott and Jim Ellis conceived the idea in 1979 and it was established in 1980... in 1992, listed with corresponding percentage in decreasing order |
KJVFREQ.TXT | 1,185 | Most frequently occurring substring Substring A subsequence, substring, prefix or suffix of a string is a subset of the symbols in a string, where the order of the elements is preserved... s in the King James Version of the Bible King James Version of the Bible The Authorized Version, commonly known as the King James Version, King James Bible or KJV, is an English translation of the Christian Bible by the Church of England begun in 1604 and completed in 1611... , listed in descending order |
NAMES.TXT | 21,986 | Most common name Name A name is a word or term used for identification. Names can identify a class or category of things, or a single thing, either uniquely, or within a given context. A personal name identifies a specific unique and identifiable individual person, and may or may not include a middle name... s used in the USA and Great Britain Great Britain Great Britain or Britain is an island situated to the northwest of Continental Europe. It is the ninth largest island in the world, and the largest European island, as well as the largest of the British Isles... |
NAMES-F.TXT | 4,946 | Common English female Female Female is the sex of an organism, or a part of an organism, which produces non-mobile ova .- Defining characteristics :The ova are defined as the larger gametes in a heterogamous reproduction system, while the smaller, usually motile gamete, the spermatozoon, is produced by the male... names |
NAMES-M.TXT | 3,897 | Common English male Male Male refers to the biological sex of an organism, or part of an organism, which produces small mobile gametes, called spermatozoa. Each spermatozoon can fuse with a larger female gamete or ovum, in the process of fertilization... names |
OFTENMIS.TXT | 366 | Most common misspelled English words |
PLACES.TXT | 10,196 | Place names in the USA |
SINGLE.TXT | 354,984 | Single words excluding proper nouns, acronyms, compound words and phrases, but including archaic Archaism In language, an archaism is the use of a form of speech or writing that is no longer current. This can either be done deliberately or as part of a specific jargon or formula... words and significant variant spellings |
USACONST.TXT | 7,618 | United States Constitution United States Constitution The Constitution of the United States is the supreme law of the United States of America. It is the framework for the organization of the United States government and for the relationship of the federal government with the states, citizens, and all people within the United States.The first three... including all amendments current to 1993 |
Total | 863,149 |