CMU Pronouncing Dictionary
Encyclopedia
The CMU Pronouncing Dictionary (also known as cmudict) is a public domain
pronouncing dictionary
created by Carnegie Mellon University
(CMU). It is used as the American lexicon for the Festival Speech Synthesis System
and also for the CMU Sphinx
speech recognition system. The latest release is 0.7a, which contains 133,746 entries (from 123,442 baseforms).
pronunciation . If there are multiple pronunciations available for a word, all subsequent entries are followed by an index in parentheses. The pronunciation is encoded using a modified form of the Arpabet
system. The difference is stress marks on vowels with levels 0, 1, 2; not all entries have stress however.
Public domain
Works are in the public domain if the intellectual property rights have expired, if the intellectual property rights are forfeited, or if they are not covered by intellectual property rights at all...
pronouncing dictionary
Dictionary
A dictionary is a collection of words in one or more specific languages, often listed alphabetically, with usage information, definitions, etymologies, phonetics, pronunciations, and other information; or a book of words in one language with their equivalents in another, also known as a lexicon...
created by Carnegie Mellon University
Carnegie Mellon University
Carnegie Mellon University is a private research university in Pittsburgh, Pennsylvania, United States....
(CMU). It is used as the American lexicon for the Festival Speech Synthesis System
Festival Speech Synthesis System
Festival is a general multi-lingual speech synthesis system originally developed by Alan W. Black at at the University of Edinburgh. Substantial contributions have also been provided by Carnegie Mellon University and other sites...
and also for the CMU Sphinx
CMU Sphinx
CMU Sphinx, also called Sphinx in short, is the general term to describe a group of speech recognition systems developed at Carnegie Mellon University...
speech recognition system. The latest release is 0.7a, which contains 133,746 entries (from 123,442 baseforms).
Database Format
The database is distributed as a text file of the format wordArpabet
Arpabet is a phonetic transcription code developed by Advanced Research Projects Agency as a part of their Speech Understanding Project . It represents each phoneme of General American English with a distinct sequence of ASCII characters. Arpabet has been used in several speech synthesizers, like...
system. The difference is stress marks on vowels with levels 0, 1, 2; not all entries have stress however.
History
Version | Release date |
---|---|
0.1 | 16 September 1993 |
0.2 | 10 March 1994 |
0.3 | 28 September 1994 |
0.4 | 8 November 1995 |
0.5 | No public release |
0.6 | 11 August 1998 |
0.7a | 19 February 2008 |
Applications
- The UnifonUnifonUnifon is a phonemic orthography for English designed in the mid-1950s by Dr. John R. Malone, a Chicago economist and newspaper equipment consultant. It was developed into a teaching aid to help children acquire reading and writing skills. Like the pronunciation key in a dictionary, Unifon matches...
converter is based on the CMU Pronouncing Dictionary. - The Natural Language ToolkitNatural Language ToolkitNatural Language Toolkit or, more commonly, NLTK is a suite of libraries and programs for symbolic and statistical natural language processing for the Python programming language. NLTK includes graphical demonstrations and sample data...
contains an interface to the CMU Pronouncing Dictionary. - The Carnegie Mellon Logios tool incorporates the CMU Pronouncing Dictionary.
External links
- The current version of the dictionary is maintained at SourceForge.
- Homepage – includes database search
- RDF converted to Resource Description FrameworkResource Description FrameworkThe Resource Description Framework is a family of World Wide Web Consortium specifications originally designed as a metadata data model...
by the open source Texai project.