Metaphone
Encyclopedia
Metaphone is a phonetic algorithm
, an algorithm published in 1990 for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar. As with Soundex, similar sounding words should share the same keys. Metaphone is available as a built-in operator in a number of systems, including later versions of PHP
.
The original author later produced a new version of the algorithm, which he named Double Metaphone. Contrary to the original algorithm whose application is limited to English only, this version takes into account spelling peculiarities of a number of other languages. In 2009 Lawrence Philips released a third version, called Metaphone 3, which achieves an accuracy of approximately 99% for English words, non-English words familiar to Americans, and first names and family names commonly found in the United States, having been developed according to modern engineering standards against a test harness of prepared correct encodings.
symbols 0BFHJKLMNPRSTWXY. The '0' represents "th
" (as an ASCII
approximation of Θ
), 'X' represents "sh
" or "ch
", and the others represent their usual English pronunciations. The vowels AEIOU are also used, but only at the beginning of the code.
.
It is called "Double" because it can return both a primary and a secondary code for a string; this accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry. For example, encoding the name "Smith" yields a primary code of SM0 and a secondary code of XMT, while the name "Schmidt" yields a primary code of XMT and a secondary code of SMT--both have XMT in common.
Double Metaphone tries to account for myriad irregularities in English
of Slavic
, Germanic
, Celtic
, Greek
, French
, Italian
, Spanish
, Chinese
, and other origin. Thus it uses a much more complex ruleset for coding than its predecessor; for example, it tests for approximately 100 different contexts of the use of the letter C alone.
Phonetic algorithm
A phonetic algorithm is an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for use with the English language; consequently, applying the rules to words in other languages might not give a meaningful result....
, an algorithm published in 1990 for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar. As with Soundex, similar sounding words should share the same keys. Metaphone is available as a built-in operator in a number of systems, including later versions of PHP
PHP
PHP is a general-purpose server-side scripting language originally designed for web development to produce dynamic web pages. For this purpose, PHP code is embedded into the HTML source document and interpreted by a web server with a PHP processor module, which generates the web page document...
.
The original author later produced a new version of the algorithm, which he named Double Metaphone. Contrary to the original algorithm whose application is limited to English only, this version takes into account spelling peculiarities of a number of other languages. In 2009 Lawrence Philips released a third version, called Metaphone 3, which achieves an accuracy of approximately 99% for English words, non-English words familiar to Americans, and first names and family names commonly found in the United States, having been developed according to modern engineering standards against a test harness of prepared correct encodings.
Procedure
Metaphone codes use the 16 consonantConsonant
In articulatory phonetics, a consonant is a speech sound that is articulated with complete or partial closure of the vocal tract. Examples are , pronounced with the lips; , pronounced with the front of the tongue; , pronounced with the back of the tongue; , pronounced in the throat; and ,...
symbols 0BFHJKLMNPRSTWXY. The '0' represents "th
Th (digraph)
Th is a digraph in the Roman alphabet. It is the most common digraph in order of frequency in the English language.-Cluster /t.h/:The most literal use of ⟨th⟩ is to represent a consonant cluster of /t/ and /h/ as in English knighthood...
" (as an ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
approximation of Θ
Theta
Theta is the eighth letter of the Greek alphabet, derived from the Phoenician letter Teth...
), 'X' represents "sh
Sh (digraph)
Sh is a digraph of the Latin alphabet, a combination of S and H.-English:In English, sh usually represents . The exception is in compound words, where the s and h are not a digraph, but pronounced separately, e.g. hogshead is hogs-head , not *hog-shead...
" or "ch
Ch (digraph)
Ch is a digraph in the Roman alphabet and Uyghur. It is treated as a letter of its own in Chamorro, Czech, Slovak, Polish, Igbo, Quechua, Guarani, Welsh, Cornish, Breton and Belarusian Łacinka alphabets. In Vietnamese, it also used to be considered a letter for collation purposes but this is no...
", and the others represent their usual English pronunciations. The vowels AEIOU are also used, but only at the beginning of the code.
- Drop duplicate adjacent letters, except for C.
- If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter.
- Drop 'B' if after 'M' and if it is at the end of the word.
- 'C' transforms to 'X' if followed by 'IA' or 'H' (unless in latter case, it is part of '-SCH-', in which case it transforms to 'K'). 'C' transforms to 'S' if followed by 'I', 'E', or 'Y'. Otherwise, 'C' transforms to 'K'.
- 'D' transforms to 'J' if followed by 'GE', 'GY', or 'GI'. Otherwise, 'D' transforms to 'T'.
- Drop 'G' if followed by 'H' and 'H' is not at the end or before a vowel. Drop 'G' if followed by 'N' or 'NED' and is at the end.
- 'G' transforms to 'J' if before 'I', 'E', or 'Y', and it is not in 'GG'. Otherwise, 'G' transforms to 'K'.
- Drop 'H' if after vowel and not before a vowel.
- 'CK' transforms to 'K'.
- 'PH' transforms to 'F'.
- 'Q' transforms to 'K'.
- 'S' transforms to 'X' if followed by 'H', 'IO', or 'IA'.
- 'T' transforms to 'X' if followed by 'IA' or 'IO'. 'TH' transforms to '0'. Drop 'T' if followed by 'CH'.
- 'V' transforms to 'F'.
- 'WH' transforms to 'W' if at the beginning. Drop 'W' if not followed by a vowel.
- 'X' transforms to 'S' if at the beginning. Otherwise, 'X' transforms to 'KS'.
- Drop 'Y' if not followed by a vowel.
- 'Z' transforms to 'S'.
- Drop all vowels unless it is the beginning.
Double Metaphone
The Double Metaphone phonetic encoding algorithm is the second generation of this algorithm. Its implementation was described in the June 2000 issue of C/C++ Users JournalC/C++ Users Journal
C/C++ Users Journal was a computer magazine published by CMP Media LLC in the United States. The magazine concentrated on the C++ programming language and was one of the last printed magazines to cover the topic....
.
It is called "Double" because it can return both a primary and a secondary code for a string; this accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry. For example, encoding the name "Smith" yields a primary code of SM0 and a secondary code of XMT, while the name "Schmidt" yields a primary code of XMT and a secondary code of SMT--both have XMT in common.
Double Metaphone tries to account for myriad irregularities in English
English language
English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...
of Slavic
Slavic languages
The Slavic languages , a group of closely related languages of the Slavic peoples and a subgroup of Indo-European languages, have speakers in most of Eastern Europe, in much of the Balkans, in parts of Central Europe, and in the northern part of Asia.-Branches:Scholars traditionally divide Slavic...
, Germanic
Germanic languages
The Germanic languages constitute a sub-branch of the Indo-European language family. The common ancestor of all of the languages in this branch is called Proto-Germanic , which was spoken in approximately the mid-1st millennium BC in Iron Age northern Europe...
, Celtic
Celtic languages
The Celtic languages are descended from Proto-Celtic, or "Common Celtic"; a branch of the greater Indo-European language family...
, Greek
Greek language
Greek is an independent branch of the Indo-European family of languages. Native to the southern Balkans, it has the longest documented history of any Indo-European language, spanning 34 centuries of written records. Its writing system has been the Greek alphabet for the majority of its history;...
, French
French language
French is a Romance language spoken as a first language in France, the Romandy region in Switzerland, Wallonia and Brussels in Belgium, Monaco, the regions of Quebec and Acadia in Canada, and by various communities elsewhere. Second-language speakers of French are distributed throughout many parts...
, Italian
Italian language
Italian is a Romance language spoken mainly in Europe: Italy, Switzerland, San Marino, Vatican City, by minorities in Malta, Monaco, Croatia, Slovenia, France, Libya, Eritrea, and Somalia, and by immigrant communities in the Americas and Australia...
, Spanish
Spanish language
Spanish , also known as Castilian , is a Romance language in the Ibero-Romance group that evolved from several languages and dialects in central-northern Iberia around the 9th century and gradually spread with the expansion of the Kingdom of Castile into central and southern Iberia during the...
, Chinese
Chinese language
The Chinese language is a language or language family consisting of varieties which are mutually intelligible to varying degrees. Originally the indigenous languages spoken by the Han Chinese in China, it forms one of the branches of Sino-Tibetan family of languages...
, and other origin. Thus it uses a much more complex ruleset for coding than its predecessor; for example, it tests for approximately 100 different contexts of the use of the letter C alone.
Metaphone 3
Developed by the same author, this algorithm aims at further improving the technique of phonetic encoding of words in the English language, non-English words familiar to Americans, and first names and family names commonly found in the United States. The first version of Metaphone to be developed according to strict engineering standards and tested against a substantial target encoding set, it improves the accuracy of the algorithm from the approximately 89% of Double Metaphone to over 99%. The ability to encode Metaphone keys taking non-initial vowels into account, as well as encoding voiced and unvoiced consonants differently, has been added. This allows the result set to be more closely focused if the developer finds that the search results include too many words that don't resemble the search term closely enough. Development for other language versions has been announced. Metaphone 3 is sold as source code in C++, Java and C# for 40 USD each.See also
- SoundexSoundexSoundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless...
- New York State Identification and Intelligence SystemNew York State Identification and Intelligence SystemThe New York State Identification and Intelligence System Phonetic Code, commonly known as NYSIIS, is a phonetic algorithm devised in 1970 as part of the New York State Identification and Intelligence System...
- Match Rating ApproachMatch Rating ApproachA phonetic algorithm developed by Western Airlines in 1977 for the indexation and comparison of homophonous names.The algorithm itself has a simple set of encoding rules but a more lengthy set of comparison rules....
External links
- Open Source Spell Checker
- Page for PHP implementation of Metaphone
- Project Dedupe
- Ruby implementation included in http://rubyforge.org/projects/text
- "The Double Metaphone Search Algorithm", C/C++ Users Journal, June 2000 (full-text access requires registration)
- The Double Metaphone Search Algorithm, By Lawrence Phillips, June 01, 2000, Dr Dobb's, Original article
- Code project article on double metaphone: http://www.codeproject.com/string/dmetaphone1.asp
Metaphone Implementations
- Metaphone implementation in T-SQL
- Soundex, Metaphone, and Double Metaphone implementation in JavaJava (programming language)Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
- Soundex, Metaphone, Caverphone implementation in PythonPython (programming language)Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...
- Text::Metaphone PerlPerlPerl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...
module from CPANCPANCPAN, the Comprehensive Perl Archive Network, is an archive of nearly 100,000 modules of software written in Perl, as well as documentation for it. It has a presence on the World Wide Web at and is mirrored worldwide at more than 200 locations... - Text::DoubleMetaphone PerlPerlPerl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...
module from CPANCPANCPAN, the Comprehensive Perl Archive Network, is an archive of nearly 100,000 modules of software written in Perl, as well as documentation for it. It has a presence on the World Wide Web at and is mirrored worldwide at more than 200 locations... - OCaml implementation of Double Metaphone
- PHPPHPPHP is a general-purpose server-side scripting language originally designed for web development to produce dynamic web pages. For this purpose, PHP code is embedded into the HTML source document and interpreted by a web server with a PHP processor module, which generates the web page document...
implementation by Stephen Woodbridge - Ruby implementation included in http://english.rubyforge.org
- Ruby implementation included in http://rubyforge.org/projects/text/
- 4GL implementation by Robert Minter
- CodeProject's article about double metaphone implementations
- FileMaker Pro custom function, requiring FileMaker Pro Advanced to implement
- Spanish Metaphone in PHP (First post), from a comment in the PHP Metaphone Manual Page
- Brazilian Portuguese in C Metaphone for Brazilian Portuguese, in C with PHP and PostgreSQL port.
- natural - javascript (nodejs) natural language toolkit
Double Metaphone Implementations
- C++C++C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...
see: http://web.archive.org/web/20080101012741/http://www.cuj.com/documents/s=8038/cuj0006philips/ - C# see: http://www.codeproject.com/KB/recipes/dmetaphone5.aspx
- PerlPerlPerl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...
see: http://search.cpan.org/dist/Text-DoubleMetaphone/ - PHPPHPPHP is a general-purpose server-side scripting language originally designed for web development to produce dynamic web pages. For this purpose, PHP code is embedded into the HTML source document and interpreted by a web server with a PHP processor module, which generates the web page document...
see: http://swoodbridge.com/DoubleMetaPhone/ and native, in C: http://pecl.php.net/package/doublemetaphone - JavaJava (programming language)Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
see: http://commons.apache.org/codec/userguide.html - RubyRuby (programming language)Ruby is a dynamic, reflective, general-purpose object-oriented programming language that combines syntax inspired by Perl with Smalltalk-like features. Ruby originated in Japan during the mid-1990s and was first developed and designed by Yukihiro "Matz" Matsumoto...
see: http://english.rubyforge.org/ and http://rubyforge.org/projects/text/ - SQLSQLSQL is a programming language designed for managing data in relational database management systems ....
:- MySQLMySQLMySQL officially, but also commonly "My Sequel") is a relational database management system that runs as a server providing multi-user access to a number of databases. It is named after developer Michael Widenius' daughter, My...
see: see: http://www.atomodo.com/code/double-metaphone - PostgreSQLPostgreSQLPostgreSQL, often simply Postgres, is an object-relational database management system available for many platforms including Linux, FreeBSD, Solaris, MS Windows and Mac OS X. It is released under the PostgreSQL License, which is an MIT-style license, and is thus free and open source software...
see: http://www.postgresql.org/docs/current/static/fuzzystrmatch.html - Transact-SQLTransact-SQLTransact-SQL is Microsoft's and Sybase's proprietary extension to SQL. SQL, often expanded to Structured Query Language, is a standardized computer language that was originally developed by IBM for querying, altering and defining relational databases, using declarative statements...
see: http://www.sqlmag.com/Articles/ArticleID/26094/pg/1/1.html (full-text access requires subscription)
- MySQL
- PythonPython (programming language)Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...
see: http://www.atomodo.com/code/double-metaphone - SmalltalkSmalltalkSmalltalk is an object-oriented, dynamically typed, reflective programming language. Smalltalk was created as the language to underpin the "new world" of computing exemplified by "human–computer symbiosis." It was designed and created in part for educational use, more so for constructionist...
, SqueakSqueakThe Squeak programming language is a Smalltalk implementation. It is object-oriented, class-based and reflective.It was derived directly from Smalltalk-80 by a group at Apple Computer that included some of the original Smalltalk-80 developers...
, also with SoundEx, see: http://www.squeaksource.com/SoundsLike.html - Visual BasicVisual BasicVisual Basic is the third-generation event-driven programming language and integrated development environment from Microsoft for its COM programming model...
see: http://www.snakelegs.org/2008/01/18/double-metaphone-visual-basic-implementation/- Visual Basic for ApplicationsVisual Basic for ApplicationsVisual Basic for Applications is an implementation of Microsoft's event-driven programming language Visual Basic 6 and its associated integrated development environment , which are built into most Microsoft Office applications...
see: http://bytes.com/topic/access/answers/192513-metaphone-source-code/
- Visual Basic for Applications