Soundex
Encyclopedia
Soundex is a phonetic algorithm
for indexing names by sound, as pronounced
in English. The goal is for homophone
s to be encoded to the same representation so that they can be matched despite minor differences in spelling
. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithm
s, as it is a standard feature of MS SQL Server and Oracle, and is often used (incorrectly) as a synonym for "phonetic algorithm". Improvements to Soundex are the basis for many modern phonetic algorithms.
ed in 1918 and 1922. A variation called American Soundex was used in the 1930s
for a retrospective analysis of the US censuses
from 1890 through 1920. The Soundex code came to prominence in the 1960s
when it was the subject of several articles in the Communications
and Journal of the Association for Computing Machinery
, and especially when described in Donald Knuth's
The Art of Computer Programming
.
The National Archives and Records Administration
(NARA) maintains the current rule set for the official implementation of Soundex used by the U.S. Government. These encoding rules are available from NARA, upon request, in the form of General Information Leaflet 55, "Using the Census Soundex".
The Soundex code for a name consists of a letter
followed by three numerical digit
s: the letter is the first letter of the name, and the digits encode the remaining consonant
s. Similar sounding consonants share the same digit so, for example, the labial consonant
s B, F, P, and V are each encoded as the number 1.
The correct value can be found as follows:
Using this algorithm, both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150". "Ashcraft" and "Ashcroft" both yield "A261" and not "A226" (e.g. the chars 's' and 'c' in the name "Ashcraft" would receive a single number of 2 and not 22, even though an 'h' lies in between them and they are not the same repeating character).
The NYSIIS
algorithm was introduced by the New York State Identification and Intelligence System in 1970 as an improvement to the Soundex algorithm. NYSIIS handles some multi-character n-gram
s and maintains relative vowel positioning, whereas Soundex does not.
Daitch–Mokotoff Soundex (D–M Soundex) was developed in 1985 by genealogist Gary Mokotoff and later improved by genealogist Randy Daitch because of problems they encountered while trying to apply the Russell Soundex to Jews with Germanic or Slavic surnames (such as Moskowitz vs. Moskovitz or Levine vs. Lewin). D–M Soundex is sometimes referred to as "Jewish Soundex" or "Eastern European Soundex", although the authors discourage the use of these nicknames. The D–M Soundex algorithm can return as many as 32 individual phonetic encodings for a single name. Results of D-M Soundex are returned in an all-numeric format between 100000 and 999999. This algorithm is much more complex than Russell Soundex.
As a response to deficiencies in the Soundex algorithm, Lawrence Philips developed the Metaphone
algorithm in 1990 for the same purpose. Philips developed an improvement to Metaphone in 2000, which he called Double Metaphone. Double Metaphone includes a much larger encoding rule set than its predecessor, handles a subset of non-Latin characters, and returns a primary and a secondary encoding to account for different pronunciations of a single word in English.
Phonetic algorithm
A phonetic algorithm is an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for use with the English language; consequently, applying the rules to words in other languages might not give a meaningful result....
for indexing names by sound, as pronounced
Pronunciation
Pronunciation refers to the way a word or a language is spoken, or the manner in which someone utters a word. If one is said to have "correct pronunciation", then it refers to both within a particular dialect....
in English. The goal is for homophone
Homophone
A homophone is a word that is pronounced the same as another word but differs in meaning. The words may be spelled the same, such as rose and rose , or differently, such as carat, caret, and carrot, or to, two, and too. Homophones that are spelled the same are also both homographs and homonyms...
s to be encoded to the same representation so that they can be matched despite minor differences in spelling
Spelling
Spelling is the writing of one or more words with letters and diacritics. In addition, the term often, but not always, means an accepted standard spelling or the process of naming the letters...
. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithm
Phonetic algorithm
A phonetic algorithm is an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for use with the English language; consequently, applying the rules to words in other languages might not give a meaningful result....
s, as it is a standard feature of MS SQL Server and Oracle, and is often used (incorrectly) as a synonym for "phonetic algorithm". Improvements to Soundex are the basis for many modern phonetic algorithms.
History
Soundex was developed by Robert C. Russell and Margaret K. Odell and patentPatent
A patent is a form of intellectual property. It consists of a set of exclusive rights granted by a sovereign state to an inventor or their assignee for a limited period of time in exchange for the public disclosure of an invention....
ed in 1918 and 1922. A variation called American Soundex was used in the 1930s
1930s
File:1930s decade montage.png|From left, clockwise: Dorothea Lange's photo of the homeless Florence Thompson show the effects of the Great Depression; Due to the economic collapse, the farms become dry and the Dust Bowl spreads through America; The Battle of Wuhan during the Second Sino-Japanese...
for a retrospective analysis of the US censuses
United States Census
The United States Census is a decennial census mandated by the United States Constitution. The population is enumerated every 10 years and the results are used to allocate Congressional seats , electoral votes, and government program funding. The United States Census Bureau The United States Census...
from 1890 through 1920. The Soundex code came to prominence in the 1960s
1960s
The 1960s was the decade that started on January 1, 1960, and ended on December 31, 1969. It was the seventh decade of the 20th century.The 1960s term also refers to an era more often called The Sixties, denoting the complex of inter-related cultural and political trends across the globe...
when it was the subject of several articles in the Communications
Communications of the ACM
Communications of the ACM is the flagship monthly journal of the Association for Computing Machinery . First published in 1957, CACM is sent to all ACM members, currently numbering about 80,000. The articles are intended for readers with backgrounds in all areas of computer science and information...
and Journal of the Association for Computing Machinery
Journal of the ACM
The Journal of the ACM is the flagship scientific journal of the Association for Computing Machinery . It is peer-reviewed and covers computer science in general, especially theoretical aspects. Its current editor-in-chief is Victor Vianu, from University of California, San Diego.The journal has...
, and especially when described in Donald Knuth's
Donald Knuth
Donald Ervin Knuth is a computer scientist and Professor Emeritus at Stanford University.He is the author of the seminal multi-volume work The Art of Computer Programming. Knuth has been called the "father" of the analysis of algorithms...
The Art of Computer Programming
The Art of Computer Programming
The Art of Computer Programming is a comprehensive monograph written by Donald Knuth that covers many kinds of programming algorithms and their analysis....
.
The National Archives and Records Administration
National Archives and Records Administration
The National Archives and Records Administration is an independent agency of the United States government charged with preserving and documenting government and historical records and with increasing public access to those documents, which comprise the National Archives...
(NARA) maintains the current rule set for the official implementation of Soundex used by the U.S. Government. These encoding rules are available from NARA, upon request, in the form of General Information Leaflet 55, "Using the Census Soundex".
Rules
Different from the original algorithm, the algorithm in American Soundex is as below.The Soundex code for a name consists of a letter
Letter (alphabet)
A letter is a grapheme in an alphabetic system of writing, such as the Greek alphabet and its descendants. Letters compose phonemes and each phoneme represents a phone in the spoken form of the language....
followed by three numerical digit
Numerical digit
A digit is a symbol used in combinations to represent numbers in positional numeral systems. The name "digit" comes from the fact that the 10 digits of the hands correspond to the 10 symbols of the common base 10 number system, i.e...
s: the letter is the first letter of the name, and the digits encode the remaining consonant
Consonant
In articulatory phonetics, a consonant is a speech sound that is articulated with complete or partial closure of the vocal tract. Examples are , pronounced with the lips; , pronounced with the front of the tongue; , pronounced with the back of the tongue; , pronounced in the throat; and ,...
s. Similar sounding consonants share the same digit so, for example, the labial consonant
Labial consonant
Labial consonants are consonants in which one or both lips are the active articulator. This precludes linguolabials, in which the tip of the tongue reaches for the posterior side of the upper lip and which are considered coronals...
s B, F, P, and V are each encoded as the number 1.
The correct value can be found as follows:
- Retain the first letter of the name and drop all other occurrences of a, e, h, i, o, u, w, y.
- Replace consonants with digits as follows (after the first letter):
- b, f, p, v => 1
- c, g, j, k, q, s, x, z => 2
- d, t => 3
- l => 4
- m, n => 5
- r => 6
- Two adjacent letters with the same number are coded as a single number.
- Continue until you have one letter and three numbers. If you run out of letters, fill in 0s until there are three numbers.
Using this algorithm, both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150". "Ashcraft" and "Ashcroft" both yield "A261" and not "A226" (e.g. the chars 's' and 'c' in the name "Ashcraft" would receive a single number of 2 and not 22, even though an 'h' lies in between them and they are not the same repeating character).
Soundex variants
A similar algorithm called "Reverse Soundex" prefixes the last letter of the name instead of the first.The NYSIIS
New York State Identification and Intelligence System
The New York State Identification and Intelligence System Phonetic Code, commonly known as NYSIIS, is a phonetic algorithm devised in 1970 as part of the New York State Identification and Intelligence System...
algorithm was introduced by the New York State Identification and Intelligence System in 1970 as an improvement to the Soundex algorithm. NYSIIS handles some multi-character n-gram
N-gram
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items in question can be phonemes, syllables, letters, words or base pairs according to the application...
s and maintains relative vowel positioning, whereas Soundex does not.
Daitch–Mokotoff Soundex (D–M Soundex) was developed in 1985 by genealogist Gary Mokotoff and later improved by genealogist Randy Daitch because of problems they encountered while trying to apply the Russell Soundex to Jews with Germanic or Slavic surnames (such as Moskowitz vs. Moskovitz or Levine vs. Lewin). D–M Soundex is sometimes referred to as "Jewish Soundex" or "Eastern European Soundex", although the authors discourage the use of these nicknames. The D–M Soundex algorithm can return as many as 32 individual phonetic encodings for a single name. Results of D-M Soundex are returned in an all-numeric format between 100000 and 999999. This algorithm is much more complex than Russell Soundex.
As a response to deficiencies in the Soundex algorithm, Lawrence Philips developed the Metaphone
Metaphone
Metaphone is a phonetic algorithm, an algorithm published in 1990 for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate...
algorithm in 1990 for the same purpose. Philips developed an improvement to Metaphone in 2000, which he called Double Metaphone. Double Metaphone includes a much larger encoding rule set than its predecessor, handles a subset of non-Latin characters, and returns a primary and a secondary encoding to account for different pronunciations of a single word in English.
See also
- Phonetic algorithmPhonetic algorithmA phonetic algorithm is an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for use with the English language; consequently, applying the rules to words in other languages might not give a meaningful result....
- MetaphoneMetaphoneMetaphone is a phonetic algorithm, an algorithm published in 1990 for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate...
- New York State Identification and Intelligence SystemNew York State Identification and Intelligence SystemThe New York State Identification and Intelligence System Phonetic Code, commonly known as NYSIIS, is a phonetic algorithm devised in 1970 as part of the New York State Identification and Intelligence System...
- Match Rating ApproachMatch Rating ApproachA phonetic algorithm developed by Western Airlines in 1977 for the indexation and comparison of homophonous names.The algorithm itself has a simple set of encoding rules but a more lengthy set of comparison rules....
External links
- The Soundex Indexing System (U.S. National Archives and Records Administration)
Ready-to-use soundex converters
- Eastman's Online Genealogy Newsletter Online soundex converter
- van der Harg - Geanealogie: Soundex Dutch soundex converter
- Indic Soundex Converts from all indian languages and English (developed by the Swatantra Malyalam Group)
Programming algorithms for soundex
- Soundex on Rosetta CodeRosetta CodeRosetta Code is a wiki-based programming chrestomathy website with solutions to various programming problems in many different programming languages. It was created in 2007 by Mike Mol. Rosetta Code includes 450 programming tasks, and covers 351 programming languages...
Implementations in around twenty languages. - Text::Soundex PerlPerlPerl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...
module from CPANCPANCPAN, the Comprehensive Perl Archive Network, is an archive of nearly 100,000 modules of software written in Perl, as well as documentation for it. It has a presence on the World Wide Web at and is mirrored worldwide at more than 200 locations... - PHP soundex function
- SimMetrics an open source (sourceforge) library of similarity metrics including a number of soundex variants
- Soundex in C#
- Soundex in Java
- Soundex in JavaScript (wrong: prefixes like "van der" are not excluded, original has two soundex codes for names with prefixes)
- Soundex in JavaScript (view page source for code)
- Soundex in Ruby
- Soundex in Python
- Soundex in STATA
- Soundex in PostgreSQL
- Soundex TclTclTcl is a scripting language created by John Ousterhout. Originally "born out of frustration", according to the author, with programmers devising their own languages intended to be embedded into applications, Tcl gained acceptance on its own...
package from the tcllibTcllibTcllib is a collection of packages available for the Tcl programming language. Tcllib is distributed in both source code as well as pre-compiled binary formats...
library
- Indic Soundex's Source Code Code for above example.