Soundex - AbsoluteAstronomy.com

Phonetic algorithm

A phonetic algorithm is an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for use with the English language; consequently, applying the rules to words in other languages might not give a meaningful result....

for indexing names by sound, as pronounced

Pronunciation

Pronunciation refers to the way a word or a language is spoken, or the manner in which someone utters a word. If one is said to have "correct pronunciation", then it refers to both within a particular dialect....

in English. The goal is for homophone

Homophone

A homophone is a word that is pronounced the same as another word but differs in meaning. The words may be spelled the same, such as rose and rose , or differently, such as carat, caret, and carrot, or to, two, and too. Homophones that are spelled the same are also both homographs and homonyms...

s to be encoded to the same representation so that they can be matched despite minor differences in spelling

Spelling

Spelling is the writing of one or more words with letters and diacritics. In addition, the term often, but not always, means an accepted standard spelling or the process of naming the letters...

. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithm

Phonetic algorithm

s, as it is a standard feature of MS SQL Server and Oracle, and is often used (incorrectly) as a synonym for "phonetic algorithm". Improvements to Soundex are the basis for many modern phonetic algorithms.

History

Soundex was developed by Robert C. Russell and Margaret K. Odell and patent

Patent

A patent is a form of intellectual property. It consists of a set of exclusive rights granted by a sovereign state to an inventor or their assignee for a limited period of time in exchange for the public disclosure of an invention....

ed in 1918 and 1922. A variation called American Soundex was used in the 1930s

1930s

File:1930s decade montage.png|From left, clockwise: Dorothea Lange's photo of the homeless Florence Thompson show the effects of the Great Depression; Due to the economic collapse, the farms become dry and the Dust Bowl spreads through America; The Battle of Wuhan during the Second Sino-Japanese...

for a retrospective analysis of the US censuses

United States Census

The United States Census is a decennial census mandated by the United States Constitution. The population is enumerated every 10 years and the results are used to allocate Congressional seats , electoral votes, and government program funding. The United States Census Bureau The United States Census...

from 1890 through 1920. The Soundex code came to prominence in the 1960s

1960s

The 1960s was the decade that started on January 1, 1960, and ended on December 31, 1969. It was the seventh decade of the 20th century.The 1960s term also refers to an era more often called The Sixties, denoting the complex of inter-related cultural and political trends across the globe...

when it was the subject of several articles in the Communications
Communications of the ACM
Communications of the ACM is the flagship monthly journal of the Association for Computing Machinery . First published in 1957, CACM is sent to all ACM members, currently numbering about 80,000. The articles are intended for readers with backgrounds in all areas of computer science and information...

and Journal of the Association for Computing Machinery
Journal of the ACM
The Journal of the ACM is the flagship scientific journal of the Association for Computing Machinery . It is peer-reviewed and covers computer science in general, especially theoretical aspects. Its current editor-in-chief is Victor Vianu, from University of California, San Diego.The journal has...

, and especially when described in Donald Knuth's

Donald Knuth

Donald Ervin Knuth is a computer scientist and Professor Emeritus at Stanford University.He is the author of the seminal multi-volume work The Art of Computer Programming. Knuth has been called the "father" of the analysis of algorithms...

The Art of Computer Programming
The Art of Computer Programming
The Art of Computer Programming is a comprehensive monograph written by Donald Knuth that covers many kinds of programming algorithms and their analysis....

.

The National Archives and Records Administration

National Archives and Records Administration

The National Archives and Records Administration is an independent agency of the United States government charged with preserving and documenting government and historical records and with increasing public access to those documents, which comprise the National Archives...

(NARA) maintains the current rule set for the official implementation of Soundex used by the U.S. Government. These encoding rules are available from NARA, upon request, in the form of General Information Leaflet 55, "Using the Census Soundex".

Rules

Different from the original algorithm, the algorithm in American Soundex is as below.

The Soundex code for a name consists of a letter

Letter (alphabet)

A letter is a grapheme in an alphabetic system of writing, such as the Greek alphabet and its descendants. Letters compose phonemes and each phoneme represents a phone in the spoken form of the language....

followed by three numerical digit

Numerical digit

A digit is a symbol used in combinations to represent numbers in positional numeral systems. The name "digit" comes from the fact that the 10 digits of the hands correspond to the 10 symbols of the common base 10 number system, i.e...

s: the letter is the first letter of the name, and the digits encode the remaining consonant

Consonant

In articulatory phonetics, a consonant is a speech sound that is articulated with complete or partial closure of the vocal tract. Examples are , pronounced with the lips; , pronounced with the front of the tongue; , pronounced with the back of the tongue; , pronounced in the throat; and ,...

s. Similar sounding consonants share the same digit so, for example, the labial consonant

Labial consonant

Labial consonants are consonants in which one or both lips are the active articulator. This precludes linguolabials, in which the tip of the tongue reaches for the posterior side of the upper lip and which are considered coronals...

s B, F, P, and V are each encoded as the number 1.

The correct value can be found as follows:

Retain the first letter of the name and drop all other occurrences of a, e, h, i, o, u, w, y.
Replace consonants with digits as follows (after the first letter):
- b, f, p, v => 1
- c, g, j, k, q, s, x, z => 2
- d, t => 3
- l => 4
- m, n => 5
- r => 6
Two adjacent letters with the same number are coded as a single number.
Continue until you have one letter and three numbers. If you run out of letters, fill in 0s until there are three numbers.

Using this algorithm, both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150". "Ashcraft" and "Ashcroft" both yield "A261" and not "A226" (e.g. the chars 's' and 'c' in the name "Ashcraft" would receive a single number of 2 and not 22, even though an 'h' lies in between them and they are not the same repeating character).

Soundex variants

A similar algorithm called "Reverse Soundex" prefixes the last letter of the name instead of the first.

The NYSIIS

New York State Identification and Intelligence System

The New York State Identification and Intelligence System Phonetic Code, commonly known as NYSIIS, is a phonetic algorithm devised in 1970 as part of the New York State Identification and Intelligence System...

algorithm was introduced by the New York State Identification and Intelligence System in 1970 as an improvement to the Soundex algorithm. NYSIIS handles some multi-character n-gram

N-gram

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items in question can be phonemes, syllables, letters, words or base pairs according to the application...

s and maintains relative vowel positioning, whereas Soundex does not.

Daitch–Mokotoff Soundex (D–M Soundex) was developed in 1985 by genealogist Gary Mokotoff and later improved by genealogist Randy Daitch because of problems they encountered while trying to apply the Russell Soundex to Jews with Germanic or Slavic surnames (such as Moskowitz vs. Moskovitz or Levine vs. Lewin). D–M Soundex is sometimes referred to as "Jewish Soundex" or "Eastern European Soundex", although the authors discourage the use of these nicknames. The D–M Soundex algorithm can return as many as 32 individual phonetic encodings for a single name. Results of D-M Soundex are returned in an all-numeric format between 100000 and 999999. This algorithm is much more complex than Russell Soundex.

As a response to deficiencies in the Soundex algorithm, Lawrence Philips developed the Metaphone

Metaphone

Metaphone is a phonetic algorithm, an algorithm published in 1990 for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate...

algorithm in 1990 for the same purpose. Philips developed an improvement to Metaphone in 2000, which he called Double Metaphone. Double Metaphone includes a much larger encoding rule set than its predecessor, handles a subset of non-Latin characters, and returns a primary and a secondary encoding to account for different pronunciations of a single word in English.

External links

The Soundex Indexing System (U.S. National Archives and Records Administration)

Ready-to-use soundex converters

Eastman's Online Genealogy Newsletter Online soundex converter
van der Harg - Geanealogie: Soundex Dutch soundex converter
Indic Soundex Converts from all indian languages and English (developed by the Swatantra Malyalam Group)

Programming algorithms for soundex

Soundex on Rosetta Code
Rosetta Code
Rosetta Code is a wiki-based programming chrestomathy website with solutions to various programming problems in many different programming languages. It was created in 2007 by Mike Mol. Rosetta Code includes 450 programming tasks, and covers 351 programming languages...

Implementations in around twenty languages.
Text::Soundex Perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

module from CPAN
CPAN
CPAN, the Comprehensive Perl Archive Network, is an archive of nearly 100,000 modules of software written in Perl, as well as documentation for it. It has a presence on the World Wide Web at and is mirrored worldwide at more than 200 locations...
PHP soundex function
SimMetrics an open source (sourceforge) library of similarity metrics including a number of soundex variants
Soundex in C#
Soundex in Java
Soundex in JavaScript (wrong: prefixes like "van der" are not excluded, original has two soundex codes for names with prefixes)
Soundex in JavaScript (view page source for code)
Soundex in Ruby
Soundex in Python
Soundex in STATA

Soundex in PostgreSQL
Soundex Tcl
Tcl
Tcl is a scripting language created by John Ousterhout. Originally "born out of frustration", according to the author, with programmers devising their own languages intended to be embedded into applications, Tcl gained acceptance on its own...

package from the tcllib
Tcllib
Tcllib is a collection of packages available for the Tcl programming language. Tcllib is distributed in both source code as well as pre-compiled binary formats...

library

Indic Soundex's Source Code Code for above example.