Language identification
Encyclopedia
Language identification is the process of determining which natural language
given content is in. Traditionally, identification of written language - as practiced, for instance, in library science
- has relied on manually identifying frequent words and letters known to be characteristic of particular languages. More recently, computational approaches have been applied to the problem, by viewing language identification as a kind of text categorization, a Natural Language Processing
approach which relies on statistical methods.
, language identification is important for categorizing materials. As librarian
s often have to categorize materials which are in languages they are not familiar with, they sometimes rely on tables of frequent word
s and distinctive letters
or characters to help them identify languages. While identifying a single such word or character may not suffice to distinguish a language from another with a similar orthography
, identifying several is often highly reliable.
Another technique, as described by Dunning (1994) is to create a language n-gram
model from a "training text" for each of the languages. Then, for any piece of text needing to be identified, a similar model is made, and that model is compared to each stored language model. The language model which is most similar to the model from the piece of text is the most likely language. This approach is problematic when the input text is in a language there is no model for. In this case, the method returns a random, "most similar" language as its result. Another problem are pieces of input text that are composed of several languages, as is common on the Web. For a more recent method, see Řehůřek and Kolkus (2009).
Natural language
In the philosophy of language, a natural language is any language which arises in an unpremeditated fashion as the result of the innate facility for language possessed by the human intellect. A natural language is typically used for communication, and may be spoken, signed, or written...
given content is in. Traditionally, identification of written language - as practiced, for instance, in library science
Library science
Library science is an interdisciplinary or multidisciplinary field that applies the practices, perspectives, and tools of management, information technology, education, and other areas to libraries; the collection, organization, preservation, and dissemination of information resources; and the...
- has relied on manually identifying frequent words and letters known to be characteristic of particular languages. More recently, computational approaches have been applied to the problem, by viewing language identification as a kind of text categorization, a Natural Language Processing
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....
approach which relies on statistical methods.
Non-Computational Approaches
In the field of library scienceLibrary science
Library science is an interdisciplinary or multidisciplinary field that applies the practices, perspectives, and tools of management, information technology, education, and other areas to libraries; the collection, organization, preservation, and dissemination of information resources; and the...
, language identification is important for categorizing materials. As librarian
Librarian
A librarian is an information professional trained in library and information science, which is the organization and management of information services or materials for those with information needs...
s often have to categorize materials which are in languages they are not familiar with, they sometimes rely on tables of frequent word
Word
In language, a word is the smallest free form that may be uttered in isolation with semantic or pragmatic content . This contrasts with a morpheme, which is the smallest unit of meaning but will not necessarily stand on its own...
s and distinctive letters
Letter (alphabet)
A letter is a grapheme in an alphabetic system of writing, such as the Greek alphabet and its descendants. Letters compose phonemes and each phoneme represents a phone in the spoken form of the language....
or characters to help them identify languages. While identifying a single such word or character may not suffice to distinguish a language from another with a similar orthography
Orthography
The orthography of a language specifies a standardized way of using a specific writing system to write the language. Where more than one writing system is used for a language, for example Kurdish, Uyghur, Serbian or Inuktitut, there can be more than one orthography...
, identifying several is often highly reliable.
Statistical Approaches
This can be done by comparing the compressibility of the text to the compressibility of texts in the known languages. This approach is known as mutual information based distance measure http://www.xs4all.nl/~ajwp/langident.pdf. The same techniques can also be used to empirically construct family trees of languages which closely correspond to the trees constructed using historical methods.Another technique, as described by Dunning (1994) is to create a language n-gram
N-gram
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items in question can be phonemes, syllables, letters, words or base pairs according to the application...
model from a "training text" for each of the languages. Then, for any piece of text needing to be identified, a similar model is made, and that model is compared to each stored language model. The language model which is most similar to the model from the piece of text is the most likely language. This approach is problematic when the input text is in a language there is no model for. In this case, the method returns a random, "most similar" language as its result. Another problem are pieces of input text that are composed of several languages, as is common on the Web. For a more recent method, see Řehůřek and Kolkus (2009).
See also
- Algorithmic information theoryAlgorithmic information theoryAlgorithmic information theory is a subfield of information theory and computer science that concerns itself with the relationship between computation and information...
- Artificial grammar learningArtificial grammar learningArtificial Grammar Learning is a paradigm of study within cognitive psychology. Its goal is to investigate the processes that underlie human language learning, by testing subjects' ability to learn a made-up language in a laboratory setting...
- Kolmogorov complexityKolmogorov complexityIn algorithmic information theory , the Kolmogorov complexity of an object, such as a piece of text, is a measure of the computational resources needed to specify the object...
- Language Analysis for the Determination of Origin
- Machine translationMachine translationMachine translation, sometimes referred to by the abbreviation MT is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another.On a basic...
- TranslationTranslationTranslation is the communication of the meaning of a source-language text by means of an equivalent target-language text. Whereas interpreting undoubtedly antedates writing, translation began only after the appearance of written literature; there exist partial translations of the Sumerian Epic of...
External links
- Language Identification Tools: list of links by Gertjan van Noord, with number of languages, brief description and license information.
- LID - Language Identification in Python: algorithm and code example of an n-gram based LID tool in PythonPython (programming language)Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...
and Scheme by Damir Cavar.
- AlchemyAPI: language identification API, available as SDK and through a RESTfull API (web-based demonstration).
- PetaMem Language Identification: provides a choice between ngram, nvect and smart methods.
- Open Xerox LanguageIdentifier, available in web-based form or through API.
- What Language Is This? Online language identifier: web-based tool written by Henrik Falck.
- Rosette Language Identifier: product by Basis Technology.
- Language Identifier: product by Sematext; exposes Java API and is available through REST/Webservice.
- lid Language Identifier: by Lingua-Systems; CC (programming language)C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
/C++C++C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...
library and PerlPerlPerl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...
Extension (online demo).
- language-detection: open-source language detection library for Java (Apache License 2.0).
- lc4j, a language categorization Java library, by Marco Olivo.
- S.M.Mohammadzadeh: Language identification/detection related documents (26 February 2011).
- Microsoft Extended Linguistic Services for Windows 7: including Microsoft Language Detection.
- Windows 7 API Code Pack for .NET: including managed interfaces for the above.