Language identification - AbsoluteAstronomy.com

Language identification is the process of determining which natural language

Natural language

In the philosophy of language, a natural language is any language which arises in an unpremeditated fashion as the result of the innate facility for language possessed by the human intellect. A natural language is typically used for communication, and may be spoken, signed, or written...

given content is in. Traditionally, identification of written language - as practiced, for instance, in library science

Library science

Library science is an interdisciplinary or multidisciplinary field that applies the practices, perspectives, and tools of management, information technology, education, and other areas to libraries; the collection, organization, preservation, and dissemination of information resources; and the...

- has relied on manually identifying frequent words and letters known to be characteristic of particular languages. More recently, computational approaches have been applied to the problem, by viewing language identification as a kind of text categorization, a Natural Language Processing

Natural language processing

Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

approach which relies on statistical methods.

Non-Computational Approaches

In the field of library science

Library science

, language identification is important for categorizing materials. As librarian

Librarian

A librarian is an information professional trained in library and information science, which is the organization and management of information services or materials for those with information needs...

s often have to categorize materials which are in languages they are not familiar with, they sometimes rely on tables of frequent word

Word

In language, a word is the smallest free form that may be uttered in isolation with semantic or pragmatic content . This contrasts with a morpheme, which is the smallest unit of meaning but will not necessarily stand on its own...

s and distinctive letters

Letter (alphabet)

A letter is a grapheme in an alphabetic system of writing, such as the Greek alphabet and its descendants. Letters compose phonemes and each phoneme represents a phone in the spoken form of the language....

or characters to help them identify languages. While identifying a single such word or character may not suffice to distinguish a language from another with a similar orthography

Orthography

The orthography of a language specifies a standardized way of using a specific writing system to write the language. Where more than one writing system is used for a language, for example Kurdish, Uyghur, Serbian or Inuktitut, there can be more than one orthography...

, identifying several is often highly reliable.

Statistical Approaches

This can be done by comparing the compressibility of the text to the compressibility of texts in the known languages. This approach is known as mutual information based distance measure http://www.xs4all.nl/~ajwp/langident.pdf. The same techniques can also be used to empirically construct family trees of languages which closely correspond to the trees constructed using historical methods.

Another technique, as described by Dunning (1994) is to create a language n-gram

N-gram

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items in question can be phonemes, syllables, letters, words or base pairs according to the application...

model from a "training text" for each of the languages. Then, for any piece of text needing to be identified, a similar model is made, and that model is compared to each stored language model. The language model which is most similar to the model from the piece of text is the most likely language. This approach is problematic when the input text is in a language there is no model for. In this case, the method returns a random, "most similar" language as its result. Another problem are pieces of input text that are composed of several languages, as is common on the Web. For a more recent method, see Řehůřek and Kolkus (2009).

External links

Language Identification Tools: list of links by Gertjan van Noord, with number of languages, brief description and license information.

LID - Language Identification in Python: algorithm and code example of an n-gram based LID tool in Python
Python (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...

and Scheme by Damir Cavar.

AlchemyAPI: language identification API, available as SDK and through a RESTfull API (web-based demonstration).

PetaMem Language Identification: provides a choice between ngram, nvect and smart methods.

Open Xerox LanguageIdentifier, available in web-based form or through API.

What Language Is This? Online language identifier: web-based tool written by Henrik Falck.

Rosette Language Identifier: product by Basis Technology.

Language Identifier: product by Sematext; exposes Java API and is available through REST/Webservice.

G2LI (Global Information Infrastructure Laboratory's Language Identifier).

lid Language Identifier: by Lingua-Systems; C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

/C++
C++
C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

library and Perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

Extension (online demo).

language-detection: open-source language detection library for Java (Apache License 2.0).

lc4j, a language categorization Java library, by Marco Olivo.

S.M.Mohammadzadeh: Language identification/detection related documents (26 February 2011).

Microsoft Extended Linguistic Services for Windows 7: including Microsoft Language Detection.

Windows 7 API Code Pack for .NET: including managed interfaces for the above.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

Non-Computational Approaches

Statistical Approaches

See also

External links