Text segmentation - AbsoluteAstronomy.com

Text segmentation is the process of dividing written text

Writing

Writing is the representation of language in a textual medium through the use of a set of signs or symbols . It is distinguished from illustration, such as cave drawing and painting, and non-symbolic preservation of language via non-textual media, such as magnetic tape audio.Writing most likely...

into meaningful units, such as word

Word

In language, a word is the smallest free form that may be uttered in isolation with semantic or pragmatic content . This contrasts with a morpheme, which is the smallest unit of meaning but will not necessarily stand on its own...

s, sentence

Sentence (linguistics)

In the field of linguistics, a sentence is an expression in natural language, and often defined to indicate a grammatical unit consisting of one or more words that generally bear minimal syntactic relation to the words that precede or follow it...

s, or topic

Topic

Topic or Topicality may refer to:* Topic , what is being talked about* Topic * Topic , a brand of confectionery bar* Topics , a work by Aristotle* Topical, a medication applied to body surfaces...

s. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing

Natural language processing

Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English

English language

English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...

and the distinctive initial, medial and final letter shapes of Arabic

Arabic language

Arabic is a name applied to the descendants of the Classical Arabic language of the 6th century AD, used most prominently in the Quran, the Islamic Holy Book...

, such signals are sometimes ambiguous and not present in all written languages.

Compare speech segmentation

Speech segmentation

Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language processing....

, the process of dividing speech into linguistically meaningful portions.

Word segmentation

Word segmentation is the problem of dividing a string of written language into its component word

Word

s.

In English

English language

and many other languages using some form of the Latin alphabet

Latin alphabet

The Latin alphabet, also called the Roman alphabet, is the most recognized alphabet used in the world today. It evolved from a western variety of the Greek alphabet called the Cumaean alphabet, which was adopted and modified by the Etruscans who ruled early Rome...

, the space

Space (punctuation)

In writing, a space is a blank area devoid of content, serving to separate words, letters, numbers, and punctuation. Conventions for interword and intersentence spaces vary among languages, and in some cases the spacing rules are quite complex....

is a good approximation of a word delimiter. (Some examples where the space character alone may not be sufficient include contractions like can't for can not.)

However the equivalent to this character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese

Chinese language

The Chinese language is a language or language family consisting of varieties which are mutually intelligible to varying degrees. Originally the indigenous languages spoken by the Han Chinese in China, it forms one of the branches of Sino-Tibetan family of languages...

, Japanese

Japanese language

is a language spoken by over 130 million people in Japan and in Japanese emigrant communities. It is a member of the Japonic language family, which has a number of proposed relationships with other languages, none of which has gained wide acceptance among historical linguists .Japanese is an...

, where sentences

Sentences

The Four Books of Sentences is a book of theology written by Peter Lombard in the twelfth century. It is a systematic compilation of theology, written around 1150; it derives its name from the sententiae or authoritative statements on biblical passages that it gathered together.-Origin and...

but not words are delimited, Thai

Thai language

Thai , also known as Central Thai and Siamese, is the national and official language of Thailand and the native language of the Thai people, Thailand's dominant ethnic group. Thai is a member of the Tai group of the Tai–Kadai language family. Historical linguists have been unable to definitively...

and Lao

Lao language

Lao or Laotian is a tonal language of the Tai–Kadai language family. It is the official language of Laos, and also spoken in the northeast of Thailand, where it is usually referred to as the Isan language. Being the primary language of the Lao people, Lao is also an important second language for...

, where phrases and sentences but not words are delimited, and Vietnamese

Vietnamese language

Vietnamese is the national and official language of Vietnam. It is the mother tongue of 86% of Vietnam's population, and of about three million overseas Vietnamese. It is also spoken as a second language by many ethnic minorities of Vietnam...

, where syllables but not words are delimited.

In some writing systems however, such as the Ge'ez script used for Amharic and Tigrinya among other languages, words are explicitly delimited (at least historically) with a non-whitespace character.

The Unicode Consortium has published a Standard Annex on Text Segmentation, exploring the issues of segmentation in multiscript texts.

Word splitting is the process of parsing

Parsing

In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens , to determine its grammatical structure with respect to a given formal grammar...

concatenated text (i.e. text that contains no spaces or other word separators) to infer where word breaks exist.

Word splitting may also refer to the process of hyphenation.

Sentence segmentation

Sentence segmentation is the problem of dividing a string of written language into its component sentences

Sentences

. In English and some other languages, using punctuation, particularly the full stop

Full stop

A full stop is the punctuation mark commonly placed at the end of sentences. In American English, the term used for this punctuation is period. In the 21st century, it is often also called a dot by young people...

character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example Mr. is not its own sentence in "Mr. Smith went to the shops in Jones Street." When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries.

As with word segmentation, not all written languages contain punctuation characters which are useful for approximating sentence boundaries.

Automatic segmentation approaches

Automatic segmentation is the problem in natural language processing

Natural language processing

Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

of implementing a computer process to segment text.

When punctuation and similar clues are not consistently available, the segmentation task often requires fairly non-trivial techniques, such as statistical decision-making, large dictionaries, as well as consideration of syntactic and semantic constraints. Effective natural language processing systems and text segmentation tools usually operate on text in specific domains and sources. As an example, processing text used in medical records is a very different problem than processing news articles or real estate advertisements.

The process of developing text segmentation tools starts with collecting a large corpus of text in an application domain. There are two general approaches:

Manual analysis of text and writing custom software
Annotate the sample corpus with boundary information and use Machine Learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.

External Links

Word Split An open source software tool designed to split conjoined words into human-readable text.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

Word segmentation

Sentence segmentation

Other segmentation problems

Automatic segmentation approaches

See also

External Links