Text segmentation
Encyclopedia
Text segmentation is the process of dividing written text
into meaningful units, such as word
s, sentence
s, or topic
s. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing
. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English
and the distinctive initial, medial and final letter shapes of Arabic
, such signals are sometimes ambiguous and not present in all written languages.
Compare speech segmentation
, the process of dividing speech into linguistically meaningful portions.
s.
In English
and many other languages using some form of the Latin alphabet
, the space
is a good approximation of a word delimiter. (Some examples where the space character alone may not be sufficient include contractions like can't for can not.)
However the equivalent to this character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese
, Japanese
, where sentences
but not words are delimited, Thai
and Lao
, where phrases and sentences but not words are delimited, and Vietnamese
, where syllables but not words are delimited.
In some writing systems however, such as the Ge'ez script used for Amharic and Tigrinya among other languages, words are explicitly delimited (at least historically) with a non-whitespace character.
The Unicode Consortium has published a Standard Annex on Text Segmentation, exploring the issues of segmentation in multiscript texts.
Word splitting is the process of parsing
concatenated text (i.e. text that contains no spaces or other word separators) to infer where word breaks exist.
Word splitting may also refer to the process of hyphenation.
. In English and some other languages, using punctuation, particularly the full stop
character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example Mr. is not its own sentence in "Mr. Smith went to the shops in Jones Street." When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries.
As with word segmentation, not all written languages contain punctuation characters which are useful for approximating sentence boundaries.
s (a task usually called morphological analysis
), paragraph
s, topic
s or discourse
turns.
A document may contain multiple topics, and the task of computerized text segmentation may be to discover these topics automatically and segment the text accordingly.
The topic boundaries may be apparent from section titles and paragraphs.
In other cases one needs to use techniques similar to those used in document classification
.
Many different approaches have been tried.
of implementing a computer process to segment text.
When punctuation and similar clues are not consistently available, the segmentation task often requires fairly non-trivial techniques, such as statistical decision-making, large dictionaries, as well as consideration of syntactic and semantic constraints. Effective natural language processing systems and text segmentation tools usually operate on text in specific domains and sources. As an example, processing text used in medical records is a very different problem than processing news articles or real estate advertisements.
The process of developing text segmentation tools starts with collecting a large corpus of text in an application domain. There are two general approaches:
Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.
Writing
Writing is the representation of language in a textual medium through the use of a set of signs or symbols . It is distinguished from illustration, such as cave drawing and painting, and non-symbolic preservation of language via non-textual media, such as magnetic tape audio.Writing most likely...
into meaningful units, such as word
Word
In language, a word is the smallest free form that may be uttered in isolation with semantic or pragmatic content . This contrasts with a morpheme, which is the smallest unit of meaning but will not necessarily stand on its own...
s, sentence
Sentence (linguistics)
In the field of linguistics, a sentence is an expression in natural language, and often defined to indicate a grammatical unit consisting of one or more words that generally bear minimal syntactic relation to the words that precede or follow it...
s, or topic
Topic
Topic or Topicality may refer to:* Topic , what is being talked about* Topic * Topic , a brand of confectionery bar* Topics , a work by Aristotle* Topical, a medication applied to body surfaces...
s. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....
. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English
English language
English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...
and the distinctive initial, medial and final letter shapes of Arabic
Arabic language
Arabic is a name applied to the descendants of the Classical Arabic language of the 6th century AD, used most prominently in the Quran, the Islamic Holy Book...
, such signals are sometimes ambiguous and not present in all written languages.
Compare speech segmentation
Speech segmentation
Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language processing....
, the process of dividing speech into linguistically meaningful portions.
Word segmentation
Word segmentation is the problem of dividing a string of written language into its component wordWord
In language, a word is the smallest free form that may be uttered in isolation with semantic or pragmatic content . This contrasts with a morpheme, which is the smallest unit of meaning but will not necessarily stand on its own...
s.
In English
English language
English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...
and many other languages using some form of the Latin alphabet
Latin alphabet
The Latin alphabet, also called the Roman alphabet, is the most recognized alphabet used in the world today. It evolved from a western variety of the Greek alphabet called the Cumaean alphabet, which was adopted and modified by the Etruscans who ruled early Rome...
, the space
Space (punctuation)
In writing, a space is a blank area devoid of content, serving to separate words, letters, numbers, and punctuation. Conventions for interword and intersentence spaces vary among languages, and in some cases the spacing rules are quite complex....
is a good approximation of a word delimiter. (Some examples where the space character alone may not be sufficient include contractions like can't for can not.)
However the equivalent to this character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese
Chinese language
The Chinese language is a language or language family consisting of varieties which are mutually intelligible to varying degrees. Originally the indigenous languages spoken by the Han Chinese in China, it forms one of the branches of Sino-Tibetan family of languages...
, Japanese
Japanese language
is a language spoken by over 130 million people in Japan and in Japanese emigrant communities. It is a member of the Japonic language family, which has a number of proposed relationships with other languages, none of which has gained wide acceptance among historical linguists .Japanese is an...
, where sentences
Sentences
The Four Books of Sentences is a book of theology written by Peter Lombard in the twelfth century. It is a systematic compilation of theology, written around 1150; it derives its name from the sententiae or authoritative statements on biblical passages that it gathered together.-Origin and...
but not words are delimited, Thai
Thai language
Thai , also known as Central Thai and Siamese, is the national and official language of Thailand and the native language of the Thai people, Thailand's dominant ethnic group. Thai is a member of the Tai group of the Tai–Kadai language family. Historical linguists have been unable to definitively...
and Lao
Lao language
Lao or Laotian is a tonal language of the Tai–Kadai language family. It is the official language of Laos, and also spoken in the northeast of Thailand, where it is usually referred to as the Isan language. Being the primary language of the Lao people, Lao is also an important second language for...
, where phrases and sentences but not words are delimited, and Vietnamese
Vietnamese language
Vietnamese is the national and official language of Vietnam. It is the mother tongue of 86% of Vietnam's population, and of about three million overseas Vietnamese. It is also spoken as a second language by many ethnic minorities of Vietnam...
, where syllables but not words are delimited.
In some writing systems however, such as the Ge'ez script used for Amharic and Tigrinya among other languages, words are explicitly delimited (at least historically) with a non-whitespace character.
The Unicode Consortium has published a Standard Annex on Text Segmentation, exploring the issues of segmentation in multiscript texts.
Word splitting is the process of parsing
Parsing
In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens , to determine its grammatical structure with respect to a given formal grammar...
concatenated text (i.e. text that contains no spaces or other word separators) to infer where word breaks exist.
Word splitting may also refer to the process of hyphenation.
Sentence segmentation
Sentence segmentation is the problem of dividing a string of written language into its component sentencesSentences
The Four Books of Sentences is a book of theology written by Peter Lombard in the twelfth century. It is a systematic compilation of theology, written around 1150; it derives its name from the sententiae or authoritative statements on biblical passages that it gathered together.-Origin and...
. In English and some other languages, using punctuation, particularly the full stop
Full stop
A full stop is the punctuation mark commonly placed at the end of sentences. In American English, the term used for this punctuation is period. In the 21st century, it is often also called a dot by young people...
character is a reasonable approximation. However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example Mr. is not its own sentence in "Mr. Smith went to the shops in Jones Street." When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries.
As with word segmentation, not all written languages contain punctuation characters which are useful for approximating sentence boundaries.
Other segmentation problems
Processes may be required to segment text into segments besides words, including morphemeMorpheme
In linguistics, a morpheme is the smallest semantically meaningful unit in a language. The field of study dedicated to morphemes is called morphology. A morpheme is not identical to a word, and the principal difference between the two is that a morpheme may or may not stand alone, whereas a word,...
s (a task usually called morphological analysis
Morphological analysis
Morphological Analysis or General Morphological Analysis is a method developed by Fritz Zwicky for exploring all the possible solutions to a multi-dimensional, non-quantified problem complex.-Overview:...
), paragraph
Paragraph
A paragraph is a self-contained unit of a discourse in writing dealing with a particular point or idea. A paragraph consists of one or more sentences. The start of a paragraph is indicated by beginning on a new line. Sometimes the first line is indented...
s, topic
Topic
Topic or Topicality may refer to:* Topic , what is being talked about* Topic * Topic , a brand of confectionery bar* Topics , a work by Aristotle* Topical, a medication applied to body surfaces...
s or discourse
Discourse
Discourse generally refers to "written or spoken communication". The following are three more specific definitions:...
turns.
A document may contain multiple topics, and the task of computerized text segmentation may be to discover these topics automatically and segment the text accordingly.
The topic boundaries may be apparent from section titles and paragraphs.
In other cases one needs to use techniques similar to those used in document classification
Document classification
Document classification or document categorization is a problem in both library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically...
.
Many different approaches have been tried.
Automatic segmentation approaches
Automatic segmentation is the problem in natural language processingNatural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....
of implementing a computer process to segment text.
When punctuation and similar clues are not consistently available, the segmentation task often requires fairly non-trivial techniques, such as statistical decision-making, large dictionaries, as well as consideration of syntactic and semantic constraints. Effective natural language processing systems and text segmentation tools usually operate on text in specific domains and sources. As an example, processing text used in medical records is a very different problem than processing news articles or real estate advertisements.
The process of developing text segmentation tools starts with collecting a large corpus of text in an application domain. There are two general approaches:
- Manual analysis of text and writing custom software
- Annotate the sample corpus with boundary information and use Machine LearningMachine learningMachine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.
See also
- Natural language processingNatural language processingNatural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....
- Speech segmentationSpeech segmentationSpeech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language processing....
- Hyphenation
- Word countWord countThe word count is the number of words in a document or passage of text. Word counting may be needed when a text is required to stay within certain numbers of words. This may particularly be the case in academia, legal proceedings, journalism and advertising. Word count is commonly used by...
External Links
- Word Split An open source software tool designed to split conjoined words into human-readable text.