Collocation extraction
Encyclopedia
Collocation extraction is the task of extracting collocation
Collocation
In corpus linguistics, collocation defines a sequence of words or terms that co-occur more often than would be expected by chance. In phraseology, collocation is a sub-type of phraseme. An example of a phraseological collocation is the expression strong tea...

s automatically from a corpus
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...

 using a computer.

Within the area of corpus linguistics
Corpus linguistics
Corpus linguistics is the study of language as expressed in samples or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are now largely...

, collocation is defined as a sequence of words or terms
Terminology
Terminology is the study of terms and their use. Terms are words and compound words that in specific contexts are given specific meanings, meanings that may deviate from the meaning the same words have in other contexts and in everyday language. The discipline Terminology studies among other...

 which co-occur
Co-occurrence
Co-occurrence or cooccurrence can either mean concurrence / coincidence or, in a more specific sense, the above-chance frequent occurrence of two terms from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic...

 more often than would be expected by chance. 'Crystal clear', 'middle management', 'nuclear family', and 'cosmetic surgery' are examples of collocated pairs of words. Some words are often found together because they make up a compound noun, for example 'riding boots' or 'motor cyclist'.

The traditional method of performing collocation extraction is to find a formula based on the statistical quantities of those words to calculate a score associated to every word pairs. Proposed formulas are mutual information
Mutual information
In probability theory and information theory, the mutual information of two random variables is a quantity that measures the mutual dependence of the two random variables...

, t-test, z test, chi-squared test and likelihood ratio.

See also

  • Collocational restriction
    Collocational restriction
    Collocational restriction is a linguistic term used in morphology. The term refers to the fact that in certain two-word phrases the meaning of an individual word is restricted to that particular phrase...

  • Collostructional analysis
    Collostructional analysis
    Collostructional analysis is a family of methods developed by Stefan Th. Gries and...

  • Compound noun, adjective and verb
  • Phrasal verb
    Phrasal verb
    A phrasal verb is a combination of a verb and a preposition, a verb and an adverb, or a verb with both an adverb and a preposition, any of which are part of the syntax of the sentence, and so are a complete semantic unit. Sentences may contain direct and indirect objects in addition to the phrasal...

  • Siamese twins (English language)
    Siamese twins (English language)
    Siamese twins in the context of the English language refers to a pair or grouping of words that is used together as an idiomatic expression or collocation, usually conjoined by the words and or or. The order of elements cannot be reversed...

  • Terminology extraction
    Terminology extraction
    Terminology mining, term extraction, term recognition, or glossary extraction, is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus....

  • n-gram analysis

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK