Collocation extraction
Encyclopedia
Collocation extraction is the task of extracting collocation
s automatically from a corpus
using a computer.
Within the area of corpus linguistics
, collocation is defined as a sequence of words or terms
which co-occur
more often than would be expected by chance. 'Crystal clear', 'middle management', 'nuclear family', and 'cosmetic surgery' are examples of collocated pairs of words. Some words are often found together because they make up a compound noun, for example 'riding boots' or 'motor cyclist'.
The traditional method of performing collocation extraction is to find a formula based on the statistical quantities of those words to calculate a score associated to every word pairs. Proposed formulas are mutual information
, t-test, z test, chi-squared test and likelihood ratio.
Collocation
In corpus linguistics, collocation defines a sequence of words or terms that co-occur more often than would be expected by chance. In phraseology, collocation is a sub-type of phraseme. An example of a phraseological collocation is the expression strong tea...
s automatically from a corpus
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...
using a computer.
Within the area of corpus linguistics
Corpus linguistics
Corpus linguistics is the study of language as expressed in samples or "real world" text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally done by hand, corpora are now largely...
, collocation is defined as a sequence of words or terms
Terminology
Terminology is the study of terms and their use. Terms are words and compound words that in specific contexts are given specific meanings, meanings that may deviate from the meaning the same words have in other contexts and in everyday language. The discipline Terminology studies among other...
which co-occur
Co-occurrence
Co-occurrence or cooccurrence can either mean concurrence / coincidence or, in a more specific sense, the above-chance frequent occurrence of two terms from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic...
more often than would be expected by chance. 'Crystal clear', 'middle management', 'nuclear family', and 'cosmetic surgery' are examples of collocated pairs of words. Some words are often found together because they make up a compound noun, for example 'riding boots' or 'motor cyclist'.
The traditional method of performing collocation extraction is to find a formula based on the statistical quantities of those words to calculate a score associated to every word pairs. Proposed formulas are mutual information
Mutual information
In probability theory and information theory, the mutual information of two random variables is a quantity that measures the mutual dependence of the two random variables...
, t-test, z test, chi-squared test and likelihood ratio.
See also
- Collocational restrictionCollocational restrictionCollocational restriction is a linguistic term used in morphology. The term refers to the fact that in certain two-word phrases the meaning of an individual word is restricted to that particular phrase...
- Collostructional analysisCollostructional analysisCollostructional analysis is a family of methods developed by Stefan Th. Gries and...
- Compound noun, adjective and verb
- Phrasal verbPhrasal verbA phrasal verb is a combination of a verb and a preposition, a verb and an adverb, or a verb with both an adverb and a preposition, any of which are part of the syntax of the sentence, and so are a complete semantic unit. Sentences may contain direct and indirect objects in addition to the phrasal...
- Siamese twins (English language)Siamese twins (English language)Siamese twins in the context of the English language refers to a pair or grouping of words that is used together as an idiomatic expression or collocation, usually conjoined by the words and or or. The order of elements cannot be reversed...
- Terminology extractionTerminology extractionTerminology mining, term extraction, term recognition, or glossary extraction, is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus....
- n-gram analysis
External links
- Sematext Key Phrase Extractor, a package for extraction of Collocations, Statistically Improbable Phrases (SIPs), etc. by SematextSematextSematext is a Brooklyn, NY based company providing commercial support, consulting, development and products around search, Natural Language Processing, Recommendation Engines, and Text Analytics. The company’s services and products are aimed at organizations using or evaluating Lucene, Solr,...
- Collocation generator at www.collins.co.uk/corpus/
- What is collocation