Semantic similarity
Encyclopedia
Semantic similarity or semantic relatedness is a concept whereby a set of documents or terms within term lists are assigned a metric
based on the likeness of their meaning / semantic content.
Concretely, this can be achieved for instance by defining a topological similarity
, by using ontologies
to define a distance between words (a naive metric for terms arranged as nodes in a directed acyclic graph
like a hierarchy
would be the minimal distance—in separating edges—between the two term nodes), or using statistical means such as a vector space model
to correlate
words and textual contexts from a suitable text corpus
(co-occurrence
).
, while similarity does not
. However, much of the literature uses these terms interchangeably, along with terms like semantic distance. In essence, semantic similarity, semantic distance, and semantic relatedness all mean, "How much does term A have to do with term B?" The answer to this question is usually a number between -1 and 1, or between 0 and 1, where 1 signifies extremely high similarity/relatedness, and 0 signifies little-to-none.
(GO).
They are mainly used to compare genes
and proteins based on the similarity of their functions rather than on their sequence similarity,
but they are also being extended to other bioentities, such as chemical compounds and
diseases.
These comparisons can be done using tools freely available on the web:
: (+) humanly constructed; (−) humanly constructed (not automatically learned), cannot measure relatedness between multi-word term, non-incremental vocabulary
Other measures calculate the similarity between ontological instances:
Some examples:
Metric space
In mathematics, a metric space is a set where a notion of distance between elements of the set is defined.The metric space which most closely corresponds to our intuitive understanding of space is the 3-dimensional Euclidean space...
based on the likeness of their meaning / semantic content.
Concretely, this can be achieved for instance by defining a topological similarity
Similarity
-Specific definitions:Different fields provide differing definitions of similarity:-In computer science:* string metric, aka string similarity* semantic similarity in computational linguistics-In other fields:...
, by using ontologies
Ontology (computer science)
In computer science and information science, an ontology formally represents knowledge as a set of concepts within a domain, and the relationships between those concepts. It can be used to reason about the entities within that domain and may be used to describe the domain.In theory, an ontology is...
to define a distance between words (a naive metric for terms arranged as nodes in a directed acyclic graph
Directed acyclic graph
In mathematics and computer science, a directed acyclic graph , is a directed graph with no directed cycles. That is, it is formed by a collection of vertices and directed edges, each edge connecting one vertex to another, such that there is no way to start at some vertex v and follow a sequence of...
like a hierarchy
Hierarchy
A hierarchy is an arrangement of items in which the items are represented as being "above," "below," or "at the same level as" one another...
would be the minimal distance—in separating edges—between the two term nodes), or using statistical means such as a vector space model
Vector space model
Vector space model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings...
to correlate
Correlation
In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence....
words and textual contexts from a suitable text corpus
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...
(co-occurrence
Co-occurrence
Co-occurrence or cooccurrence can either mean concurrence / coincidence or, in a more specific sense, the above-chance frequent occurrence of two terms from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic...
).
Taxonomy
The concept of semantic similarity is more specific than semantic relatedness, as the latter includes concepts as antonymy and meronymyMeronymy
Meronymy is a semantic relation used in linguistics. A meronym denotes a constituent part of, or a member of something. That is,...
, while similarity does not
. However, much of the literature uses these terms interchangeably, along with terms like semantic distance. In essence, semantic similarity, semantic distance, and semantic relatedness all mean, "How much does term A have to do with term B?" The answer to this question is usually a number between -1 and 1, or between 0 and 1, where 1 signifies extremely high similarity/relatedness, and 0 signifies little-to-none.
Visualisation
An intuitive way of visualising the semantic similarity of terms is by grouping together closer related terms and spacing more distantly related ones wider apart. This is also common - if sometime subconscious - practice for mind maps and concept maps.Biomedical Informatics
Semantic similarity measures have been applied and developed in biomedical ontologies, namely, the Gene OntologyGene Ontology
The Gene Ontology, or GO, is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species...
(GO).
They are mainly used to compare genes
Gênes
Gênes is the name of a département of the First French Empire in present Italy, named after the city of Genoa. It was formed in 1805, when Napoleon Bonaparte occupied the Republic of Genoa. Its capital was Genoa, and it was divided in the arrondissements of Genoa, Bobbio, Novi Ligure, Tortona and...
and proteins based on the similarity of their functions rather than on their sequence similarity,
but they are also being extended to other bioentities, such as chemical compounds and
diseases.
These comparisons can be done using tools freely available on the web:
- ProteInOn can be used to find interacting proteins, find assigned GO terms and calculate the functional semantic similarity of UniProtUniProtUniProt is a comprehensive, high-quality and freely accessible database of protein sequence and functional information, many of which are derived from genome sequencing projects...
proteins and to get the information content and calculate the functional semantic similarity of GO terms. - CMPSim provides a functional similarity measure between chemical compounds and metabolic pathways using ChEBIChEBIChemical Entities of Biological Interest, also known as ChEBI, is a database and ontology of molecular entities focused on 'small' chemical compounds, that is part of the Open Biomedical Ontologies effort...
based semantic similarity measures. - CESSM provides a tool for the automated evaluation of GO-based semantic similarity measures.
GeoInformatics
Similarity is also applied to find similar geographic features or feature types:- SIM-DL similarity server can be used to compute similarities between concepts stored in geographic feature type ontologies.
- Geo-Net-PT Similarity Calculator can be used to compute how well related two geographic concepts are in the Geo-Net-PT ontology.
Linguistics
Several metrics use WordNetWordNet
WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets...
: (+) humanly constructed; (−) humanly constructed (not automatically learned), cannot measure relatedness between multi-word term, non-incremental vocabulary
Topological similarity
There are essentially two types of approaches that calculate topological similarity between ontological concepts:- Edge-based: which use the edges and their types as the data source;
- Node-based: in which the main data sources are the nodes and their properties.
Other measures calculate the similarity between ontological instances:
- Pairwise: measure functional similarity between two instances by combining the semantic similarities of the concepts they represent
- Groupwise: calculate the similarity directly not combining the semantic similarities of the concepts they represent
Some examples:
Edge-based
- IntelliGO:
Node-based
- Resnik
- based on the notion of information contentInformation contentThe term information content is used to refer the meaning of information as opposed to the form or carrier of the information. For example, the meaning that is conveyed in an expression or document, which can be distinguished from the sounds or symbols or codes and carrier that physically form the...
- based on the notion of information content
- Lin
- Jiang and Conrath
- DiShInDiShInDiShIn is a method for exploitation of multiple inheritance when calculating the shared information content between two ontology concepts being compared by node-based semantic similarity measures...
Disjunctive Shared Information between Ontology Concepts- other alternative: GraSMGraSMGraSM is a method for incorporating the semantic richness of a graph in semantic similarity measures by selecting disjunctive common ancestors of two concepts. GraSM assumes that two common ancestors are disjunctive if there are independent paths from both ancestors to the concept...
(Graph-based Similarity Measure)
- other alternative: GraSM
Pairwise
- maximum of the pairwise similarities
- composite average in which only the best-matching pairs are considered (best-match average)
Groupwise
- Jaccard indexJaccard indexThe Jaccard index, also known as the Jaccard similarity coefficient , is a statistic used for comparing the similarity and diversity of sample sets....
- simGIC
- simLP
- simUI
Statistical similarity
- LSA (Latent semantic analysisLatent semantic analysisLatent semantic analysis is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close...
) (+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times - PMI (Pointwise mutual informationPointwise Mutual InformationPointwise mutual information , or point mutual information, is a measure of association used in information theory and statistics.-Definition:...
) (+) large vocab, because it uses any search engine (like Google); (−) cannot measure relatedness between whole sentences or documents - SOC-PMI (Second-order co-occurrence pointwise mutual informationSecond-order co-occurrence pointwise mutual informationSecond-order co-occurrence pointwise mutual information is a semantic similarity measure using pointwise mutual information to sort lists of important neighbor words of the two target words from a large corpus. PMI-IR used AltaVista's Advanced Search query syntax to calculate probabilities. Note...
) (+) sort lists of important neighbor words from a large corpus; (−) cannot measure relatedness between whole sentences or documents - GLSA (Generalized Latent Semantic Analysis) (+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times
- ICAN (Incremental Construction of an Associative Network) (+) incremental, network-based measure, good for spreading activation, accounts for second-order relatedness; (−) cannot measure relatedness between multi-word terms, long pre-processing times
- NGD (Normalized Google distanceNormalized Google distanceGoogle distance is a semantic similarity measure derived from the number of hits returned by the Google search engine for a given set of keywords...
) (+) large vocab, because it uses any search engine (like Google); (−) can measure relatedness between whole sentences or documents but the larger the sentence or document the more ingenuity is required, Cilibrasi & Vitanyi (2007), reference below. - ESA (Explicit Semantic Analysis) based on WikipediaWikipediaWikipedia is a free, web-based, collaborative, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. Its 20 million articles have been written collaboratively by volunteers around the world. Almost all of its articles can be edited by anyone with access to the site,...
and the ODPOpen Directory ProjectThe Open Directory Project , also known as Dmoz , is a multilingual open content directory of World Wide Web links. It is owned by Netscape but it is constructed and maintained by a community of volunteer editors.ODP uses a hierarchical ontology scheme for organizing site listings... - n° of Wikipedia (noW), inspired by the game Six Degrees of Wikipedia, is a distance metric based on the hierarchical structure of Wikipedia. A directed-acyclic graph is first constructed and later, Dijkstra's shortest path algorithmDijkstra's algorithmDijkstra's algorithm, conceived by Dutch computer scientist Edsger Dijkstra in 1956 and published in 1959, is a graph search algorithm that solves the single-source shortest path problem for a graph with nonnegative edge path costs, producing a shortest path tree...
is employed to determine the noW value between two terms as the geodesic distance between the corresponding topics (i.e. nodes) in the graph. - VGEM (Vector Generation of an Explicitly-defined Multidimensional Semantic Space) (+) incremental vocab, can compare multi-word terms (−) performance depends on choosing specific dimensions
- BLOSSOM (Best path Length On a Semantic Self-Organizing Map) (+) uses a Self Organizing MapSelf-organizing mapA self-organizing map or self-organizing feature map is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional , discretized representation of the input space of the training samples, called a map...
to reduce high dimensional spaces, can use different vector representations (VGEM or word-document matrix), provides 'concept path linking' from one word to another (−) highly experimental, requires nontrivial SOM calculation - SimRankSimRankSimRank is a general similarity measure, based on a simple and intuitive graph-theoretic model.SimRank is applicable in any domain with object-to-object relationships, that measures similarity of the structural context in which objects occur, based on their relationships with other...
Software
- WordNet-Similarity, an open source package for computing the similarity and relatedness of concepts found in WordNet
- UMLS-Similarity, an open source package for computing the similarity and relatedness of concepts found in the Unified Medical Language System (UMLS)
Web Services
- Measures of Semantic Relatedness (MRS)
- WordNet-Similarity, a web interface to WordNet-Similarity
- UMLS-Similarity, a web interface to UMLS-Similarity
See also
- Terminology extractionTerminology extractionTerminology mining, term extraction, term recognition, or glossary extraction, is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus....
- Coherence (linguistics)Coherence (linguistics)Coherence in linguistics is what makes a text semantically meaningful.It is especially dealt with in text linguistics. Coherence is achieved through syntactical features such as the use of deictic, anaphoric and cataphoric elements or a logical tense structure, as well as presuppositions and...
- AnalogyAnalogyAnalogy is a cognitive process of transferring information or meaning from a particular subject to another particular subject , and a linguistic expression corresponding to such a process...
- Semantic differentialSemantic differentialSemantic differential is a type of a rating scale designed to measure the connotative meaning of objects, events, and concepts. The connotations are used to derive the attitude towards the given object, event or concept.-Semantic differential:...
External links
- List of related literature
- WordNet::Similarity (using WordNetWordNetWordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets...
as an ontologyOntology (computer science)In computer science and information science, an ontology formally represents knowledge as a set of concepts within a domain, and the relationships between those concepts. It can be used to reason about the entities within that domain and may be used to describe the domain.In theory, an ontology is...
) - WordNet Explorer (interactive graphic WordNet database editor)
- Similarity-based Learning Methods for the Semantic Web (C. d'Amato, PhD Thesis)
- Survey on Semantic Similarity Measures (C. d'Amato, S. Staab, N. Fanizzi, EKAW 2008, Springer-Verlag)
- lgorithm, Implementation and Application of the SIM-DL Similarity Server (Introduction to the SIM-DL Similarity Server)