Consensus clustering
Encyclopedia
Clustering is the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters. Often similarity is assessed according to a distance measure
. Clustering is a common technique for statistical
data analysis
, which is used in many fields, including machine learning
, data mining
, pattern recognition
, image analysis
and bioinformatics
.
Consensus clustering has emerged as an important elaboration of the classical clustering problem. Consensus clustering, also called aggregation of clustering (or partitions), refers to the situation in which a number of different (input) clusterings have been obtained for a particular dataset and it is desired to find a single (consensus) clustering which is a better fit in some sense than the existing clusterings. Consensus clustering is thus the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. When cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be NP-complete
. Consensus clustering for unsupervised learning is analogous to ensemble learning
in supervised learning.
2. Clustering Aggregation (Fern and Brodley): They apply the clustering aggregation idea to a collection of soft clusterings they obtain by random projections. They use an agglomerative algorithm and do not penalize for merging dissimilar nodes.
3. Fred and Jain: propose to use a single linkage algorithm to combine multiple runs of the k-means algorithm.
4. Dana Cristofor and Dan Simovici: observe the connection between clustering aggregation and clustering of categorical data. They propose information theoretic distance measures, and they propose genetic algorithms for finding the best aggregation solution.
5. Topchy et al.: They define clustering aggregation as a maximum likelihood estimation problem, and they propose an EM algorithm for finding the consensus clustering.
In CSPA the similarity between two data-points is defined to be directly proportional to number of constituent clusterings of the ensemble in which they are clustered together. The intuition is that the more similar two data-points are the higher is the chance that constituent clusterings will place them in the same cluster. CSPA is the simplest heuristic, but its computational and storage complexity are both quadratic in n. The following two methods are computationally less expensive:
2. Hyper-Graph Partitioning Algorithm (HGPA)
The HGPA algorithm takes a very different approach to finding the consensus clustering than the previous method.
The cluster ensemble problem is formulated as partitioning the hypergraph by cutting a minimal number of hyperedges. They make use of hMETIS which is a hypergraph partitioning package system.
3. Meta-CLustering Algorithm (MCLA)
The Meta-CLustering Algorithm (MCLA) is based on clustering clusters.
First, it tries to solve the cluster correspondence problem and then uses voting to place data-points into the final consensus clusters. The cluster correspondence problem is solved by grouping the clusters identified in the individual clusterings of the ensemble.
The clustering is performed using METIS and Spectral clustering.
1. sCSPA
sCSPA extends CSPA by calculating a similarity matrix. Each object is visualized as a point in dimensional space, with each dimension corresponding to probability of its belonging to a cluster. This technique first transforms the objects into a label-space and then interprets the dot product between the vectors representing the objects as their similarity.
2. sMCLA
sMCLA extends MCLA by accepting soft clusterings as input. sMCLA’s working can be divided into the following steps:
3. sHBGF
HBGF represents the ensemble as a bipartite graph with clusters and instances as nodes, and edges between the instances and the clusters they belong to. This approach can be trivially adapted to consider soft ensembles since the graph partitioning algorithm METIS accepts weights on the edges of the graph to be partitioned. In sHBGF, the graph has n + t vertices, where t is the total number of underlying clusters.
Metric (mathematics)
In mathematics, a metric or distance function is a function which defines a distance between elements of a set. A set with a metric is called a metric space. A metric induces a topology on a set but not all topologies can be generated by a metric...
. Clustering is a common technique for statistical
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
data analysis
Data analysis
Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making...
, which is used in many fields, including machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
, data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
, pattern recognition
Pattern recognition
In machine learning, pattern recognition is the assignment of some sort of output value to a given input value , according to some specific algorithm. An example of pattern recognition is classification, which attempts to assign each input value to one of a given set of classes...
, image analysis
Image analysis
Image analysis is the extraction of meaningful information from images; mainly from digital images by means of digital image processing techniques...
and bioinformatics
Bioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...
.
Consensus clustering has emerged as an important elaboration of the classical clustering problem. Consensus clustering, also called aggregation of clustering (or partitions), refers to the situation in which a number of different (input) clusterings have been obtained for a particular dataset and it is desired to find a single (consensus) clustering which is a better fit in some sense than the existing clusterings. Consensus clustering is thus the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. When cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be NP-complete
NP-complete
In computational complexity theory, the complexity class NP-complete is a class of decision problems. A decision problem L is NP-complete if it is in the set of NP problems so that any given solution to the decision problem can be verified in polynomial time, and also in the set of NP-hard...
. Consensus clustering for unsupervised learning is analogous to ensemble learning
Ensemble learning
In statistics and machine learning, ensemble methods use multiple models to obtain better predictive performance than could be obtained from any of the constituent models....
in supervised learning.
Issues with existing clustering techniques
- Current clustering techniques do not address all the requirements adequately.
- Dealing with large number of dimensions and large number of data items can be problematic because of time complexity;
- Effectiveness of the method depends on the definition of “distance” (for distance based clustering)
- If an obvious distance measure doesn’t exist we must “define” it, which is not always easy, especially in multidimensional spaces.
- The result of the clustering algorithm (that in many cases can be arbitrary itself) can be interpreted in different ways.
Why Consensus Clustering?
- There are potential shortcomings for each of the known clustering techniques.
- Interpretation of results are difficult in a few cases.
- When there is no knowledge about the number of clusters, it becomes difficult.
- They are extremely sensitive to the initial settings.
- Some algorithms can never undo what was done previously.
- Iterative descent clustering methods, such as the SOMSelf-organizing mapA self-organizing map or self-organizing feature map is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional , discretized representation of the input space of the training samples, called a map...
and K-MeansK-means algorithmIn statistics and data mining, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean...
clustering circumvent some of the shortcomings of Hierarchical clusteringHierarchical clusteringIn statistics, hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:...
by providing for univocally defined clusters and cluster boundaries. However, they lack the intuitive and visual appeal of Hierarchical clusteringHierarchical clusteringIn statistics, hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:...
, and the number of clusters must be chosen a priori. - An extremely important issue in cluster analysis is the validation of the clustering results, that is, how to gain confidence about the significance of the clusters provided by the clustering technique, (cluster numbers and cluster assignments). Lacking an external objective criterion (the equivalent of a known class label in supervised learning) this validation becomes somewhat elusive.
Advantages of Consensus Clustering
- Provides for a method to represent the consensus across multiple runs of a clustering algorithm, to determine the number of clusters in the data, and to assess the stability of the discovered clusters.
- The method can also be used to represent the consensus over multiple runs of a clustering algorithm with random restart (such as K-means, model-based Bayesian clustering, SOM, etc.), so as to account for * its sensitivity to the initial conditions.
- It also provides for a visualization tool to inspect cluster number, membership, and boundaries.
- We will be able to extract lot of features / attributes from multiples runs of different clustering algorithms on the data. These features can give us valuable information in doing a final consensus clustering.
Related Work
1. Clustering Ensemble (Strehl and Ghosh): They consider various formulations for the problem, most of which reduce the problem to a hyper-graph partitioning problem. In one of their formulations they consider the same graph as in the correlation clustering problem. The solution they propose is to compute the best k-partition of the graph, which does not take into account the penalty for merging two nodes that are far apart.2. Clustering Aggregation (Fern and Brodley): They apply the clustering aggregation idea to a collection of soft clusterings they obtain by random projections. They use an agglomerative algorithm and do not penalize for merging dissimilar nodes.
3. Fred and Jain: propose to use a single linkage algorithm to combine multiple runs of the k-means algorithm.
4. Dana Cristofor and Dan Simovici: observe the connection between clustering aggregation and clustering of categorical data. They propose information theoretic distance measures, and they propose genetic algorithms for finding the best aggregation solution.
5. Topchy et al.: They define clustering aggregation as a maximum likelihood estimation problem, and they propose an EM algorithm for finding the consensus clustering.
Hard Ensemble Clustering
This approach by Strehl and Ghosh introduces the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings. They discuss three approaches towards solving this problem to obtain high quality consensus functions. Their techniques have low computational costs and this makes it feasible to evaluate each of the techniques discussed below and arrive at the best solution by comparing the results against the objective function.Efficient Consensus Functions
1. Cluster-based Similarity Partitioning Algorithm (CSPA)In CSPA the similarity between two data-points is defined to be directly proportional to number of constituent clusterings of the ensemble in which they are clustered together. The intuition is that the more similar two data-points are the higher is the chance that constituent clusterings will place them in the same cluster. CSPA is the simplest heuristic, but its computational and storage complexity are both quadratic in n. The following two methods are computationally less expensive:
2. Hyper-Graph Partitioning Algorithm (HGPA)
The HGPA algorithm takes a very different approach to finding the consensus clustering than the previous method.
The cluster ensemble problem is formulated as partitioning the hypergraph by cutting a minimal number of hyperedges. They make use of hMETIS which is a hypergraph partitioning package system.
3. Meta-CLustering Algorithm (MCLA)
The Meta-CLustering Algorithm (MCLA) is based on clustering clusters.
First, it tries to solve the cluster correspondence problem and then uses voting to place data-points into the final consensus clusters. The cluster correspondence problem is solved by grouping the clusters identified in the individual clusterings of the ensemble.
The clustering is performed using METIS and Spectral clustering.
Soft Clustering Ensembles
Punera and Ghosh extended the idea of hard clustering ensembles to the soft clustering scenario. Each instance in a soft ensemble is represented by a concatenation of r posterior membership probability distributions obtained from the constituent clustering algorithms. We can define a distance measure between two instances using the Kullback-Leibler (KL) divergence, which calculates the “distance” between two probability distributions.1. sCSPA
sCSPA extends CSPA by calculating a similarity matrix. Each object is visualized as a point in dimensional space, with each dimension corresponding to probability of its belonging to a cluster. This technique first transforms the objects into a label-space and then interprets the dot product between the vectors representing the objects as their similarity.
2. sMCLA
sMCLA extends MCLA by accepting soft clusterings as input. sMCLA’s working can be divided into the following steps:
- Construct Soft Meta-Graph of Clusters
- Group the Clusters into Meta-Clusters
- Collapse Meta-Clusters using Weighting
- Compete for Objects
3. sHBGF
HBGF represents the ensemble as a bipartite graph with clusters and instances as nodes, and edges between the instances and the clusters they belong to. This approach can be trivially adapted to consider soft ensembles since the graph partitioning algorithm METIS accepts weights on the edges of the graph to be partitioned. In sHBGF, the graph has n + t vertices, where t is the total number of underlying clusters.
Further reading
- Stefano Monti, Pablo Tamayo, Jill Mesirov and Todd Golub. "Consensus Clustering - A resampling-based method for class discovery and visualization of gene expression microarray data"