Canopy clustering algorithm
Encyclopedia
The canopy clustering algorithm is an unsupervised pre-clustering
algorithm, often used as preprocessing step for the K-means algorithm
or the Hierarchical clustering
algorithm.
It is intended to speed up clustering operations on large data set
s, where using another algorithm directly may be impractical due to the size of the data set.
The algorithm proceeds as follows:
Since the algorithm uses distance functions and requires the specification of distance tresholds, it's applicability for high-dimensional data is limited by the curse of dimensionality
. Only when a cheap and approximative - low dimensional - distance function is available, the produced canopies will preserve the clusters produced by K-means.
Data clustering
Cluster analysis or clustering is the task of assigning a set of objects into groups so that the objects in the same cluster are more similar to each other than to those in other clusters....
algorithm, often used as preprocessing step for the K-means algorithm
K-means algorithm
In statistics and data mining, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean...
or the Hierarchical clustering
Hierarchical clustering
In statistics, hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:...
algorithm.
It is intended to speed up clustering operations on large data set
Data set
A data set is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. Its values for each of the variables, such as height and weight of an object or values of random numbers. Each...
s, where using another algorithm directly may be impractical due to the size of the data set.
The algorithm proceeds as follows:
- Cheaply partitioning the data into overlapping subsets (called "canopies")
- Perform more expensive clustering, but only within these canopies
Since the algorithm uses distance functions and requires the specification of distance tresholds, it's applicability for high-dimensional data is limited by the curse of dimensionality
Curse of dimensionality
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing high-dimensional spaces that do not occur in low-dimensional settings such as the physical space commonly modeled with just three dimensions.There are multiple phenomena referred to by this name in...
. Only when a cheap and approximative - low dimensional - distance function is available, the produced canopies will preserve the clusters produced by K-means.
Benefits
- The number of instances of training data that must be compared at each step is reduced
- There is some evidence that the resulting clusters are improved