Community structure
Encyclopedia
In the study of complex network
s, a network is said to have community structure if the nodes of the network can be easily grouped into (potentially overlapping) sets of nodes such that each set of nodes is densely connected internally. In the particular case of non-overlapping community finding, this implies that the network divides naturally into groups of nodes with dense connections internally and sparser connections between groups. A more general definition, suitable also for the case of overlapping community finding, is to aim to place pairs of nodes in the same community if, and only if, they are connected to each other.
, heavy-tailed degree distribution
s, and clustering
, among others. Another common characteristic is community structure
.
In the context of networks, community structure refers to the occurrence of groups of nodes in a network that are more densely connected internally than with the rest of the network, as shown in the example image to the right. This inhomogeneity of connections suggests that the network has certain natural divisions within it. Note that communities are often defined in terms of the partition of the set
of vertices, that is each node is put into one and only one community, just as in the figure. This is a useful simplification but may not be appropriate in many cases.
Not all networks need display community structure. Many model networks, for example, such as the random graph
and the Barabási–Albert model
, do not display community structure.
In the minimum-cut method, the network is divided into a predetermined number of parts, usually of approximately the same size, chosen such that the number of edges between groups is minimized. The method works well in many of the applications for which it was originally intended but is less than ideal for finding community structure in general networks since it will find communities regardless of whether they are implicit in the structure, and it will find only a fixed number of them.
. In this method one defines a similarity measure
quantifying some (usually topological) type of similarity between node pairs. Commonly used measures include the cosine similarity
, the Jaccard index
, and the Hamming distance
between rows of the adjacency matrix
. Then one groups similar nodes into communities according to this measure. There are several common schemes for performing the grouping, the two simplest being single-linkage clustering, in which two groups are considered separate communities if and only if all pairs of nodes in different groups have similarity lower than a given threshold, and complete linkage clustering, in which all nodes within every group have similarity greater than threshold.
The Girvan–Newman algorithm returns results of reasonable quality and is popular because it has been implemented in a number of standard software packages. But it also runs slowly, taking time O(m2n) on a network of n vertices and m edges, making it impractical for networks of more than a few thousand nodes
.
is a benefit function that measures the quality of a particular division of a network into communities. The modularity maximization method detects communities by searching over possible divisions of a network for one or more that have particularly high modularity. Since exhaustive search over all possible divisions is usually intractable, practical algorithms are based on approximate optimization methods such as greedy algorithms, simulated annealing, or spectral optimization, with different approaches offering different balances between speed and accuracy.
The usefulness of modularity optimization is however questionable: on the one hand, it has been shown that modularity optimization often fails to detect clusters smaller than some scale, depending on the size of the network (resolution limit
); on the other hand the landscape of modularity values is characterized by a huge degeneracy of partitions with high modularity, close to the absolute maximum, which may be very different from each other
s are subgraphs in which every node is connected to every other node in the clique. As nodes can not be more tightly connected than this, it is not surprising that there are many approaches to community detection in networks based on the detection of cliques in a graph and the analysis of how these overlap. Note that as a node can be a member of more than one clique, a node can be a member of more than one community in these methods giving an overlapping community structure.
One approach is to find the maximal cliques, that is find the cliques which are not the subgraph of any other clique. The classic algorithm to find these is the Bron-Kerbosch algorithm. The overlap of these can be used to define communities in several ways. The simplest is to consider only maximal cliques bigger than a minimum size (number of nodes). The union of these cliques then defines a subgraph whose components (disconnected parts) then define communities. Such approaches are often implemented in social network analysis software
such as UCInet.
The alternative approach to is to use cliques of fixed size, k. The overlap of these can be used to define a type of k-regular hypergraph
or a structure which is a generalisation of the line graph (the case when k=2) known as a Clique graph. The clique graphs have vertices which represent the cliques in the original graph while the edges of the clique graph record the overlap of the clique in the original graph. Applying any of the previous community detection methods (which assign each node to a community) to the clique graph then assigns each clique to a community. This can then be used to determine community membership of nodes in the cliques. Again as a node may be in several cliques, it can be a member of several communities.
For instance the clique percolation method
defines communities as percolation clusters
of k-cliques
. To do this it
finds all k-cliques in a network, that is all the complete sub-graphs of k-nodes.
It then defines two k-cliques to be adjacent if they share k − 1 nodes, that is this is used to define edges in a clique graph. A community is then defined to be the maximal union of k-cliques in which we can reach any k-clique from any other k-clique through series of k-clique adjacencies. That is communities are just the connected components in the clique graph. Since a node can belong to several different k-clique percolation clusters at the same time, the communities can overlap with each other.
values of different algorithms applied to the same networks to see which algorithms find the community structures with highest modularity. Another approach involves applying algorithms to networks that have known community structure, to see whether that structure can be recovered. A number of real-world networks have become standard benchmarks for such problems, particularly the so-called "karate club" network, a social network drawn from a well-known study of university students in the 1970s, and the "college football" network, a network of American football games between US universities. Real-world networks have the advantage of presenting realistic challenges to algorithms, but the disadvantage that the nature and difficulty of those challenges cannot easily be controlled by the experimenter. An alternative approach is to use a computer-generated network that has community structure deliberately placed in it by the experimenter. A common example of this approach is the "four groups" test, in which a network is divided into four equally-sized groups (usually of 32 nodes each) and the probabilities of connection within and between groups varied to create more or less challenging structures for the detection algorithm. Such benchmark graphs are a special case of the planted l-partition model
of Condon
and Karp
, or more generally of "stochastic block models," a general class of random network models containing community structure. Recently, other more flexible benchmarks have been proposed that allow for varying group sizes and nontrivial degree distributions, such as the benchmark of Lancichinetti et al.
,
which is an extension of the four groups benchmark that includes heterogeneous distributions of node degree and community size, making it a more severe test of community detection methods.
Complex network
In the context of network theory, a complex network is a graph with non-trivial topological features—features that do not occur in simple networks such as lattices or random graphs but often occur in real graphs...
s, a network is said to have community structure if the nodes of the network can be easily grouped into (potentially overlapping) sets of nodes such that each set of nodes is densely connected internally. In the particular case of non-overlapping community finding, this implies that the network divides naturally into groups of nodes with dense connections internally and sparser connections between groups. A more general definition, suitable also for the case of overlapping community finding, is to aim to place pairs of nodes in the same community if, and only if, they are connected to each other.
What is community structure?
In the study of networks, such as computer and information networks, social networks or biological networks, a number of different characteristics have been found to occur commonly, including the small-world propertyWatts and Strogatz model
The Watts and Strogatz model is a random graph generation model that produces graphs with small-world properties, including short average path lengths and high clustering. It was proposed by Duncan J. Watts and Steven Strogatz in their joint 1998 Nature paper...
, heavy-tailed degree distribution
Degree distribution
In the study of graphs and networks, the degree of a node in a network is the number of connections it has to other nodes and the degree distribution is the probability distribution of these degrees over the whole network.-Definition:...
s, and clustering
Clustering coefficient
In graph theory, a clustering coefficient is a measure of degree to which nodes in a graph tend to cluster together. Evidence suggests that in most real-world networks, and in particular social networks, nodes tend to create tightly knit groups characterised by a relatively high density of ties...
, among others. Another common characteristic is community structure
.
In the context of networks, community structure refers to the occurrence of groups of nodes in a network that are more densely connected internally than with the rest of the network, as shown in the example image to the right. This inhomogeneity of connections suggests that the network has certain natural divisions within it. Note that communities are often defined in terms of the partition of the set
Partition of a set
In mathematics, a partition of a set X is a division of X into non-overlapping and non-empty "parts" or "blocks" or "cells" that cover all of X...
of vertices, that is each node is put into one and only one community, just as in the figure. This is a useful simplification but may not be appropriate in many cases.
Not all networks need display community structure. Many model networks, for example, such as the random graph
Erdos–Rényi model
In graph theory, the Erdős–Rényi model, named for Paul Erdős and Alfréd Rényi, is either of two models for generating random graphs, including one that sets an edge between each pair of nodes with equal probability, independently of the other edges...
and the Barabási–Albert model
BA model
The Barabási–Albert model is an algorithm for generating random scale-free networks using a preferential attachment mechanism. Scale-free networks are widely observed in natural and man-made systems, including the Internet, the world wide web, citation networks, and some social...
, do not display community structure.
Why are we interested in it?
Community structures are quite common in real networks. Social networks often include community groups (the origin of the term, in fact) based on common location, interests, occupation, etc. Metabolic networks have communities based on functional groupings. Citation networks form communities by research topic. Being able to identify these sub-structures within a network can provide insight into how network function and topology affect each other.Algorithms for finding communities
Finding communities within an arbitrary network can be a difficult task. The number of communities, if any, within the network is typically unknown and the communities are often of unequal size and/or density. Despite these difficulties, however, several methods for community finding have been developed and employed with varying levels of success.Minimum-cut method
One of the oldest algorithms for dividing networks into parts is the minimum-cut method (and variants such as ratio cut and normalized cut). This method sees use, for example, in load balancing for parallel computing in order to minimize communication between processor nodes.In the minimum-cut method, the network is divided into a predetermined number of parts, usually of approximately the same size, chosen such that the number of edges between groups is minimized. The method works well in many of the applications for which it was originally intended but is less than ideal for finding community structure in general networks since it will find communities regardless of whether they are implicit in the structure, and it will find only a fixed number of them.
Hierarchical clustering
Another method for finding community structures in networks is hierarchical clusteringHierarchical clustering
In statistics, hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:...
. In this method one defines a similarity measure
Similarity
-Specific definitions:Different fields provide differing definitions of similarity:-In computer science:* string metric, aka string similarity* semantic similarity in computational linguistics-In other fields:...
quantifying some (usually topological) type of similarity between node pairs. Commonly used measures include the cosine similarity
Cosine similarity
Cosine similarity is a measure of similarity between two vectors by measuring the cosine of the angle between them. The cosine of 0 is 1, and less than 1 for any other angle. The cosine of the angle between two vectors thus determines whether two vectors are pointing in roughly the same...
, the Jaccard index
Jaccard index
The Jaccard index, also known as the Jaccard similarity coefficient , is a statistic used for comparing the similarity and diversity of sample sets....
, and the Hamming distance
Hamming distance
In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different...
between rows of the adjacency matrix
Adjacency matrix
In mathematics and computer science, an adjacency matrix is a means of representing which vertices of a graph are adjacent to which other vertices...
. Then one groups similar nodes into communities according to this measure. There are several common schemes for performing the grouping, the two simplest being single-linkage clustering, in which two groups are considered separate communities if and only if all pairs of nodes in different groups have similarity lower than a given threshold, and complete linkage clustering, in which all nodes within every group have similarity greater than threshold.
Girvan–Newman algorithm
Another commonly used algorithm for finding communities is the Girvan–Newman algorithm. This algorithm identifies edges in a network that lie between communities and then removes them, leaving behind just the communities themselves. The identification is performed by employing the graph-theoretic measure betweenness, which assigns a number to each edge which is large if the edge lies "between" many pairs of nodes.The Girvan–Newman algorithm returns results of reasonable quality and is popular because it has been implemented in a number of standard software packages. But it also runs slowly, taking time O(m2n) on a network of n vertices and m edges, making it impractical for networks of more than a few thousand nodes
.
Modularity maximization
One of the most widely used methods for community detection is modularity maximization. ModularityModularity (networks)
Modularity is one measure of the structure of networks or graphs. It measures the strength of division of a network into modules . Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules...
is a benefit function that measures the quality of a particular division of a network into communities. The modularity maximization method detects communities by searching over possible divisions of a network for one or more that have particularly high modularity. Since exhaustive search over all possible divisions is usually intractable, practical algorithms are based on approximate optimization methods such as greedy algorithms, simulated annealing, or spectral optimization, with different approaches offering different balances between speed and accuracy.
The usefulness of modularity optimization is however questionable: on the one hand, it has been shown that modularity optimization often fails to detect clusters smaller than some scale, depending on the size of the network (resolution limit
); on the other hand the landscape of modularity values is characterized by a huge degeneracy of partitions with high modularity, close to the absolute maximum, which may be very different from each other
The Louvain method
The Louvain method is a greedy optimisation method that has proved to provide excellent results for a wide range of applications and is now one of the most widely used method. The method is implemented in many network analysis softwares, including NetworkX and Gephi. The method consists of two phases. First, it looks for "small" communities by optimizing modularity in a local way. Second, it aggregates nodes of the same community and builds a new network whose nodes are the communities. These steps are repeated iteratively until a maximum of modularity is attained.Clique based methods
CliqueClique
A clique is an exclusive group of people who share common interests, views, purposes, patterns of behavior, or ethnicity. A clique as a reference group can be either normative or comparative. Membership in a clique is typically exclusive, and qualifications for membership may be social or...
s are subgraphs in which every node is connected to every other node in the clique. As nodes can not be more tightly connected than this, it is not surprising that there are many approaches to community detection in networks based on the detection of cliques in a graph and the analysis of how these overlap. Note that as a node can be a member of more than one clique, a node can be a member of more than one community in these methods giving an overlapping community structure.
One approach is to find the maximal cliques, that is find the cliques which are not the subgraph of any other clique. The classic algorithm to find these is the Bron-Kerbosch algorithm. The overlap of these can be used to define communities in several ways. The simplest is to consider only maximal cliques bigger than a minimum size (number of nodes). The union of these cliques then defines a subgraph whose components (disconnected parts) then define communities. Such approaches are often implemented in social network analysis software
Social network analysis software
Social network analysis software facilitates quantitative or qualitative analysis of social networks, by describing features of a network, either through numerical or visual representation...
such as UCInet.
The alternative approach to is to use cliques of fixed size, k. The overlap of these can be used to define a type of k-regular hypergraph
Hypergraph
In mathematics, a hypergraph is a generalization of a graph, where an edge can connect any number of vertices. Formally, a hypergraph H is a pair H = where X is a set of elements, called nodes or vertices, and E is a set of non-empty subsets of X called hyperedges or links...
or a structure which is a generalisation of the line graph (the case when k=2) known as a Clique graph. The clique graphs have vertices which represent the cliques in the original graph while the edges of the clique graph record the overlap of the clique in the original graph. Applying any of the previous community detection methods (which assign each node to a community) to the clique graph then assigns each clique to a community. This can then be used to determine community membership of nodes in the cliques. Again as a node may be in several cliques, it can be a member of several communities.
For instance the clique percolation method
Clique percolation method
The clique percolation method is a popular approach for analyzing the overlapping community structure of networks. The term network community...
defines communities as percolation clusters
Percolation theory
In mathematics, percolation theory describes the behavior of connected clusters in a random graph. The applications of percolation theory to materials science and other domains are discussed in the article percolation.-Introduction:...
of k-cliques
Clique
A clique is an exclusive group of people who share common interests, views, purposes, patterns of behavior, or ethnicity. A clique as a reference group can be either normative or comparative. Membership in a clique is typically exclusive, and qualifications for membership may be social or...
. To do this it
finds all k-cliques in a network, that is all the complete sub-graphs of k-nodes.
It then defines two k-cliques to be adjacent if they share k − 1 nodes, that is this is used to define edges in a clique graph. A community is then defined to be the maximal union of k-cliques in which we can reach any k-clique from any other k-clique through series of k-clique adjacencies. That is communities are just the connected components in the clique graph. Since a node can belong to several different k-clique percolation clusters at the same time, the communities can overlap with each other.
Testing methods of finding communities algorithms
Most research on community detection has been concerned with the development of new methods and relatively little attention has been paid to the testing of those methods, i.e., to checking how good their performance is compared with other methods. To the extent that testing has taken place, a number of approaches have been employed. For instance, researchers have compared the modularityModularity (networks)
Modularity is one measure of the structure of networks or graphs. It measures the strength of division of a network into modules . Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules...
values of different algorithms applied to the same networks to see which algorithms find the community structures with highest modularity. Another approach involves applying algorithms to networks that have known community structure, to see whether that structure can be recovered. A number of real-world networks have become standard benchmarks for such problems, particularly the so-called "karate club" network, a social network drawn from a well-known study of university students in the 1970s, and the "college football" network, a network of American football games between US universities. Real-world networks have the advantage of presenting realistic challenges to algorithms, but the disadvantage that the nature and difficulty of those challenges cannot easily be controlled by the experimenter. An alternative approach is to use a computer-generated network that has community structure deliberately placed in it by the experimenter. A common example of this approach is the "four groups" test, in which a network is divided into four equally-sized groups (usually of 32 nodes each) and the probabilities of connection within and between groups varied to create more or less challenging structures for the detection algorithm. Such benchmark graphs are a special case of the planted l-partition model
of Condon
Anne Condon
Anne Elizabeth Condon is an Irish-Canadian computer scientist, a professor of computer science at the University of British Columbia. Her research focuses on computational complexity theory and bioinformatics...
and Karp
Richard Karp
Richard Manning Karp is a computer scientist and computational theorist at the University of California, Berkeley, notable for research in the theory of algorithms, for which he received a Turing Award in 1985, The Benjamin Franklin Medal in Computer and Cognitive Science in 2004, and the Kyoto...
, or more generally of "stochastic block models," a general class of random network models containing community structure. Recently, other more flexible benchmarks have been proposed that allow for varying group sizes and nontrivial degree distributions, such as the benchmark of Lancichinetti et al.
,
which is an extension of the four groups benchmark that includes heterogeneous distributions of node degree and community size, making it a more severe test of community detection methods.