Biclustering
Encyclopedia
Biclustering, co-clustering, or two-mode
clustering is a data mining
technique which allows simultaneous clustering of the rows and columns of a matrix
.
The term was first introduced by Mirkin (recently by Cheng and Church in gene expression
analysis), although the technique was originally introduced much earlier (i.e., by J.A. Hartigan).
Given a set of rows in columns (i.e., an matrix), the biclustering algorithm generates biclusters - a subset of rows which exhibit similar behavior across a subset of columns, or vice versa.
requiring either large computational effort or the use of lossy heuristics to short-circuit the calculation.
They are:
The relationship between these cluster models and other types of clustering such as correlation clustering
is discussed in.
, including: block clustering, CTWC (Coupled Two-Way Clustering), ITWC (Interrelated Two-Way Clustering), δ-bicluster, δ-pCluster, δ-pattern, FLOC, OPC, Plaid Model, OPSMs (Order-preserving submatrixes), Gibbs, SAMBA (Statistical-Algorithmic Method for Bicluster Analysis),
, Robust Biclustering Algorithm (RoBA), Crossing Minimization
, cMonkey, PRMs, DCC, LEB (Localize and Extract Biclusters), QUBIC (QUalitative BIClustering), BCCA (Bi-Correlation Clustering Algorithm) and FABIA (Factor Analysis for Bicluster Acquisition). Biclustering algorithms have also been proposed and used in other application fields under the names coclustering, bidimentional clustering, and subspace clustering.
Given the known importance of discovering local patterns in time series data, recent proposals have addressed the biclustering problem in the specific case of time series gene expression data. In this case, the interesting biclusters can be restricted to those with contiguous columns. This restriction leads to a tractable problem and enables the development of efficient exaustive enumeration algorithms such as CCC-Biclustering and e-CCC-Biclustering . These algorithms find and report all maximal biclusters with coherent and contiguous columns with perfect/approximate expression patterns, in time linear/polynomial in the size of the time series gene expression matrix using efficient string
processing techniques based on suffix trees.
Some recent algorithms have attempted to include additional support for biclustering rectangular matrices in the form of other datatypes, including cMonkey.
There is an ongoing debate about how to judge the results of these methods, as biclustering allows overlap between clusters and some algorithms allow the exclusion of hard-to-reconcile columns/conditions. Not all of the available algorithms are deterministic and the analyst must pay attention to the degree to which results represent stable minima. Because this is an unsupervised-classification problem, the lack of a gold standard makes it difficult to spot errors in the results. One approach is to utilize multiple biclustering algorithms, with majority or super-majority voting amongst them deciding the best result. Another way is to analyse the quality of shifting and scaling patterns in biclusters. Biclustering has been used in the domain of text mining (or classification) where it is popularly known as co-clustering
. Text corpora are represented in a vectorial form as a matrix D whose rows denote the documents and whose columns denote the words in the dictionary. Matrix elements Dij denote occurrence of word j in document i. Co-clustering algorithms are then applied to discover blocks in D that correspond to a group of documents (rows) characterized by a group of words(columns).
Several approaches have been proposed based on the information contents of the resulting blocks: matrix-based approaches such as SVD and BVD, and graph-based approaches. Information-theoretic algorithms iteratively assign each row to a cluster of documents and each column to a cluster of words such that the mutual information is maximized. Matrix-based methods focus on the decomposition of matrices into blocks such that the error between the original matrix and the regenerated matrices from the decomposition is minimized. Graph-based methods tend to minimize the cuts between the clusters. Given two groups of documents d1 and d2, the number of cuts can be measured as the number of words that occur in documents of groups d1 and d2.
More recently (Bisson and Hussain) have proposed a new approach of using the similarity between words and the similarity between documents to co-cluster the matrix. Their method (known as χ-Sim, for cross similarity) is based on finding document-document similarity and word-word similarity, and then using classical clustering methods such as hierarchical clustering. Instead of explicitly clustering rows and columns alternately, they consider higher-order occurrences of words, inherently taking into account the documents in which they occur. Thus, the similarity between two words is calculated based on the documents in which they occur and also the documents in which "similar" words occur. The idea here is that two documents about the same topic do not necessarily use the same set of words to describe it but a subset of the words and other similar words that are characteristic of that topic. This approach of taking higher-order similarities takes the latent semantic structure of the whole corpus into consideration with the result of generating a better clustering of the documents and words.
In contrast to other approaches, FABIA is a multiplicative model that assumes realistic non-Gaussian signal distributions with heavy tails. FABIA utilizes well understood model selection techniques like variational approaches and applies the Bayesian framework. The generative framework allows FABIA to determine the information content
of each bicluster to separate spurious biclusters from true biclusters.
clustering is a data mining
technique which allows simultaneous clustering of the rows and columns of a matrix
.
The term was first introduced by Mirkin (recently by Cheng and Church in gene expression
analysis), although the technique was originally introduced much earlier (i.e., by J.A. Hartigan).
Given a set of rows in columns (i.e., an matrix), the biclustering algorithm generates biclusters - a subset of rows which exhibit similar behavior across a subset of columns, or vice versa.
requiring either large computational effort or the use of lossy heuristics to short-circuit the calculation.
They are:
The relationship between these cluster models and other types of clustering such as correlation clustering
is discussed in.
, including: block clustering, CTWC (Coupled Two-Way Clustering), ITWC (Interrelated Two-Way Clustering), δ-bicluster, δ-pCluster, δ-pattern, FLOC, OPC, Plaid Model, OPSMs (Order-preserving submatrixes), Gibbs, SAMBA (Statistical-Algorithmic Method for Bicluster Analysis),
, Robust Biclustering Algorithm (RoBA), Crossing Minimization
, cMonkey, PRMs, DCC, LEB (Localize and Extract Biclusters), QUBIC (QUalitative BIClustering), BCCA (Bi-Correlation Clustering Algorithm) and FABIA (Factor Analysis for Bicluster Acquisition). Biclustering algorithms have also been proposed and used in other application fields under the names coclustering, bidimentional clustering, and subspace clustering.
Given the known importance of discovering local patterns in time series data, recent proposals have addressed the biclustering problem in the specific case of time series gene expression data. In this case, the interesting biclusters can be restricted to those with contiguous columns. This restriction leads to a tractable problem and enables the development of efficient exaustive enumeration algorithms such as CCC-Biclustering and e-CCC-Biclustering . These algorithms find and report all maximal biclusters with coherent and contiguous columns with perfect/approximate expression patterns, in time linear/polynomial in the size of the time series gene expression matrix using efficient string
processing techniques based on suffix trees.
Some recent algorithms have attempted to include additional support for biclustering rectangular matrices in the form of other datatypes, including cMonkey.
There is an ongoing debate about how to judge the results of these methods, as biclustering allows overlap between clusters and some algorithms allow the exclusion of hard-to-reconcile columns/conditions. Not all of the available algorithms are deterministic and the analyst must pay attention to the degree to which results represent stable minima. Because this is an unsupervised-classification problem, the lack of a gold standard makes it difficult to spot errors in the results. One approach is to utilize multiple biclustering algorithms, with majority or super-majority voting amongst them deciding the best result. Another way is to analyse the quality of shifting and scaling patterns in biclusters. Biclustering has been used in the domain of text mining (or classification) where it is popularly known as co-clustering
. Text corpora are represented in a vectorial form as a matrix D whose rows denote the documents and whose columns denote the words in the dictionary. Matrix elements Dij denote occurrence of word j in document i. Co-clustering algorithms are then applied to discover blocks in D that correspond to a group of documents (rows) characterized by a group of words(columns).
Several approaches have been proposed based on the information contents of the resulting blocks: matrix-based approaches such as SVD and BVD, and graph-based approaches. Information-theoretic algorithms iteratively assign each row to a cluster of documents and each column to a cluster of words such that the mutual information is maximized. Matrix-based methods focus on the decomposition of matrices into blocks such that the error between the original matrix and the regenerated matrices from the decomposition is minimized. Graph-based methods tend to minimize the cuts between the clusters. Given two groups of documents d1 and d2, the number of cuts can be measured as the number of words that occur in documents of groups d1 and d2.
More recently (Bisson and Hussain) have proposed a new approach of using the similarity between words and the similarity between documents to co-cluster the matrix. Their method (known as χ-Sim, for cross similarity) is based on finding document-document similarity and word-word similarity, and then using classical clustering methods such as hierarchical clustering. Instead of explicitly clustering rows and columns alternately, they consider higher-order occurrences of words, inherently taking into account the documents in which they occur. Thus, the similarity between two words is calculated based on the documents in which they occur and also the documents in which "similar" words occur. The idea here is that two documents about the same topic do not necessarily use the same set of words to describe it but a subset of the words and other similar words that are characteristic of that topic. This approach of taking higher-order similarities takes the latent semantic structure of the whole corpus into consideration with the result of generating a better clustering of the documents and words.
In contrast to other approaches, FABIA is a multiplicative model that assumes realistic non-Gaussian signal distributions with heavy tails. FABIA utilizes well understood model selection techniques like variational approaches and applies the Bayesian framework. The generative framework allows FABIA to determine the information content
of each bicluster to separate spurious biclusters from true biclusters.
clustering is a data mining
technique which allows simultaneous clustering of the rows and columns of a matrix
.
The term was first introduced by Mirkin (recently by Cheng and Church in gene expression
analysis), although the technique was originally introduced much earlier (i.e., by J.A. Hartigan).
Given a set of rows in columns (i.e., an matrix), the biclustering algorithm generates biclusters - a subset of rows which exhibit similar behavior across a subset of columns, or vice versa.
requiring either large computational effort or the use of lossy heuristics to short-circuit the calculation.
They are:
The relationship between these cluster models and other types of clustering such as correlation clustering
is discussed in.
, including: block clustering, CTWC (Coupled Two-Way Clustering), ITWC (Interrelated Two-Way Clustering), δ-bicluster, δ-pCluster, δ-pattern, FLOC, OPC, Plaid Model, OPSMs (Order-preserving submatrixes), Gibbs, SAMBA (Statistical-Algorithmic Method for Bicluster Analysis),
, Robust Biclustering Algorithm (RoBA), Crossing Minimization
, cMonkey, PRMs, DCC, LEB (Localize and Extract Biclusters), QUBIC (QUalitative BIClustering), BCCA (Bi-Correlation Clustering Algorithm) and FABIA (Factor Analysis for Bicluster Acquisition). Biclustering algorithms have also been proposed and used in other application fields under the names coclustering, bidimentional clustering, and subspace clustering.
Given the known importance of discovering local patterns in time series data, recent proposals have addressed the biclustering problem in the specific case of time series gene expression data. In this case, the interesting biclusters can be restricted to those with contiguous columns. This restriction leads to a tractable problem and enables the development of efficient exaustive enumeration algorithms such as CCC-Biclustering and e-CCC-Biclustering . These algorithms find and report all maximal biclusters with coherent and contiguous columns with perfect/approximate expression patterns, in time linear/polynomial in the size of the time series gene expression matrix using efficient string
processing techniques based on suffix trees.
Some recent algorithms have attempted to include additional support for biclustering rectangular matrices in the form of other datatypes, including cMonkey.
There is an ongoing debate about how to judge the results of these methods, as biclustering allows overlap between clusters and some algorithms allow the exclusion of hard-to-reconcile columns/conditions. Not all of the available algorithms are deterministic and the analyst must pay attention to the degree to which results represent stable minima. Because this is an unsupervised-classification problem, the lack of a gold standard makes it difficult to spot errors in the results. One approach is to utilize multiple biclustering algorithms, with majority or super-majority voting amongst them deciding the best result. Another way is to analyse the quality of shifting and scaling patterns in biclusters. Biclustering has been used in the domain of text mining (or classification) where it is popularly known as co-clustering
. Text corpora are represented in a vectorial form as a matrix D whose rows denote the documents and whose columns denote the words in the dictionary. Matrix elements Dij denote occurrence of word j in document i. Co-clustering algorithms are then applied to discover blocks in D that correspond to a group of documents (rows) characterized by a group of words(columns).
Several approaches have been proposed based on the information contents of the resulting blocks: matrix-based approaches such as SVD and BVD, and graph-based approaches. Information-theoretic algorithms iteratively assign each row to a cluster of documents and each column to a cluster of words such that the mutual information is maximized. Matrix-based methods focus on the decomposition of matrices into blocks such that the error between the original matrix and the regenerated matrices from the decomposition is minimized. Graph-based methods tend to minimize the cuts between the clusters. Given two groups of documents d1 and d2, the number of cuts can be measured as the number of words that occur in documents of groups d1 and d2.
More recently (Bisson and Hussain) have proposed a new approach of using the similarity between words and the similarity between documents to co-cluster the matrix. Their method (known as χ-Sim, for cross similarity) is based on finding document-document similarity and word-word similarity, and then using classical clustering methods such as hierarchical clustering. Instead of explicitly clustering rows and columns alternately, they consider higher-order occurrences of words, inherently taking into account the documents in which they occur. Thus, the similarity between two words is calculated based on the documents in which they occur and also the documents in which "similar" words occur. The idea here is that two documents about the same topic do not necessarily use the same set of words to describe it but a subset of the words and other similar words that are characteristic of that topic. This approach of taking higher-order similarities takes the latent semantic structure of the whole corpus into consideration with the result of generating a better clustering of the documents and words.
In contrast to other approaches, FABIA is a multiplicative model that assumes realistic non-Gaussian signal distributions with heavy tails. FABIA utilizes well understood model selection techniques like variational approaches and applies the Bayesian framework. The generative framework allows FABIA to determine the information content
of each bicluster to separate spurious biclusters from true biclusters.
Mode
Mode may mean:* Transport mode, a means of transportation* Block cipher modes of operation, in cryptography* A technocomplex of stone tools...
clustering is a data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
technique which allows simultaneous clustering of the rows and columns of a matrix
Matrix (mathematics)
In mathematics, a matrix is a rectangular array of numbers, symbols, or expressions. The individual items in a matrix are called its elements or entries. An example of a matrix with six elements isMatrices of the same size can be added or subtracted element by element...
.
The term was first introduced by Mirkin (recently by Cheng and Church in gene expression
Gene expression
Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. These products are often proteins, but in non-protein coding genes such as ribosomal RNA , transfer RNA or small nuclear RNA genes, the product is a functional RNA...
analysis), although the technique was originally introduced much earlier (i.e., by J.A. Hartigan).
Given a set of rows in columns (i.e., an matrix), the biclustering algorithm generates biclusters - a subset of rows which exhibit similar behavior across a subset of columns, or vice versa.
Complexity
The complexity of the biclustering problem depends on the exact problem formulation, and particularly on the merit function used to evaluate the quality of a given bicluster. However most interesting variants of this problem are NP-completeNP-complete
In computational complexity theory, the complexity class NP-complete is a class of decision problems. A decision problem L is NP-complete if it is in the set of NP problems so that any given solution to the decision problem can be verified in polynomial time, and also in the set of NP-hard...
requiring either large computational effort or the use of lossy heuristics to short-circuit the calculation.
Type of Bicluster
Different biclustering algorithms have different definitions of bicluster.They are:
- Bicluster with constant values (a),
- Bicluster with constant values on rows (b) or columns (c),
- Bicluster with coherent values (d, e).
|
|
|
|
|
The relationship between these cluster models and other types of clustering such as correlation clustering
Correlation clustering
In machine learning, correlation clustering or cluster editing operates in a scenario where the relationship between the objects are known instead of the actual representation of the objects...
is discussed in.
Algorithms
There are many biclustering algorithms developed for bioinformaticsBioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...
, including: block clustering, CTWC (Coupled Two-Way Clustering), ITWC (Interrelated Two-Way Clustering), δ-bicluster, δ-pCluster, δ-pattern, FLOC, OPC, Plaid Model, OPSMs (Order-preserving submatrixes), Gibbs, SAMBA (Statistical-Algorithmic Method for Bicluster Analysis),
, Robust Biclustering Algorithm (RoBA), Crossing Minimization
, cMonkey, PRMs, DCC, LEB (Localize and Extract Biclusters), QUBIC (QUalitative BIClustering), BCCA (Bi-Correlation Clustering Algorithm) and FABIA (Factor Analysis for Bicluster Acquisition). Biclustering algorithms have also been proposed and used in other application fields under the names coclustering, bidimentional clustering, and subspace clustering.
Given the known importance of discovering local patterns in time series data, recent proposals have addressed the biclustering problem in the specific case of time series gene expression data. In this case, the interesting biclusters can be restricted to those with contiguous columns. This restriction leads to a tractable problem and enables the development of efficient exaustive enumeration algorithms such as CCC-Biclustering and e-CCC-Biclustering . These algorithms find and report all maximal biclusters with coherent and contiguous columns with perfect/approximate expression patterns, in time linear/polynomial in the size of the time series gene expression matrix using efficient string
processing techniques based on suffix trees.
Some recent algorithms have attempted to include additional support for biclustering rectangular matrices in the form of other datatypes, including cMonkey.
There is an ongoing debate about how to judge the results of these methods, as biclustering allows overlap between clusters and some algorithms allow the exclusion of hard-to-reconcile columns/conditions. Not all of the available algorithms are deterministic and the analyst must pay attention to the degree to which results represent stable minima. Because this is an unsupervised-classification problem, the lack of a gold standard makes it difficult to spot errors in the results. One approach is to utilize multiple biclustering algorithms, with majority or super-majority voting amongst them deciding the best result. Another way is to analyse the quality of shifting and scaling patterns in biclusters. Biclustering has been used in the domain of text mining (or classification) where it is popularly known as co-clustering
. Text corpora are represented in a vectorial form as a matrix D whose rows denote the documents and whose columns denote the words in the dictionary. Matrix elements Dij denote occurrence of word j in document i. Co-clustering algorithms are then applied to discover blocks in D that correspond to a group of documents (rows) characterized by a group of words(columns).
Several approaches have been proposed based on the information contents of the resulting blocks: matrix-based approaches such as SVD and BVD, and graph-based approaches. Information-theoretic algorithms iteratively assign each row to a cluster of documents and each column to a cluster of words such that the mutual information is maximized. Matrix-based methods focus on the decomposition of matrices into blocks such that the error between the original matrix and the regenerated matrices from the decomposition is minimized. Graph-based methods tend to minimize the cuts between the clusters. Given two groups of documents d1 and d2, the number of cuts can be measured as the number of words that occur in documents of groups d1 and d2.
More recently (Bisson and Hussain) have proposed a new approach of using the similarity between words and the similarity between documents to co-cluster the matrix. Their method (known as χ-Sim, for cross similarity) is based on finding document-document similarity and word-word similarity, and then using classical clustering methods such as hierarchical clustering. Instead of explicitly clustering rows and columns alternately, they consider higher-order occurrences of words, inherently taking into account the documents in which they occur. Thus, the similarity between two words is calculated based on the documents in which they occur and also the documents in which "similar" words occur. The idea here is that two documents about the same topic do not necessarily use the same set of words to describe it but a subset of the words and other similar words that are characteristic of that topic. This approach of taking higher-order similarities takes the latent semantic structure of the whole corpus into consideration with the result of generating a better clustering of the documents and words.
In contrast to other approaches, FABIA is a multiplicative model that assumes realistic non-Gaussian signal distributions with heavy tails. FABIA utilizes well understood model selection techniques like variational approaches and applies the Bayesian framework. The generative framework allows FABIA to determine the information content
Information content
The term information content is used to refer the meaning of information as opposed to the form or carrier of the information. For example, the meaning that is conveyed in an expression or document, which can be distinguished from the sounds or symbols or codes and carrier that physically form the...
of each bicluster to separate spurious biclusters from true biclusters.
Others
Biclustering, co-clustering, or two-modeMode
Mode may mean:* Transport mode, a means of transportation* Block cipher modes of operation, in cryptography* A technocomplex of stone tools...
clustering is a data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
technique which allows simultaneous clustering of the rows and columns of a matrix
Matrix (mathematics)
In mathematics, a matrix is a rectangular array of numbers, symbols, or expressions. The individual items in a matrix are called its elements or entries. An example of a matrix with six elements isMatrices of the same size can be added or subtracted element by element...
.
The term was first introduced by Mirkin (recently by Cheng and Church in gene expression
Gene expression
Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. These products are often proteins, but in non-protein coding genes such as ribosomal RNA , transfer RNA or small nuclear RNA genes, the product is a functional RNA...
analysis), although the technique was originally introduced much earlier (i.e., by J.A. Hartigan).
Given a set of rows in columns (i.e., an matrix), the biclustering algorithm generates biclusters - a subset of rows which exhibit similar behavior across a subset of columns, or vice versa.
Complexity
The complexity of the biclustering problem depends on the exact problem formulation, and particularly on the merit function used to evaluate the quality of a given bicluster. However most interesting variants of this problem are NP-completeNP-complete
In computational complexity theory, the complexity class NP-complete is a class of decision problems. A decision problem L is NP-complete if it is in the set of NP problems so that any given solution to the decision problem can be verified in polynomial time, and also in the set of NP-hard...
requiring either large computational effort or the use of lossy heuristics to short-circuit the calculation.
Type of Bicluster
Different biclustering algorithms have different definitions of bicluster.They are:
- Bicluster with constant values (a),
- Bicluster with constant values on rows (b) or columns (c),
- Bicluster with coherent values (d, e).
|
|
|
|
|
The relationship between these cluster models and other types of clustering such as correlation clustering
Correlation clustering
In machine learning, correlation clustering or cluster editing operates in a scenario where the relationship between the objects are known instead of the actual representation of the objects...
is discussed in.
Algorithms
There are many biclustering algorithms developed for bioinformaticsBioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...
, including: block clustering, CTWC (Coupled Two-Way Clustering), ITWC (Interrelated Two-Way Clustering), δ-bicluster, δ-pCluster, δ-pattern, FLOC, OPC, Plaid Model, OPSMs (Order-preserving submatrixes), Gibbs, SAMBA (Statistical-Algorithmic Method for Bicluster Analysis),
, Robust Biclustering Algorithm (RoBA), Crossing Minimization
, cMonkey, PRMs, DCC, LEB (Localize and Extract Biclusters), QUBIC (QUalitative BIClustering), BCCA (Bi-Correlation Clustering Algorithm) and FABIA (Factor Analysis for Bicluster Acquisition). Biclustering algorithms have also been proposed and used in other application fields under the names coclustering, bidimentional clustering, and subspace clustering.
Given the known importance of discovering local patterns in time series data, recent proposals have addressed the biclustering problem in the specific case of time series gene expression data. In this case, the interesting biclusters can be restricted to those with contiguous columns. This restriction leads to a tractable problem and enables the development of efficient exaustive enumeration algorithms such as CCC-Biclustering and e-CCC-Biclustering . These algorithms find and report all maximal biclusters with coherent and contiguous columns with perfect/approximate expression patterns, in time linear/polynomial in the size of the time series gene expression matrix using efficient string
processing techniques based on suffix trees.
Some recent algorithms have attempted to include additional support for biclustering rectangular matrices in the form of other datatypes, including cMonkey.
There is an ongoing debate about how to judge the results of these methods, as biclustering allows overlap between clusters and some algorithms allow the exclusion of hard-to-reconcile columns/conditions. Not all of the available algorithms are deterministic and the analyst must pay attention to the degree to which results represent stable minima. Because this is an unsupervised-classification problem, the lack of a gold standard makes it difficult to spot errors in the results. One approach is to utilize multiple biclustering algorithms, with majority or super-majority voting amongst them deciding the best result. Another way is to analyse the quality of shifting and scaling patterns in biclusters. Biclustering has been used in the domain of text mining (or classification) where it is popularly known as co-clustering
. Text corpora are represented in a vectorial form as a matrix D whose rows denote the documents and whose columns denote the words in the dictionary. Matrix elements Dij denote occurrence of word j in document i. Co-clustering algorithms are then applied to discover blocks in D that correspond to a group of documents (rows) characterized by a group of words(columns).
Several approaches have been proposed based on the information contents of the resulting blocks: matrix-based approaches such as SVD and BVD, and graph-based approaches. Information-theoretic algorithms iteratively assign each row to a cluster of documents and each column to a cluster of words such that the mutual information is maximized. Matrix-based methods focus on the decomposition of matrices into blocks such that the error between the original matrix and the regenerated matrices from the decomposition is minimized. Graph-based methods tend to minimize the cuts between the clusters. Given two groups of documents d1 and d2, the number of cuts can be measured as the number of words that occur in documents of groups d1 and d2.
More recently (Bisson and Hussain) have proposed a new approach of using the similarity between words and the similarity between documents to co-cluster the matrix. Their method (known as χ-Sim, for cross similarity) is based on finding document-document similarity and word-word similarity, and then using classical clustering methods such as hierarchical clustering. Instead of explicitly clustering rows and columns alternately, they consider higher-order occurrences of words, inherently taking into account the documents in which they occur. Thus, the similarity between two words is calculated based on the documents in which they occur and also the documents in which "similar" words occur. The idea here is that two documents about the same topic do not necessarily use the same set of words to describe it but a subset of the words and other similar words that are characteristic of that topic. This approach of taking higher-order similarities takes the latent semantic structure of the whole corpus into consideration with the result of generating a better clustering of the documents and words.
In contrast to other approaches, FABIA is a multiplicative model that assumes realistic non-Gaussian signal distributions with heavy tails. FABIA utilizes well understood model selection techniques like variational approaches and applies the Bayesian framework. The generative framework allows FABIA to determine the information content
Information content
The term information content is used to refer the meaning of information as opposed to the form or carrier of the information. For example, the meaning that is conveyed in an expression or document, which can be distinguished from the sounds or symbols or codes and carrier that physically form the...
of each bicluster to separate spurious biclusters from true biclusters.
Others
Biclustering, co-clustering, or two-modeMode
Mode may mean:* Transport mode, a means of transportation* Block cipher modes of operation, in cryptography* A technocomplex of stone tools...
clustering is a data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
technique which allows simultaneous clustering of the rows and columns of a matrix
Matrix (mathematics)
In mathematics, a matrix is a rectangular array of numbers, symbols, or expressions. The individual items in a matrix are called its elements or entries. An example of a matrix with six elements isMatrices of the same size can be added or subtracted element by element...
.
The term was first introduced by Mirkin (recently by Cheng and Church in gene expression
Gene expression
Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. These products are often proteins, but in non-protein coding genes such as ribosomal RNA , transfer RNA or small nuclear RNA genes, the product is a functional RNA...
analysis), although the technique was originally introduced much earlier (i.e., by J.A. Hartigan).
Given a set of rows in columns (i.e., an matrix), the biclustering algorithm generates biclusters - a subset of rows which exhibit similar behavior across a subset of columns, or vice versa.
Complexity
The complexity of the biclustering problem depends on the exact problem formulation, and particularly on the merit function used to evaluate the quality of a given bicluster. However most interesting variants of this problem are NP-completeNP-complete
In computational complexity theory, the complexity class NP-complete is a class of decision problems. A decision problem L is NP-complete if it is in the set of NP problems so that any given solution to the decision problem can be verified in polynomial time, and also in the set of NP-hard...
requiring either large computational effort or the use of lossy heuristics to short-circuit the calculation.
Type of Bicluster
Different biclustering algorithms have different definitions of bicluster.They are:
- Bicluster with constant values (a),
- Bicluster with constant values on rows (b) or columns (c),
- Bicluster with coherent values (d, e).
|
|
|
|
|
The relationship between these cluster models and other types of clustering such as correlation clustering
Correlation clustering
In machine learning, correlation clustering or cluster editing operates in a scenario where the relationship between the objects are known instead of the actual representation of the objects...
is discussed in.
Algorithms
There are many biclustering algorithms developed for bioinformaticsBioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...
, including: block clustering, CTWC (Coupled Two-Way Clustering), ITWC (Interrelated Two-Way Clustering), δ-bicluster, δ-pCluster, δ-pattern, FLOC, OPC, Plaid Model, OPSMs (Order-preserving submatrixes), Gibbs, SAMBA (Statistical-Algorithmic Method for Bicluster Analysis),
, Robust Biclustering Algorithm (RoBA), Crossing Minimization
, cMonkey, PRMs, DCC, LEB (Localize and Extract Biclusters), QUBIC (QUalitative BIClustering), BCCA (Bi-Correlation Clustering Algorithm) and FABIA (Factor Analysis for Bicluster Acquisition). Biclustering algorithms have also been proposed and used in other application fields under the names coclustering, bidimentional clustering, and subspace clustering.
Given the known importance of discovering local patterns in time series data, recent proposals have addressed the biclustering problem in the specific case of time series gene expression data. In this case, the interesting biclusters can be restricted to those with contiguous columns. This restriction leads to a tractable problem and enables the development of efficient exaustive enumeration algorithms such as CCC-Biclustering and e-CCC-Biclustering . These algorithms find and report all maximal biclusters with coherent and contiguous columns with perfect/approximate expression patterns, in time linear/polynomial in the size of the time series gene expression matrix using efficient string
processing techniques based on suffix trees.
Some recent algorithms have attempted to include additional support for biclustering rectangular matrices in the form of other datatypes, including cMonkey.
There is an ongoing debate about how to judge the results of these methods, as biclustering allows overlap between clusters and some algorithms allow the exclusion of hard-to-reconcile columns/conditions. Not all of the available algorithms are deterministic and the analyst must pay attention to the degree to which results represent stable minima. Because this is an unsupervised-classification problem, the lack of a gold standard makes it difficult to spot errors in the results. One approach is to utilize multiple biclustering algorithms, with majority or super-majority voting amongst them deciding the best result. Another way is to analyse the quality of shifting and scaling patterns in biclusters. Biclustering has been used in the domain of text mining (or classification) where it is popularly known as co-clustering
. Text corpora are represented in a vectorial form as a matrix D whose rows denote the documents and whose columns denote the words in the dictionary. Matrix elements Dij denote occurrence of word j in document i. Co-clustering algorithms are then applied to discover blocks in D that correspond to a group of documents (rows) characterized by a group of words(columns).
Several approaches have been proposed based on the information contents of the resulting blocks: matrix-based approaches such as SVD and BVD, and graph-based approaches. Information-theoretic algorithms iteratively assign each row to a cluster of documents and each column to a cluster of words such that the mutual information is maximized. Matrix-based methods focus on the decomposition of matrices into blocks such that the error between the original matrix and the regenerated matrices from the decomposition is minimized. Graph-based methods tend to minimize the cuts between the clusters. Given two groups of documents d1 and d2, the number of cuts can be measured as the number of words that occur in documents of groups d1 and d2.
More recently (Bisson and Hussain) have proposed a new approach of using the similarity between words and the similarity between documents to co-cluster the matrix. Their method (known as χ-Sim, for cross similarity) is based on finding document-document similarity and word-word similarity, and then using classical clustering methods such as hierarchical clustering. Instead of explicitly clustering rows and columns alternately, they consider higher-order occurrences of words, inherently taking into account the documents in which they occur. Thus, the similarity between two words is calculated based on the documents in which they occur and also the documents in which "similar" words occur. The idea here is that two documents about the same topic do not necessarily use the same set of words to describe it but a subset of the words and other similar words that are characteristic of that topic. This approach of taking higher-order similarities takes the latent semantic structure of the whole corpus into consideration with the result of generating a better clustering of the documents and words.
In contrast to other approaches, FABIA is a multiplicative model that assumes realistic non-Gaussian signal distributions with heavy tails. FABIA utilizes well understood model selection techniques like variational approaches and applies the Bayesian framework. The generative framework allows FABIA to determine the information content
Information content
The term information content is used to refer the meaning of information as opposed to the form or carrier of the information. For example, the meaning that is conveyed in an expression or document, which can be distinguished from the sounds or symbols or codes and carrier that physically form the...
of each bicluster to separate spurious biclusters from true biclusters.
Others
- A. Tanay. R. Sharan, and R. Shamir, "Biclustering Algorithms: A Survey", In Handbook of Computational Molecular Biology, Edited by Srinivas Aluru, Chapman (2004)