Sequence clustering
Encyclopedia
In bioinformatics
, sequence
clustering algorithm
s attempt to group sequences that are somehow related. The sequences can be either of genomic, "transcriptomic" (ESTs
) or protein
origin.
For proteins, homologous
sequences are typically grouped into families
. For EST data, clustering is important to group sequences originating from the same gene
before the ESTs are assembled
to reconstruct the original mRNA.
Some clustering algorithms use single-linkage clustering, constructing a transitive closure
of sequences with a similarity over a particular threshold. UCLUST and CD-HIT use a greedy algorithm
that identifies a representative sequence for each cluster and assigns a new sequence to that cluster if it is sufficiently similar to the representative; if a sequence is not matched then it becomes the representative sequence
for a new cluster. The similarity score is often based on sequence alignment
. Sequence clustering is often used to make a non-redundant set of representative sequences
.
Sequence clusters are often synonymous with (but not identical to) protein families
. Determining a representative tertiary structure
for each sequence cluster is the aim of many structural genomics
initiatives.
Bioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...
, sequence
Primary structure
The primary structure of peptides and proteins refers to the linear sequence of its amino acid structural units. The term "primary structure" was first coined by Linderstrøm-Lang in 1951...
clustering algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
s attempt to group sequences that are somehow related. The sequences can be either of genomic, "transcriptomic" (ESTs
Expressed sequence tag
An expressed sequence tag or EST is a short sub-sequence of a cDNA sequence. They may be used to identify gene transcripts, and are instrumental in gene discovery and gene sequence determination. The identification of ESTs has proceeded rapidly, with approximately 65.9 million ESTs now available in...
) or protein
Protein
Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form, facilitating a biological function. A polypeptide is a single linear polymer chain of amino acids bonded together by peptide bonds between the carboxyl and amino groups of...
origin.
For proteins, homologous
Homology (biology)
Homology forms the basis of organization for comparative biology. In 1843, Richard Owen defined homology as "the same organ in different animals under every variety of form and function". Organs as different as a bat's wing, a seal's flipper, a cat's paw and a human hand have a common underlying...
sequences are typically grouped into families
Protein family
A protein family is a group of evolutionarily-related proteins, and is often nearly synonymous with gene family. The term protein family should not be confused with family as it is used in taxonomy....
. For EST data, clustering is important to group sequences originating from the same gene
Gene
A gene is a molecular unit of heredity of a living organism. It is a name given to some stretches of DNA and RNA that code for a type of protein or for an RNA chain that has a function in the organism. Living beings depend on genes, as they specify all proteins and functional RNA chains...
before the ESTs are assembled
Sequence assembly
In bioinformatics, sequence assembly refers to aligning and merging fragments of a much longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology cannot read whole genomes in one go, but rather reads small pieces of between 20 and 1000 bases,...
to reconstruct the original mRNA.
Some clustering algorithms use single-linkage clustering, constructing a transitive closure
Transitive closure
In mathematics, the transitive closure of a binary relation R on a set X is the transitive relation R+ on set X such that R+ contains R and R+ is minimal . If the binary relation itself is transitive, then the transitive closure will be that same binary relation; otherwise, the transitive closure...
of sequences with a similarity over a particular threshold. UCLUST and CD-HIT use a greedy algorithm
Greedy algorithm
A greedy algorithm is any algorithm that follows the problem solving heuristic of making the locally optimal choice at each stagewith the hope of finding the global optimum....
that identifies a representative sequence for each cluster and assigns a new sequence to that cluster if it is sufficiently similar to the representative; if a sequence is not matched then it becomes the representative sequence
Representative sequences
Protein sequences can provide data about the biological function and evolution of proteins and protein domains. Grouping and interrelating protein sequences can therefore provide information about both human biological processes, and the historical development of biological processes on earth.Such...
for a new cluster. The similarity score is often based on sequence alignment
Sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are...
. Sequence clustering is often used to make a non-redundant set of representative sequences
Representative sequences
Protein sequences can provide data about the biological function and evolution of proteins and protein domains. Grouping and interrelating protein sequences can therefore provide information about both human biological processes, and the historical development of biological processes on earth.Such...
.
Sequence clusters are often synonymous with (but not identical to) protein families
Protein family
A protein family is a group of evolutionarily-related proteins, and is often nearly synonymous with gene family. The term protein family should not be confused with family as it is used in taxonomy....
. Determining a representative tertiary structure
Tertiary structure
In biochemistry and molecular biology, the tertiary structure of a protein or any other macromolecule is its three-dimensional structure, as defined by the atomic coordinates.-Relationship to primary structure:...
for each sequence cluster is the aim of many structural genomics
Structural genomics
Structural genomics seeks to describe the 3-dimensional structure of every protein encoded by a given genome. This genome-based approach allows for a high-throughput method of structure determination by a combination of experimental and modeling approaches...
initiatives.
Sequence clustering packages
- UCLUST: An exceptionally fast sequence clustering program for nucleotide and protein sequences
- RDB90 and nrdb90.pl: a nonredundant sequence database
- TribeMCL: a method for clustering proteins into related groups
- BAG: a graph theoretic sequence clustering algorithm
- CD-HIT: a ultra-fast method for clustering protein and nucleotide sequences, with many new applications in next generation sequencing (NGS) data
- JESAM: Open source parallel scalable DNA alignment engine with optional clustering software component
- UICluster: Parallel Clustering of EST (Gene) Sequences
- BLASTClust single-linkage clustering with BLAST
- Clusterer: extendable java application for sequence grouping and cluster analyses
- PATDB: a program for rapidly identifying perfect substrings
- nrdb: a program for merging trivially redundant (identical) sequences
- CluSTr: A single-linkage protein sequence clustering database from Smith-Waterman sequence similarities; covers over 7 mln sequences including UniProt and IPI
- ICAtools - original (ancient) DNA clustering package with many algorithms useful for artifact discovery or EST clustering
- Virus Orthologous Clusters: A viral protein sequence clustering database; contains all predicted genes from eleven virus families organized into ortholog groups by BLASTP similarity