Protein family
Encyclopedia
A protein family is a group of evolution
arily-related protein
s, and is often nearly synonymous with gene family
. The term protein family should not be confused with family
as it is used in taxonomy.
Proteins in a family descend from a common ancestor (see homology
) and typically have similar three-dimensional structures
, functions, and significant sequence similarity. While it is difficult to evaluate the significance of functional or structural similarity, there is a fairly well developed framework for evaluating the significance of similarity between a group of sequences using sequence alignment
methods. Proteins that do not share a common ancestor are very unlikely to show statistically significant sequence similarity, making sequence alignment a powerful tool for identifying the members of protein families.
Currently, over 60,000 protein families have been defined, although ambiguity in the definition of protein family leads different researchers to wildly varying numbers.
, hemoglobin
, and cytochrome c
. Since that time, it was found that many proteins comprise multiple independent structural and functional units or domains
. Due to evolutionary shuffling, different domains in a protein have evolved independently. This has led, in recent years, to a focus on families of protein domains. A number of online resources are devoted to identifying and cataloging such domains (see list of links at the end of this article).
Regions of each protein have differing functional constraints (features critical to the structure and function of the protein). For example, the active site
of an enzyme requires certain amino acid residues to be precisely oriented in three dimensions. On the other hand, a protein–protein binding interface may consist of a large surface with constraints on the hydrophobicity or polarity of the amino acid residues. Functionally constrained regions of proteins evolve more slowly than unconstrained regions such as surface loops, giving rise to discernible blocks of conserved sequence when the sequences of a protein family are compared (see multiple sequence alignment
). These blocks are most commonly referred to as motifs, although many other terms are used (blocks, signatures, fingerprints, etc.). Again, a large number of online resources are devoted to identifying and cataloging protein motifs (see list at end of article).
The algorithmic means for establishing protein families on a large scale are based on a notion of similarity. Most of the time the only similarity we have access to is sequence similarity.
Evolution
Evolution is any change across successive generations in the heritable characteristics of biological populations. Evolutionary processes give rise to diversity at every level of biological organisation, including species, individual organisms and molecules such as DNA and proteins.Life on Earth...
arily-related protein
Protein
Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form, facilitating a biological function. A polypeptide is a single linear polymer chain of amino acids bonded together by peptide bonds between the carboxyl and amino groups of...
s, and is often nearly synonymous with gene family
Gene family
A gene family is a set of several similar genes, formed by duplication of a single original gene, and generally with similar biochemical functions...
. The term protein family should not be confused with family
Family (biology)
In biological classification, family is* a taxonomic rank. Other well-known ranks are life, domain, kingdom, phylum, class, order, genus, and species, with family fitting between order and genus. As for the other well-known ranks, there is the option of an immediately lower rank, indicated by the...
as it is used in taxonomy.
Proteins in a family descend from a common ancestor (see homology
Homology (biology)
Homology forms the basis of organization for comparative biology. In 1843, Richard Owen defined homology as "the same organ in different animals under every variety of form and function". Organs as different as a bat's wing, a seal's flipper, a cat's paw and a human hand have a common underlying...
) and typically have similar three-dimensional structures
Protein structure
Proteins are an important class of biological macromolecules present in all organisms. Proteins are polymers of amino acids. Classified by their physical size, proteins are nanoparticles . Each protein polymer – also known as a polypeptide – consists of a sequence formed from 20 possible L-α-amino...
, functions, and significant sequence similarity. While it is difficult to evaluate the significance of functional or structural similarity, there is a fairly well developed framework for evaluating the significance of similarity between a group of sequences using sequence alignment
Sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are...
methods. Proteins that do not share a common ancestor are very unlikely to show statistically significant sequence similarity, making sequence alignment a powerful tool for identifying the members of protein families.
Currently, over 60,000 protein families have been defined, although ambiguity in the definition of protein family leads different researchers to wildly varying numbers.
Terminology and usage
As with many biological terms, the use of protein family is somewhat context dependent; it may indicate large groups of proteins with the lowest possible level of detectable sequence similarity, or very narrow groups of proteins with almost identical sequence, function, and three-dimensional structure, or any kind of group in-between. To distinguish between these situations, Dayhoff introduced the concept of a protein superfamily. Other terms such as protein class, protein group, and protein sub-family have been coined over the years, but all suffer similar ambiguities of usage. A common usage is superfamily > family > sub-family. In the end, caveat emptor, it is up to a reader to discern exactly how these terms are being used in a particular context.Protein domains and motifs
The concept of protein family was conceived at a time when very few protein structures or sequences were known; at that time, primarily small, single-domain proteins such as myoglobinMyoglobin
Myoglobin is an iron- and oxygen-binding protein found in the muscle tissue of vertebrates in general and in almost all mammals. It is related to hemoglobin, which is the iron- and oxygen-binding protein in blood, specifically in the red blood cells. The only time myoglobin is found in the...
, hemoglobin
Hemoglobin
Hemoglobin is the iron-containing oxygen-transport metalloprotein in the red blood cells of all vertebrates, with the exception of the fish family Channichthyidae, as well as the tissues of some invertebrates...
, and cytochrome c
Cytochrome c
The Cytochrome complex, or cyt c is a small heme protein found loosely associated with the inner membrane of the mitochondrion. It belongs to the cytochrome c family of proteins. Cytochrome c is a highly soluble protein, unlike other cytochromes, with a solubility of about 100 g/L and is an...
. Since that time, it was found that many proteins comprise multiple independent structural and functional units or domains
Protein domain
A protein domain is a part of protein sequence and structure that can evolve, function, and exist independently of the rest of the protein chain. Each domain forms a compact three-dimensional structure and often can be independently stable and folded. Many proteins consist of several structural...
. Due to evolutionary shuffling, different domains in a protein have evolved independently. This has led, in recent years, to a focus on families of protein domains. A number of online resources are devoted to identifying and cataloging such domains (see list of links at the end of this article).
Regions of each protein have differing functional constraints (features critical to the structure and function of the protein). For example, the active site
Active site
In biology the active site is part of an enzyme where substrates bind and undergo a chemical reaction. The majority of enzymes are proteins but RNA enzymes called ribozymes also exist. The active site of an enzyme is usually found in a cleft or pocket that is lined by amino acid residues that...
of an enzyme requires certain amino acid residues to be precisely oriented in three dimensions. On the other hand, a protein–protein binding interface may consist of a large surface with constraints on the hydrophobicity or polarity of the amino acid residues. Functionally constrained regions of proteins evolve more slowly than unconstrained regions such as surface loops, giving rise to discernible blocks of conserved sequence when the sequences of a protein family are compared (see multiple sequence alignment
Multiple sequence alignment
A multiple sequence alignment is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor...
). These blocks are most commonly referred to as motifs, although many other terms are used (blocks, signatures, fingerprints, etc.). Again, a large number of online resources are devoted to identifying and cataloging protein motifs (see list at end of article).
Evolution of protein families
According to current dogma, protein families arise in two ways. Firstly, the separation of a parent species into two genetically isolated descendent species allows a gene/protein to independently accumulate variations (mutations) in these two lineages. This results in a family of orthologous proteins, usually with conserved sequence motifs. Secondly, a gene duplication may create a second copy of a gene (termed a paralog). Because the original gene is still able to perform its function, the duplicated gene is free to diverge and may acquire new functions (by random mutation). Certain gene/protein families, especially in eukaryotes, undergo extreme expansions and contractions in the course of evolution, sometimes in concert with whole genome duplications. This expansion and contraction of protein families is one of the salient features of genome evolution, but its importance and ramifications are currently unclear.Use and importance of protein families
As the total number of sequenced proteins increases and interest expands in proteome analysis, there is an ongoing effort to organize proteins into families and to describe their component domains and motifs. Reliable identification of protein families is critical to phylogenetic analysis, functional annotation, and the exploration of diversity of protein function in a given phylogenetic branch.The algorithmic means for establishing protein families on a large scale are based on a notion of similarity. Most of the time the only similarity we have access to is sequence similarity.
Related articles
- homologyHomology (biology)Homology forms the basis of organization for comparative biology. In 1843, Richard Owen defined homology as "the same organ in different animals under every variety of form and function". Organs as different as a bat's wing, a seal's flipper, a cat's paw and a human hand have a common underlying...
- protein structureProtein structureProteins are an important class of biological macromolecules present in all organisms. Proteins are polymers of amino acids. Classified by their physical size, proteins are nanoparticles . Each protein polymer – also known as a polypeptide – consists of a sequence formed from 20 possible L-α-amino...
- protein domains
- Protein subfamilyProtein subfamilyProtein subfamily is a level of protein classification, especially protein 3D structures. It is under protein family. Protein family in SCOP means the members are all related evolutionarily and they share very similar structures with functional similarities...
- sequence alignmentSequence alignmentIn bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are...
- sequence clusteringSequence clusteringIn bioinformatics, sequence clustering algorithms attempt to group sequences that are somehow related. The sequences can be either of genomic, "transcriptomic" or protein origin.For proteins, homologous sequences are typically grouped into families...
- genome annotation
- Cdx protein familyCdx protein familyThe Cdx protein family is a group of the transcription factor proteins which bind to DNA to regulate the expression of genes. In particular this family of proteins can regulate the Hox genes.- Cdx proteins :*Cdx1 protein*Cdx2 protein*Cdx4 protein...
Protein structure resources
- SCOPStructural Classification of ProteinsThe Structural Classification of Proteins database is a largely manual classification of protein structural domains based on similarities of their structures and amino acid sequences. A motivation for this classification is to determine the evolutionary relationship between proteins...
- CATHCATHThe CATH Protein Structure Classification is a semi-automatic, hierarchical classification of protein domains published in 1997 by Christine Orengo, Janet Thornton and their colleagues....
- InterProInterProInterPro is a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them....
Protein families
Words in parentheses contain a highly abridged description of the function of the family.- Ammonium transporterAmmonium transporterAmmonium transporters are a family of proteins which transport ammonium ions across the cell membrane.-Human proteins containing this domain :RHAG, RHBG, RHCE, RHCG, RHD-Further reading:* [...
- Protein Kinase (transmission of biochemical signals)
- MAP KinaseMitogen-activated protein kinaseMitogen-activated protein kinases are serine/threonine-specific protein kinases that respond to extracellular stimuli and regulate various cellular activities, such as gene expression, mitosis, differentiation, proliferation, and cell survival/apoptosis.-Activation:MAP kinases are activated...
- MAP Kinase KinaseMitogen-activated protein kinaseMitogen-activated protein kinases are serine/threonine-specific protein kinases that respond to extracellular stimuli and regulate various cellular activities, such as gene expression, mitosis, differentiation, proliferation, and cell survival/apoptosis.-Activation:MAP kinases are activated...
- MAP Kinase Kinase KinaseMitogen-activated protein kinaseMitogen-activated protein kinases are serine/threonine-specific protein kinases that respond to extracellular stimuli and regulate various cellular activities, such as gene expression, mitosis, differentiation, proliferation, and cell survival/apoptosis.-Activation:MAP kinases are activated...
- Receptor tyrosine kinaseReceptor tyrosine kinaseReceptor tyrosine kinases s are the high-affinity cell surface receptors for many polypeptide growth factors, cytokines, and hormones. Of the 90 unique tyrosine kinase genes identified in the human genome, 58 encode receptor tyrosine kinase proteins....
s
- MAP Kinase
- Major histocompatibility complexMajor histocompatibility complexMajor histocompatibility complex is a cell surface molecule encoded by a large gene family in all vertebrates. MHC molecules mediate interactions of leukocytes, also called white blood cells , which are immune cells, with other leukocytes or body cells...
or MHC (immune system) - Immunoglobulin superfamilyImmunoglobulin superfamilyThe immunoglobulin superfamily is a large group of cell surface and soluble proteins that are involved in the recognition, binding, or adhesion processes of cells. Molecules are categorized as members of this superfamily based on shared structural features with immunoglobulins ; they all possess a...
(immunity) - Globin protein family - (oxygen binding)
- G protein-coupled receptorG protein-coupled receptorG protein-coupled receptors , also known as seven-transmembrane domain receptors, 7TM receptors, heptahelical receptors, serpentine receptor, and G protein-linked receptors , comprise a large protein family of transmembrane receptors that sense molecules outside the cell and activate inside signal...
- (transmembrane receptor)- Olfactory ReceptorOlfactory receptorOlfactory receptors expressed in the cell membranes of olfactory receptor neurons are responsible for the detection of odor molecules. Activated olfactory receptors are the initial player in a signal transduction cascade which ultimately produces a nerve impulse which is transmitted to the brain...
- Olfactory Receptor
- G-proteins
- HomeoboxHomeoboxA homeobox is a DNA sequence found within genes that are involved in the regulation of patterns of anatomical development in animals, fungi and plants.- Discovery :...
(gene regulation) - Heat Shock protein families - (stress response)
- HSP60HSP60Heat shock proteins are generally responsible for preventing damage to proteins in response to high levels of heat. Heat shock proteins are classified into six major families based on their molecular mass: small HSPs, HSP40, HSP60, HSP70, HSP90, and HSP110...
family - HSP70Hsp70The 70 kilodalton heat shock proteins are a family of ubiquitously expressed heat shock proteins. Proteins with similar structure exist in virtually all living organisms...
family - HSP90Hsp90Hsp90 is a molecular chaperone and is one of the most abundant proteins expressed in cells. It is a member of the heat shock protein family, which is upregulated in response to stress...
family - HSP110 family
- HSP60
- Polycomb-group proteinsPolycomb-group proteinsPolycomb-group proteins are a family of proteins first discovered in fruit flies that can remodel chromatin such that epigenetic silencing of genes takes place...
- Cellular motor proteins (e.g., in flagella)
- MyosinMyosinMyosins comprise a family of ATP-dependent motor proteins and are best known for their role in muscle contraction and their involvement in a wide range of other eukaryotic motility processes. They are responsible for actin-based motility. The term was originally used to describe a group of similar...
- KinesinKinesinA kinesin is a protein belonging to a class of motor proteins found in eukaryotic cells. Kinesins move along microtubule filaments, and are powered by the hydrolysis of ATP . The active movement of kinesins supports several cellular functions including mitosis, meiosis and transport of cellular...
- DyneinDyneinDynein is a motor protein in cells which converts the chemical energy contained in ATP into the mechanical energy of movement. Dynein transports various cellular cargo by "walking" along cytoskeletal microtubules towards the minus-end of the microtubule, which is usually oriented towards the cell...
- Myosin
- Transcription factors
- Cdx protein familyCdx protein familyThe Cdx protein family is a group of the transcription factor proteins which bind to DNA to regulate the expression of genes. In particular this family of proteins can regulate the Hox genes.- Cdx proteins :*Cdx1 protein*Cdx2 protein*Cdx4 protein...
- Cdx protein family
External links
- Pfam - Protein families database of alignments and HMMs
- PROSITE - Database of protein domains, families and functional sites
- PIRSF - SuperFamily Classification System
- PASS2 - Protein Alignment as Structural Superfamilies v2 - PASS2@NCBS
- SUPERFAMILY - Library of HMMs representing superfamilies and database of (superfamily and family) annotations for all completely sequenced organisms