Structural genomics
Encyclopedia
Structural genomics seeks to describe the 3-dimensional structure
of every protein encoded by a given genome
. This genome-based approach allows for a high-throughput method of structure determination by a combination of experimental and modeling approaches
. The principal difference between structural genomics and traditional structural prediction
is that structural genomics attempts to determine the structure of every protein encoded by the genome, rather than focusing on one particular protein. What is it that makes it possible to determine the structure of every protein in the genome at once rather than solve the structures one at a time? With full-genome sequences available, structure prediction can be done more quickly through a combination of experimental and modeling approaches, especially because the availability of large number of sequenced genomes and previously-solved protein structures allows scientists to model protein structure on the structures of previously solved homologs.
Because protein structure is closely linked with protein function, the structural genomics has the potential to inform knowledge of protein function. In addition to elucidating protein functions, structural genomics can be used to identify novel protein folds and potential targets for drug discovery. Structural genomics involves taking a large number of approaches to structure determination, including experimental methods using genomic sequences or modeling-based approaches based on sequence or structural homology
to a protein of known structure or based on chemical and physical principles for a protein with no homology to any known structure.
As opposed to traditional structural biology
, the determination of a protein structure
through a structural genomics effort often (but not always) comes before anything is known regarding the protein function. This raises new challenges in structural bioinformatics
, i.e. determining protein function from its 3D
structure.
Structural genomics emphasizes high throughput determination of protein structures. This is performed in dedicated centers of structural genomics.
While most structural biologists pursue structures of individual proteins or protein groups, specialists in structural genomics pursue structures of proteins on a genome wide scale. This implies large scale cloning, expression and purification. One main advantage of this approach is economy of scale. On the other hand, the scientific value of some resultant structures is at times questioned. A Science article from January 2006 analyzes the structural genomics field.
One advantage of structural genomics, such as the Protein Structure Initiative
, is that the scientific community gets immediate access to new structures, as well as to reagents such as clones and protein. A disadvantage is that many of these structures are of proteins of unknown function and do not have corresponding publications. This requires new ways of communicating this structural information to the broader research community. The Bioinformatics core of the Joint center for structural genomics (JCSG) has recently developed a wiki-based approach namely The Open Protein Structure Annotation Network (TOPSAN) for annotating protein structures emerging from high-throughput structural genomics centers.
may be better able to identify novel protein folds than the experimental approaches because they are not limited by experimental constraints.
Protein function depends on 3-D structure and these 3-D structures are more highly-conserved than sequences
. Thus, the high-throughput structure determination methods of structural genomics have the potential to inform our understanding of protein functions. This also has potential implications for drug discovery and protein engineering. Furthermore, every protein that is added to the structural database increases the likelihood that the database will include homologous sequences of other unknown proteins. The Protein Structure Initiative
(PSI) is a multifaceted effort funded by the National Institutes of Health
with various academic and industrial partners that aims to increase knowledge of protein structure using a structural genomics approach and to improve structure-determination methodology.
(ORF), the part of a gene that is likely to contain the sequence for the mRNA
and protein, to be cloned and expressed as protein. These proteins are then purified and crystallized, and then subjected to one of two types of structure determination: X-ray crystallography
and Nuclear Magnetic Resonance
(NMR). The whole genome sequence allows for the design of every primer required in order to amplify all of the ORFs, clone them into bacteria, and then express them. By using a whole-genome approach to this traditional method of protein structure determination, all of the proteins encoded by the genome can be expressed at once. This approach allows for the structural determination of every protein that is encoded by the genome.
program, which divides the protein into short segments and arranges short polypeptide chain into a low-energy local conformation. Rosetta is available for commercial use and for non-commercial use through its public program, Robetta.
bases structural modeling on fold similarities rather than sequence identity. This method may help identify distantly-related proteins and can be used to infer molecular functions.
(PSI) is to solve the structures for all the proteins in Thermotogo maritima
, a thermophillic bacterium. T. maritima was selected as a structural genomics target based on its relatively small genome consisting of 1,877 genes and the hypothesis that the proteins expressed by a thermophilic bacterium would be easier to crystallize.
Lesley et al used Escherichia coli
to express all the open-reading frames (ORFs) of T. martima. These proteins were then crystallized and structures were determined for successfully-crystallized proteins using X-ray crystallography. Among other structures, this structural genomics approach allowed for the determination of the structure of the TM0449 protein, which was found to exhibit a novel fold as it did not share structural homology with any known protein.
is to determine the structures of potential drug targets in Mycobacterium tuberculosis
, the bacterium that causes tuberculosis. The development of novel drug therapies against tuberculosis are particularly important given the growing problem of multi-drug-resistant tuberculosis
.
The fully sequenced genome of M. tuberculosis has allowed scientists to clone many of these protein targets into expression vectors for purification and structure determination by X-ray crystallography. Studies have identified a number of target proteins for structure determination, including extracellular proteins that may be involved in pathogenesis, iron-regulatory proteins, current drug targets, and proteins predicted to have novel folds. So far, structures have been determined for 708 of the proteins encoded by M. tuberculosis.
(PDB): repository for protein sequence and structural information
UniProt
: provides sequence and functional information
Structural Classification of Proteins
(SCOP Classifications): hierarchical-based approach
Class, Architecture, Topology and Homologous superfamily
(CATH): hierarchical-based approach
Protein structure
Proteins are an important class of biological macromolecules present in all organisms. Proteins are polymers of amino acids. Classified by their physical size, proteins are nanoparticles . Each protein polymer – also known as a polypeptide – consists of a sequence formed from 20 possible L-α-amino...
of every protein encoded by a given genome
Genome
In modern molecular biology and genetics, the genome is the entirety of an organism's hereditary information. It is encoded either in DNA or, for many types of virus, in RNA. The genome includes both the genes and the non-coding sequences of the DNA/RNA....
. This genome-based approach allows for a high-throughput method of structure determination by a combination of experimental and modeling approaches
Protein structure prediction
Protein structure prediction is the prediction of the three-dimensional structure of a protein from its amino acid sequence — that is, the prediction of its secondary, tertiary, and quaternary structure from its primary structure. Structure prediction is fundamentally different from the inverse...
. The principal difference between structural genomics and traditional structural prediction
Protein structure prediction
Protein structure prediction is the prediction of the three-dimensional structure of a protein from its amino acid sequence — that is, the prediction of its secondary, tertiary, and quaternary structure from its primary structure. Structure prediction is fundamentally different from the inverse...
is that structural genomics attempts to determine the structure of every protein encoded by the genome, rather than focusing on one particular protein. What is it that makes it possible to determine the structure of every protein in the genome at once rather than solve the structures one at a time? With full-genome sequences available, structure prediction can be done more quickly through a combination of experimental and modeling approaches, especially because the availability of large number of sequenced genomes and previously-solved protein structures allows scientists to model protein structure on the structures of previously solved homologs.
Because protein structure is closely linked with protein function, the structural genomics has the potential to inform knowledge of protein function. In addition to elucidating protein functions, structural genomics can be used to identify novel protein folds and potential targets for drug discovery. Structural genomics involves taking a large number of approaches to structure determination, including experimental methods using genomic sequences or modeling-based approaches based on sequence or structural homology
Homology modeling
Homology modeling, also known as comparative modeling of protein refers to constructing an atomic-resolution model of the "target" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous protein...
to a protein of known structure or based on chemical and physical principles for a protein with no homology to any known structure.
As opposed to traditional structural biology
Structural biology
Structural biology is a branch of molecular biology, biochemistry, and biophysics concerned with the molecular structure of biological macromolecules, especially proteins and nucleic acids, how they acquire the structures they have, and how alterations in their structures affect their function...
, the determination of a protein structure
Protein structure
Proteins are an important class of biological macromolecules present in all organisms. Proteins are polymers of amino acids. Classified by their physical size, proteins are nanoparticles . Each protein polymer – also known as a polypeptide – consists of a sequence formed from 20 possible L-α-amino...
through a structural genomics effort often (but not always) comes before anything is known regarding the protein function. This raises new challenges in structural bioinformatics
Structural bioinformatics
Structural bioinformatics is the branch of bioinformatics which is related to the analysis and prediction of the three-dimensional structure of biological macromolecules such as proteins, RNA, and DNA...
, i.e. determining protein function from its 3D
Three-dimensional space
Three-dimensional space is a geometric 3-parameters model of the physical universe in which we live. These three dimensions are commonly called length, width, and depth , although any three directions can be chosen, provided that they do not lie in the same plane.In physics and mathematics, a...
structure.
Structural genomics emphasizes high throughput determination of protein structures. This is performed in dedicated centers of structural genomics.
While most structural biologists pursue structures of individual proteins or protein groups, specialists in structural genomics pursue structures of proteins on a genome wide scale. This implies large scale cloning, expression and purification. One main advantage of this approach is economy of scale. On the other hand, the scientific value of some resultant structures is at times questioned. A Science article from January 2006 analyzes the structural genomics field.
One advantage of structural genomics, such as the Protein Structure Initiative
Protein Structure Initiative
The Protein Structure Initiative is an ongoing effort begun in 2000 to accelerate discovery in structural genomics and contribute to understanding biological function. Funded by the U.S...
, is that the scientific community gets immediate access to new structures, as well as to reagents such as clones and protein. A disadvantage is that many of these structures are of proteins of unknown function and do not have corresponding publications. This requires new ways of communicating this structural information to the broader research community. The Bioinformatics core of the Joint center for structural genomics (JCSG) has recently developed a wiki-based approach namely The Open Protein Structure Annotation Network (TOPSAN) for annotating protein structures emerging from high-throughput structural genomics centers.
Goals
One goal of structural genomics is to identify novel protein folds. Experimental methods of protein structure determination require proteins that express and/or crystallize well, which may inherently bias the kinds of proteins folds that this experimental data elucidate. A genomic, modeling-based approach such as ab initio modelingDe novo protein structure prediction
In computational biology, de novo protein structure prediction is the task of estimating a protein's tertiary structure from its sequence alone. The problem is very difficult and has occupied leading scientists for decades. Research has focused in three areas: alternate lower-resolution...
may be better able to identify novel protein folds than the experimental approaches because they are not limited by experimental constraints.
Protein function depends on 3-D structure and these 3-D structures are more highly-conserved than sequences
Peptide sequence
Peptide sequence or amino acid sequence is the order in which amino acid residues, connected by peptide bonds, lie in the chain in peptides and proteins. The sequence is generally reported from the N-terminal end containing free amino group to the C-terminal end containing free carboxyl group...
. Thus, the high-throughput structure determination methods of structural genomics have the potential to inform our understanding of protein functions. This also has potential implications for drug discovery and protein engineering. Furthermore, every protein that is added to the structural database increases the likelihood that the database will include homologous sequences of other unknown proteins. The Protein Structure Initiative
Protein Structure Initiative
The Protein Structure Initiative is an ongoing effort begun in 2000 to accelerate discovery in structural genomics and contribute to understanding biological function. Funded by the U.S...
(PSI) is a multifaceted effort funded by the National Institutes of Health
National Institutes of Health
The National Institutes of Health are an agency of the United States Department of Health and Human Services and are the primary agency of the United States government responsible for biomedical and health-related research. Its science and engineering counterpart is the National Science Foundation...
with various academic and industrial partners that aims to increase knowledge of protein structure using a structural genomics approach and to improve structure-determination methodology.
Methods
Structural genomics takes advantage of completed genome sequences in several ways in order to determine protein structures. The gene sequence of the target protein can also be compared to a known sequence and structural information can then be inferred from the known protein’s structure. Structural genomics can be used to predict novel protein folds based on other structural data. Structural genomics can also take modeling-based approach that relies on homology between the unknown protein and a solved protein structure.de novo methods
Completed genome sequences allow every open reading frameOpen reading frame
In molecular genetics, an open reading frame is a DNA sequence that does not contain a stop codon in a given reading frame.Normally, inserts which interrupt the reading frame of a subsequent region after the start codon cause frameshift mutation of the sequence and dislocate the sequences for stop...
(ORF), the part of a gene that is likely to contain the sequence for the mRNA
Messenger RNA
Messenger RNA is a molecule of RNA encoding a chemical "blueprint" for a protein product. mRNA is transcribed from a DNA template, and carries coding information to the sites of protein synthesis: the ribosomes. Here, the nucleic acid polymer is translated into a polymer of amino acids: a protein...
and protein, to be cloned and expressed as protein. These proteins are then purified and crystallized, and then subjected to one of two types of structure determination: X-ray crystallography
X-ray crystallography
X-ray crystallography is a method of determining the arrangement of atoms within a crystal, in which a beam of X-rays strikes a crystal and causes the beam of light to spread into many specific directions. From the angles and intensities of these diffracted beams, a crystallographer can produce a...
and Nuclear Magnetic Resonance
Protein nuclear magnetic resonance spectroscopy
Nuclear magnetic resonance spectroscopy of proteins is a field of structural biology in which NMR spectroscopy is used to obtain information about the structure and dynamics of proteins. The field was pioneered by Richard R. Ernst and Kurt Wüthrich, among others...
(NMR). The whole genome sequence allows for the design of every primer required in order to amplify all of the ORFs, clone them into bacteria, and then express them. By using a whole-genome approach to this traditional method of protein structure determination, all of the proteins encoded by the genome can be expressed at once. This approach allows for the structural determination of every protein that is encoded by the genome.
ab initio modeling
This approach uses protein sequence data and the chemical and physical interactions of the encoded amino acids to predict the 3-D structures of proteins with no homology to solved protein structures. One highly successful method for ab initio modeling is the RosettaRosetta@home
Rosetta@home is a distributed computing project for protein structure prediction on the Berkeley Open Infrastructure for Network Computing platform, run by the Baker laboratory at the University of Washington...
program, which divides the protein into short segments and arranges short polypeptide chain into a low-energy local conformation. Rosetta is available for commercial use and for non-commercial use through its public program, Robetta.
Sequence-based modeling
This modeling technique compares the gene sequence of an unknown protein with sequences of proteins with known structures. Depending on the degree of similarity between the sequences, the structure of the known protein can be used as a model for solving the structure of the unknown protein. Highly accurate modeling is considered to require at least 50% amino acid sequence identity between the unknown protein and the solved structure. 30-50% sequence identity gives a model of intermediate-accuracy, and sequence identity below 30% gives low-accuracy models. It has been predicted that at least 16,000 protein structures will need to be determined in order for all structural motifs to be represented at least once and thus allowing the structure of any unknown protein to be solved accurately through modeling. One disadvantage of this method, however, is that structure is more conserved than sequence and thus sequence-based modeling may not be the most accurate way to predict protein structures.Threading
ThreadingThreading (protein sequence)
Protein threading, also known as fold recognition, is a method of protein modeling which is used to model those proteins which have the same fold as proteins of known structures, but do not have homologous proteins with known structure.It differs from the homology modeling method of structure...
bases structural modeling on fold similarities rather than sequence identity. This method may help identify distantly-related proteins and can be used to infer molecular functions.
Examples of Structural Genomics
There are currently a number of on-going efforts to solve the structures for every protein in a given proteome.The Thermotogo maritima proteome
One current goal of the Joint Center for Structural Genomics (JCSG), a part of the Protein Structure InitiativeProtein Structure Initiative
The Protein Structure Initiative is an ongoing effort begun in 2000 to accelerate discovery in structural genomics and contribute to understanding biological function. Funded by the U.S...
(PSI) is to solve the structures for all the proteins in Thermotogo maritima
Thermotoga
Thermotoga is a genus of the phylum Thermotogae. Members of Thermotoga are hyperthermophilic bacteria whose cell is wrapped in an unique sheath-like outer membrane, called a "toga"....
, a thermophillic bacterium. T. maritima was selected as a structural genomics target based on its relatively small genome consisting of 1,877 genes and the hypothesis that the proteins expressed by a thermophilic bacterium would be easier to crystallize.
Lesley et al used Escherichia coli
Escherichia coli
Escherichia coli is a Gram-negative, rod-shaped bacterium that is commonly found in the lower intestine of warm-blooded organisms . Most E. coli strains are harmless, but some serotypes can cause serious food poisoning in humans, and are occasionally responsible for product recalls...
to express all the open-reading frames (ORFs) of T. martima. These proteins were then crystallized and structures were determined for successfully-crystallized proteins using X-ray crystallography. Among other structures, this structural genomics approach allowed for the determination of the structure of the TM0449 protein, which was found to exhibit a novel fold as it did not share structural homology with any known protein.
The Mycobacterium tuberculosis proteome
The goal of the TB Structural Genomics ConsortiumMycobacterium Tuberculosis Structural Genomics Consortium
The TB Structural Genomics Consortium is a worldwide consortium of scientists developing a foundation for tuberculosis diagnosis and treatment by determining the 3-dimensional structures of proteins from M. Tuberculosis. The consortium seeks to solve structures of proteins that are of great...
is to determine the structures of potential drug targets in Mycobacterium tuberculosis
Mycobacterium tuberculosis
Mycobacterium tuberculosis is a pathogenic bacterial species in the genus Mycobacterium and the causative agent of most cases of tuberculosis . First discovered in 1882 by Robert Koch, M...
, the bacterium that causes tuberculosis. The development of novel drug therapies against tuberculosis are particularly important given the growing problem of multi-drug-resistant tuberculosis
Multi-drug-resistant tuberculosis
Multi-drug-resistant tuberculosis is defined as TB that is resistant at least to isoniazid and rifampicin , the two most powerful first-line anti-TB drugs...
.
The fully sequenced genome of M. tuberculosis has allowed scientists to clone many of these protein targets into expression vectors for purification and structure determination by X-ray crystallography. Studies have identified a number of target proteins for structure determination, including extracellular proteins that may be involved in pathogenesis, iron-regulatory proteins, current drug targets, and proteins predicted to have novel folds. So far, structures have been determined for 708 of the proteins encoded by M. tuberculosis.
Protein Structure Databases and Classifications
Protein Data BankProtein Data Bank
The Protein Data Bank is a repository for the 3-D structural data of large biological molecules, such as proteins and nucleic acids....
(PDB): repository for protein sequence and structural information
UniProt
UniProt
UniProt is a comprehensive, high-quality and freely accessible database of protein sequence and functional information, many of which are derived from genome sequencing projects...
: provides sequence and functional information
Structural Classification of Proteins
Structural Classification of Proteins
The Structural Classification of Proteins database is a largely manual classification of protein structural domains based on similarities of their structures and amino acid sequences. A motivation for this classification is to determine the evolutionary relationship between proteins...
(SCOP Classifications): hierarchical-based approach
Class, Architecture, Topology and Homologous superfamily
CATH
The CATH Protein Structure Classification is a semi-automatic, hierarchical classification of protein domains published in 1997 by Christine Orengo, Janet Thornton and their colleagues....
(CATH): hierarchical-based approach
See also
- GenomicsGenomicsGenomics is a discipline in genetics concerning the study of the genomes of organisms. The field includes intensive efforts to determine the entire DNA sequence of organisms and fine-scale genetic mapping efforts. The field also includes studies of intragenomic phenomena such as heterosis,...
- Omics
- Structural proteomics
- Protein Structure InitiativeProtein Structure InitiativeThe Protein Structure Initiative is an ongoing effort begun in 2000 to accelerate discovery in structural genomics and contribute to understanding biological function. Funded by the U.S...
External links
- Structural Genomics ConsortiumStructural Genomics ConsortiumThe Structural Genomics Consortium is a not-for-profit organization formed in 2004 to determine the three dimensional structures of proteins of medical relevance, and place them in the Protein Data Bank without restriction...
- Protein Structure Initiative (PSI)
- PSI Structural Genomics Knowledgebase: A Nature Gateway
- Northeast Structural Genomics Consortium
- The Midwest Center for Structural Genomics
- Berkeley Structural Genomics Center
- Center for Eukaryotic Structural Genomics
- Yeast Structural Genomics (Genomique Structurale de la levure)
- RIKEN Structural Genomics/Proteomics Initiative
- Structural Genomics of Pathogenic Protozoa
- The Joint Center for Structural Genomics
- Mycobacterium tuberculosis Structural Genomics ConsortiumMycobacterium Tuberculosis Structural Genomics ConsortiumThe TB Structural Genomics Consortium is a worldwide consortium of scientists developing a foundation for tuberculosis diagnosis and treatment by determining the 3-dimensional structures of proteins from M. Tuberculosis. The consortium seeks to solve structures of proteins that are of great...
- New York SGX Research Center for Structural Genomics (NYSGXRC)
- NJCST Initiative in Structural Genomics and Bioinformatics
- Structural Genomics at Brookhaven Natl. Labs
- Structure to Function Pilot Project: CARB
- The Southeast Collaboratory for Structural Genomics
- Toronto Structural Proteomics Consortium
- Protein Structure Factory
- Oxford Protein Production Facility
- Center for Structural Genomics of Infectious Diseases
- Seattle Structural Genomics Center for Infectious Disease
- Structural Proteomics in Europe SPINE
- Forum for European Structural Proteomics (FESP)
- Israel Structural Proteomics Center (ISPC)