1000 Plant Genomes Project
Encyclopedia
Announced in 2008, shortly after the human 1000 Genomes Project, the 1000 Plant Genomes Project is another, similar highly large-scale genomics
Genomics
Genomics is a discipline in genetics concerning the study of the genomes of organisms. The field includes intensive efforts to determine the entire DNA sequence of organisms and fine-scale genetic mapping efforts. The field also includes studies of intragenomic phenomena such as heterosis,...

 endeavour to take advantage of the speed and efficiency of next-generation DNA sequencing
DNA sequencing
DNA sequencing includes several methods and technologies that are used for determining the order of the nucleotide bases—adenine, guanine, cytosine, and thymine—in a molecule of DNA....

. Headed by Dr. Gane Ka-Shu Wong and Dr. Michael Deyholos of the University of Alberta
University of Alberta
The University of Alberta is a public research university located in Edmonton, Alberta, Canada. Founded in 1908 by Alexander Cameron Rutherford, the first premier of Alberta and Henry Marshall Tory, its first president, it is widely recognized as one of the best universities in Canada...

, the project aims to obtain the transcriptome
Transcriptome
The transcriptome is the set of all RNA molecules, including mRNA, rRNA, tRNA, and other non-coding RNA produced in one or a population of cells.-Scope:...

 (expressed genes) of 1000 different plant species over then next few years.

In light of recent advances in DNA sequencing technologies that have dramatically reduced the cost and time needed to sequence an organism’s entire genome
Genome
In modern molecular biology and genetics, the genome is the entirety of an organism's hereditary information. It is encoded either in DNA or, for many types of virus, in RNA. The genome includes both the genes and the non-coding sequences of the DNA/RNA....

, large-scale (involving many organisms) sequencing projects have been and are currently being undertaken. The recently-started 1000 genomes project for example, aims to obtain high genome coverage of 1000 individual people to better understand human genetic variation
Genetic variation
Genetic variation, variation in alleles of genes, occurs both within and among populations. Genetic variation is important because it provides the “raw material” for natural selection. Genetic variation is brought about by mutation, a change in a chemical structure of a gene. Polyploidy is an...

 because genomic sequence is the best way to assess this.

Goals of the Project

Although the current number of classified green plant
Plant
Plants are living organisms belonging to the kingdom Plantae. Precise definitions of the kingdom vary, but as the term is used here, plants include familiar organisms such as trees, flowers, herbs, bushes, grasses, vines, ferns, mosses, and green algae. The group is also called green plants or...

 species is around 370,000 there are probably many thousands more yet unclassified. Despite this number, very few of these species have detailed DNA sequence information to date; 79,486 species in GenBank
GenBank
The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced and maintained by the National Center for Biotechnology Information as part of the International Nucleotide Sequence...

 as of this writing, but most (>95%) have DNA sequence for only one or two genes. “…almost none of the roughly half million plant species known to humanity has been touched by genomics at any level”. Furthermore, almost all detailed genetic maps are constructed from mitochondrial or chloroplast DNA rather than the actual genomic DNA of the plants. The 1000 Plant Genomes Project will produce a roughly a 100x increase in the number of species with available broad genome sequence.

Evolutionary Relationships

There have been efforts to determine the evolutionary relationships between the known plant species, but phylogenies
Phylogenetics
In biology, phylogenetics is the study of evolutionary relatedness among groups of organisms , which is discovered through molecular sequencing data and morphological data matrices...

 (or phylogenetic trees) created solely using morphological data, cellular structures, single enzymes, or on only a few sequences (like rRNA
Ribosomal RNA
Ribosomal ribonucleic acid is the RNA component of the ribosome, the enzyme that is the site of protein synthesis in all living cells. Ribosomal RNA provides a mechanism for decoding mRNA into amino acids and interacts with tRNAs during translation by providing peptidyl transferase activity...

) can be prone to error; morphological features are especially vulnerable when two species look physically similar though they are not closely related (as a result of convergent evolution
Convergent evolution
Convergent evolution describes the acquisition of the same biological trait in unrelated lineages.The wing is a classic example of convergent evolution in action. Although their last common ancestor did not have wings, both birds and bats do, and are capable of powered flight. The wings are...

 for example) or homology
Homology (biology)
Homology forms the basis of organization for comparative biology. In 1843, Richard Owen defined homology as "the same organ in different animals under every variety of form and function". Organs as different as a bat's wing, a seal's flipper, a cat's paw and a human hand have a common underlying...

, or when two species closely related look very different because, for example, they are able to change in response to their environment very well. These situations are very common in the plant kingdom. An alternative method for constructing evolutionary relationships is through changes in DNA sequence of many genes between the different species which is often more robust to problems of similar-appearing species. With the amount of genomic sequence produced by this project, many predicted evolutionary relationships can be better tested by sequence alignment
Sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are...

 (figure 1) to improve their certainty.

Biotechnology applications

The list of plant genomes to be sequenced in the project is not random; instead plants that produce valuable chemicals or other products (secondary metabolite
Secondary metabolite
Secondary metabolites are organic compounds that are not directly involved in the normal growth, development, or reproduction of an organism. Unlike primary metabolites, absence of secondary metabolities does not result in immediate death, but rather in long-term impairment of the organism's...

s in many cases) will be focused on in the hopes that characterizing the involved genes will allow the underlying biosynthetic processes to be used or modified. For example, there are many plants known to produce oils (like olives) and some of the oils from certain plants bear a strong chemical resemblance to petroleum products like the Oil palm
Oil palm
The oil palms comprise two species of the Arecaceae, or palm family. They are used in commercial agriculture in the production of palm oil. The African Oil Palm Elaeis guineensis is native to West Africa, occurring between Angola and Gambia, while the American Oil Palm Elaeis oleifera is native to...

 and hydrocarbon
Hydrocarbon
In organic chemistry, a hydrocarbon is an organic compound consisting entirely of hydrogen and carbon. Hydrocarbons from which one hydrogen atom has been removed are functional groups, called hydrocarbyls....

-producing species. If these plant mechanisms could be used to produce mass quantities of industrially-useful oil, or modified such that they do, then they would be of great value. Here, knowing the sequence of the plant’s genes involved in the metabolic pathway producing the oil is a large first step to allow such utilization. A recent example of how engineering natural biochemical pathways works is Golden rice
Golden rice
Golden rice is a variety of Oryza sativa rice produced through genetic engineering to biosynthesize beta-carotene, a precursor of pro-vitamin A in the edible parts of rice...

 which has involved genetically modifying its pathway, so that a precursor to vitamin A is produced in large quantities making the brown-colored rice a potential solution for vitamin A deficiency. This is concept of engineering plants to do “work” is popular and its potential would dramatically increase as a result of gene information on 1000 plant species.
Biosynthetic pathways could also be used for mass production of medicinal compounds using plants rather than manual organic chemical reactions as most are created currently.

Project Approach

Using the 28 Illumina Genome Analyzer
DNA sequencing
DNA sequencing includes several methods and technologies that are used for determining the order of the nucleotide bases—adenine, guanine, cytosine, and thymine—in a molecule of DNA....

 next-generation DNA sequencing machines at the Beijing Genomics Institute
Beijing Genomics Institute
BGI , known as the Beijing Genomics Institute prior to 2008, is one of the world’s premier genome sequencing centers. Its sequencing output is expected to soon surpass the equivalent of more than 15,000 human genomes per year....

 (BGI – Shenzhen, China), the 3Gb/run (3 billion base pairs per experiment) capacity of each of these machines will enable fast and accurate sequencing of the plant samples.

Species selection

The selection of plant species to be sequenced has nearly been compiled through an international collaboration of the various funding agencies and researcher groups expressing their interest in certain plants . There has been a focus on those plant species that are known to have useful biosynthetic capacity to facilitate the biotechnology goals of the project, and selection of other species to fill in gaps and explain some unknown evolutionary relationships of the current plant phylogeny. In addition to industrial compound biosynthetic capacity, plant species known or suspected to produce medically active chemicals (such as poppies producing opiate
Opiate
In medicine, the term opiate describes any of the narcotic opioid alkaloids found as natural products in the opium poppy plant.-Overview:Opiates are so named because they are constituents or derivatives of constituents found in opium, which is processed from the latex sap of the opium poppy,...

s) were assigned a high priority to better understand the synthesis process, explore commercial production potential, and discover new pharmaceutical options. A large number of plant species with medicinal properties have been selected from traditional Chinese medicine
Traditional Chinese medicine
Traditional Chinese Medicine refers to a broad range of medicine practices sharing common theoretical concepts which have been developed in China and are based on a tradition of more than 2,000 years, including various forms of herbal medicine, acupuncture, massage , exercise , and dietary therapy...

 (TCM) . The largely completed list of selected species can be publicly viewed at [www.onekp.com/samples/list.php].

Transcriptome vs. genome sequencing

Rather than sequencing the entire genome (all DNA sequence) of the various plant species, the project will sequence only those regions of the genome that produce a protein product (coding genes
Gene
A gene is a molecular unit of heredity of a living organism. It is a name given to some stretches of DNA and RNA that code for a type of protein or for an RNA chain that has a function in the organism. Living beings depend on genes, as they specify all proteins and functional RNA chains...

); the transcriptome
Transcriptome
The transcriptome is the set of all RNA molecules, including mRNA, rRNA, tRNA, and other non-coding RNA produced in one or a population of cells.-Scope:...

 . This approach is justified by the focus on biochemical pathways where only the genes producing the involved proteins are required to understand the synthetic mechanism, and because these thousands of sequences would represent adequate sequence detail to construct very robust evolutionary relationships through sequence comparison. The numbers of coding genes in plant species can vary considerably, but all have tens of thousands or more making the transcriptome a large collection of information. However, non-coding sequence makes up the majority (>90%) of the genome content. Although this approach is similar conceptually to expressed sequence tag
Expressed sequence tag
An expressed sequence tag or EST is a short sub-sequence of a cDNA sequence. They may be used to identify gene transcripts, and are instrumental in gene discovery and gene sequence determination. The identification of ESTs has proceeded rapidly, with approximately 65.9 million ESTs now available in...

s (ESTs), it is fundamentally different in that the entire sequence of each gene will be acquired with high coverage rather than just a small portion of the gene sequence with an EST. To distinguish the two, the non-EST method is known as “shotgun transcriptome sequencing”.

Transcriptome shotgun sequencing

mRNA (messenger RNA
Messenger RNA
Messenger RNA is a molecule of RNA encoding a chemical "blueprint" for a protein product. mRNA is transcribed from a DNA template, and carries coding information to the sites of protein synthesis: the ribosomes. Here, the nucleic acid polymer is translated into a polymer of amino acids: a protein...

) is collected from a sample, converted to cDNA
Complementary DNA
In genetics, complementary DNA is DNA synthesized from a messenger RNA template in a reaction catalyzed by the enzyme reverse transcriptase and the enzyme DNA polymerase. cDNA is often used to clone eukaryotic genes in prokaryotes...

 by a reverse transcriptase enzyme
Reverse transcriptase
In the fields of molecular biology and biochemistry, a reverse transcriptase, also known as RNA-dependent DNA polymerase, is a DNA polymerase enzyme that transcribes single-stranded RNA into single-stranded DNA. It also helps in the formation of a double helix DNA once the RNA has been reverse...

, and then fragmented so that it can be sequenced. Other than transcriptome shotgun sequencing
RNA-Seq
RNA-seq, also called "Whole Transcriptome Shotgun Sequencing" and dubbed "a revolutionary tool for transcriptomics", refers to the use of high-throughput sequencing technologies to sequence cDNA in order to get information about a sample's RNA content, a technique that is quickly becoming...

, this technique has been called RNA-seq and whole transcriptome shotgun sequencing (WTSS) .
Once the cDNA fragments are sequenced, they will be de novo
De novo
In general usage, de novo is a Latin expression meaning "from the beginning," "afresh," "anew," "beginning again." It is used in:* De novo transcriptome assembly, the method of creating a transcriptome without a reference genome...

 assembled (without aligning to a reference genome
Reference genome
A reference genome is a digital nucleic acid sequence database, assembled by scientists as a representative example of a species' genetic code. As they are often assembled from the sequencing of DNA from a number of donors, reference genomes do not accurately represent the genetic code of any...

 sequence) back into the complete gene sequence by combining all of the fragments from that gene during the data analysis phase.

Plant tissue sampling

The samples will come from around the world, with a number of particularly rare species being supplied by botanical gardens such as the Fairy Lake Garden (Shenzhen, China). The type of tissue collected will be determined by the expected location of biosynthetic activity; for example if an interesting process or chemical is known to exist primarily in the leaves, the sample will come from the leaves.

what are the Limitations

Since only the transcriptome is being sequenced, the project will not reveal information about gene regulatory sequence
Regulatory sequence
A regulatory sequence is a segment of DNA where regulatory proteins such as transcription factors bind preferentially. These regulatory proteins bind to short stretches of DNA called regulatory regions, which are appropriately positioned in the genome, usually a short distance 'upstream' of the...

, non-coding RNA
Non-coding RNA
A non-coding RNA is a functional RNA molecule that is not translated into a protein. Less-frequently used synonyms are non-protein-coding RNA , non-messenger RNA and functional RNA . The term small RNA is often used for short bacterial ncRNAs...

s, DNA repetitive elements, or other genomic features that are not part of the coding sequence. Based on the few whole plant genomes collected so far, these non-coding regions will in fact make up the majority of the genome, and the non-coding DNA may actually be the primary driver of trait differences seen between species.

Since mRNA is the starting material, the amount of sequence representation for a given gene will be based on the expression
Gene expression
Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. These products are often proteins, but in non-protein coding genes such as ribosomal RNA , transfer RNA or small nuclear RNA genes, the product is a functional RNA...

 level (how many mRNA molecules it produces). This means that highly expressed genes get better coverage because there is more sequence to work from. The result, then, is that some important genes may not be reliably detected by the project if they are expressed at a low level yet still have important biochemical functions.

Many plant species (especially agriculturally-manipulated ones) are known to have undergone large genome-wide changes through duplication of the whole genome. The rice and the wheat genomes, for example, can have 4-6 copies of whole genomes (wheat) whereas animals typically only have 2 (diploidy
Ploidy
Ploidy is the number of sets of chromosomes in a biological cell.Human sex cells have one complete set of chromosomes from the male or female parent. Sex cells, also called gametes, combine to produce somatic cells. Somatic cells, therefore, have twice as many chromosomes. The haploid number is...

). These duplicated genes may pose a problem for the de novo
De novo
In general usage, de novo is a Latin expression meaning "from the beginning," "afresh," "anew," "beginning again." It is used in:* De novo transcriptome assembly, the method of creating a transcriptome without a reference genome...

 assembly of sequence fragments, because repeat sequences confuse the computer programs when trying to put the fragments together, and they can be difficult to track through evolution.

Similarities

Just as the Beijing Genomics Institute
Beijing Genomics Institute
BGI , known as the Beijing Genomics Institute prior to 2008, is one of the world’s premier genome sequencing centers. Its sequencing output is expected to soon surpass the equivalent of more than 15,000 human genomes per year....

 in Shenzhen, China is one of the major genomics centers involved in the 1000 Genomes Project, the institute is the site of sequencing for the 1000 Plant Genomes Project.
Both projects are large-scale efforts to obtain detailed DNA sequence information to improve our understanding of the organisms, and both projects will utilize next-generation sequencing to facilitate a timely completion.hai

Differences

The goals of the two projects are significantly different. While the 1000 Genomes Project focuses on genetic variation in a single species, the 1000 Plant Genomes Project looks at the evolutionary relationships and genes of 1000 different plant species.

While the 1000 Genomes Project has been initially estimated to cost up to $50 million USD, the 1000 Plant Genomes Project will likely not be as expensive; the difference in cost comes from the target sequence in the genomes. Since the 1000 Plant Genomes Project will only be sequencing the transcriptome, whereas the human project will sequence as much of the genome as is decided feasible, there is a much lower amount of sequencing effort needed in this more specific approach. While this means that there will be less overall sequence output relative to the 1000 Genomes Project, the non-coding portions of the genomes excluded in the 1000 Plant Genomes Project are not important to its goals like they are to the human project. So then the more focused approach of the 1000 Plant Genomes Project minimizes cost while still achieving its goals.

Funding

The project will be funded by the Informatics Circle of Research Excellence (iCORE), the Alberta Agricultural Research Institute (AARI), Genome Alberta, the University of Alberta, the Beijing Genomics Institute (BGI), and Musea Ventures (a USA-based private investment firm) . To date, the project has received $1.5 million CAD from the Alberta Government and another $0.5 million from Musea Ventures . An additional $2.5 million CAD will be contributed by the Alberta government over the next 3 years .
In January 2010, BGI announced that it would be contributing $100 million to large-scale sequencing projects of plants and animals (including the 1000 Plant Genomes Project) .

Related projects

  • The 1000 Genomes Project (www.1000genomes.org/
  • The 1001 Genomes Project – Sequencing the whole genome of 1001 Arabidopsis
    Arabidopsis thaliana
    Arabidopsis thaliana is a small flowering plant native to Europe, Asia, and northwestern Africa. A spring annual with a relatively short life cycle, arabidopsis is popular as a model organism in plant biology and genetics...

     strains (http://1001genomes.org/index.html)
  • Genome 10K – Whole genome sequence of 10000 vertebrate
    Vertebrate
    Vertebrates are animals that are members of the subphylum Vertebrata . Vertebrates are the largest group of chordates, with currently about 58,000 species described. Vertebrates include the jawless fishes, bony fishes, sharks and rays, amphibians, reptiles, mammals, and birds...

    species (www.genome10k.org)
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK