De novo transcriptome assembly
Encyclopedia
De novo transcriptome assembly is the method of creating a transcriptome
without the aid of a reference genome
.
, planaria, Parhyale hawaiensis
, as well as the brains of the Nile crocodile
, the corn snake
, the bearded dragon, and the red-eared slider
, to name just a few. Studying non-model organisms can provide novel insights into the mechanisms underlying the “diversity of fascinating morphological innovations” that have enabled the abundance of life on planet Earth. In animals and plants, the “innovations” that cannot be examined in common model organisms include mimicry, mutualism
, parasitism
, and asexual reproduction
.
. A set of assembled transcripts allows for initial gene expression studies.
regions of DNA – transcriptome sequence coverage levels can be directly indicative of gene expression levels. These repeated sequences also create ambiguities in the formation of contigs
in genome assembly, while ambiguities in transcriptome assembly contigs usually correspond to spliced isoforms, or minor variation among members of a gene family.
)
Once mRNA is extracted and purified from cells, it is sent to a high-throughput sequencing facility, where it is first reverse transcribed to create a cDNA library. This cDNA can then be fragmented into various lengths depending on the platform used for sequencing. Each of the following platforms utilizes a different type of technology to sequence millions of short reads: 454 Sequencing, Illumina
, and SOLiD
.
and antisense transcripts. Finally, it can be difficult to reconstruct and tease apart all splicing isoforms.
Short read assemblers generally use one of two basic algorithms: overlap graphs and de Bruijn graphs. Overlap graphs are utilized for most assemblers designed for Sanger sequenced reads. The overlaps between each pair of reads is computed and compiled into a graph, in which each node represents a single sequence read. This algorithm is more computationally intensive than de Bruijn graphs, and most effective in assembling fewer reads with a high degree of overlap.
De Bruijn graphs
align k-mers
(usually 25-50 bp) based on k-1 sequence conservation to create contigs. The use of k-mers – which are shorter than the read lengths – in de Bruijn graphs reduces the computational intensity of this method.
based data mining to annotate sequence data for which no GO annotation is available yet. It works by blasting assembled contigs against a non-redundant nucleotide database, then annotating them based on sequence similarity. It is a research tool often employed in functional genomics research on non-model species.
Contigs can also screened for open reading frames (ORFs) in order to predict the amino acid sequence of proteins derived from these transcripts.
Oftentimes, exceptionally short and long reads are filtered out, as these short sequences are unlikely to represent functional proteins.
The Velvet algorithm uses de Bruijn graphs to assemble transcripts. In simulations, Velvet can produce contigs up to 50-kb N50 length using prokaryotic data and 3-kb N50 in mammalian bacterial artificial chromosomes
(BACs). These preliminary transcripts are transferred to Oases, which uses paired end read and long read information to build transcript isoforms.
is a parallel, paired-end sequence assembler. Trans-ABySS (Assembly By Short Sequences) is a software pipeline written in Python
and Perl
for analyzing ABySS-assembled transcriptome contigs. This pipeline can be applied to assemblies generated across a wide range of k values. It first reduces the dataset into smaller sets of non-redundant contigs, and identifies splicing events including exon-skipping, novel exons, retained introns, novel introns, and alternative splicing. The Trans-ABySS algorithms are also able to estimate gene expression levels, identify potential polyadenylation
sites, as well as candidate gene-fusion events.
Transcriptome
The transcriptome is the set of all RNA molecules, including mRNA, rRNA, tRNA, and other non-coding RNA produced in one or a population of cells.-Scope:...
without the aid of a reference genome
Reference genome
A reference genome is a digital nucleic acid sequence database, assembled by scientists as a representative example of a species' genetic code. As they are often assembled from the sequencing of DNA from a number of donors, reference genomes do not accurately represent the genetic code of any...
.
Introduction
Before de novo transcriptome assembly, transcriptome information was only readily available for a handful of model organisms utilized by the international scientific research community. With the advent of high-throughput sequencing (also called next-generation sequencing) technologies that are both cost- and labor- effective, it is now possible to expand the range of organisms studied via these methods. Within the past few years, transcriptomes have been created for chickpeaChickpea
The chickpea is a legume of the family Fabaceae, subfamily Faboideae...
, planaria, Parhyale hawaiensis
Parhyale hawaiensis
Parhyale hawaiensis is an amphipod crustacean species that is used in developmental and genetic analyses.-Habitat:P. hawaiensis is a detritovore that has a circumtropical, worldwide, intertidal, and shallow-water marine distribution, and it may occur as a species complex...
, as well as the brains of the Nile crocodile
Nile crocodile
The Nile crocodile or Common crocodile is an African crocodile which is common in Somalia, Ethiopia, Uganda, Kenya, Egypt, Tanzania, Zambia, Zimbabwe, Gabon, South Africa, Malawi, Sudan, Botswana, and Cameroon...
, the corn snake
Corn Snake
The Corn Snake , or Red Rat Snake, is a North American species of Rat Snake that subdues its small prey by constriction. The name "Corn Snake" is a holdover from the days when southern farmers stored harvested ears of corn in a wood frame or log building called a crib...
, the bearded dragon, and the red-eared slider
Red-eared slider
The red-eared slider is a semiaquatic turtle belonging to the family Emydidae. It is a subspecies of pond slider. It is the most popular pet turtle in the United States and also popular in the rest of the world...
, to name just a few. Studying non-model organisms can provide novel insights into the mechanisms underlying the “diversity of fascinating morphological innovations” that have enabled the abundance of life on planet Earth. In animals and plants, the “innovations” that cannot be examined in common model organisms include mimicry, mutualism
Mutualism
Mutualism is the way two organisms of different species biologically interact in a relationship in which each individual derives a fitness benefit . Similar interactions within a species are known as co-operation...
, parasitism
Parasitism
Parasitism is a type of symbiotic relationship between organisms of different species where one organism, the parasite, benefits at the expense of the other, the host. Traditionally parasite referred to organisms with lifestages that needed more than one host . These are now called macroparasites...
, and asexual reproduction
Asexual reproduction
Asexual reproduction is a mode of reproduction by which offspring arise from a single parent, and inherit the genes of that parent only, it is reproduction which does not involve meiosis, ploidy reduction, or fertilization. A more stringent definition is agamogenesis which is reproduction without...
.
De novo vs. reference-based assembly
Prior to the development of transcriptome assembly computer programs, transcriptome data were analyzed primarily by mapping on to a reference genome. Though genome alignment is a robust way of characterizing transcript sequences, this method is disadvantaged by its inability to account for incidents of structural alterations of mRNA transcripts, such as alternative splicingAlternative splicing
Alternative splicing is a process by which the exons of the RNA produced by transcription of a gene are reconnected in multiple ways during RNA splicing...
. A set of assembled transcripts allows for initial gene expression studies.
Transcriptome vs. genome assembly
Unlike genome sequence coverage levels – which can vary randomly as a result of repeat content in non-coding intronIntron
An intron is any nucleotide sequence within a gene that is removed by RNA splicing to generate the final mature RNA product of a gene. The term intron refers to both the DNA sequence within a gene, and the corresponding sequence in RNA transcripts. Sequences that are joined together in the final...
regions of DNA – transcriptome sequence coverage levels can be directly indicative of gene expression levels. These repeated sequences also create ambiguities in the formation of contigs
Contig
A contig is a set of overlapping DNA segments that together represent a consensus region of DNA. In bottom-up sequencing projects, a contig refers to overlapping sequence data ; in top-down sequencing projects, contig refers to the overlapping clones that form a physical map of the genome that is...
in genome assembly, while ambiguities in transcriptome assembly contigs usually correspond to spliced isoforms, or minor variation among members of a gene family.
RNA-seq
(Main article: RNA-seqRNA-Seq
RNA-seq, also called "Whole Transcriptome Shotgun Sequencing" and dubbed "a revolutionary tool for transcriptomics", refers to the use of high-throughput sequencing technologies to sequence cDNA in order to get information about a sample's RNA content, a technique that is quickly becoming...
)
Once mRNA is extracted and purified from cells, it is sent to a high-throughput sequencing facility, where it is first reverse transcribed to create a cDNA library. This cDNA can then be fragmented into various lengths depending on the platform used for sequencing. Each of the following platforms utilizes a different type of technology to sequence millions of short reads: 454 Sequencing, Illumina
Illumina (company)
Illumina, Inc. is a company incorporated in April 1998 that develops, manufactures and markets integrated systems for the analysis of genetic variation and biological function. Using its technologies, the company provides a line of products and services that serve the sequencing, genotyping and...
, and SOLiD
ABI Solid Sequencing
SOLiD is a next-generation sequencing technology developed by Life Technologies and has been commercially available since 2008. These next generation technologies generate hundreds of millions to billions of small sequence reads at one time...
.
Assembly algorithms
These sequences are input into a short read transcript assembly program, a number of which are available (see Assemblers). Although these programs have been generally successful in assembling genomes, transcriptome assembly presents some unique challenges. Whereas high sequence coverage for a genome may indicate the presence of repetitive sequences (and thus be masked), for a transcriptome, they may indicate abundance. In addition, unlike genome sequencing, transcriptome sequencing can be strand-specific, due to the possibility of both senseSense
Senses are physiological capacities of organisms that provide inputs for perception. The senses and their operation, classification, and theory are overlapping topics studied by a variety of fields, most notably neuroscience, cognitive psychology , and philosophy of perception...
and antisense transcripts. Finally, it can be difficult to reconstruct and tease apart all splicing isoforms.
Short read assemblers generally use one of two basic algorithms: overlap graphs and de Bruijn graphs. Overlap graphs are utilized for most assemblers designed for Sanger sequenced reads. The overlaps between each pair of reads is computed and compiled into a graph, in which each node represents a single sequence read. This algorithm is more computationally intensive than de Bruijn graphs, and most effective in assembling fewer reads with a high degree of overlap.
De Bruijn graphs
De Bruijn graph
In graph theory, an n-dimensional De Bruijn graph of m symbols is a directed graph representing overlaps between sequences of symbols. It has mn vertices, consisting of all possible length-n sequences of the given symbols; the same symbol may appear multiple times in a sequence...
align k-mers
K-mer
The term k-mer usually refers to a specific n-tuple or n-gram of nucleic acid or amino acid sequences that can be used to identify certain regions within biomolecules like DNA or proteins...
(usually 25-50 bp) based on k-1 sequence conservation to create contigs. The use of k-mers – which are shorter than the read lengths – in de Bruijn graphs reduces the computational intensity of this method.
Functional annotation
Functional annotation of the assembled transcripts allows for insight into the particular molecular functions, cellular components, and biological processes in which the putative proteins are involved. Blast2GO enables Gene OntologyGene Ontology
The Gene Ontology, or GO, is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species...
based data mining to annotate sequence data for which no GO annotation is available yet. It works by blasting assembled contigs against a non-redundant nucleotide database, then annotating them based on sequence similarity. It is a research tool often employed in functional genomics research on non-model species.
Contigs can also screened for open reading frames (ORFs) in order to predict the amino acid sequence of proteins derived from these transcripts.
Verification and quality control
Since a reference genome is not available, the quality of computer-assembled contigs may be verified by aligning the sequences of conserved gene domains found in mRNA transcripts to transcriptomes or genomes of closely related species. Another method is to design PCR primers for predicted transcripts, then attempt to amplify them from the cDNA library.Oftentimes, exceptionally short and long reads are filtered out, as these short sequences are unlikely to represent functional proteins.
Assemblers
The following is a compendium of assembly software that has been used to generate transcriptomes, and has also been cited in scientific literature.Velvet
(Main article: Velvet assembler)The Velvet algorithm uses de Bruijn graphs to assemble transcripts. In simulations, Velvet can produce contigs up to 50-kb N50 length using prokaryotic data and 3-kb N50 in mammalian bacterial artificial chromosomes
Bacterial artificial chromosome
A bacterial artificial chromosome is a DNA construct, based on a functional fertility plasmid , used for transforming and cloning in bacteria, usually E. coli. F-plasmids play a crucial role because they contain partition genes that promote the even distribution of plasmids after bacterial cell...
(BACs). These preliminary transcripts are transferred to Oases, which uses paired end read and long read information to build transcript isoforms.
Trans-ABySS
ABySSAbyss
-Sciences:* Abyssal plain, a flat area on the ocean floor* Abyssal zone, a deep extent of the sea* Abyssinia , various uses, including an old name for Ethiopia-Philosophy:* Abyss , is a bottomless depth...
is a parallel, paired-end sequence assembler. Trans-ABySS (Assembly By Short Sequences) is a software pipeline written in Python
Python (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...
and Perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...
for analyzing ABySS-assembled transcriptome contigs. This pipeline can be applied to assemblies generated across a wide range of k values. It first reduces the dataset into smaller sets of non-redundant contigs, and identifies splicing events including exon-skipping, novel exons, retained introns, novel introns, and alternative splicing. The Trans-ABySS algorithms are also able to estimate gene expression levels, identify potential polyadenylation
Polyadenylation
Polyadenylation is the addition of a poly tail to an RNA molecule. The poly tail consists of multiple adenosine monophosphates; in other words, it is a stretch of RNA that has only adenine bases. In eukaryotes, polyadenylation is part of the process that produces mature messenger RNA for translation...
sites, as well as candidate gene-fusion events.
Trinity
Trinity first divides the sequence data into a number of de Bruijn graphs, each representing transcriptional variations at a single gene or locus. It then extracts full-length splicing isoforms and distinguishes transcripts derived from paralogous genes from each graph separately. Trinity consists of three independent software modules, which are used sequentially to produce transcripts:- Inchworm assembles the RNA-Seq data into transcript sequences, often generating full-length transcripts for a dominant isoform, but then reports just the unique portions of alternatively spliced transcripts.
- Chrysalis clusters the Inchworm contigs and constructs complete de Bruijn graphs for each cluster. Each cluster represents the full transcriptional complexity for a given gene (or a family or set of genes that share a conserved sequence). Chrysalis then partitions the full read set among these separate graphs.
- Butterfly then processes the individual graphs in parallel, tracing the paths of reads within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and teasing apart transcripts that corresponds to paralogous genes.
See also
- TranscriptomeTranscriptomeThe transcriptome is the set of all RNA molecules, including mRNA, rRNA, tRNA, and other non-coding RNA produced in one or a population of cells.-Scope:...
- Human-transcriptome database for alternative splicingHuman-transcriptome database for alternative splicingThe Human-transcriptome DataBase for Alternative Splicing is a database of alternatively spliced human transcripts based on H-Invitational....
(H-DBAS) - UniGeneUniGeneUniGene is an NCBI database of the transcriptome and thus, despite the name, not primarily a database for genes. Each entry is a set of transcripts that appear to stem from the same transcription locus...
- Full-parasitesFull-parasitesFull-Parasites is a transcriptome database of apicomplexa parasites....