Phred base calling
Encyclopedia
Phred base-calling is a computer program for identifying a base (nucleobase
Nucleobase
Nucleobases are a group of nitrogen-based molecules that are required to form nucleotides, the basic building blocks of DNA and RNA. Nucleobases provide the molecular structure necessary for the hydrogen bonding of complementary DNA and RNA strands, and are key components in the formation of stable...

) sequence from a fluorescence "trace" data generated by an automated DNA sequencer that uses electrophoresis
Electrophoresis
Electrophoresis, also called cataphoresis, is the motion of dispersed particles relative to a fluid under the influence of a spatially uniform electric field. This electrokinetic phenomenon was observed for the first time in 1807 by Reuss , who noticed that the application of a constant electric...

 and 4-fluorescent dye method. When originally developed, Phred produced significantly fewer errors in the data sets examined than other methods, averaging 40-50% fewer errors. Phred quality score
Phred quality score
Phred quality scores were originally developed by the program Phred to help in the automation of DNA sequencing in the Human Genome Project. Phred quality scores are assigned to each base call in automated sequencer traces...

s have become widely accepted to characterize the quality of DNA sequences, and can be used to compare the efficacy of different sequencing methods.

Background

The fluorescent-dye DNA
DNA
Deoxyribonucleic acid is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms . The DNA segments that carry this genetic information are called genes, but other DNA sequences have structural purposes, or are involved in...

 sequencing
Sequencing
In genetics and biochemistry, sequencing means to determine the primary structure of an unbranched biopolymer...

 is a molecular biology
Molecular biology
Molecular biology is the branch of biology that deals with the molecular basis of biological activity. This field overlaps with other areas of biology and chemistry, particularly genetics and biochemistry...

 technique that involves labeling single-strand DNA
DNA
Deoxyribonucleic acid is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms . The DNA segments that carry this genetic information are called genes, but other DNA sequences have structural purposes, or are involved in...

 sequences of varied length with 4 fluorescent dyes (corresponding to 4 different bases
Bases
Bases may refer to:*Bases , a military style of dress adopted by the chivalry of the sixteenth century.*Business Association of Stanford Entrepreneurial Students...

 used in DNA) and subsequently separating the DNA sequences by "slab gel"- or capillary-electrophoresis
Electrophoresis
Electrophoresis, also called cataphoresis, is the motion of dispersed particles relative to a fluid under the influence of a spatially uniform electric field. This electrokinetic phenomenon was observed for the first time in 1807 by Reuss , who noticed that the application of a constant electric...

 method (see DNA Sequencing
DNA sequencing
DNA sequencing includes several methods and technologies that are used for determining the order of the nucleotide bases—adenine, guanine, cytosine, and thymine—in a molecule of DNA....

). The electrophoresis run is monitored by CCD on the DNA sequencer and this produces a time "trace" data (or "chromatogram") of the fluorescent "peaks" that passed the CCD point. Examining the fluorescence peaks in the trace data, we can determine the order of individual bases (nucleobase
Nucleobase
Nucleobases are a group of nitrogen-based molecules that are required to form nucleotides, the basic building blocks of DNA and RNA. Nucleobases provide the molecular structure necessary for the hydrogen bonding of complementary DNA and RNA strands, and are key components in the formation of stable...

) in the DNA
DNA
Deoxyribonucleic acid is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms . The DNA segments that carry this genetic information are called genes, but other DNA sequences have structural purposes, or are involved in...

. Since the intensity, shape and the location of a fluorescence peak are not always consistent or unambiguous, however, sometimes it is difficult or time-consuming to determine (or "call") the correct bases for the peaks accurately if it is done manually.

Automated DNA sequencing techniques have revolutionized the field of molecular biology
Biology
Biology is a natural science concerned with the study of life and living organisms, including their structure, function, growth, origin, evolution, distribution, and taxonomy. Biology is a vast subject containing many subdivisions, topics, and disciplines...

 - generating vast amounts of DNA sequence data. However, the sequence data is produced at a significantly higher rate than can be processed (i.e. interpreting the trace data to produce the sequence data), thereby creating a bottleneck. To remove the bottleneck, both automated software that can speed up the processing with improved accuracy and a reliable measure of the accuracy are needed. To meet this need, many software programs have been developed. One such program is Phred.

History

Phred was originally conceived in the early 1990s by Phil Green, then a professor at Washington University in St. Louis
Washington University in St. Louis
Washington University in St. Louis is a private research university located in suburban St. Louis, Missouri. Founded in 1853, and named for George Washington, the university has students and faculty from all fifty U.S. states and more than 110 nations...

. LaDeana Hillier, Michael Wendl, David Ficenec, Tim Gleeson, Alan Blanchard, and Richard Mott also contributed to the codebase and algorithm. Green moved to University of Washington
University of Washington
University of Washington is a public research university, founded in 1861 in Seattle, Washington, United States. The UW is the largest university in the Northwest and the oldest public university on the West Coast. The university has three campuses, with its largest campus in the University...

 in the mid 1990s, after which development was primarily managed by himself and Brent Ewing. Phred played a notable role in the Human Genome Project
Human Genome Project
The Human Genome Project is an international scientific research project with a primary goal of determining the sequence of chemical base pairs which make up DNA, and of identifying and mapping the approximately 20,000–25,000 genes of the human genome from both a physical and functional...

, where large amounts of sequence data were processed by automated scripts. It is currently the most widely used basecalling software program by both academic and commercial DNA sequencing laboratories because of its high base calling accuracy. Phred is distributed commercially by CodonCode Corporation, and used to perform the "Call bases" function in the program CodonCode Aligner
CodonCode Aligner
CodonCode Aligner is a commercial application for DNA sequence assembly, sequence alignment, and editing on Mac OS X and Windows.- Features :* Chromatogram editing, end clipping, and vector trimming.* Sequence assembly and contig editing...

. It is also used by the MacVector
MacVector
MacVector is a commercial sequence analysis application for Apple Macintosh computers running Mac OS X. It is intended to be used by Molecular Biologists to help analyze, design, research and document their experiments in the laboratory.- Features :...

plugin Assembler.

Methods

Phred uses a four-phase procedure as outlined by Ewing et al. to determine a sequence of base calls from the processed DNA sequence tracing:
  1. Predicted peak locations are determined, based on the assumption that fragments are relatively evenly spaced, on average, in most regions of the gel, to determine the correct number of bases and their idealized evenly spaced locations in regions where the peaks are not well resolved, noisy, or displaced (as in compressions)
  2. Observed peaks are identified in the trace
  3. Observed peaks are matched to the predicted peak locations, omitting some peaks and splitting others; as each observed peak comes from a specific array and is thus associated with 1 of the 4 bases (A, G, T, or C), the ordered list of matched observed peaks determines a base sequence for the trace.
  4. The unmatched observed peaks are checked for any peak that appears to represent a base but could not be assigned to a predicted peak in the third phase and if found, the corresponding base is inserted into the read sequence.

The entire procedure is rapid, usually taking less than half a second per trace.

Applications

Phred is often used together with another software program called Phrap, which is a program for DNA sequence assembly. Phrap was routinely used in some of the largest sequencing projects in the Human Genome Sequencing Project and is currently one of the most widely used DNA sequence assembly programs in the biotech industry. Phrap uses Phred quality scores to determine highly accurate consensus sequences and to estimate the quality of the consensus sequences. Phrap also uses Phred quality scores to estimate whether discrepancies between two overlapping sequences are more likely to arise from random errors, or from different copies of a repeated sequence.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK