Warren Gish
Encyclopedia
Warren Gish is the owner of Advanced Biocomputing LLC. He joined Washington University in St. Louis
as a junior faculty member in 1994, and was a Research Associate Professor of Genetics from 2002 to 2007. Gish is known primarily for his implementation and contributions to the algorithm
of the NCBI
BLAST
sequence analysis program, the WU-BLAST package, and most recently the AB-BLAST package.
While working for U.C. Berkeley
, Gish sped up the FASTA
program of Pearson and Lipman
by 2-fold on Sun
and 3-fold on VAX
, without altering the results.
A centralized search service was also envisioned at this time, wherein all nucleotide sequences from GenBank
would be maintained in memory—in compressed form due to limited memory—to eliminate I/O bottlenecks, with clients invoking searches remotely via the Internet.
Gish's earliest contributions to BLAST
were made while working at the NCBI
, starting in July 1989. These include: the use of compressed nucleotide sequences, both as an efficient storage format and as a rapid, native search format; identification of word-hits using DFAs (Mealy machine
architecture); parallel processing; memory-mapped I/O; the use of sentinel bytes and words at the start and end of sequences to improve the speed of word-hit extension; the first implementations of BLASTX, TBLASTN and TBLASTX; the transparent use of external (plug-in) programs such as seg, xnu, and dust to mask low-complexity regions in query sequences at run time; the NCBI BLAST E-mail Service with optional public key-encrypted communications; the NCBI Experimental BLAST Network Service; the NCBI
non-redundant (nr) protein and nucleotide sequence databases, typically updated on a daily basis with all data from GenBank
, Swiss-Prot
, and the PIR
; a BLAST
function library used in specialized applications for EST analysis and Entrez
data production, as well as in the NCBI
BLAST
suite version 1.4; and project management for the earliest NCBI
Dispatcher
for distributed services (inspired by CORBA
's Object Request Broker
). The NCBI Experimental BLAST Network Service, running the latest BLAST
software on SMP
hardware against the latest sequence databases, established the NCBI
in December 1989, as a convenient, one-stop shop for sequence similarity searching.
At Washington University in St. Louis
, Gish developed the first practical BLAST
suite of programs to combine rapid gapped sequence alignment
with statistical evaluation methods appropriate for gapped alignment scores.
The resulting search programs were significantly more sensitive, but only marginally slower than ungapped BLAST
,
due to novel application of the BLAST dropoff score X during gapped alignment extension.
Sensitivity of gapped BLAST was further improved by his novel application
of Karlin-Altschul Sum statistics
to the evaluation of multiple, gapped alignment scores in all BLAST
search modes.
Sum statistics were originally (and analytically) developed for the evaluation of multiple, ungapped alignment scores.
The empirical use of Sum statistics in the treatment of gapped alignments was validated in collaboration with Stephen Altschul
, from 1994-1995.
In May 1996, WU-BLAST version 2.0 with gapped alignments was publicly released in the form of a drop-in upgrade for existing users of ungapped NCBI
BLAST
and WU-BLAST (both at version 1.4, after having forked in 1994).
Little NIH funding (average 20% FTE) was received for his WU-BLAST development, starting in November 1995, and ending shortly after the September 1997 release of the NCBI
gapped BLAST
(“blastall”).
As an option to WU-BLAST, Gish implemented a faster, more memory-efficient and more sensitive two-hit BLAST
algorithm than is used by the NCBI
software.
In 1999, Gish added support to WU-BLAST for the Extended Database Format (XDF), the first BLAST
database format capable of accurately representing the entire draft sequence of the human genome in full-length chromosome sequence objects.
This was also the first time any BLAST
package introduced a new database format in a manner transparent to existing users and without abandoning support for prior formats, as a result of abstracting the database I/O functions completely separately from the data analysis functions.
WU-BLAST with XDF was the first BLAST
suite to support accurate, comprehensive indexed-retrieval of NCBI
standard sequence identifiers,
to allow users to retrieve individual sequences in part or in whole, natively, translated or reverse-complemented, and able to dump the entire contents of a BLAST
database back into human-readable FASTA format
.
In 2000, unique support for reporting of links (consistent sets of HSPs) was added,
along with the ability for users to limit the distance between HSPs allowed in the same set to a biologically relevant length (e.g., the length of the longest intron in the species of interest)
and with the distance limitation entering into the calculation of p-values.
Between 2001-2003, Gish improved the speed of the DFA code used in WU-BLAST.
Gish also proposed multiplexing query sequences to speed up BLAST
searches by an order of magnitude or more (MPBLAST); implemented segmented sequences with internal sentinel bytes, in part to aid multiplexing with MPBLAST and in part to aid analysis of segmented query sequences from shotgun sequencing assemblies;
and directed use of WU-BLAST as a fast, flexible search engine for accurately identifying and masking genome sequences for repetitive elements and low-complexity sequences (the MaskerAid package for RepeatMasker).
With doctoral student Miao Zhang, Gish directed development of EXALIN, which significantly improved the accuracy of spliced alignment predictions,
by a novel approach that combined information from donor and acceptor splice site models with information from sequence conservation.
Although EXALIN performed full dynamic programming
by default, it could optionally utilize the output from WU-BLAST to seed the dynamic programming
and speed up the process by about 100-fold with little loss of sensitivity or accuracy.
In 2008, Gish founded Advanced Biocomputing, LLC, where he continues to improve and support the AB-BLAST package.
Washington University in St. Louis
Washington University in St. Louis is a private research university located in suburban St. Louis, Missouri. Founded in 1853, and named for George Washington, the university has students and faculty from all fifty U.S. states and more than 110 nations...
as a junior faculty member in 1994, and was a Research Associate Professor of Genetics from 2002 to 2007. Gish is known primarily for his implementation and contributions to the algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
of the NCBI
National Center for Biotechnology Information
The National Center for Biotechnology Information is part of the United States National Library of Medicine , a branch of the National Institutes of Health. The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper...
BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
sequence analysis program, the WU-BLAST package, and most recently the AB-BLAST package.
While working for U.C. Berkeley
University of California, Berkeley
The University of California, Berkeley , is a teaching and research university established in 1868 and located in Berkeley, California, USA...
, Gish sped up the FASTA
FASTA
FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.- History :...
program of Pearson and Lipman
David J. Lipman
David J. Lipman is an American biologist who since 1989 has been the Director of the National Center for Biotechnology Information at the National Institutes of Health. NCBI is the home of GenBank, the U.S. node of the International Sequence Database Consortium, and PubMed, one of the most heavily...
by 2-fold on Sun
Sun-3
Sun-3 was the name given to a series of UNIX computer workstations and servers produced by Sun Microsystems, launched on September 9th, 1985. The Sun-3 series were VMEbus-based systems similar to some of the earlier Sun-2 series, but using the Motorola 68020 microprocessor, in combination with the...
and 3-fold on VAX
VAX 8000
The VAX 8000 was a family of minicomputers developed and manufactured by Digital Equipment Corporation using processors implementing the VAX instruction set architecture .- VAX 8600 :...
, without altering the results.
A centralized search service was also envisioned at this time, wherein all nucleotide sequences from GenBank
GenBank
The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced and maintained by the National Center for Biotechnology Information as part of the International Nucleotide Sequence...
would be maintained in memory—in compressed form due to limited memory—to eliminate I/O bottlenecks, with clients invoking searches remotely via the Internet.
Gish's earliest contributions to BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
were made while working at the NCBI
NCBI
NCBI may refer to:* National Center for Biotechnology Information, part of the U.S. National Institutes of Health* National Coalition Building Institute, a U.S. non-profit training group...
, starting in July 1989. These include: the use of compressed nucleotide sequences, both as an efficient storage format and as a rapid, native search format; identification of word-hits using DFAs (Mealy machine
Mealy machine
In the theory of computation, a Mealy machine is a finite-state machine whose output values are determined both by its current state and the current inputs. The outputs change asynchronously with respect to the clock, meaning that the outputs change at unpredictable times, making timing analysis...
architecture); parallel processing; memory-mapped I/O; the use of sentinel bytes and words at the start and end of sequences to improve the speed of word-hit extension; the first implementations of BLASTX, TBLASTN and TBLASTX; the transparent use of external (plug-in) programs such as seg, xnu, and dust to mask low-complexity regions in query sequences at run time; the NCBI BLAST E-mail Service with optional public key-encrypted communications; the NCBI Experimental BLAST Network Service; the NCBI
NCBI
NCBI may refer to:* National Center for Biotechnology Information, part of the U.S. National Institutes of Health* National Coalition Building Institute, a U.S. non-profit training group...
non-redundant (nr) protein and nucleotide sequence databases, typically updated on a daily basis with all data from GenBank
GenBank
The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced and maintained by the National Center for Biotechnology Information as part of the International Nucleotide Sequence...
, Swiss-Prot
UniPro
UniProSM is a high-speed interface technology for interconnecting integrated circuits in mobile phones or comparable products...
, and the PIR
Protein Information Resource
The Protein Information Resource , located at Georgetown University Medical Center , is an integrated public bioinformatics resource to support genomic and proteomic research, and scientific studies-History:...
; a BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
function library used in specialized applications for EST analysis and Entrez
Entrez
The Entrez Global Query Cross-Database Search System is a powerful federated search engine, or web portal that allows users to search many discrete health sciences databases at the National Center for Biotechnology Information website...
data production, as well as in the NCBI
NCBI
NCBI may refer to:* National Center for Biotechnology Information, part of the U.S. National Institutes of Health* National Coalition Building Institute, a U.S. non-profit training group...
BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
suite version 1.4; and project management for the earliest NCBI
NCBI
NCBI may refer to:* National Center for Biotechnology Information, part of the U.S. National Institutes of Health* National Coalition Building Institute, a U.S. non-profit training group...
Dispatcher
Dispatcher
Dispatchers are communications personnel responsible for receiving and transmitting pure and reliable messages, tracking vehicles and equipment, and recording other important information...
for distributed services (inspired by CORBA
Çorba
Chorba , ciorbă , shurpa , shorpo , or sorpa is one of various kinds of soup or stew found in national cuisines across Middle East...
's Object Request Broker
Object request broker
In distributed computing, an object request broker is a piece of middleware software that allows programmers to make program calls from one computer to another via a network...
). The NCBI Experimental BLAST Network Service, running the latest BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
software on SMP
Symmetric multiprocessing
In computing, symmetric multiprocessing involves a multiprocessor computer hardware architecture where two or more identical processors are connected to a single shared main memory and are controlled by a single OS instance. Most common multiprocessor systems today use an SMP architecture...
hardware against the latest sequence databases, established the NCBI
NCBI
NCBI may refer to:* National Center for Biotechnology Information, part of the U.S. National Institutes of Health* National Coalition Building Institute, a U.S. non-profit training group...
in December 1989, as a convenient, one-stop shop for sequence similarity searching.
At Washington University in St. Louis
Washington University in St. Louis
Washington University in St. Louis is a private research university located in suburban St. Louis, Missouri. Founded in 1853, and named for George Washington, the university has students and faculty from all fifty U.S. states and more than 110 nations...
, Gish developed the first practical BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
suite of programs to combine rapid gapped sequence alignment
with statistical evaluation methods appropriate for gapped alignment scores.
The resulting search programs were significantly more sensitive, but only marginally slower than ungapped BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
,
due to novel application of the BLAST dropoff score X during gapped alignment extension.
Sensitivity of gapped BLAST was further improved by his novel application
of Karlin-Altschul Sum statistics
to the evaluation of multiple, gapped alignment scores in all BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
search modes.
Sum statistics were originally (and analytically) developed for the evaluation of multiple, ungapped alignment scores.
The empirical use of Sum statistics in the treatment of gapped alignments was validated in collaboration with Stephen Altschul
Stephen Altschul
For the former MTV news/current CBS news correspondent, see Serena Altschul.Stephen Frank Altschul is an American mathematician who has designed algorithms that are widely used in the field of bioinformatics . Most notably, Altschul is the co-author of the BLAST algorithm used for sequence...
, from 1994-1995.
In May 1996, WU-BLAST version 2.0 with gapped alignments was publicly released in the form of a drop-in upgrade for existing users of ungapped NCBI
NCBI
NCBI may refer to:* National Center for Biotechnology Information, part of the U.S. National Institutes of Health* National Coalition Building Institute, a U.S. non-profit training group...
BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
and WU-BLAST (both at version 1.4, after having forked in 1994).
Little NIH funding (average 20% FTE) was received for his WU-BLAST development, starting in November 1995, and ending shortly after the September 1997 release of the NCBI
NCBI
NCBI may refer to:* National Center for Biotechnology Information, part of the U.S. National Institutes of Health* National Coalition Building Institute, a U.S. non-profit training group...
gapped BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
(“blastall”).
As an option to WU-BLAST, Gish implemented a faster, more memory-efficient and more sensitive two-hit BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
algorithm than is used by the NCBI
NCBI
NCBI may refer to:* National Center for Biotechnology Information, part of the U.S. National Institutes of Health* National Coalition Building Institute, a U.S. non-profit training group...
software.
In 1999, Gish added support to WU-BLAST for the Extended Database Format (XDF), the first BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
database format capable of accurately representing the entire draft sequence of the human genome in full-length chromosome sequence objects.
This was also the first time any BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
package introduced a new database format in a manner transparent to existing users and without abandoning support for prior formats, as a result of abstracting the database I/O functions completely separately from the data analysis functions.
WU-BLAST with XDF was the first BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
suite to support accurate, comprehensive indexed-retrieval of NCBI
NCBI
NCBI may refer to:* National Center for Biotechnology Information, part of the U.S. National Institutes of Health* National Coalition Building Institute, a U.S. non-profit training group...
standard sequence identifiers,
to allow users to retrieve individual sequences in part or in whole, natively, translated or reverse-complemented, and able to dump the entire contents of a BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
database back into human-readable FASTA format
FASTA format
In bioinformatics, FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences...
.
In 2000, unique support for reporting of links (consistent sets of HSPs) was added,
along with the ability for users to limit the distance between HSPs allowed in the same set to a biologically relevant length (e.g., the length of the longest intron in the species of interest)
and with the distance limitation entering into the calculation of p-values.
Between 2001-2003, Gish improved the speed of the DFA code used in WU-BLAST.
Gish also proposed multiplexing query sequences to speed up BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
searches by an order of magnitude or more (MPBLAST); implemented segmented sequences with internal sentinel bytes, in part to aid multiplexing with MPBLAST and in part to aid analysis of segmented query sequences from shotgun sequencing assemblies;
and directed use of WU-BLAST as a fast, flexible search engine for accurately identifying and masking genome sequences for repetitive elements and low-complexity sequences (the MaskerAid package for RepeatMasker).
With doctoral student Miao Zhang, Gish directed development of EXALIN, which significantly improved the accuracy of spliced alignment predictions,
by a novel approach that combined information from donor and acceptor splice site models with information from sequence conservation.
Although EXALIN performed full dynamic programming
Dynamic programming
In mathematics and computer science, dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems which are only slightly smaller and optimal substructure...
by default, it could optionally utilize the output from WU-BLAST to seed the dynamic programming
Dynamic programming
In mathematics and computer science, dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems which are only slightly smaller and optimal substructure...
and speed up the process by about 100-fold with little loss of sensitivity or accuracy.
In 2008, Gish founded Advanced Biocomputing, LLC, where he continues to improve and support the AB-BLAST package.