Bioinformatics workflow management systems
Encyclopedia
A bioinformatics workflow management system is a specialized form of workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, in a specific domain of science, bioinformatics
.
There are currently many different workflow systems. Some have been developed more generally as scientific workflow systems for use by scientists from many different disciplines like astronomy
and earth science
.
Record
Bioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...
.
There are currently many different workflow systems. Some have been developed more generally as scientific workflow systems for use by scientists from many different disciplines like astronomy
Astronomy
Astronomy is a natural science that deals with the study of celestial objects and phenomena that originate outside the atmosphere of Earth...
and earth science
Earth science
Earth science is an all-embracing term for the sciences related to the planet Earth. It is arguably a special case in planetary science, the Earth being the only known life-bearing planet. There are both reductionist and holistic approaches to Earth sciences...
.
Examples
- AndurilAnduril (workflow engine)Anduril is an open source component-based workflow framework for scientific data analysis developed at the Computational Systems Biology Laboratory, University of Helsinki....
is an open sourceOpen sourceThe term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...
component-based workflow framework for scientific data analysis developed at the University of HelsinkiUniversity of HelsinkiThe University of Helsinki is a university located in Helsinki, Finland since 1829, but was founded in the city of Turku in 1640 as The Royal Academy of Turku, at that time part of the Swedish Empire. It is the oldest and largest university in Finland with the widest range of disciplines available...
. Anduril provides an execution engine written in Java, a large number of components for bioinformatics analysis, and the AndurilScript language to create and manage workflows. - BioBike is a biocomputing platform based upon the KnowOS (Knowledge Operating System) e-science technology. Written entirely in Lisp, KnowOS's main distinguishing feature is "through-the-browser" programmability.
- BioExtractBioExtractThe BioExtract Server is a web-based system for querying biomolecular sequence data, executing analytic tools on the resulting extracts, and constructing workflows composed of such queries and tools....
harnesses the power of online informatics tools for creating and customizing workflows. Users can query online sequence data, analyze it using an array of informatics tools, create and share custom workflows for repeated analysis, and save the resulting data and workflows in standardized reports. - BioManager is a bioinformatic data management and analysis workflow developed by the University of SydneyUniversity of SydneyThe University of Sydney is a public university located in Sydney, New South Wales. The main campus spreads across the suburbs of Camperdown and Darlington on the southwestern outskirts of the Sydney CBD. Founded in 1850, it is the oldest university in Australia and Oceania...
. - CellProfilerCellProfilerCellProfiler is free, open-source software designed to enable biologists without training in computer vision or programming to quantitatively measure phenotypes from thousands of images automatically...
is an open sourceOpen sourceThe term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...
modular image analysis software developed at the Broad InstituteBroad InstituteThe Broad Institute is a genomic medicine research center located in Cambridge, Massachusetts, United States. Although it is independently governed and supported as a 501 nonprofit research organization, the institute is formally affiliated with the Massachusetts Institute of Technology, Harvard...
. Capable of handling hundreds of thousands of images, it contains advanced algorithms for image analysis of cell-based assays and is optimized for high-throughput work. The software allows the user to construct a pipeline of individual modules; each module performs a image processing step, such as image loading, object identification, and feature extraction. - Discovery NetDiscovery NetDiscovery Net is one of the earliest examples of a scientific workflow system allowing users to coordinate the execution of remote services based on Web service and Grid Services standards....
(circa 2000) is one of the earliest examples of scientific workflow systems. It was the winner of the “Most Innovative Data Intensive Application Award” at the ACM SC02 (Supercomputing 2002) conference and exhibition, based on a demonstration of a fully interactive distributed genome annotation pipeline for a Malaria genome case study. The Discovery Net system originated from a £2m EPSRC-funded project with the same name investigating the development of an e-Science platform for scientific discovery from the data generated by a wide variety of high throughput devices at Imperial College LondonImperial College LondonImperial College London is a public research university located in London, United Kingdom, specialising in science, engineering, business and medicine...
. Many of the features of the system (architecture features, visual front-end, simplified access to remote Web and Grid Services and inclusion of a workflow store) were considered novel at the time, and have since found their way into other academic and commercial systems. - Ergatis is a web-based system used to create, run, and monitor reusable bioinformatics analysis pipelines. It contains pre-built components for common bioinformatics analysis tasks, such as blast searches or storing data in a ChadoGeneric Model Organism DatabaseThe Generic Model Organism Database Project began as an effort to create reusable software tools for developing Model Organism Databases . MODs describe genome and other information about important experimental organisms in the life sciences...
database. These components can be arranged graphically to create highly-configurable pipelines. - GalaxyGalaxy (computational biology)Galaxy is a scientific workflow, data integration, and data and analysis persistence and publishing platform that aims to make computational biology accessible to research scientists that do not have computer programming experience...
is an open sourceOpen sourceThe term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...
workflow system developed at Penn StatePennsylvania State UniversityThe Pennsylvania State University, commonly referred to as Penn State or PSU, is a public research university with campuses and facilities throughout the state of Pennsylvania, United States. Founded in 1855, the university has a threefold mission of teaching, research, and public service...
and Emory UniversityEmory UniversityEmory University is a private research university in metropolitan Atlanta, located in the Druid Hills section of unincorporated DeKalb County, Georgia, United States. The university was founded as Emory College in 1836 in Oxford, Georgia by a small group of Methodists and was named in honor of...
. Galaxy is available as a free public web server and as downloadable software. Galaxy stresses ease of use and sharing and persisting analyses. - GenePatternGenePatternis a freely available software package developed at the Broad Institute of MIT and Harvard for the analysis of genomic data. Designed to enable researchers to develop, capture, and reproduce genomic analysis methodologies, GenePattern was first released in 2004...
is a genomic analysis platform developed at the Broad Institute of MIT & HarvardBroad InstituteThe Broad Institute is a genomic medicine research center located in Cambridge, Massachusetts, United States. Although it is independently governed and supported as a 501 nonprofit research organization, the institute is formally affiliated with the Massachusetts Institute of Technology, Harvard...
that provides access to more than 150 tools for gene expression analysis, proteomics, SNP analysis, RNA-seq, flow cytometry, and common data processing tasks. A web-based interface provides access to these tools and allows the creation of multi-step analysis pipelines that enable reproducible in silico research. - Geodise (Grid Enabled Optimisation and Design Search for Engineering) was developed at the University of SouthamptonUniversity of SouthamptonThe University of Southampton is a British public university located in the city of Southampton, England, a member of the Russell Group. The origins of the university can be dated back to the founding of the Hartley Institution in 1862 by Henry Robertson Hartley. In 1902, the Institution developed...
. - KeplerKepler scientific workflow systemKepler is a free software system for designing, executing, reusing, evolving, archiving, and sharing scientific workflows.Kepler's facilities provide process and data monitoring, provenance information, and high-speed data movement solutions...
enables scientists in a variety of disciplines like biology, ecology and astronomy to compose and execute workflows. Kepler is based on the Ptolemy II system for heterogeneous, concurrent modeling and design. Ptolemy II was developed by the members of the Ptolemy project at University of California Berkeley. Although not originally intended for scientific workflows, it provides a mature platform for building and executing workflows, and supports multiple models of computation. - LONI PipelineLONI PipelineThe LONI Pipeline is a distributed system for constructing, validating, executing and disseminating scientific workflows on grid computing architectures. A major difference between this and other workflow processing environments is that the LONI Pipeline does not require new tools and services to...
is a Java-based distributed graphical data-analysis environment for constructing, validating, executing and disseminating scientific workflows. As the LONI PipelineLONI PipelineThe LONI Pipeline is a distributed system for constructing, validating, executing and disseminating scientific workflows on grid computing architectures. A major difference between this and other workflow processing environments is that the LONI Pipeline does not require new tools and services to...
references all data, services and tools as external objects, it directly allows resource interoperability without the need for rebuilding the software. - Medicel Integrator Workflow is a cluster-enabled bioinformatics workflow design and execution application. It can be used stand-alone or integrated with a biology data warehouse.
- Pegasus is a flexible framework that enables the mapping of complex scientific workflows onto the gridGrid computingGrid computing is a term referring to the combination of computer resources from multiple administrative domains to reach a common goal. The grid can be thought of as a distributed system with non-interactive workloads that involve a large number of files...
developed at the Information Sciences InstituteInformation Sciences InstituteThe Information Sciences Institute is a research and development unit of the University of Southern California's Viterbi School of Engineering which focuses on computer and communications technology and information processing...
at the University of Southern CaliforniaUniversity of Southern CaliforniaThe University of Southern California is a private, not-for-profit, nonsectarian, research university located in Los Angeles, California, United States. USC was founded in 1880, making it California's oldest private research university...
. - Pegasys is a software for executing and integrating analyses of biological sequences, developed by the University of British ColumbiaUniversity of British ColumbiaThe University of British Columbia is a public research university. UBC’s two main campuses are situated in Vancouver and in Kelowna in the Okanagan Valley...
. - Pipeline Pilot is AccelrysAccelrysAccelrys is a software company headquartered in the US, with representation in Europe and Japan. It provides software for chemical, materials and bioscience research for the pharmaceutical, biotechnology, consumer packaged goods, aerospace, energy and chemical industries.Accelrys started in 2001...
’ scientific informatics platform that streamlines the data integration and analysis by using a Visual Programming Language (similar to LabVIEWLabVIEWLabVIEW is a system design platform and development environment for a visual programming language from National Instruments. LabVIEW provides engineers and scientists with the tools needed to create and deploy measurement and control systems.The graphical language is named "G"...
) to build a pipeline to transform any number of inputs (raw data) into any number of outputs. - Taverna workbenchTaverna workbenchTaverna Workbench is an open source software tool for designing and executing workflows, created by the myGrid project and funded through the OMII-UK...
is an open sourceOpen sourceThe term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...
workflow system that enables scientists (typically, though not exclusively, in bioinformatics) to compose and execute scientific workflows. It has been developed as part of a £5.5m EPSRC project called myGridMyGridThe myGrid consortium is a multi-institutional, multi-disciplinary internationally leading research group focussing on the challenges of eScience...
based at the University of ManchesterUniversity of ManchesterThe University of Manchester is a public research university located in Manchester, United Kingdom. It is a "red brick" university and a member of the Russell Group of research-intensive British universities and the N8 Group...
. Independently, other researchers have created Programming by exampleProgramming by exampleIn computer science, programming by example , also known as programming by demonstration or more generally as demonstrational programming, is an End-user development technique for teaching a computer new behavior by demonstrating actions on concrete examples...
workflow development tools that are interoperable with Taverna. - Triana is an open source problem solving environment developed at Cardiff UniversityCardiff UniversityCardiff University is a leading research university located in the Cathays Park area of Cardiff, Wales, United Kingdom. It received its Royal charter in 1883 and is a member of the Russell Group of Universities. The university is consistently recognised as providing high quality research-based...
that combines an intuitive visual interface with powerful data analysis tools. - Wildfire is a distributed, Grid-enabled workflow construction and execution environment. It has a graphical user interface for constructing and running workflows. Wildfire borrows user interface features from Jemboss and adds a drag-and-drop interface allowing the user to compose EMBOSSEMBOSSEMBOSS is an acronym for European Molecular Biology Open Software Suite. EMBOSS is a free Open Source software analysis package specially developed for the needs of the molecular biology and bioinformatics user community...
(and other) programs into workflows. For execution, Wildfire uses GEL, the underlying workflow execution engine, which can exploit available parallelism on multiple CPU machines including Beowulf-class clusters and Grids. - Sight is a web agent – oriented workflow platform that historically has extensive means to integrate websites with ordinary web forms and HTML responses (there is also support for WSDL as well). The system has a GUI-based workflow composer that supports modules with multiple ports and allows to access data from the modules that stand earlier in workflow. Sight was developed in Ulm universityUniversity of UlmThe University of Ulm is a public university in the city of Ulm, in the South German state of Baden-Württemberg. The university was founded in 1967 and focuses on natural sciences, medicine, engineering sciences, mathematics, economics and computer science...
using java and it currently released under GPL. - RetroGuideRetroGuideRetroGuide is a name of a research project in medical informatics focusing on using workflow technology in healthcare. In 2009, RetroGuide became a component in a larger project/system called HealthFlow...
is a query framework for querying retrospective bioinformatics data. - UGENE Workflow DesignerUGENEUGENE is free open-source cross-platform bioinformatics software.It integrates dozens of well-known biological tools and algorithms, providing both graphical user and command line interfaces...
is an open sourceOpen sourceThe term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...
visual environment designed for building and executing bioinformatics workflows. The main purpose of the system is providing user-friendly GUIGuiGui or guee is a generic term to refer to grilled dishes in Korean cuisine. These most commonly have meat or fish as their primary ingredient, but may in some cases also comprise grilled vegetables or other vegetarian ingredients. The term derives from the verb, "gupda" in Korean, which literally...
for creating computational workflows that can be executed as well as on commodity hardware as on high-performance clusters and supercomputers. - HCDC is an open source workflow system developed at ETH ZurichETH ZurichThe Swiss Federal Institute of Technology Zurich or ETH Zürich is an engineering, science, technology, mathematics and management university in the City of Zurich, Switzerland....
that is focus on large scale image based biological experiments. Include large collection of components for multiwell plate handling (96, 384, ...). - Mobyle is a framework and web portal specifically aimed at the integration of bioinformatics software and databanks. Mobyle is the successor of Pise and the RPBS server, previous systems that provided web environments to define and execute bioinformatics analyses.
- Remora is a web server implemented according to the BioMoby web-service specifications, providing life science researchers with an easy-to-use workflow generator and launcher, a repository of predefined workflows and a survey system.
External links
This paper reviews some of the above workflow systems from the ACM SIGMODSIGMOD
SIGMOD is the Association for Computing Machinery's Special Interest Group on Management of Data, which specializes in large-scale data management problems and databases....
Record
- Portal of a joint European Grid and web-services project called EMBRACE. Provides much information and many work-out bioinformatics examples and web-services.
- Galaxy
- GenePattern Website and (Nature Genetics) paper in CIBEC'08 comparing multiple workflow systems for bioinformatics applications
- Workflow technology based Solutions for Bioinformatics
- Mobyle
- Remora