Galaxy (computational biology)
Encyclopedia
Galaxy is a scientific workflow
, data integration
, and data and analysis persistence and publishing
platform that aims to make computational biology
accessible to research scientists that do not have computer programming
experience. Although it was initially developed for genomics research, it is largely domain agnostic and is now used as a general bioinformatics workflow management system
.
. These systems provide a means to build multi-step computational analyses akin to a recipe. They typically provide a graphical user interface
for specifying what data to operate on, what steps to take, and what order to do them in.
Galaxy is also a data integration
platform for biological data. It supports data uploads from the user's computer, by URL, and from many online resources (such as the UCSC Genome Browser
, BioMart and InterMine
).
Galaxy supports a range of widely used biological data formats, and translation between those formats. Galaxy provides a web interface to many text manipulation utilities, enabling researchers to do their own custom reformatting and manipulation without having to do any programming
.
Galaxy includes interval manipulation utilities for doing set theoretic operations
(e.g. intersection
, union
, ...) on intervals. Many biological file formats include genomic interval data (a frame of reference, e.g., chromosome
or contig
name, and start and stop positions), allowing these data to be integrated.
Finally, Galaxy is also supports data and analysis persistence and publishing. See Reproducibility and Transparency below.
is a specialized domain that often requires knowledge of computer programming
. Galaxy aims to give biomedical researchers access to computational biology without also requiring them to understand computer programming. Galaxy does this by stressing a simple user interface over the ability to build complex workflows. This design choice makes it relatively easy to build typical analyses, but more difficult to build complex workflows that include, for example, looping constructs. (See Taverna workbench
for an example system that supports looping.)
is a key goal of science: When scientific results are published the publications should include enough information that others can repeat the experiment and get the same results. There have been many recent efforts to extend this goal from the bench (the "wet lab") to computational experiments (the "dry lab
") as well. This has proved to be a more difficult task than initially expected.
Galaxy supports reproducibility by capturing sufficient information about every step in a computational analysis, so that the analysis can be repeated, exactly, at any point in the future. This includes keeping track of all input, intermediate, and final datasets, as well as the parameters provided to, and the order of each step of the analysis.
Histories:
Workflows:
Datasets:
Pages:
implemented using the Python programming language
. It is developed by the Galaxy team at Penn State and Emory University
, and the Galaxy Community.
Scientific workflow system
A Scientific Workflow Systems is a specialized form of a workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, in a scientific application...
, data integration
Data integration
Data integration involves combining data residing in different sources and providing users with a unified view of these data.This process becomes significant in a variety of situations, which include both commercial and scientific domains...
, and data and analysis persistence and publishing
Digital preservation
Digital preservation is the set of processes, activities and management of digital information over time to ensure its long term accessibility. The goal of digital preservation is to preserve materials resulting from digital reformatting, and particularly information that is born-digital with no...
platform that aims to make computational biology
Computational biology
Computational biology involves the development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems...
accessible to research scientists that do not have computer programming
Computer programming
Computer programming is the process of designing, writing, testing, debugging, and maintaining the source code of computer programs. This source code is written in one or more programming languages. The purpose of programming is to create a program that performs specific operations or exhibits a...
experience. Although it was initially developed for genomics research, it is largely domain agnostic and is now used as a general bioinformatics workflow management system
Bioinformatics workflow management systems
A bioinformatics workflow management system is a specialized form of workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, in a specific domain of science, bioinformatics....
.
Functionality
Galaxy is a scientific workflow systemScientific workflow system
A Scientific Workflow Systems is a specialized form of a workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, in a scientific application...
. These systems provide a means to build multi-step computational analyses akin to a recipe. They typically provide a graphical user interface
Graphical user interface
In computing, a graphical user interface is a type of user interface that allows users to interact with electronic devices with images rather than text commands. GUIs can be used in computers, hand-held devices such as MP3 players, portable media players or gaming devices, household appliances and...
for specifying what data to operate on, what steps to take, and what order to do them in.
Galaxy is also a data integration
Data integration
Data integration involves combining data residing in different sources and providing users with a unified view of these data.This process becomes significant in a variety of situations, which include both commercial and scientific domains...
platform for biological data. It supports data uploads from the user's computer, by URL, and from many online resources (such as the UCSC Genome Browser
UCSC Genome Browser
The University of California, Santa Cruz is an up-to-date source for genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations...
, BioMart and InterMine
InterMine
InterMine is a powerful open source data warehouse system. Using InterMine, you can create databases of biological data accessed by sophisticated web query tools. InterMine can be used to create databases from a single data set or can integrate multiple sources of data. Support is provided for...
).
Galaxy supports a range of widely used biological data formats, and translation between those formats. Galaxy provides a web interface to many text manipulation utilities, enabling researchers to do their own custom reformatting and manipulation without having to do any programming
Shell script
A shell script is a script written for the shell, or command line interpreter, of an operating system. It is often considered a simple domain-specific programming language...
.
Galaxy includes interval manipulation utilities for doing set theoretic operations
Set theory
Set theory is the branch of mathematics that studies sets, which are collections of objects. Although any type of object can be collected into a set, set theory is applied most often to objects that are relevant to mathematics...
(e.g. intersection
Intersection
Intersection has various meanings in different contexts:*In mathematics and geometry**Intersection , the set of elements common to some collection of sets.**Line-line intersection**Line-plane intersection**Line–sphere intersection...
, union
Union
Union may refer to:* Trade union or labor union, an organization of workers that have banded together, often for the purpose of getting better working conditions or pay...
, ...) on intervals. Many biological file formats include genomic interval data (a frame of reference, e.g., chromosome
Chromosome
A chromosome is an organized structure of DNA and protein found in cells. It is a single piece of coiled DNA containing many genes, regulatory elements and other nucleotide sequences. Chromosomes also contain DNA-bound proteins, which serve to package the DNA and control its functions.Chromosomes...
or contig
Contig
A contig is a set of overlapping DNA segments that together represent a consensus region of DNA. In bottom-up sequencing projects, a contig refers to overlapping sequence data ; in top-down sequencing projects, contig refers to the overlapping clones that form a physical map of the genome that is...
name, and start and stop positions), allowing these data to be integrated.
Finally, Galaxy is also supports data and analysis persistence and publishing. See Reproducibility and Transparency below.
Project Goals
Galaxy is "an open, web-based platform for performing accessible, reproducible, and transparent genomic science."Accessibility
Computational biologyComputational biology
Computational biology involves the development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems...
is a specialized domain that often requires knowledge of computer programming
Computer programming
Computer programming is the process of designing, writing, testing, debugging, and maintaining the source code of computer programs. This source code is written in one or more programming languages. The purpose of programming is to create a program that performs specific operations or exhibits a...
. Galaxy aims to give biomedical researchers access to computational biology without also requiring them to understand computer programming. Galaxy does this by stressing a simple user interface over the ability to build complex workflows. This design choice makes it relatively easy to build typical analyses, but more difficult to build complex workflows that include, for example, looping constructs. (See Taverna workbench
Taverna workbench
Taverna Workbench is an open source software tool for designing and executing workflows, created by the myGrid project and funded through the OMII-UK...
for an example system that supports looping.)
Reproducibility
ReproducibilityReproducibility
Reproducibility is the ability of an experiment or study to be accurately reproduced, or replicated, by someone else working independently...
is a key goal of science: When scientific results are published the publications should include enough information that others can repeat the experiment and get the same results. There have been many recent efforts to extend this goal from the bench (the "wet lab") to computational experiments (the "dry lab
Dry lab
A dry lab is a laboratory where computational or applied mathematical analyses are done on a computer-generated model to simulate a phenomenon in the physical realm whether it be a molecule changing quantum states, the event horizon of a black hole or anything that otherwise might be impossible or...
") as well. This has proved to be a more difficult task than initially expected.
Galaxy supports reproducibility by capturing sufficient information about every step in a computational analysis, so that the analysis can be repeated, exactly, at any point in the future. This includes keeping track of all input, intermediate, and final datasets, as well as the parameters provided to, and the order of each step of the analysis.
Transparency
Galaxy supports transparency in scientific research by enabling researchers to share any of their Galaxy Objects either publicly, or with specific individuals. Shared items can be examined in detail, rerun at will and copied and modified to test hypotheses.Galaxy Objects: Histories Workflows, Datasets and Pages
Galaxy objects are anything that can be saved, persisted, and shared in Galaxy:Histories:
- Histories are computational analyses (recipes) run with specified input datasets, computational steps and parameters. Histories include all intermediate and output datasets as well.
Workflows:
- Workflows are computational analyses that specify all the steps (and parameters) in the analysis, but none of the data. Workflows are used to run the same analysis against multiple sets of input data.
Datasets:
- Datasets includes any input, intermediate, or output dataset, used or produced in an analysis.
Pages:
- Histories, workflows and datasets can include user-provided annotation. Galaxy Pages enables the creation of a virtual paper that describes the how and why of the overall experiment. Tight integration of Pages with Histories, Workflows, and Datasets supports this goal.
Availability
Galaxy is available:- As a free public web server, supported by the Galaxy Project.. This server includes many bioinformatics tools that are widely useful in many areas of genomics research. Users can create logins, and save histories, workflows, and datasets on the server. These saved items can also be shared with others.
- As open-source softwareOpen-source softwareOpen-source software is computer software that is available in source code form: the source code and certain other rights normally reserved for copyright holders are provided under a software license that permits users to study, change, improve and at times also to distribute the software.Open...
that can be downloaded, installed and customized to address specific needs.. Galaxy can be installed locally or using a computing cloudCloud computingCloud computing is the delivery of computing as a service rather than a product, whereby shared resources, software, and information are provided to computers and other devices as a utility over a network ....
. - Public web servers hosted by other organizations. Several organizations with their own Galaxy installation have also opted to make those servers available to others.
Implementation
Galaxy is open-source softwareOpen-source software
Open-source software is computer software that is available in source code form: the source code and certain other rights normally reserved for copyright holders are provided under a software license that permits users to study, change, improve and at times also to distribute the software.Open...
implemented using the Python programming language
Python (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...
. It is developed by the Galaxy team at Penn State and Emory University
Emory University
Emory University is a private research university in metropolitan Atlanta, located in the Druid Hills section of unincorporated DeKalb County, Georgia, United States. The university was founded as Emory College in 1836 in Oxford, Georgia by a small group of Methodists and was named in honor of...
, and the Galaxy Community.