FileQuirks
Encyclopedia
FileQuirks is a bioinformatic web server
for recognition of biological data types developed in Laboratory of Bioinformatics and Protein Engineering in IIMCB Warsaw (GeneSilico
). It enables to quickly check the format of a file with a biological data.
, Mass spectrometry data format, European Data Format
, Protein Data Bank (file format)
. For example, despite several unification attempts, there are more than 20 formats for biological sequences used (and the number is still growing). Although standardized XML, CSV or tabular formats are promoted by different initiatives, most of commonly used file formats have a form of raw text files and have no characteristic features that might be used to identify or distinguish them. As a result, users of bioinformatic software spend significant amount of time on checking what are their File formats and assessing whether they are compatible with input or output formats of the tools they would like to use.
Example files for each of the file formats are stored in the database. Adding a new file format to recognize requires only providing example files of this format.
Systems calculates a set of (hundreds or more) descriptors of values 0 or 1, which are evaluated for each of the stored files. The currently used descriptors are regular expressions. Regular expressions are designed in a way to recognize common patterns used in biology, like word "BLAST" present in every BLAST
report or ">" sign at the beginning of the line of sequence formats. If a regular expressions matches given file, the value of the descriptor is 1, otherwise it is 0. The matching is performed by python module re, with multiline flag enabled.
User query is evaluated against all regular expressions in the database. Afterwards, the data formats which example files match similare regular expressions are presented to the user.
To improve the result a set of "expert" regular experssions are also present, which are designed to recognize only one specific format. An example of such expression is "(>([^\t\n\r\f\v]*)\r?\n\r?([ANCTGUanctgu\n\r]{20,})){2,}" - which (believe or not) matches only files with more than one sequence of nucleic acid in FASTA format. Expert expressions are evaluated against every user query and matching data types are presented.
Web server
Web server can refer to either the hardware or the software that helps to deliver content that can be accessed through the Internet....
for recognition of biological data types developed in Laboratory of Bioinformatics and Protein Engineering in IIMCB Warsaw (GeneSilico
GeneSilico
Laboratory of Bioinformatics and Protein Engineering in International Institute of Molecular and Cell Biology in Warsaw, Poland.-Fields of research:* Protein and nucleic acid structure modeling* Discovery and analysis of enzymes that act on DNA or RNA...
). It enables to quickly check the format of a file with a biological data.
Background
We currently observe an explosion of publicly available bioinformatic tools and data. In parallel we can also observe constant increase in number of data formats used, such as: FASTA formatFASTA format
In bioinformatics, FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences...
, Mass spectrometry data format, European Data Format
European data format
European Data Format is a standard file format designed for exchange and storage of medical time series. Being an open and non-proprietary format, EDF is commonly used to archive, exchange and analyse data from commercial devices in a format that is independent of the acquisition system. In this...
, Protein Data Bank (file format)
Protein Data Bank (file format)
The Protein Data Bank file format is a textual file format describing the three dimensional structures of molecules held in the Protein Data Bank. The pdb format accordingly provides for description and annotation of protein and nucleic acid structures including atomic coordinates, observed...
. For example, despite several unification attempts, there are more than 20 formats for biological sequences used (and the number is still growing). Although standardized XML, CSV or tabular formats are promoted by different initiatives, most of commonly used file formats have a form of raw text files and have no characteristic features that might be used to identify or distinguish them. As a result, users of bioinformatic software spend significant amount of time on checking what are their File formats and assessing whether they are compatible with input or output formats of the tools they would like to use.
Algorithm
FileQuirks checks the format of the data file using an extremely simple and data-driven algorithm.Example files for each of the file formats are stored in the database. Adding a new file format to recognize requires only providing example files of this format.
Systems calculates a set of (hundreds or more) descriptors of values 0 or 1, which are evaluated for each of the stored files. The currently used descriptors are regular expressions. Regular expressions are designed in a way to recognize common patterns used in biology, like word "BLAST" present in every BLAST
BLAST
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences...
report or ">" sign at the beginning of the line of sequence formats. If a regular expressions matches given file, the value of the descriptor is 1, otherwise it is 0. The matching is performed by python module re, with multiline flag enabled.
User query is evaluated against all regular expressions in the database. Afterwards, the data formats which example files match similare regular expressions are presented to the user.
To improve the result a set of "expert" regular experssions are also present, which are designed to recognize only one specific format. An example of such expression is "(>([^\t\n\r\f\v]*)\r?\n\r?([ANCTGUanctgu\n\r]{20,})){2,}" - which (believe or not) matches only files with more than one sequence of nucleic acid in FASTA format. Expert expressions are evaluated against every user query and matching data types are presented.
See also
- BioCatalogue Search services by data feature
- BioCatalogueBioCatalogueThe BioCatalogue is a curated catalogue of Life Science Web Services. The BioCatalogue was launched in June 2009 at the Intelligent Systems for Molecular Biology Conference....