Biomedical text mining
Encyclopedia
Biomedical text mining refers to text mining applied to texts and literature of the biomedical and molecular biology domain. It is a rather recent research field on the edge of natural language processing
, bioinformatics
, medical informatics and computational linguistics
.
There is an increasing interest in text mining
and information extraction strategies applied to the biomedical and molecular biology
literature due to the increasing number of electronically available publications stored in databases such as PubMed
.
and gene
names in free text, the association of gene cluster
s obtained by microarray experiments with the biological context provided by the corresponding literature, automatic extraction of protein interactions and associations of proteins to functional concepts (e.g. gene ontology
terms). Even the extraction of kinetic parameters from text or the subcellular location of proteins have been addressed by information extraction and text mining technology.
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....
, bioinformatics
Bioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...
, medical informatics and computational linguistics
Computational linguistics
Computational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective....
.
There is an increasing interest in text mining
Text mining
Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...
and information extraction strategies applied to the biomedical and molecular biology
Molecular biology
Molecular biology is the branch of biology that deals with the molecular basis of biological activity. This field overlaps with other areas of biology and chemistry, particularly genetics and biochemistry...
literature due to the increasing number of electronically available publications stored in databases such as PubMed
PubMed
PubMed is a free database accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine at the National Institutes of Health maintains the database as part of the Entrez information retrieval system...
.
Main applications
The main developments in this area have been related to the identification of biological entities (named entity recognition), such as proteinProtein
Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form, facilitating a biological function. A polypeptide is a single linear polymer chain of amino acids bonded together by peptide bonds between the carboxyl and amino groups of...
and gene
Gene
A gene is a molecular unit of heredity of a living organism. It is a name given to some stretches of DNA and RNA that code for a type of protein or for an RNA chain that has a function in the organism. Living beings depend on genes, as they specify all proteins and functional RNA chains...
names in free text, the association of gene cluster
Gene cluster
A gene cluster is a set of two or more genes that serve to encode for the same or similar products. Because populations from a common ancestor tend to possess the same varieties of gene clusters, they are useful for tracing back recent evolutionary history...
s obtained by microarray experiments with the biological context provided by the corresponding literature, automatic extraction of protein interactions and associations of proteins to functional concepts (e.g. gene ontology
Gene Ontology
The Gene Ontology, or GO, is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species...
terms). Even the extraction of kinetic parameters from text or the subcellular location of proteins have been addressed by information extraction and text mining technology.
Examples
- KLEIO - an advanced information retrieval system providing knowledge enriched searching for biomedicine.
- FACTA+ - a MEDLINE search engine for finding associations between biomedical concepts. The FACTA+ Visualizer helps intuitive understanding of FACTA+ search results through graphical visualization of the results.
- U-Compare - U-Compare is an integrated text mining/natural language processing system based on the UIMAUimaUIMA stands for Unstructured Information Management Architecture. An OASIS standard as of March 2009, UIMA is to date the only industry standard for content analytics....
Framework, with an emphasis on components for biomedical text mining. - TerMine - a term management system that identifies key terms in biomedical and other text types.
- MEDIE - an intelligent search engine to retrieve biomedical correlations from MEDLINE, based on indexing by Natural Language Processing and Text Mining techniques
- AcroMine - an acronym dictionary which can be used to find distinct expanded forms of acronyms from MEDLINE.
- AcroMine Disambiguator - Disambiguates abbreviations in biomedical text with their correct full forms.
- GENIA tagger - Analyses biomedical text and outputs base forms, part-of-speech tags, chunk tags, and named entity tags
- NEMine - Recognises gene/protein names in text
- Yeast MetaboliNER - Recognizes yeast metabolite names in text.
- Smart Dictionary Lookup - machine learning-based gene/protein name lookup.
- Chilibot — A tool for finding relationships between genes or gene products.
- EBIMed - EBIMed is a web application that combines Information Retrieval and Extraction from Medline.
- FABLE — A gene-centric text-mining search engine for MEDLINE
- GOAnnotator, an online tool that uses Semantic similaritySemantic similaritySemantic similarity or semantic relatedness is a concept whereby a set of documents or terms within term lists are assigned a metric based on the likeness of their meaning / semantic content....
for verification of electronic protein annotations using GO terms automatically extracted from literature. - GoPubMed — retrieves PubMedPubMedPubMed is a free database accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine at the National Institutes of Health maintains the database as part of the Entrez information retrieval system...
abstracts for your search query, then detects ontology terms from the Gene OntologyGene OntologyThe Gene Ontology, or GO, is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species...
and Medical Subject HeadingsMedical Subject HeadingsMedical Subject Headings is a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences; it can also serve as a thesaurus that facilitates searching...
in the abstracts and allows the user to browse the search results by exploring the ontologieOntologyOntology is the philosophical study of the nature of being, existence or reality as such, as well as the basic categories of being and their relations...
s and displaying only papers mentioning selected terms, their synonyms or descendants. - Anne O'Tate Retrieves sets of PubMed records, using a standard PubMed interface, and analyzes them, arranging content of PubMed record fields (MeSH, author, journal, words from title and abtsracts, and others) in order of frequency.
- Information Hyperlinked Over Proteins (iHOP): "A network of concurring genes and proteins extends through the scientific literature touching on phenotypes, pathologies and gene function. iHOP provides this network as a natural way of accessing millions of PubMed abstracts. By using genes and proteins as hyperlinks between sentences and abstracts, the information in PubMed can be converted into one navigable resource, bringing all advantages of the internet to scientific literature research."
- LitInspector — Gene and signal transduction pathway data mining in PubMedPubMedPubMed is a free database accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine at the National Institutes of Health maintains the database as part of the Entrez information retrieval system...
abstracts. - NextBio- Life sciences search engine with a text mining functionality that utilizes PubMedPubMedPubMed is a free database accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine at the National Institutes of Health maintains the database as part of the Entrez information retrieval system...
abstracts (ex: literature search) and clinical trials (example) to return concepts relevant to the query based on a number of heuristics including ontology relationships, journal impact, publication date, and authorship. - PubAnatomy — An interactive visual search engine that provides new ways to explore relationships among Medline literature, text mining results, anatomical structures, gene expression and other background information.
- PubGene — Co-occurrence networksCo-occurrence networksCo-occurrence networks are generally used to provide a graphic visualization of potential relationships between people, organizations, concepts or other entities represented within written material...
display of gene and protein symbols as well as MeSHMedical Subject HeadingsMedical Subject Headings is a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences; it can also serve as a thesaurus that facilitates searching...
, GOGene OntologyThe Gene Ontology, or GO, is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species...
, PubChemPubChemPubChem is a database of chemical molecules and their activities against biological assays. The system is maintained by the National Center for Biotechnology Information , a component of the National Library of Medicine, which is part of the United States National Institutes of Health . PubChem can...
and interaction terms (such as "binds" or "induces") as these appear in MEDLINEMEDLINEMEDLINE is a bibliographic database of life sciences and biomedical information. It includes bibliographic information for articles from academic journals covering medicine, nursing, pharmacy, dentistry, veterinary medicine, and health care...
records (that is, PubMedPubMedPubMed is a free database accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine at the National Institutes of Health maintains the database as part of the Entrez information retrieval system...
titles and abstracts). - Whatizit - Whatizit is great at identifying molecular biology terms and linking them to publicly available databases.
- XTractor — Discovering Newer Scientific Relations Across PubMedPubMedPubMed is a free database accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine at the National Institutes of Health maintains the database as part of the Entrez information retrieval system...
Abstracts. A tool to obtain manually annotated,expert curated relationships for Proteins, Diseases, Drugs and Biological Processes as they get published in PubMedPubMedPubMed is a free database accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine at the National Institutes of Health maintains the database as part of the Entrez information retrieval system...
. - Medical Abstract — Medical Abstract is an aggregator for medical abstract journal from PubMedPubMedPubMed is a free database accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine at the National Institutes of Health maintains the database as part of the Entrez information retrieval system...
Abstracts. - MuGeX — MuGeX is a tool for finding disease specific mutation-gene pairs.
- MedCase — MedCase is an experimental tool of Faculties of Veterinary Medicine and Computer Science in Cluj-Napoca, designed as a homeostatic serving sistem with natural language support for medical applications.
Conferences at which BioNLP research is presented
BioNLP is presented at a variety of meetings:- Pacific Symposium on BiocomputingPacific Symposium on BiocomputingThe Pacific Symposium on Biocomputing is an international, multidisciplinary scientific meeting held annually since 1996. The purpose of this conference is for the presentation and discussion of current research in the theory and application of computational methods in problems of biological...
: in plenary session - Intelligent Systems for Molecular BiologyIntelligent Systems for Molecular BiologyIntelligent Systems for Molecular Biology is a scientific meeting on the subjects of bioinformatics and computational biology organized by the International Society for Computational Biology . Its principal focus is on the development and application of advanced computational methods for...
: in plenary session and also in the BioLINK and Bio-ontologies workshops - Association for Computational LinguisticsAssociation for Computational LinguisticsThe Association for Computational Linguistics is the international scientific and professional society for people working on problems involving natural language and computation. An annual meeting is held each summer in locations where significant computational linguistics research is carried out...
and North American Association for Computational LinguisticsNorth American Association for Computational LinguisticsThe North American Chapter of the Association for Computational Linguistics provides a regional focus for members of the Association for Computational Linguistics in North America as well as in Central and South America, organizes annual conferences, promotes cooperation and information exchange...
annual meetings and associated workshops: in plenary session and as part of the BioNLP workshop (see below) - BioNLP 2010
- American Medical Informatics AssociationAmerican Medical Informatics AssociationAMIA, formerly known as the American Medical Informatics Association, is an American non-profit organization dedicated to the development and application of biomedical and health informatics in the support of patient care, teaching, research, and health care administration.- History :AMIA is the...
annual meeting: in plenary session