Text mining - AbsoluteAstronomy.com

Text mining, sometimes alternately referred to as text data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

, roughly equivalent to text analytics
Text analytics
The term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation. The term is roughly synonymous with text mining;...

, refers to the process of deriving high-quality information

Information

Information in its most restricted technical sense is a message or collection of messages that consists of an ordered sequence of symbols, or it is the meaning that can be interpreted from such a message or collection of messages. Information can be recorded or transmitted. It can be recorded as...

from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning

Pattern recognition

In machine learning, pattern recognition is the assignment of some sort of output value to a given input value , according to some specific algorithm. An example of pattern recognition is classification, which attempts to assign each input value to one of a given set of classes...

. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database

Database

A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...

), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance

Relevance (information retrieval)

In information science and information retrieval, relevance denotes how well a retrieved document or set of documents meets the information need of the user.-Types:...

, novelty

Novelty (patent)

Novelty is a patentability requirement. An invention is not patentable if the claimed subject matter was disclosed before the date of filing, or before the date of priority if a priority is claimed, of the patent application....

, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction

Concept Mining

Concept mining is an activity that results in the extraction of concepts from artifacts. Solutions to the task typically involve aspects of artificial intelligence and statistics, such as data mining and text mining...

, production of granular taxonomies, sentiment analysis

Sentiment analysis

Sentiment analysis or opinion mining refers to the application of natural language processing, computational linguistics, and text analytics to identify and extract subjective information in source materials....

, document summarization, and entity relation modeling (i.e., learning relations between named entities

Named entity recognition

Named-entity recognition is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.Most research on NER...

History

Labor-intensive manual text mining approaches first surfaced in the mid-1980s, but technological advances have enabled the field to advance during the past decade. Text mining is an interdisciplinary field that draws on information retrieval

Information retrieval

Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...

, data mining

Data mining

Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

, machine learning

Machine learning

Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

, statistics

Statistics

Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, and computational linguistics

Computational linguistics

Computational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective....

. As most information (common estimates say over 80%) is currently stored as text, text mining is believed to have a high commercial potential value.
Increasing interest is being paid to multilingual data mining: the ability to gain information across languages and cluster similar items from different linguistic sources according to their meaning.

Security applications

Many text mining software packages are marketed for security applications, especially analysis of plain text sources such as Internet news. It also involves in the study of text encryption.

Biomedical applications

A range of text mining applications in the biomedical literature has been described.

The more important online text mining application in the biomedical literature is GoPubMed . GoPubmed was actually the first semantic search engine on the Web.
Other example is PubGene

PubGene

PubGene AS is located in Oslo, Norway and is the daughter company of PubGene Inc.In 2001, PubGene founders demonstrated one of the firstapplications of text mining to research in biomedicine...

that combines biomedical text mining with network visualization as an Internet service.

Software and applications

Text mining methods and software is also being researched and developed by major firms, including IBM

IBM

International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...

and Microsoft

Microsoft

Microsoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...

, to further automate the mining and analysis processes, and by different firms working in the area of search and indexing in general as a way to improve their results.
Within public sector much effort has been concentrated on creating software for tracking and monitoring terrorist activities

Information Awareness Office

The Information Awareness Office was established by the Defense Advanced Research Projects Agency in January 2002 to bring together several DARPA projects focused on applying surveillance and information technology to track and monitor terrorists and other asymmetric threats to national security,...

Online media applications

Text mining is being used by large media companies, such as the Tribune Company

Tribune Company

The Tribune Company is a large American multimedia corporation based in Chicago, Illinois. It is the nation's second-largest newspaper publisher, with ten daily newspapers and commuter tabloids including Chicago Tribune, Los Angeles Times, Hartford Courant, Orlando Sentinel, South Florida...

, to disambiguate information and to provide readers with greater search experiences, which in turn increases site "stickiness" and revenue. Additionally, on the back end, editors are benefiting by being able to share, associate and package news across properties, significantly increasing opportunities to monetize content.

Marketing applications

Text mining is starting to be used in marketing as well, more specifically in analytical customer relationship management

Customer relationship management

Customer relationship management is a widely implemented strategy for managing a company’s interactions with customers, clients and sales prospects. It involves using technology to organize, automate, and synchronize business processes—principally sales activities, but also those for marketing,...

. Coussement and Van den Poel (2008) apply it to improve predictive analytics

Predictive analytics

Predictive analytics encompasses a variety of statistical techniques from modeling, machine learning, data mining and game theory that analyze current and historical facts to make predictions about future events....

models for customer churn (customer attrition

Customer attrition

Customer attrition, also known as customer churn, customer turnover, or customer defection, is a business term used to describe loss of clients or customers....

Sentiment analysis

may involve analysis of movie reviews for estimating how favorable a review is for a movie.
Such an analysis may need a labeled data set or labeling of the affectivity

Affect (psychology)

Affect refers to the experience of feeling or emotion. Affect is a key part of the process of an organism's interaction with stimuli. The word also refers sometimes to affect display, which is "a facial, vocal, or gestural behavior that serves as an indicator of affect" .The affective domain...

of words.
A resource for affectivity of words has been made for WordNet

WordNet

WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets...

.

Text has been used to detect emotions in the related area of affective computing
. Text based approaches to affective computing have been used on multiple corpora such as students evaluations, children stories and news stories.

Academic applications

The issue of text mining is of importance to publishers who hold large database

Database

s of information needing indexing

Index (database)

A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space...

for retrieval. This is especially true in scientific disciplines, in which highly specific information is often contained within written text. Therefore, initiatives have been taken such as Nature's

Nature (journal)

Nature, first published on 4 November 1869, is ranked the world's most cited interdisciplinary scientific journal by the Science Edition of the 2010 Journal Citation Reports...

proposal for an Open Text Mining Interface (OTMI) and the National Institutes of Health

National Institutes of Health

The National Institutes of Health are an agency of the United States Department of Health and Human Services and are the primary agency of the United States government responsible for biomedical and health-related research. Its science and engineering counterpart is the National Science Foundation...

's common Journal Publishing Document Type Definition

Document Type Definition

Document Type Definition is a set of markup declarations that define a document type for SGML-family markup languages...

(DTD) that would provide semantic cues to machines to answer specific queries contained within text without removing publisher barriers to public access.

Academic institutions have also become involved in the text mining initiative:

The National Centre for Text Mining
National Centre for Text Mining
The National Centre for Text Mining was the world’s first publicly funded text mining centre. It was established to provide support, advice, and information on TM technologies and to disseminate information from the larger TM community, while also providing tailored services and tools in response...

(NaCTeM), is the first publicly funded text mining centre in the world. NaCTeM is operated by the University of Manchester
University of Manchester
The University of Manchester is a public research university located in Manchester, United Kingdom. It is a "red brick" university and a member of the Russell Group of research-intensive British universities and the N8 Group...

in close collaboration with the Tsujii Lab, University of Tokyo
University of Tokyo
, abbreviated as , is a major research university located in Tokyo, Japan. The University has 10 faculties with a total of around 30,000 students, 2,100 of whom are foreign. Its five campuses are in Hongō, Komaba, Kashiwa, Shirokane and Nakano. It is considered to be the most prestigious university...

. NaCTeM provides customised tools, research facilities and offers advice to the academic community. They are funded by the Joint Information Systems Committee
Joint Information Systems Committee
JISC is a United Kingdom non-departmental public body whose role is to support post-16 and higher education and research by providing leadership in the use of ICT in learning, teaching, research and administration...

(JISC) and two of the UK Research Council
Research Council
The UK Research Councils, of which there are currently seven, are publicly-funded agencies responsible for co-ordinating and funding particular areas of research, including the arts, humanities, all areas of science and engineering...

s (EPSRC & BBSRC). With an initial focus on text mining in the biological
Biology
Biology is a natural science concerned with the study of life and living organisms, including their structure, function, growth, origin, evolution, distribution, and taxonomy. Biology is a vast subject containing many subdivisions, topics, and disciplines...

and biomedical sciences, research has since expanded into the areas of social sciences
Social sciences
Social science is the field of study concerned with society. "Social science" is commonly used as an umbrella term to refer to a plurality of fields outside of the natural sciences usually exclusive of the administrative or managerial sciences...

.

In the United States, the School of Information
UC Berkeley School of Information
The UC Berkeley School of Information or the iSchool is a graduate school offering both a professional master's degree and a research-oriented Ph.D. degree at the University of California, Berkeley. The school was created in 1994 and was known as the School of Information Management and Systems ...

at University of California, Berkeley
University of California, Berkeley
The University of California, Berkeley , is a teaching and research university established in 1868 and located in Berkeley, California, USA...

is developing a program called BioText to assist biology
Biology
Biology is a natural science concerned with the study of life and living organisms, including their structure, function, growth, origin, evolution, distribution, and taxonomy. Biology is a vast subject containing many subdivisions, topics, and disciplines...

researchers in text mining and analysis.

Notable software and applications

Text mining computer programs are available from many commercial

Commercial software

Commercial software, or less commonly, payware, is computer software that is produced for sale or that serves commercial purposes.Commercial software is most often proprietary software, but free software packages may also be commercial software....

and open source

Open source

The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

companies and sources.

Commercial

AeroText
AeroText
AeroText is a suite of text mining applications that are used for content analysis. Content used can be in multiple languages.AeroText is a solution developed at the Integrated Systems and Solutions division of Lockheed Martin Corporation, a leading U.S. Defense contractor...

– provides a suite of text mining applications for content analysis. Content used can be in multiple languages.
Attensity
Attensity
Attensity provides text analytics software for Customer Experience Management . Attensity's software applications extract facts, relationships and sentiment from unstructured data, which comprise approximately 85% of the information companies store electronically.The software uses natural language...

– hosted, integrated and stand-alone text mining (analytics) software that uses natural language processing technology to address collective intelligence in social media and forums; the voice of the customer in surveys and emails; customer relationship management; e-services; research and e-discovery; risk and compliance; and intelligence analysis.
Autonomy
Autonomy Corporation
Autonomy is a multinational enterprise software company with joint headquarters in Cambridge, United Kingdom, and San Francisco, USA and a subsidiary of Hewlett-Packard. The company uses a combination of technologies born out of research at the University of Cambridge...

– suite of text mining, clustering and categorization solutions for a variety of industries.
Basis Technology
Basis Technology Corp.
Basis Technology Corp. is a software company specializing in applying artificial intelligence techniques to understanding documents written in different languages...

– provides a suite of text analysis modules to identify language, enable search in more than 20 languages, extract entities, and efficiently search for and translate entities.
Clarabridge
Clarabridge
Clarabridge is a software company formed in 2005 in Reston, VA. Clarabridge offers its Clarabridge Enterprise and Clarabridge Professional products as SaaS and on premise software solutions that utilize sentiment and text analytics to automatically collect, categorize and report on structured and...

– offers SaaS, Hosted or on premise sentiment and text analytics (text mining) software solutions that utilizes natural language (NLP), machine learning, clustering and categorization to extract insights from unstructured and structured data.
Endeca Technologies – provides software to analyze and cluster unstructured text.
Expert System S.p.A.
Expert System S.p.A.
Expert System is a software company, founded in Italy in 1989, pioneer in developing and marketing semantic technologies to understand and manage unstructured information. Expert System's semantic approach, thanks to its capability of natural language processing, enables a rapid and complete...

– suite of semantic technologies and products for developers and knowledge managers.
Fair Isaac
Fair Isaac
Fair Isaac Corporation is a public company that provides analytics and decision making services—including credit scoring—intended to help financial services companies make complex, high-volume decisions.- History :...

– leading provider of decision management solutions powered by advanced analytics (includes text analytics).
Inxight
Inxight
Inxight Software, Inc. is a software company specializing in visualization, information retrieval and natural language processing. It was bought by Business Objects in 2007; Business Objects was in turn acquired by SAP AG in 2008. Founded in 1997, Inxight is headquartered in Sunnyvale, California...

– provider of text analytics, search, and unstructured visualization technologies. (Inxight was bought by Business Objects
Business Objects (company)
SAP Business Objects is a French enterprise software company, specializing in business intelligence . Since 2007, it has been a part of SAP AG. The company claimed more than 46,000 customers worldwide in its final earnings release...

that was bought by SAP AG
SAP AG
SAP AG is a German software corporation that makes enterprise software to manage business operations and customer relations. Headquartered in Walldorf, Baden-Württemberg, with regional offices around the world, SAP is the market leader in enterprise application software...

in 2008).
LanguageWare
Languageware
LanguageWare is a natural language processing technology developed by IBM, that allows applications to process natural language text. It comprises a set of Java libraries which provide a range of NLP functions: language identification, text segmentation/tokenization, normalization, entity and...

– text analysis libraries and customization tooling from IBM.
Language Computer Corporation
Language Computer Corporation
Language Computer Corporation is a natural language processing research company based in Richardson, Texas. The company develops a variety of natural language processing products, including software for question answering, information extraction, and automatic summarization.Since its founding in...

– provides a suite of customizable text extraction and analysis tools, available in multiple languages.
LexisNexis
LexisNexis
LexisNexis Group is a company providing computer-assisted legal research services. In 2006 it had the world's largest electronic database for legal and public-records related information...

– provider of business intelligence solutions based on an extensive news and company information content set. Through the recent acquisition of Datops LexisNexis is leveraging its search and retrieval expertise to become a player in the text and data mining field.
Mathematica
Mathematica
Mathematica is a computational software program used in scientific, engineering, and mathematical fields and other areas of technical computing...

– provides built in tools for text alignment, pattern matching, clustering and semantic analysis.
Nstein Technologies
Nstein Technologies
Nstein Technologies Inc. was founded in January 2000 via a merger between the I.T. research and development firm GESPRO and Net Création, a technology marketing company. At the time, the company mandate was to develop and market “intelligent interactive linguistic tools based on innovative...

– text mining solution that creates rich metadata to allow publishers to increase page views, increase site stickiness, optimize SEO, automate tagging, improve search experience, increase editorial productivity, decrease operational publishing costs, increase online revenues. In combination with search engines it is used to create semantic search applications.
SAS
SAS System
SAS is an integrated system of software products provided by SAS Institute Inc. that enables programmers to perform:* retrieval, management, and mining* report writing and graphics* statistical analysis...

– solutions including SAS Text Miner and Teragram; commercial text analytics, natural language processing, and taxonomy software leveraged for Information Management
Information management
Information management is the collection and management of information from one or more sources and the distribution of that information to one or more audiences. This sometimes involves those who have a stake in, or a right to that information...

. SAS Text Miner rated as the third most used text mining software (9%) by Rexer's Annual Data Miner Survey
Rexer's Annual Data Miner Survey
Rexer Analytics’s Annual Data Miner Survey is the largest survey of data mining professionals in the industry. It consists of approximately 50 multiple choice and open-ended questions that cover seven general areas of data mining science and practice: Field and goals, Algorithms, Models, Tools...

in 2010.
IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...

SPSS
SPSS
SPSS is a computer program used for survey authoring and deployment , data mining , text analytics, statistical analysis, and collaboration and deployment ....

– provider of IBM SPSS Modeler and IBM SPSS Text Analytics (now called IBM SPSS Modeler Premium). Rated as the second (17%) and fourth (7%), respectively, most used text mining software by Rexer's Annual Data Miner Survey
Rexer's Annual Data Miner Survey
Rexer Analytics’s Annual Data Miner Survey is the largest survey of data mining professionals in the industry. It consists of approximately 50 multiple choice and open-ended questions that cover seven general areas of data mining science and practice: Field and goals, Algorithms, Models, Tools...

in 2010.
StatSoft
StatSoft
StatSoft is a global provider of enterprise and desktop software for data analysis, data management, data visualization, data mining , and quality control.-Company History:...

– provides STATISTICA
STATISTICA
STATISTICA is a statistics and analytics software package developed by StatSoft. STATISTICA provides data analysis, data management, data mining, and data visualization procedures...

Text Miner as an optional extension to STATISTICA Data Miner, for Predictive Analytics Solutions. Rated as the top used text mining software (19%) by Rexer's Annual Data Miner Survey
Rexer's Annual Data Miner Survey
Rexer Analytics’s Annual Data Miner Survey is the largest survey of data mining professionals in the industry. It consists of approximately 50 multiple choice and open-ended questions that cover seven general areas of data mining science and practice: Field and goals, Algorithms, Models, Tools...

in 2010.
Thomson Data Analyzer
Thomson Data Analyzer
Thomson Data Analyzer works with patent, scientific databases and news sources to deliver information snapshots and graphs enable insights into:* Companies – Current and Potential Product Portfolios...

– enables complex analysis on patent information, scientific publications and news.

Free libre open-source

Carrot2
Carrot2
Carrot² is an open source search results clustering engine. It can automatically cluster small collections of documents, e.g. search results or document abstracts, into thematic categories. Apart from two specialized search results clustering algorithms, Carrot² offers ready-to-use components for...

– text and search results clustering framework.
GATE
General Architecture for Text Engineering
General Architecture for Text Engineering or GATE is a Java suite of tools originally developed at the University of Sheffield beginning in 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for all sorts of natural language processing tasks, including...

– natural language processing and language engineering tool.
OpenNLP
OpenNLP
The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks...

- natural language processing
Natural Language Toolkit
Natural Language Toolkit
Natural Language Toolkit or, more commonly, NLTK is a suite of libraries and programs for symbolic and statistical natural language processing for the Python programming language. NLTK includes graphical demonstrations and sample data...

(NLTK) – a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python
Python (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...

programming language.
RapidMiner with its Text Processing Extension – data and text mining software. Rated as the fifth most used text mining software (6%) by Rexer's Annual Data Miner Survey
Rexer's Annual Data Miner Survey
Rexer Analytics’s Annual Data Miner Survey is the largest survey of data mining professionals in the industry. It consists of approximately 50 multiple choice and open-ended questions that cover seven general areas of data mining science and practice: Field and goals, Algorithms, Models, Tools...

in 2010.
Unstructured Information Management Architecture (UIMA
Uima
UIMA stands for Unstructured Information Management Architecture. An OASIS standard as of March 2009, UIMA is to date the only industry standard for content analytics....

) – a component framework to analyze unstructured content such as text, audio and video, originally developed by IBM.
tm: Text Mining Package - a framework for text mining applications within R
R (programming language)
R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians for developing statistical software, and R is widely used for statistical software development and data analysis....

originally created by Ingo Feinerer as part of his dissertation at the Institute for Statistics and Mathematics of the Vienna University of Economics and Business Administration.

Implications

Until recently, websites most often used text-based searches, which only found documents containing specific user-defined words or phrases. Now, through use of a semantic web

Semantic Web

The Semantic Web is a collaborative movement led by the World Wide Web Consortium that promotes common formats for data on the World Wide Web. By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web of unstructured documents into a "web of...

, text mining can find content based on meaning and context (rather than just by a specific word).

Additionally, text mining software can be used to build large dossiers of information about specific people and events. For example, large datasets based on data extracted from news reports can be built to facilitate social networks analysis or counter-intelligence

Counter-intelligence

Counterintelligence or counter-intelligence refers to efforts made by intelligence organizations to prevent hostile or enemy intelligence organizations from successfully gathering and collecting intelligence against them. National intelligence programs, and, by extension, the overall defenses of...

. In effect, the text mining software may act in a capacity similar to an intelligence analyst or research librarian, albeit with a more limited scope of analysis.

Text mining is also used in some email spam filters as a way of determining the characteristics of messages that are likely to be advertisements or other unwanted material.

External links

Marti Hearst: What Is Text Mining? (October, 2003)

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

History

Security applications

Biomedical applications

Software and applications

Online media applications

Marketing applications

Sentiment analysis

Academic applications

Notable software and applications

Commercial

Free libre open-source

Implications

See also

External links