Text mining
Encyclopedia
Text mining, sometimes alternately referred to as text data mining
, roughly equivalent to text analytics
, refers to the process of deriving high-quality information
from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning
. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database
), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance
, novelty
, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction
, production of granular taxonomies, sentiment analysis
, document summarization, and entity relation modeling (i.e., learning relations between named entities
).
, data mining
, machine learning
, statistics
, and computational linguistics
. As most information (common estimates say over 80%) is currently stored as text, text mining is believed to have a high commercial potential value.
Increasing interest is being paid to multilingual data mining: the ability to gain information across languages and cluster similar items from different linguistic sources according to their meaning.
The more important online text mining application in the biomedical literature is GoPubMed . GoPubmed was actually the first semantic search engine on the Web.
Other example is PubGene
that combines biomedical text mining with network visualization as an Internet service.
and Microsoft
, to further automate the mining and analysis processes, and by different firms working in the area of search and indexing in general as a way to improve their results.
Within public sector much effort has been concentrated on creating software for tracking and monitoring terrorist activities
.
, to disambiguate information and to provide readers with greater search experiences, which in turn increases site "stickiness" and revenue. Additionally, on the back end, editors are benefiting by being able to share, associate and package news across properties, significantly increasing opportunities to monetize content.
. Coussement and Van den Poel (2008) apply it to improve predictive analytics
models for customer churn (customer attrition
).
may involve analysis of movie reviews for estimating how favorable a review is for a movie.
Such an analysis may need a labeled data set or labeling of the affectivity
of words.
A resource for affectivity of words has been made for WordNet
.
Text has been used to detect emotions in the related area of affective computing
. Text based approaches to affective computing have been used on multiple corpora such as students evaluations, children stories and news stories.
s of information needing indexing
for retrieval. This is especially true in scientific disciplines, in which highly specific information is often contained within written text. Therefore, initiatives have been taken such as Nature's
proposal for an Open Text Mining Interface (OTMI) and the National Institutes of Health
's common Journal Publishing Document Type Definition
(DTD) that would provide semantic cues to machines to answer specific queries contained within text without removing publisher barriers to public access.
Academic institutions have also become involved in the text mining initiative:
and open source
companies and sources.
, text mining can find content based on meaning and context (rather than just by a specific word).
Additionally, text mining software can be used to build large dossiers of information about specific people and events. For example, large datasets based on data extracted from news reports can be built to facilitate social networks analysis or counter-intelligence
. In effect, the text mining software may act in a capacity similar to an intelligence analyst or research librarian, albeit with a more limited scope of analysis.
Text mining is also used in some email spam filters as a way of determining the characteristics of messages that are likely to be advertisements or other unwanted material.
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
, roughly equivalent to text analytics
Text analytics
The term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation. The term is roughly synonymous with text mining;...
, refers to the process of deriving high-quality information
Information
Information in its most restricted technical sense is a message or collection of messages that consists of an ordered sequence of symbols, or it is the meaning that can be interpreted from such a message or collection of messages. Information can be recorded or transmitted. It can be recorded as...
from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning
Pattern recognition
In machine learning, pattern recognition is the assignment of some sort of output value to a given input value , according to some specific algorithm. An example of pattern recognition is classification, which attempts to assign each input value to one of a given set of classes...
. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...
), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance
Relevance (information retrieval)
In information science and information retrieval, relevance denotes how well a retrieved document or set of documents meets the information need of the user.-Types:...
, novelty
Novelty (patent)
Novelty is a patentability requirement. An invention is not patentable if the claimed subject matter was disclosed before the date of filing, or before the date of priority if a priority is claimed, of the patent application....
, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction
Concept Mining
Concept mining is an activity that results in the extraction of concepts from artifacts. Solutions to the task typically involve aspects of artificial intelligence and statistics, such as data mining and text mining...
, production of granular taxonomies, sentiment analysis
Sentiment analysis
Sentiment analysis or opinion mining refers to the application of natural language processing, computational linguistics, and text analytics to identify and extract subjective information in source materials....
, document summarization, and entity relation modeling (i.e., learning relations between named entities
Named entity recognition
Named-entity recognition is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.Most research on NER...
).
History
Labor-intensive manual text mining approaches first surfaced in the mid-1980s, but technological advances have enabled the field to advance during the past decade. Text mining is an interdisciplinary field that draws on information retrievalInformation retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...
, data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
, machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
, statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
, and computational linguistics
Computational linguistics
Computational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective....
. As most information (common estimates say over 80%) is currently stored as text, text mining is believed to have a high commercial potential value.
Increasing interest is being paid to multilingual data mining: the ability to gain information across languages and cluster similar items from different linguistic sources according to their meaning.
Security applications
Many text mining software packages are marketed for security applications, especially analysis of plain text sources such as Internet news. It also involves in the study of text encryption.Biomedical applications
A range of text mining applications in the biomedical literature has been described.The more important online text mining application in the biomedical literature is GoPubMed . GoPubmed was actually the first semantic search engine on the Web.
Other example is PubGene
PubGene
PubGene AS is located in Oslo, Norway and is the daughter company of PubGene Inc.In 2001, PubGene founders demonstrated one of the firstapplications of text mining to research in biomedicine...
that combines biomedical text mining with network visualization as an Internet service.
Software and applications
Text mining methods and software is also being researched and developed by major firms, including IBMIBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
and Microsoft
Microsoft
Microsoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...
, to further automate the mining and analysis processes, and by different firms working in the area of search and indexing in general as a way to improve their results.
Within public sector much effort has been concentrated on creating software for tracking and monitoring terrorist activities
Information Awareness Office
The Information Awareness Office was established by the Defense Advanced Research Projects Agency in January 2002 to bring together several DARPA projects focused on applying surveillance and information technology to track and monitor terrorists and other asymmetric threats to national security,...
.
Online media applications
Text mining is being used by large media companies, such as the Tribune CompanyTribune Company
The Tribune Company is a large American multimedia corporation based in Chicago, Illinois. It is the nation's second-largest newspaper publisher, with ten daily newspapers and commuter tabloids including Chicago Tribune, Los Angeles Times, Hartford Courant, Orlando Sentinel, South Florida...
, to disambiguate information and to provide readers with greater search experiences, which in turn increases site "stickiness" and revenue. Additionally, on the back end, editors are benefiting by being able to share, associate and package news across properties, significantly increasing opportunities to monetize content.
Marketing applications
Text mining is starting to be used in marketing as well, more specifically in analytical customer relationship managementCustomer relationship management
Customer relationship management is a widely implemented strategy for managing a company’s interactions with customers, clients and sales prospects. It involves using technology to organize, automate, and synchronize business processes—principally sales activities, but also those for marketing,...
. Coussement and Van den Poel (2008) apply it to improve predictive analytics
Predictive analytics
Predictive analytics encompasses a variety of statistical techniques from modeling, machine learning, data mining and game theory that analyze current and historical facts to make predictions about future events....
models for customer churn (customer attrition
Customer attrition
Customer attrition, also known as customer churn, customer turnover, or customer defection, is a business term used to describe loss of clients or customers....
).
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysis or opinion mining refers to the application of natural language processing, computational linguistics, and text analytics to identify and extract subjective information in source materials....
may involve analysis of movie reviews for estimating how favorable a review is for a movie.
Such an analysis may need a labeled data set or labeling of the affectivity
Affect (psychology)
Affect refers to the experience of feeling or emotion. Affect is a key part of the process of an organism's interaction with stimuli. The word also refers sometimes to affect display, which is "a facial, vocal, or gestural behavior that serves as an indicator of affect" .The affective domain...
of words.
A resource for affectivity of words has been made for WordNet
WordNet
WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets...
.
Text has been used to detect emotions in the related area of affective computing
. Text based approaches to affective computing have been used on multiple corpora such as students evaluations, children stories and news stories.
Academic applications
The issue of text mining is of importance to publishers who hold large databaseDatabase
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...
s of information needing indexing
Index (database)
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space...
for retrieval. This is especially true in scientific disciplines, in which highly specific information is often contained within written text. Therefore, initiatives have been taken such as Nature's
Nature (journal)
Nature, first published on 4 November 1869, is ranked the world's most cited interdisciplinary scientific journal by the Science Edition of the 2010 Journal Citation Reports...
proposal for an Open Text Mining Interface (OTMI) and the National Institutes of Health
National Institutes of Health
The National Institutes of Health are an agency of the United States Department of Health and Human Services and are the primary agency of the United States government responsible for biomedical and health-related research. Its science and engineering counterpart is the National Science Foundation...
's common Journal Publishing Document Type Definition
Document Type Definition
Document Type Definition is a set of markup declarations that define a document type for SGML-family markup languages...
(DTD) that would provide semantic cues to machines to answer specific queries contained within text without removing publisher barriers to public access.
Academic institutions have also become involved in the text mining initiative:
- The National Centre for Text MiningNational Centre for Text MiningThe National Centre for Text Mining was the world’s first publicly funded text mining centre. It was established to provide support, advice, and information on TM technologies and to disseminate information from the larger TM community, while also providing tailored services and tools in response...
(NaCTeM), is the first publicly funded text mining centre in the world. NaCTeM is operated by the University of ManchesterUniversity of ManchesterThe University of Manchester is a public research university located in Manchester, United Kingdom. It is a "red brick" university and a member of the Russell Group of research-intensive British universities and the N8 Group...
in close collaboration with the Tsujii Lab, University of TokyoUniversity of Tokyo, abbreviated as , is a major research university located in Tokyo, Japan. The University has 10 faculties with a total of around 30,000 students, 2,100 of whom are foreign. Its five campuses are in Hongō, Komaba, Kashiwa, Shirokane and Nakano. It is considered to be the most prestigious university...
. NaCTeM provides customised tools, research facilities and offers advice to the academic community. They are funded by the Joint Information Systems CommitteeJoint Information Systems CommitteeJISC is a United Kingdom non-departmental public body whose role is to support post-16 and higher education and research by providing leadership in the use of ICT in learning, teaching, research and administration...
(JISC) and two of the UK Research CouncilResearch CouncilThe UK Research Councils, of which there are currently seven, are publicly-funded agencies responsible for co-ordinating and funding particular areas of research, including the arts, humanities, all areas of science and engineering...
s (EPSRC & BBSRC). With an initial focus on text mining in the biologicalBiologyBiology is a natural science concerned with the study of life and living organisms, including their structure, function, growth, origin, evolution, distribution, and taxonomy. Biology is a vast subject containing many subdivisions, topics, and disciplines...
and biomedical sciences, research has since expanded into the areas of social sciencesSocial sciencesSocial science is the field of study concerned with society. "Social science" is commonly used as an umbrella term to refer to a plurality of fields outside of the natural sciences usually exclusive of the administrative or managerial sciences...
.
- In the United States, the School of InformationUC Berkeley School of InformationThe UC Berkeley School of Information or the iSchool is a graduate school offering both a professional master's degree and a research-oriented Ph.D. degree at the University of California, Berkeley. The school was created in 1994 and was known as the School of Information Management and Systems ...
at University of California, BerkeleyUniversity of California, BerkeleyThe University of California, Berkeley , is a teaching and research university established in 1868 and located in Berkeley, California, USA...
is developing a program called BioText to assist biologyBiologyBiology is a natural science concerned with the study of life and living organisms, including their structure, function, growth, origin, evolution, distribution, and taxonomy. Biology is a vast subject containing many subdivisions, topics, and disciplines...
researchers in text mining and analysis.
Notable software and applications
Text mining computer programs are available from many commercialCommercial software
Commercial software, or less commonly, payware, is computer software that is produced for sale or that serves commercial purposes.Commercial software is most often proprietary software, but free software packages may also be commercial software....
and open source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...
companies and sources.
Commercial
- AeroTextAeroTextAeroText is a suite of text mining applications that are used for content analysis. Content used can be in multiple languages.AeroText is a solution developed at the Integrated Systems and Solutions division of Lockheed Martin Corporation, a leading U.S. Defense contractor...
– provides a suite of text mining applications for content analysis. Content used can be in multiple languages. - AttensityAttensityAttensity provides text analytics software for Customer Experience Management . Attensity's software applications extract facts, relationships and sentiment from unstructured data, which comprise approximately 85% of the information companies store electronically.The software uses natural language...
– hosted, integrated and stand-alone text mining (analytics) software that uses natural language processing technology to address collective intelligence in social media and forums; the voice of the customer in surveys and emails; customer relationship management; e-services; research and e-discovery; risk and compliance; and intelligence analysis. - AutonomyAutonomy CorporationAutonomy is a multinational enterprise software company with joint headquarters in Cambridge, United Kingdom, and San Francisco, USA and a subsidiary of Hewlett-Packard. The company uses a combination of technologies born out of research at the University of Cambridge...
– suite of text mining, clustering and categorization solutions for a variety of industries. - Basis TechnologyBasis Technology Corp.Basis Technology Corp. is a software company specializing in applying artificial intelligence techniques to understanding documents written in different languages...
– provides a suite of text analysis modules to identify language, enable search in more than 20 languages, extract entities, and efficiently search for and translate entities. - ClarabridgeClarabridgeClarabridge is a software company formed in 2005 in Reston, VA. Clarabridge offers its Clarabridge Enterprise and Clarabridge Professional products as SaaS and on premise software solutions that utilize sentiment and text analytics to automatically collect, categorize and report on structured and...
– offers SaaS, Hosted or on premise sentiment and text analytics (text mining) software solutions that utilizes natural language (NLP), machine learning, clustering and categorization to extract insights from unstructured and structured data. - Endeca Technologies – provides software to analyze and cluster unstructured text.
- Expert System S.p.A.Expert System S.p.A.Expert System is a software company, founded in Italy in 1989, pioneer in developing and marketing semantic technologies to understand and manage unstructured information. Expert System's semantic approach, thanks to its capability of natural language processing, enables a rapid and complete...
– suite of semantic technologies and products for developers and knowledge managers. - Fair IsaacFair IsaacFair Isaac Corporation is a public company that provides analytics and decision making services—including credit scoring—intended to help financial services companies make complex, high-volume decisions.- History :...
– leading provider of decision management solutions powered by advanced analytics (includes text analytics). - InxightInxightInxight Software, Inc. is a software company specializing in visualization, information retrieval and natural language processing. It was bought by Business Objects in 2007; Business Objects was in turn acquired by SAP AG in 2008. Founded in 1997, Inxight is headquartered in Sunnyvale, California...
– provider of text analytics, search, and unstructured visualization technologies. (Inxight was bought by Business ObjectsBusiness Objects (company)SAP Business Objects is a French enterprise software company, specializing in business intelligence . Since 2007, it has been a part of SAP AG. The company claimed more than 46,000 customers worldwide in its final earnings release...
that was bought by SAP AGSAP AGSAP AG is a German software corporation that makes enterprise software to manage business operations and customer relations. Headquartered in Walldorf, Baden-Württemberg, with regional offices around the world, SAP is the market leader in enterprise application software...
in 2008). - LanguageWareLanguagewareLanguageWare is a natural language processing technology developed by IBM, that allows applications to process natural language text. It comprises a set of Java libraries which provide a range of NLP functions: language identification, text segmentation/tokenization, normalization, entity and...
– text analysis libraries and customization tooling from IBM. - Language Computer CorporationLanguage Computer CorporationLanguage Computer Corporation is a natural language processing research company based in Richardson, Texas. The company develops a variety of natural language processing products, including software for question answering, information extraction, and automatic summarization.Since its founding in...
– provides a suite of customizable text extraction and analysis tools, available in multiple languages. - LexisNexisLexisNexisLexisNexis Group is a company providing computer-assisted legal research services. In 2006 it had the world's largest electronic database for legal and public-records related information...
– provider of business intelligence solutions based on an extensive news and company information content set. Through the recent acquisition of Datops LexisNexis is leveraging its search and retrieval expertise to become a player in the text and data mining field. - MathematicaMathematicaMathematica is a computational software program used in scientific, engineering, and mathematical fields and other areas of technical computing...
– provides built in tools for text alignment, pattern matching, clustering and semantic analysis. - Nstein TechnologiesNstein TechnologiesNstein Technologies Inc. was founded in January 2000 via a merger between the I.T. research and development firm GESPRO and Net Création, a technology marketing company. At the time, the company mandate was to develop and market “intelligent interactive linguistic tools based on innovative...
– text mining solution that creates rich metadata to allow publishers to increase page views, increase site stickiness, optimize SEO, automate tagging, improve search experience, increase editorial productivity, decrease operational publishing costs, increase online revenues. In combination with search engines it is used to create semantic search applications. - SASSAS SystemSAS is an integrated system of software products provided by SAS Institute Inc. that enables programmers to perform:* retrieval, management, and mining* report writing and graphics* statistical analysis...
– solutions including SAS Text Miner and Teragram; commercial text analytics, natural language processing, and taxonomy software leveraged for Information ManagementInformation managementInformation management is the collection and management of information from one or more sources and the distribution of that information to one or more audiences. This sometimes involves those who have a stake in, or a right to that information...
. SAS Text Miner rated as the third most used text mining software (9%) by Rexer's Annual Data Miner SurveyRexer's Annual Data Miner SurveyRexer Analytics’s Annual Data Miner Survey is the largest survey of data mining professionals in the industry. It consists of approximately 50 multiple choice and open-ended questions that cover seven general areas of data mining science and practice: Field and goals, Algorithms, Models, Tools...
in 2010. - IBMIBMInternational Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
SPSSSPSSSPSS is a computer program used for survey authoring and deployment , data mining , text analytics, statistical analysis, and collaboration and deployment ....
– provider of IBM SPSS Modeler and IBM SPSS Text Analytics (now called IBM SPSS Modeler Premium). Rated as the second (17%) and fourth (7%), respectively, most used text mining software by Rexer's Annual Data Miner SurveyRexer's Annual Data Miner SurveyRexer Analytics’s Annual Data Miner Survey is the largest survey of data mining professionals in the industry. It consists of approximately 50 multiple choice and open-ended questions that cover seven general areas of data mining science and practice: Field and goals, Algorithms, Models, Tools...
in 2010. - StatSoftStatSoftStatSoft is a global provider of enterprise and desktop software for data analysis, data management, data visualization, data mining , and quality control.-Company History:...
– provides STATISTICASTATISTICASTATISTICA is a statistics and analytics software package developed by StatSoft. STATISTICA provides data analysis, data management, data mining, and data visualization procedures...
Text Miner as an optional extension to STATISTICA Data Miner, for Predictive Analytics Solutions. Rated as the top used text mining software (19%) by Rexer's Annual Data Miner SurveyRexer's Annual Data Miner SurveyRexer Analytics’s Annual Data Miner Survey is the largest survey of data mining professionals in the industry. It consists of approximately 50 multiple choice and open-ended questions that cover seven general areas of data mining science and practice: Field and goals, Algorithms, Models, Tools...
in 2010. - Thomson Data AnalyzerThomson Data AnalyzerThomson Data Analyzer works with patent, scientific databases and news sources to deliver information snapshots and graphs enable insights into:* Companies – Current and Potential Product Portfolios...
– enables complex analysis on patent information, scientific publications and news.
Free libre open-source
- Carrot2Carrot2Carrot² is an open source search results clustering engine. It can automatically cluster small collections of documents, e.g. search results or document abstracts, into thematic categories. Apart from two specialized search results clustering algorithms, Carrot² offers ready-to-use components for...
– text and search results clustering framework. - GATEGeneral Architecture for Text EngineeringGeneral Architecture for Text Engineering or GATE is a Java suite of tools originally developed at the University of Sheffield beginning in 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for all sorts of natural language processing tasks, including...
– natural language processing and language engineering tool. - OpenNLPOpenNLPThe Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks...
- natural language processing - Natural Language ToolkitNatural Language ToolkitNatural Language Toolkit or, more commonly, NLTK is a suite of libraries and programs for symbolic and statistical natural language processing for the Python programming language. NLTK includes graphical demonstrations and sample data...
(NLTK) – a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the PythonPython (programming language)Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...
programming language. - RapidMiner with its Text Processing Extension – data and text mining software. Rated as the fifth most used text mining software (6%) by Rexer's Annual Data Miner SurveyRexer's Annual Data Miner SurveyRexer Analytics’s Annual Data Miner Survey is the largest survey of data mining professionals in the industry. It consists of approximately 50 multiple choice and open-ended questions that cover seven general areas of data mining science and practice: Field and goals, Algorithms, Models, Tools...
in 2010. - Unstructured Information Management Architecture (UIMAUimaUIMA stands for Unstructured Information Management Architecture. An OASIS standard as of March 2009, UIMA is to date the only industry standard for content analytics....
) – a component framework to analyze unstructured content such as text, audio and video, originally developed by IBM. - tm: Text Mining Package - a framework for text mining applications within RR (programming language)R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians for developing statistical software, and R is widely used for statistical software development and data analysis....
originally created by Ingo Feinerer as part of his dissertation at the Institute for Statistics and Mathematics of the Vienna University of Economics and Business Administration.
Implications
Until recently, websites most often used text-based searches, which only found documents containing specific user-defined words or phrases. Now, through use of a semantic webSemantic Web
The Semantic Web is a collaborative movement led by the World Wide Web Consortium that promotes common formats for data on the World Wide Web. By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web of unstructured documents into a "web of...
, text mining can find content based on meaning and context (rather than just by a specific word).
Additionally, text mining software can be used to build large dossiers of information about specific people and events. For example, large datasets based on data extracted from news reports can be built to facilitate social networks analysis or counter-intelligence
Counter-intelligence
Counterintelligence or counter-intelligence refers to efforts made by intelligence organizations to prevent hostile or enemy intelligence organizations from successfully gathering and collecting intelligence against them. National intelligence programs, and, by extension, the overall defenses of...
. In effect, the text mining software may act in a capacity similar to an intelligence analyst or research librarian, albeit with a more limited scope of analysis.
Text mining is also used in some email spam filters as a way of determining the characteristics of messages that are likely to be advertisements or other unwanted material.
See also
- Approximate nonnegative matrix factorization, an algorithm used for text mining
- BioCreativeBioCreativeBioCreAtIvE consists in a community-wide effort for evaluating information extraction and text mining developments in the biological domain....
text mining evaluation in biomedical literature - Business intelligenceBusiness intelligenceBusiness intelligence mainly refers to computer-based techniques used in identifying, extracting, and analyzing business data, such as sales revenue by products and/or departments, or by associated costs and incomes....
- Computational linguisticsComputational linguisticsComputational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective....
- Concept MiningConcept MiningConcept mining is an activity that results in the extraction of concepts from artifacts. Solutions to the task typically involve aspects of artificial intelligence and statistics, such as data mining and text mining...
- Data miningData miningData mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
- Information retrievalInformation retrievalInformation retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...
- Name resolutionName resolution-In computer languages:Expressions in computer languages can contain identifiers. The semantics of such expressions depend on the entities that the identifiers refer to. The algorithm that determines what an identifier in a given context refers to is part of the language definition.The complexity...
- National Centre for Text MiningNational Centre for Text MiningThe National Centre for Text Mining was the world’s first publicly funded text mining centre. It was established to provide support, advice, and information on TM technologies and to disseminate information from the larger TM community, while also providing tailored services and tools in response...
(NaCTeM) - Natural language processingNatural language processingNatural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....
- Stop wordsStop wordsIn computing, stop words are words which are filtered out prior to, or after, processing of natural language data . It is controlled by human input and not automated. There is not one definite list of stop words which all tools use, if even used...
- Text analyticsText analyticsThe term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation. The term is roughly synonymous with text mining;...
- Text classification sometimes is considered a (sub)task of text mining.
- OpenNLPOpenNLPThe Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks...
Java NLP library from Apache - UIMAUimaUIMA stands for Unstructured Information Management Architecture. An OASIS standard as of March 2009, UIMA is to date the only industry standard for content analytics....
Unstructured Information Management Architecture from IBM. - Web miningWeb miningWeb mining - is the application of data mining techniques to discover patterns from the Web.According to analysis targets, web mining can be divided into three different types, which are Web usage mining, Web content mining and Web structure mining.-Web usage mining:Web usage mining is the process...
, a task that may involve text mining (e.g. first find appropriate web pages by classifying crawled web pages, then extract the desired information from the text content of these pages considered relevant). - w-shinglingW-shinglingIn natural language processing a w-shingling is a set of unique "shingles"—contiguous subsequences of tokens in a document—that can be used to gauge the similarity of two documents...
External links
- Marti Hearst: What Is Text Mining? (October, 2003)