AeroText
Encyclopedia
AeroText is a suite of text mining
applications that are used for content analysis
. Content used can be in multiple languages.
AeroText is a solution developed at the Integrated Systems and Solutions division of Lockheed Martin Corporation, a leading U.S. Defense contractor
. Rocket Software acquired AeroText from Lockheed Martin on June 5, 2008, and they are continuing to develop and support AeroText.
(Department of Defense
), the solution has become one of the leading solutions available and is often integrated into other solutions. AeroText solutions provide both information extraction
and link analysis
capabilities.
AeroText output is normalized and stored within the solution’s cache as templates. However, the information can be output in a variety of ways using the Run Time Integration Toolkit (RIT) to integrate the output into existing systems through the use of RIT modules. Wrappers for XML
and the DARPA Agent Markup Language ( DAML) and also provided, thus making the solution flexible enough to be utilized in other domains. For instance, the solution was presented to the National Institute of
Health’s Biomedical Computing Interest Group (BCIG) in April 2002 and demonstrated excellent applicability to the biomedical domain.
“AeroText is data-independent, which means it does not rely on or have a bias towards a particular domain, document type, document source, or natural language” (Haser and Childs, 2002). Sample target applications include automatic database generation, document routing, browsing, summarization, enhanced full text search, and targeted document search in addition to link analysis. The solution’s multilingual utility is also a strength. The technology is also flexible enough to be able to support format standards, such as DAML (Kogut and Holmes), which aid in law enforcement activities.
The current 5.x release exists as a set of various components that are used to carry out integration and data mining tasks. The Integrated Development Environment
(IDE) is, perhaps, the most important component as it provides the rule development, modification, and coordination capabilities – “a complete environment to build, test, and analyze linguistic knowledge bases” (Kogut and Holmes). This graphical interface includes not only object oriented editors and rules wizards, but is also allows visual tools for analyzing extracted data, debugging linguistic data, and analyzing performance (AeroText). As a result, customized logic domains are available.
The Instance Based Run-Time Engine actually carries out the extraction on input documents by applying a Knowledge Base
(see below). According to the company, “an Instance is defined as the creation of a single Document Object in the AeroText Application Program Interface (API).” The engine is available in Java
, C
, or COM
APIs and has wrappers for XML
and DAML.
The Run Time Integration Toolkit (RIT) helps to deploy AeroText by minimizing the need for integration code and
provides for the integration of AeroText output into existing systems through the use of RIT modules.
The Corpus Analyzer clusters documents based on entity and conceptual similarities between documents.
The Answer Key Editor creates an information store for scoring by assigning “an Answer Key that corresponds to a specific collection of documents” (AeroText). This Key objectively measures the accuracy of the extraction process. The scoring capability is integrated into the development environment, enabling the developer to identify and analyze extraction errors in large sets of data during the development process.
Much of the solution’s technology is provided within the company’s Knowledge Bases
(KBs). English serves as the key core KB and provides linguistic-driven rules which approach 100 entity types uses to extract text. KBs are also available for the Arabic, Chinese
(simplified and traditional), Spanish
, and Indonesian
(including Melagu) languages. A KB Compiler is used to convert “linguistic data files into an efficient run-time knowledge base” (Kogut and Holmes).
AeroText’s solution components are available separately or as one of two product bundles. The Standard bundle includes the IDE, Instance-based Run-Time Engine, Core English Knowledge Base, and the Customization Tool. The Professional bundle includes the Standard components as well as the Corpus Analyzer and the Answer Key Editor).
AeroText can handle any textual input, as the Instance Based Run-Time Engine supports both ASCII and Unicode text.
AeroText's main focus is on "information extraction", which includes both named entity extraction and intrasource link analysis. “AeroText information extraction technology is designed for natural language text” (AeroText, 2003). The company has organized its capabilities into several groupings. Specifically for information extraction, entities (persons, organizations, places, etc.), key phrases (time expressions, money amounts, etc.), and grammatical phrases (verb phrases, etc.) can all
be extracted. In terms of link analysis, the solution provides entity coreference (resolution of multiple mentions of the same entity, including pronouns), entity associations (identify relationships), event extraction (who, what, when, where), topic categorization (subject matter determinations), temporal resolution (resolution of time expressions, etc.), and location resolution (identification of a particular place which can be tied to GIS). Additionally, the company’s BlockFinder can be used to understand textual tables. (Haser and Childs, 2002).
The solution gains its flexibility and broad range of applicability from the fact that the system is based on the use of
manually crafted rules. These rules are used to perform both entity extraction and intrasource link analysis. While different modules developed will be extensively subject-matter specific, the solution can be easily modified to handle the requirements of a different domain. Therefore, in order to use the solution, “an AeroText specialist must generate a set of extraction rules. These rules describe for AeroText how to identify and structure the information to be extracted. In effect, they create fairly abstract templates that describe all the different ways a concept can be expressed in the target language” (Noble, b). These rules not only extract the information from the text, but also specify how the information should be structured within event records (Noble, a).
(Haser and Childs) explains that the fundamental components of the solution include features,
elements, templates, packages, rulebases, and caches.
These terms are explained using the following example: “Feb. 28, 2002 AAA Corporation will acquire Tampa-based ZZZ Inc. within 60 days.”)
An entities cache stores times, organizations, and other such information, while an events cache can store event information, such as acquisitions. A high-level overview of how the solution is set up is provided by the adjacent figure. Given a test document, a knowledge engineer produces the answer key of supposed output while the knowledge base engine uses pre-packaged and user-developed rules to extract the entities and relationships from the text. These two outputs are compared and scored. If changes need to be made, the knowledge engineer creates additional rules or makes other enhancements to the knowledge base (which in turn updates the knowledge base engine).
Presentation at NIH BCIG. April 18, 2002. Online. http://www.altum.com/bcig/events/seminars/502002_04.pdf and http://www.altum.com/bcig/events/seminars/2002_04.htm Accessed January 9,
2006.
Hill, Ryan (2005). Lockheed Martin Signs NetMap Analytics as Authorized Distributor of AeroTextTM
Information Extraction Software. August 3, 2005. Online. http://www.netmapanalytics.com/press/AeroText.pdf Accessed January 9, 2006.
KMWorld. KMWorld Buyers Guide: Lockheed Martin Corporation. Online. http://www.kmworld.com/buyersGuide/ReadCompany.aspx?CategoryID=77&CompanyID=17
Kogut, Paul and Holmes, William. AeroDAML: Applying Information Extraction to Generate DAML
Annotations from Web Pages. Online. http://semannot2001.aifb.uni-karlsruhe.de/positionpapers/AeroDAML3.pdf
Mordoff, Keith (2004). Lockheed Martin’s NEW AeroTextTM Version 4.0 Helps Users Tackle Data
Overload, Pinpoint Critical Information. April 14, 2005. Online. http://www.lockheedmartin.com/data/assets/10586.pdf
Noble, David (a). Fusion of Open Source Information. Online. http://www.ebrinc.com/files/Noble_Fusion.pdf
Noble, David (b). Structuring Open Source Information to Support Intelligence Analysis. Online.
http://www.ebrinc.com/files/Noble_Structuring.pdf
Roberts, Gregory (2003). AeroTextTM Products: Executive Summary Information. Online.
http://www.lockheedmartin.com/data/assets/3504.pdf
Taylor, Sarah M. (2004). "Information Extraction Tools: Deciphering Human Language." IT
Professional. Vol. 06, no. 6, pages: 28-34. November/December, 2004. Online. http://ieeexplore.ieee.org/iel5/6294/30282/01390870.pdf?tp=&arnumber=1390870&isnumber=30282.
Text mining
Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...
applications that are used for content analysis
Content analysis
Content analysis or textual analysis is a methodology in the social sciences for studying the content of communication. Earl Babbie defines it as "the study of recorded human communications, such as books, websites, paintings and laws."According to Dr...
. Content used can be in multiple languages.
AeroText is a solution developed at the Integrated Systems and Solutions division of Lockheed Martin Corporation, a leading U.S. Defense contractor
Defense contractor
A defense contractor is a business organization or individual that provides products or services to a military department of a government. Products typically include military aircraft, ships, vehicles, weaponry, and electronic systems...
. Rocket Software acquired AeroText from Lockheed Martin on June 5, 2008, and they are continuing to develop and support AeroText.
History
Originally developed for the U.S. intelligence communityIntelligence community
Intelligence community may refer to* Bangladeshi intelligence community* Croatian intelligence community * Israeli intelligence community* Italian intelligence community, see SISMI...
(Department of Defense
United States Department of Defense
The United States Department of Defense is the U.S...
), the solution has become one of the leading solutions available and is often integrated into other solutions. AeroText solutions provide both information extraction
Information extraction
Information extraction is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language...
and link analysis
Link Analysis
In network theory, link analysis is a data-analysis technique used to evaluate relationships between nodes. Relationships may be identified among various types of nodes , including organizations, people and transactions...
capabilities.
Functionality
AeroText converts unstructured information into structured information. The user has the capability to define the parameters of both.AeroText output is normalized and stored within the solution’s cache as templates. However, the information can be output in a variety of ways using the Run Time Integration Toolkit (RIT) to integrate the output into existing systems through the use of RIT modules. Wrappers for XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
and the DARPA Agent Markup Language ( DAML) and also provided, thus making the solution flexible enough to be utilized in other domains. For instance, the solution was presented to the National Institute of
Health’s Biomedical Computing Interest Group (BCIG) in April 2002 and demonstrated excellent applicability to the biomedical domain.
“AeroText is data-independent, which means it does not rely on or have a bias towards a particular domain, document type, document source, or natural language” (Haser and Childs, 2002). Sample target applications include automatic database generation, document routing, browsing, summarization, enhanced full text search, and targeted document search in addition to link analysis. The solution’s multilingual utility is also a strength. The technology is also flexible enough to be able to support format standards, such as DAML (Kogut and Holmes), which aid in law enforcement activities.
The current 5.x release exists as a set of various components that are used to carry out integration and data mining tasks. The Integrated Development Environment
Integrated development environment
An integrated development environment is a software application that provides comprehensive facilities to computer programmers for software development...
(IDE) is, perhaps, the most important component as it provides the rule development, modification, and coordination capabilities – “a complete environment to build, test, and analyze linguistic knowledge bases” (Kogut and Holmes). This graphical interface includes not only object oriented editors and rules wizards, but is also allows visual tools for analyzing extracted data, debugging linguistic data, and analyzing performance (AeroText). As a result, customized logic domains are available.
The Instance Based Run-Time Engine actually carries out the extraction on input documents by applying a Knowledge Base
Knowledge base
A knowledge base is a special kind of database for knowledge management. A Knowledge Base provides a means for information to be collected, organised, shared, searched and utilised.-Types:...
(see below). According to the company, “an Instance is defined as the creation of a single Document Object in the AeroText Application Program Interface (API).” The engine is available in Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
, C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
, or COM
Component Object Model
Component Object Model is a binary-interface standard for software componentry introduced by Microsoft in 1993. It is used to enable interprocess communication and dynamic object creation in a large range of programming languages...
APIs and has wrappers for XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
and DAML.
The Run Time Integration Toolkit (RIT) helps to deploy AeroText by minimizing the need for integration code and
provides for the integration of AeroText output into existing systems through the use of RIT modules.
The Corpus Analyzer clusters documents based on entity and conceptual similarities between documents.
The Answer Key Editor creates an information store for scoring by assigning “an Answer Key that corresponds to a specific collection of documents” (AeroText). This Key objectively measures the accuracy of the extraction process. The scoring capability is integrated into the development environment, enabling the developer to identify and analyze extraction errors in large sets of data during the development process.
Much of the solution’s technology is provided within the company’s Knowledge Bases
Knowledge base
A knowledge base is a special kind of database for knowledge management. A Knowledge Base provides a means for information to be collected, organised, shared, searched and utilised.-Types:...
(KBs). English serves as the key core KB and provides linguistic-driven rules which approach 100 entity types uses to extract text. KBs are also available for the Arabic, Chinese
Chinese language
The Chinese language is a language or language family consisting of varieties which are mutually intelligible to varying degrees. Originally the indigenous languages spoken by the Han Chinese in China, it forms one of the branches of Sino-Tibetan family of languages...
(simplified and traditional), Spanish
Spanish language
Spanish , also known as Castilian , is a Romance language in the Ibero-Romance group that evolved from several languages and dialects in central-northern Iberia around the 9th century and gradually spread with the expansion of the Kingdom of Castile into central and southern Iberia during the...
, and Indonesian
Indonesian language
Indonesian is the official language of Indonesia. Indonesian is a normative form of the Riau Islands dialect of Malay, an Austronesian language which has been used as a lingua franca in the Indonesian archipelago for centuries....
(including Melagu) languages. A KB Compiler is used to convert “linguistic data files into an efficient run-time knowledge base” (Kogut and Holmes).
AeroText’s solution components are available separately or as one of two product bundles. The Standard bundle includes the IDE, Instance-based Run-Time Engine, Core English Knowledge Base, and the Customization Tool. The Professional bundle includes the Standard components as well as the Corpus Analyzer and the Answer Key Editor).
AeroText can handle any textual input, as the Instance Based Run-Time Engine supports both ASCII and Unicode text.
AeroText's main focus is on "information extraction", which includes both named entity extraction and intrasource link analysis. “AeroText information extraction technology is designed for natural language text” (AeroText, 2003). The company has organized its capabilities into several groupings. Specifically for information extraction, entities (persons, organizations, places, etc.), key phrases (time expressions, money amounts, etc.), and grammatical phrases (verb phrases, etc.) can all
be extracted. In terms of link analysis, the solution provides entity coreference (resolution of multiple mentions of the same entity, including pronouns), entity associations (identify relationships), event extraction (who, what, when, where), topic categorization (subject matter determinations), temporal resolution (resolution of time expressions, etc.), and location resolution (identification of a particular place which can be tied to GIS). Additionally, the company’s BlockFinder can be used to understand textual tables. (Haser and Childs, 2002).
The solution gains its flexibility and broad range of applicability from the fact that the system is based on the use of
manually crafted rules. These rules are used to perform both entity extraction and intrasource link analysis. While different modules developed will be extensively subject-matter specific, the solution can be easily modified to handle the requirements of a different domain. Therefore, in order to use the solution, “an AeroText specialist must generate a set of extraction rules. These rules describe for AeroText how to identify and structure the information to be extracted. In effect, they create fairly abstract templates that describe all the different ways a concept can be expressed in the target language” (Noble, b). These rules not only extract the information from the text, but also specify how the information should be structured within event records (Noble, a).
(Haser and Childs) explains that the fundamental components of the solution include features,
elements, templates, packages, rulebases, and caches.
These terms are explained using the following example: “Feb. 28, 2002 AAA Corporation will acquire Tampa-based ZZZ Inc. within 60 days.”)
- A feature is “a list of terms that represents a common idea based on meaning or grammar,” e.g., ‘inc.’ and ‘corp.’ are business designations {CorpDesignator}.
- An element is “a set of regular expressions that allow binding of information to matched text”; for instance, “FEB” and “February” both refer to the second month (month = “2”).
- A template is “a frame with slots used to hold extracted text and sometimes related information.” A time template, for example, would include a “text” field as well as “StartDate” and “EndDate” fields.
- A package is “a set of rules, similar to elements, but with associated actions that fill template slots with extracted information.” The example above would have Time, Organization, and Location templates into which extracted information could be organized.
- A rulebase is “a collection of packages that are activated at the appropriate time during a processing sequence.” This example would have the Time and Organization templates feed into an Acquisition template.
- A cache provides “a virtual bin for storing extracted information.”
An entities cache stores times, organizations, and other such information, while an events cache can store event information, such as acquisitions. A high-level overview of how the solution is set up is provided by the adjacent figure. Given a test document, a knowledge engineer produces the answer key of supposed output while the knowledge base engine uses pre-packaged and user-developed rules to extract the entities and relationships from the text. These two outputs are compared and scored. If changes need to be made, the knowledge engineer creates additional rules or makes other enhancements to the knowledge base (which in turn updates the knowledge base engine).
Further reading
Haser, Tom and Childs, Lois (2002). “Drug Discovery through Information Extraction Technology.”Presentation at NIH BCIG. April 18, 2002. Online. http://www.altum.com/bcig/events/seminars/502002_04.pdf and http://www.altum.com/bcig/events/seminars/2002_04.htm Accessed January 9,
2006.
Hill, Ryan (2005). Lockheed Martin Signs NetMap Analytics as Authorized Distributor of AeroTextTM
Information Extraction Software. August 3, 2005. Online. http://www.netmapanalytics.com/press/AeroText.pdf Accessed January 9, 2006.
KMWorld. KMWorld Buyers Guide: Lockheed Martin Corporation. Online. http://www.kmworld.com/buyersGuide/ReadCompany.aspx?CategoryID=77&CompanyID=17
Kogut, Paul and Holmes, William. AeroDAML: Applying Information Extraction to Generate DAML
Annotations from Web Pages. Online. http://semannot2001.aifb.uni-karlsruhe.de/positionpapers/AeroDAML3.pdf
Mordoff, Keith (2004). Lockheed Martin’s NEW AeroTextTM Version 4.0 Helps Users Tackle Data
Overload, Pinpoint Critical Information. April 14, 2005. Online. http://www.lockheedmartin.com/data/assets/10586.pdf
Noble, David (a). Fusion of Open Source Information. Online. http://www.ebrinc.com/files/Noble_Fusion.pdf
Noble, David (b). Structuring Open Source Information to Support Intelligence Analysis. Online.
http://www.ebrinc.com/files/Noble_Structuring.pdf
Roberts, Gregory (2003). AeroTextTM Products: Executive Summary Information. Online.
http://www.lockheedmartin.com/data/assets/3504.pdf
Taylor, Sarah M. (2004). "Information Extraction Tools: Deciphering Human Language." IT
Professional. Vol. 06, no. 6, pages: 28-34. November/December, 2004. Online. http://ieeexplore.ieee.org/iel5/6294/30282/01390870.pdf?tp=&arnumber=1390870&isnumber=30282.
External links
- AeroText homepage at Lockheed Martin
- AeroText for Linux
- AeroText MACE results
- CMD Solutions: apparent reseller/provider
- DAML presentation
- Defense Intelligence Agency Boosts Search Firepower
See also
- Data miningData miningData mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
- Lexical analysisLexical analysisIn computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner...
- DAML