Data transformation
Encyclopedia
- This article is about data transformation in computer science (metadata). For statistical application, see data transformation (statistics)Data transformation (statistics)In statistics, data transformation refers to the application of a deterministic mathematical function to each point in a data set — that is, each data point zi is replaced with the transformed value yi = f, where f is a function...
.
In metadata
Metadata
The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...
and data warehouse
Data warehouse
In computing, a data warehouse is a database used for reporting and analysis. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store for additional operations before it is used in the DW for reporting.A data warehouse...
, a data transformation converts data from a source data format into destination data.
Data transformation can be divided into two steps:
- data mappingData mappingData mapping is the process of creating data element mappings between two distinct data models. Data mapping is used as a first step for a wide variety of data integration tasks including:...
maps data elementData elementIn metadata, the term data element is an atomic unit of data that has precise meaning or precise semantics. A data element has:# An identification such as a data element name# A clear data element definition# One or more representation terms...
s from the source to the destination and captures any transformation that must occur - code generationAutomatic programmingIn computer science, the term automatic programming identifies a type of computer programming in which some mechanism generates a computer program to allow human programmers to write the code at a higher abstraction level....
that creates the actual transformation program
Data element to data element mapping is frequently complicated by complex transformations that require one-to-many
Cardinality (data modeling)
In data modeling, the cardinality of one data table with respect to another data table is a critical aspect of database design. Relationships between data tables define cardinality when explaining how each table links to another....
and many-to-one transformation rules.
The code generation step takes the data element mapping specification and creates an executable program that can be run on a computer system. Code generation can also create transformation in easy-to-maintain computer languages such as Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
or XSLT
XSLT
XSLT is a declarative, XML-based language used for the transformation of XML documents. The original document is not changed; rather, a new document is created based on the content of an existing one. The new document may be serialized by the processor in standard XML syntax or in another format,...
.
When the mapping is indirect via a mediating data model
Data model
A data model in software engineering is an abstract model, that documents and organizes the business data for communication between team members and is used as a plan for developing applications, specifically how data is stored and accessed....
, the process is also called data mediation.
Transformational languages
There are numerous languages available for performing data transformation. Many transformation languageTransformation language
A transformation language is a computer language designed to transform some input text in a certain formal language into a modified output text that meets some specific goal....
s require a grammar
Grammar
In linguistics, grammar is the set of structural rules that govern the composition of clauses, phrases, and words in any given natural language. The term refers also to the study of such rules, and this field includes morphology, syntax, and phonology, often complemented by phonetics, semantics,...
to be provided. In many cases the grammar is structured using something closely resembling Backus–Naur Form (BNF)
Backus–Naur form
In computer science, BNF is a notation technique for context-free grammars, often used to describe the syntax of languages used in computing, such as computer programming languages, document formats, instruction sets and communication protocols.It is applied wherever exact descriptions of...
. There are numerous languages available for such purposes varying in their accessibility (cost) and general usefulness. Examples of such languages include:
- AWK - one of the oldest and popular TXT data transform language;
- PerlPerlPerl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...
- a high-level language with both procedural and object-oriented syntax capable of powerful operations on binary or text data. - Template languagesWeb templateA web template is a tool used to separate content from presentation in web design, and for mass-production of web documents. It is a basic component of a web template system.Web templates can be used to set up any type of website...
- specialized for transform data into documents (see also template processorTemplate processorA template processor is software or a software component that is designed to combine one or moretemplates with a data model to produceone or more result documents...
); - TXLTXL (programming language)TXL is a special-purpose programming language originally designed by Charles Halpern-Hamu and James Cordy at the University of Toronto in 1985. The acronym "TXL" originally stood for "Turing eXtender Language" after the language's original purpose, the specification and rapid prototyping of...
- prototyping language-based descriptions, used for source code or data transformation. - XSLTXSLTXSLT is a declarative, XML-based language used for the transformation of XML documents. The original document is not changed; rather, a new document is created based on the content of an existing one. The new document may be serialized by the processor in standard XML syntax or in another format,...
- the standard XML data transformation language (suitable by XQueryXQuery- Features :XQuery provides the means to extract and manipulate data from XML documents or any data source that can be viewed as XML, such as relational databases or office documents....
in many applications);
Although transformational languages are typically best suited for transformation, something as simple as regular expressions can be used to achieve useful transformation. A text editor
Text editor
A text editor is a type of program used for editing plain text files.Text editors are often provided with operating systems or software development packages, and can be used to change configuration files and programming language source code....
like emacs
Emacs
Emacs is a class of text editors, usually characterized by their extensibility. GNU Emacs has over 1,000 commands. It also allows the user to combine these commands into macros to automate work.Development began in the mid-1970s and continues actively...
or Textpad
TextPad
TextPad is a text editor for the Microsoft Windows family of operating systems.First released in 1992, this software is currently in its fifth major version...
supports the use of regular expressions with arguments. This would allow all instances of a particular pattern to be replaced with another pattern using parts of the original pattern. For example:
foo ("some string", 42, gCommon);
bar (someObj, anotherObj);
foo ("another string", 24, gCommon);
bar (myObj, myOtherObj);
could both be transformed into a more compact form like:
foobar("some string", 42, someObj, anotherObj);
foobar("another string", 24, myObj, myOtherObj);
In other words, all instances of a function invocation of foo with three arguments, followed by a function invocation with two invocations would be replaced with a single function invocation using some or all of the original set of arguments.
Another advantage to using regular expressions is that they will not fail the null transform test. That is, using your transformational language of choice, run a sample program through a transformation that doesn't perform any transformations. Many transformational languages will fail this test.
Transforming source code
Program synthesisProgram synthesis
Program synthesis is a special form of automatic programming that is most often paired with a technique for formal verification. The goal is to automatically construct a program that provably satisfies a given high-level specification...
, Automatic programming
Automatic programming
In computer science, the term automatic programming identifies a type of computer programming in which some mechanism generates a computer program to allow human programmers to write the code at a higher abstraction level....
and other fields use the data transformation strategies for translate, adapt or even generate software source code
Source code
In computer science, source code is text written using the format and syntax of the programming language that it is being written in. Such a language is specially designed to facilitate the work of computer programmers, who specify the actions to be performed by a computer mostly by writing source...
. Inversely these source transformation tools can be used for data transform, typically for transform "document source code" as HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....
or another XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
dialect (see also Template processor
Template processor
A template processor is software or a software component that is designed to combine one or moretemplates with a data model to produceone or more result documents...
s).
For further information on (software) source transformation see(Chapter 2.4) or.
Generally the different types of transformations fall into one of two categories,
- Translation: a transformation from a language X into another language Y.
- Rephrasing: a rephrasing involves a transformation within the same language but merely stated a different way.
Example
A difficult problem to address in C++ is "unstructured preprocessor directives". These are preprocessor directives which do not contain blocks of code with simple grammatical descriptions, like in this function definition:void MyFunc
{
if (x>17)
{ printf("test");
# ifdef FOO
} else {
# endif
if (gWatch)
mTest = 42;
}
}
A really general solution to handling this is very hard because such preprocessor directives can essentially edit the underlying language in arbitrary ways.
However, because such directives are not, in practice, used in completely arbitrary ways, one can build practical tools for handling preprocessed languages. The DMS Software Reengineering Toolkit
DMS Software Reengineering Toolkit
The DMS Software Reengineering Toolkit is a proprietary set of program transformation tools available for automating custom source program analysis, modification, translation or generation of software systems for arbitrary mixtures of source languages for large scale software systems.DMS has been...
is capable of handling structured macros and preprocessor conditionals.
Brabrand and Schwartzbach (2000) offer another approach, substituting the C preprocessor by a metamorphic one.
See also
Concepts:
|
Languages and typical transforms:
ATLAS Transformation Language ATL is a model transformation language and toolkit developed and maintained by OBEO and AtlanMod. It was initiated by the team... Identity transform The identity transform is a data transformation that copies the source data into the destination data without change.The identity transformation is considered an essential process in creating a reusable transformation library. By creating a library of variations of the base identity... QVT QVT is a standard set of languages for model transformation defined by the Object Management Group .- Overview :... TXL TXL may refer to:* IATA airport code for Berlin-Tegel International Airport* Name of a character in Today's Special* TXL programming language... (general) XQuery - Features :XQuery provides the means to extract and manipulate data from XML documents or any data source that can be viewed as XML, such as relational databases or office documents.... (XML) XSLT XSLT is a declarative, XML-based language used for the transformation of XML documents. The original document is not changed; rather, a new document is created based on the content of an existing one. The new document may be serialized by the processor in standard XML syntax or in another format,... (XML) |
Other:
|