Data exchange
Encyclopedia
Data exchange is the process of taking data
structured under a source schema
and actually transforming it into data structured under a target schema, so that the target data is an accurate representation of the source data. Data exchange is similar to the related concept of data integration
except that data is actually restructured (with possible loss of content) in data exchange. There may be no way to transform an instance given all of our constraints. Conversely, there may be numerous ways to transform the instance (possibly infinitely many), in which case we must identify and justify a "best" choice of solutions.
A data exchange language is a language that is domain-independent and can be used for any kind of data. Its semantic expression capabilities and qualities are largely determined by comparison with the capabilities of natural languages. The term is also applied to any file format
that can be read by more than one program, including proprietary formats such as Microsoft Office
documents. However, a file format is not a real language as it lacks a grammar and vocabulary.
Practice has shown that certain types of formal language
s are better suited for this task than others, since their specification is driven by a formal process instead of a particular softwares implementation needs. For example XML
is a markup language
that was designed to enable the creation of dialects (the definition of domain-specific sublanguages) and a popular choice now in particular on the internet. However, it does not contain domain specific dictionaries or fact types. Beneficial to a reliable data exchange is the availability of standard dictionaries-taxonomies and tools libraries such as parsers, schema validator
s and transformation tools.
Nomenclature
Notes:
for data exchange on the World Wide Web
has several reasons. First of all, it is closely related to the preexisting standards Standard Generalized Markup Language
(SGML) and Hypertext Markup Language (HTML), and as such a parser written to support these two languages can be easily extended to support XML as well. For example, XHTML
has been defined as a format that is formal XML, but understood correctly by most (if not all) HTML parsers. This lead to quick adoption of XML support in web browsers and the toolchains used for generating web pages.
programming language, the JSON
(JavaScript Object Notation) was split out into a low-level format for structured data exchange. While it was originally not designed for data exchange at all, it was discovered to be useful. In contrast to XML above, there exist no schema definition and no support for dialecting. The key benefits of this language are the low overhead (the amount of data needed for structuring) compared to XML and the similarly wide support: every web browser that has JavaScript support can also process JSON.
is a language that was designed to be human-readable (and as such to be easy to edit with any standard text editor). It's notion often is similar to reStructuredText
or a Wiki syntax, who also try to be readable both by humans and computers. YAML 1.2 also includes a shorthand notion that is compatible with JSON, and as such any JSON document is also valid YAML; this however does not hold the other way.
is a language that was designed to be human-readable and easy to edit using any standard text editor. To achieve that it uses a simple free-form syntax with minimal punctuation, and a rich set of datatypes. REBOL datatypes like URLs, e-mails, date and time values, tuples, strings, tags, etc. respect the common standards. REBOL is designed to not need any additional meta-language, being designed in a metacircular fashion. The metacircularity of the language is the reason why e.g. the Parse dialect used (not exclusively) for definitions and transformations of REBOL dialects is also itself a dialect of REBOL. REBOL was used as a source of inspiration by the designer of JSON.
is a formalized subset of natural English, which includes a simple grammar and a large extensible English Dictionary-Taxonomy that defines the general and domain specific terminology (terms for concepts), whereas the concepts are arranged in a subtype-supertype hierarchy (a Taxonomy), which supports inheritance of knowledge and requirements. The Dictionary-Taxonomy also includes standardized fact types (also called relation types). The terms and relation types together can be used to create and interpret expressions of facts, knowledge, requirements and other information. Gellish can be used in combination with SQL, RDF/XML, OWL and various other meta-languages. The Gellish standard is being adopted as ISO 15926-11.
Data
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...
structured under a source schema
Database schema
A database schema of a database system is its structure described in a formal language supported by the database management system and refers to the organization of data to create a blueprint of how a database will be constructed...
and actually transforming it into data structured under a target schema, so that the target data is an accurate representation of the source data. Data exchange is similar to the related concept of data integration
Data integration
Data integration involves combining data residing in different sources and providing users with a unified view of these data.This process becomes significant in a variety of situations, which include both commercial and scientific domains...
except that data is actually restructured (with possible loss of content) in data exchange. There may be no way to transform an instance given all of our constraints. Conversely, there may be numerous ways to transform the instance (possibly infinitely many), in which case we must identify and justify a "best" choice of solutions.
Data exchange languages
A data exchange language is a language that is domain-independent and can be used for any kind of data. Its semantic expression capabilities and qualities are largely determined by comparison with the capabilities of natural languages. The term is also applied to any file format
File format
A file format is a particular way that information is encoded for storage in a computer file.Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa. There are different kinds of formats for...
that can be read by more than one program, including proprietary formats such as Microsoft Office
Microsoft Office
Microsoft Office is a non-free commercial office suite of inter-related desktop applications, servers and services for the Microsoft Windows and Mac OS X operating systems, introduced by Microsoft in August 1, 1989. Initially a marketing term for a bundled set of applications, the first version of...
documents. However, a file format is not a real language as it lacks a grammar and vocabulary.
Practice has shown that certain types of formal language
Formal language
A formal language is a set of words—that is, finite strings of letters, symbols, or tokens that are defined in the language. The set from which these letters are taken is the alphabet over which the language is defined. A formal language is often defined by means of a formal grammar...
s are better suited for this task than others, since their specification is driven by a formal process instead of a particular softwares implementation needs. For example XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
is a markup language
Markup language
A markup language is a modern system for annotating a text in a way that is syntactically distinguishable from that text. The idea and terminology evolved from the "marking up" of manuscripts, i.e. the revision instructions by editors, traditionally written with a blue pencil on authors' manuscripts...
that was designed to enable the creation of dialects (the definition of domain-specific sublanguages) and a popular choice now in particular on the internet. However, it does not contain domain specific dictionaries or fact types. Beneficial to a reliable data exchange is the availability of standard dictionaries-taxonomies and tools libraries such as parsers, schema validator
Validator
A validator is a computer program used to check the validity or syntactical correctness of a fragment of code or document. The term is commonly used in the context of validating HTML, CSS and XML documents or RSS feeds though it can be used for any defined format or language.-HTML validator:In the...
s and transformation tools.
Popular languages used for data exchange
The following is an incomplete list of popular generic languages used for data exchange in multiple domains.Schemas | Flexible | Semantic verification | Dictionary -Taxonomy | Synonyms and homonyms | Dialecting | Web standard | Transformations | Lightweight | Human readable | Compatibility | |
---|---|---|---|---|---|---|---|---|---|---|---|
XML XML Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards.... |
subset of SGML, HTML HTML HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages.... |
||||||||||
Atom | XML XML Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards.... dialect |
||||||||||
JSON JSON JSON , or JavaScript Object Notation, is a lightweight text-based open standard designed for human-readable data interchange. It is derived from the JavaScript scripting language for representing simple data structures and associative arrays, called objects... |
subset of JavaScript JavaScript JavaScript is a prototype-based scripting language that is dynamic, weakly typed and has first-class functions. It is a multi-paradigm language, supporting object-oriented, imperative, and functional programming styles.... |
||||||||||
YAML YAML YAML is a human-readable data serialization format that takes concepts from programming languages such as C, Perl, and Python, and ideas from XML and the data format of electronic mail . YAML was first proposed by Clark Evans in 2001, who designed it together with Ingy döt Net and Oren Ben-Kiki... |
superset of JSON JSON JSON , or JavaScript Object Notation, is a lightweight text-based open standard designed for human-readable data interchange. It is derived from the JavaScript scripting language for representing simple data structures and associative arrays, called objects... |
||||||||||
REBOL REBOL REBOL is a cross-platform data exchange language and a multi-paradigm dynamic programming language originally designed by Carl Sassenrath for network communications and distributed computing. The language and its official implementation, which is a proprietary freely redistributable software are... |
|||||||||||
Gellish Gellish Gellish is a controlled natural language, also called a formal language, in which information and knowledge can be expressed in such a way that it is computer-interpretable, as well as system-independent. Gellish is a structured subset of natural language that is suitable for information modelling... |
ISO | SQL, RDF/XML, OWL |
Nomenclature
- Schemas - Whether the language definition is available in a computer interpretable form.
- Flexible - Whether the language enables extension of the semantic expression capabilities without modifying the schema.
- Semantic verification - Whether the language definition enables semantic verification of the correctness of expressions in the language.
- Dictionary-Taxonomy - Whether the language includes a dictionary and a taxonomy (subtype-supertype hierarchy) of concepts with inheritance.
- Synonyms and homonyms - Whether the language includes and supports the use of synonyms and homonyms in the expressions.
- Dialecting - Whether the language definition is available in multiple natural languages or dialects.
- Web or ISO standard - Organization that endorsed the language as a standard.
- Transformations - Whether the language includes a translation to other standards.
- Lightweight - Whether a lightweight version is available, in addition to a full version.
- Human readable - Whether expressions in the language are readable by humans without training.
- Compatibility - Which other tools are possible or required when using the language.
Notes:
- The schema of XML contains a very limited grammar and vocabulary.
- Available as extension.
- in the default format, not the compact syntax.
- the syntax is fairly simple (the language was designed to be human readable); the dialects may require domain knowledge.
- the standardized fact types are denoted by standardized English phrases, which interpretation and use needs some training.
- the Parse dialect is used to specify, validate, and transform dialects.
- the English version includes a Gellish English Dictionary-Taxonomy that also includes standardized fact types (= kinds of relations).
XML for data exchange
The popularity of XMLXML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
for data exchange on the World Wide Web
World Wide Web
The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...
has several reasons. First of all, it is closely related to the preexisting standards Standard Generalized Markup Language
Standard Generalized Markup Language
The Standard Generalized Markup Language is an ISO-standard technology for defining generalized markup languages for documents...
(SGML) and Hypertext Markup Language (HTML), and as such a parser written to support these two languages can be easily extended to support XML as well. For example, XHTML
XHTML
XHTML is a family of XML markup languages that mirror or extend versions of the widely-used Hypertext Markup Language , the language in which web pages are written....
has been defined as a format that is formal XML, but understood correctly by most (if not all) HTML parsers. This lead to quick adoption of XML support in web browsers and the toolchains used for generating web pages.
JSON for data exchange
Actually a part of the JavaScriptJavaScript
JavaScript is a prototype-based scripting language that is dynamic, weakly typed and has first-class functions. It is a multi-paradigm language, supporting object-oriented, imperative, and functional programming styles....
programming language, the JSON
JSON
JSON , or JavaScript Object Notation, is a lightweight text-based open standard designed for human-readable data interchange. It is derived from the JavaScript scripting language for representing simple data structures and associative arrays, called objects...
(JavaScript Object Notation) was split out into a low-level format for structured data exchange. While it was originally not designed for data exchange at all, it was discovered to be useful. In contrast to XML above, there exist no schema definition and no support for dialecting. The key benefits of this language are the low overhead (the amount of data needed for structuring) compared to XML and the similarly wide support: every web browser that has JavaScript support can also process JSON.
YAML for data exchange
YAMLYAML
YAML is a human-readable data serialization format that takes concepts from programming languages such as C, Perl, and Python, and ideas from XML and the data format of electronic mail . YAML was first proposed by Clark Evans in 2001, who designed it together with Ingy döt Net and Oren Ben-Kiki...
is a language that was designed to be human-readable (and as such to be easy to edit with any standard text editor). It's notion often is similar to reStructuredText
ReStructuredText
reStructuredText is a lightweight markup language intended to be highly readable in source format. Its formal name indicates that it is a "revised, reworked, and reinterpreted StructuredText."...
or a Wiki syntax, who also try to be readable both by humans and computers. YAML 1.2 also includes a shorthand notion that is compatible with JSON, and as such any JSON document is also valid YAML; this however does not hold the other way.
REBOL for data exchange
REBOLREBOL
REBOL is a cross-platform data exchange language and a multi-paradigm dynamic programming language originally designed by Carl Sassenrath for network communications and distributed computing. The language and its official implementation, which is a proprietary freely redistributable software are...
is a language that was designed to be human-readable and easy to edit using any standard text editor. To achieve that it uses a simple free-form syntax with minimal punctuation, and a rich set of datatypes. REBOL datatypes like URLs, e-mails, date and time values, tuples, strings, tags, etc. respect the common standards. REBOL is designed to not need any additional meta-language, being designed in a metacircular fashion. The metacircularity of the language is the reason why e.g. the Parse dialect used (not exclusively) for definitions and transformations of REBOL dialects is also itself a dialect of REBOL. REBOL was used as a source of inspiration by the designer of JSON.
Gellish for data exchange
Gellish EnglishGellish English
Gellish English is a variant of Gellish and is a formal language, which means that it is structured and formalised subset of natural English that is computer interpretable. Its definition includes an English dictionary of concepts that is arranged in a taxonomy and that is extended into an ontology...
is a formalized subset of natural English, which includes a simple grammar and a large extensible English Dictionary-Taxonomy that defines the general and domain specific terminology (terms for concepts), whereas the concepts are arranged in a subtype-supertype hierarchy (a Taxonomy), which supports inheritance of knowledge and requirements. The Dictionary-Taxonomy also includes standardized fact types (also called relation types). The terms and relation types together can be used to create and interpret expressions of facts, knowledge, requirements and other information. Gellish can be used in combination with SQL, RDF/XML, OWL and various other meta-languages. The Gellish standard is being adopted as ISO 15926-11.
See also
- Atom (file format)
- GellishGellishGellish is a controlled natural language, also called a formal language, in which information and knowledge can be expressed in such a way that it is computer-interpretable, as well as system-independent. Gellish is a structured subset of natural language that is suitable for information modelling...
- Lightweight markup languageLightweight markup languageA lightweight markup language is a markup language with a simple syntax, designed to be easy for a human to enter with a simple text editor, and easy to read in its raw form....
- Markup languageMarkup languageA markup language is a modern system for annotating a text in a way that is syntactically distinguishable from that text. The idea and terminology evolved from the "marking up" of manuscripts, i.e. the revision instructions by editors, traditionally written with a blue pencil on authors' manuscripts...
- REBOLREBOLREBOL is a cross-platform data exchange language and a multi-paradigm dynamic programming language originally designed by Carl Sassenrath for network communications and distributed computing. The language and its official implementation, which is a proprietary freely redistributable software are...
- RSSRSS-Mathematics:* Root-sum-square, the square root of the sum of the squares of the elements of a data set* Residual sum of squares in statistics-Technology:* RSS , "Really Simple Syndication" or "Rich Site Summary", a family of web feed formats...
- Standard Generalized Markup LanguageStandard Generalized Markup LanguageThe Standard Generalized Markup Language is an ISO-standard technology for defining generalized markup languages for documents...