SXML
Encyclopedia
SXML is a way to write and process XML
data in the form of S-expression
s.
Textual correspondence between SXML and XML for a sample XML snippet is shown below:
The following two observation can be drawn from the above example:
Similarity between XML and S-expressions reified in SXML allows achieving close integration
between XML data and programming language
expressions, resulting in illustrativeness and simplicity of XML data processing
for an application
programmer
.
, such as XPath
, XSLT
, XQuery
etc., are not general-purpose programming languages and are thus not sufficient for implementing complete applications. For this reason, most XML applications are implemented by means of traditional programming languages, such as C and Java; or scripting languages, for example, Perl, JavaScript and Python.
An attempt to combine two different languages (for example, XPath and Java) leads to a problem known as impedance mismatch.
Impedance mismatch problem consists of two aspects:
Impedance mismatch requires complicated converters and API
s (for example, DOM
) to be used for combining such two languages.
Impedance mismatch problem can be reduced and even eliminated using the Scheme functional programming language for XML data processing
Scheme, being a Lisp dialect, is a widely recognized scripting language
.
It is one of the most elegant and compact programming languages practically used: the description of Scheme standard consists
of 40 pages only.
Scheme is a high-level programming language
, suitable for fast prototyping.
Moreover, Scheme programs are generally several times more compact than the equivalent C
programs.
XML and SXML textual notations are much alike: informally, SXML replaces XML start/end tags with opening/closing brackets.
On the other hand, SXML is an S-expression and is thus the main data structure
for Scheme programming language, and consequently SXML can be easily and naturally processed via Scheme.
.
The start and the end tags of the root element
enclose the whole content of the document, which may include other elements or arbitrary character data.
Text with familiar angular brackets is an external representation of an XML document.
Applications ought to deal with an internalized form: an XML Information Set
, or its specializations (such as the DOM
).
This internalized form lets an application locate specific data or transform an XML tree into another tree.
The W3 Consortium
defines the XML Information Set (Infoset) as an abstract data set
that describes information available in a well-formed XML document.
An XML document's information set consists of a number of information items, which denote elements, attributes, character data, processing instruction
s, and other components of the document.
Each information item has a number of associated properties, e.g., name, namespace URI.
Some properties—for example, "children" and "attributes" -- are collections
of other information items.
Although technically Infoset is specified for XML, it largely applies to other semi-structured data
formats, in particular, HTML
.
XML document parsing
is just one of possible ways to create an instance of XML Infoset.
It is worth a note that XML Information Set recommendation does not attempt to be exhaustive, nor does it constitute a minimum set of information items and properties.
Its purpose is to provide a consistent set of definitions for use in other specifications that need to refer to the information in a well-formed XML document.
The abstract data model defined in the XML Information Set Recommendation is applicable to every XML-related specification of the W3 Consortium.
Namely, the Document Object Model can be considered the application programming interface
(API) for dealing with information items; the XPath data model
uses the concept of nodes which can be derived from information items, etc.
The DOM and the XPath data model are thus two instances of XML
Information Set.
XML Information Set Recommendation itself imposes no restrictions on data structure
s or interface
s for accessing information items.
Different interpretations are thus possible for the XML Information Set abstract data model.
For example, it is convenient to consider an XML Information Set a tree structure, and the terms "information set" and "information item" are then similar in meaning to the generic terms "tree
" and "node
" respectively.
An information item may be also considered as a container
for its properties, either text strings (e.g. name, namespace URI) or containers themselves (e.g. child elements for an XML element).
The information set is thus a hierarchy
of nested containers.
Such a hierarchy of containers comprising text strings and other containers greatly lends itself to be described by an S-expression
, because the latter is recursively defined as a list whose members are either atomic values or S-expressions themselves.
S-expressions are easy to parse into an internal representation suitable for traversal; they also have a simple external notation, which is relatively easy to compose even by hand.
SXML is a concrete instance of the XML Infoset in the form of S-expressions.
Infoset's goal is to present in some form all relevant pieces of data and their abstract, container-slot relationships with each other.
SXML gives the nest of containers a concrete realization as S-expressions, and provides means of accessing items and their properties.
SXML is a "relative" of XPath
and the DOM
, whose data models are two other instances of the XML Infoset.
SXML is particularly suitable for Scheme-based XML/HTML authoring, XPath queries, and tree transformations.
XML and SXML can thus be considered two syntactically different representations for the XML Information Set.
Further discussion on SXML in this section is based on the SXML specification.
A simplified SXML grammar in EBNF
notation is presented below.
An SXML is a single Scheme symbol.
Since an information item in the XML Infoset is a sum of its properties, a list is a particularly suitable data structure to represent an item.
The head of the list, a Scheme identifier, names the item.
For many items this is their (expanded) item name.
For an information item that denotes an XML element, the corresponding list starts with element name, optionally followed by a collection
of attributes.
The rest of the element item list is an ordered sequence of element children, character data, processing instructions, and other elements in turn.
Every child is unique; items never share their children even if the latter have the identical content.
The following example illustrates an XML element and its SXML
form (which satisfies the production in SXML grammar).
The value of an attribute is normally a string
; it may be omitted (in case of HTML
) for a boolean
attribute, e.g., a "certified" attribute in the above example.
A collection of attributes is considered an information item in its own right, tagged with a special name @.
The character "@" may not occur in a well-formed XML name; therefore an cannot be mistaken for a list that represents an element.
An XML document renders attributes, processing instructions and other meta-data differently from
the element markup.
In contrast, SXML represents element content and meta-data uniformly—as tagged lists.
RELAX NG
-- a schema language
for XML—also aims to treat attributes as uniformly as possible with elements.
This uniform treatment, argues James Clark
, is a significant factor in simplifying the language.
SXML takes advantage of the fact that every XML name is also a valid Scheme identifier, but not every Scheme identifier is a
valid XML name.
This observation lets us introduce administrative names such as @, *PI*, *TOP* without worrying about potential name clashes.
The observation also makes the relationship between XML and
SXML well-defined
.
An XML document converted to SXML can be reconstructed into an equivalent XML document (in terms of the Infoset).
Moreover, due to the implementation freedom given by the Infoset specification, SXML itself is an instance of the Infoset.
The XML Recommendation specifies that processing instructions (PI) are distinct from elements and character data; processing instructions must be passed through to applications.
In SXML, PIs are therefore represented by nodes of a dedicated type *PI*.
XPath
and the DOM Level 2 treat processing instructions in a similar way.
A sample XML document and its SXML representation are both shown below, thus providing an illustrative comparison between nested XML tags and nested SXML lists.
Note that the SXML document is a bit more compact than its XML counterpart.
SXML can also be considered an abstract syntax tree
for a parsed XML document.
An XML document or a well-formed part of it can automatically be converted into the corresponding SXML form via a functional Scheme XML parsing framework SSAX.
It is worth a note that SXML represents all the information contained in XML documents, including comments, namespaces
and external entities.
These are omitted in this section for the sake of simplicity,
but they are considered in the SXML specification.
page which looks like this:
xml:lang="en" lang="en">
An example page
When translated to SXML it looks like this:
(*TOP* (@ (*NAMESPACES* (x "http://www.w3.org/1999/xhtml")))
(x:html (@ (xml:lang "en") (lang "en"))
(x:head
(x:title "An example page"))
(x:body
(x:h1 (@ (id "greeting")) "Hi, there")
(x:p "This is just an >>example<< to show XHTML & SXML."))))
Each element's tag pair is replaced by a set of parentheses. The tag's name is not repeated at the end, it is simply the first symbol in the list. The element's contents follow, which are either elements themselves or strings. There is no special syntax required for XML attributes. In SXML they are simply represented as just another node, which has the special name of@ . This can't cause a name clash with an actual "@" tag, because @ is not allowed as a tag name in XML. This is a common pattern in SXML: Anytime a tag is used to indicate a special status or something that is not possible in XML, a name is used that does not constitute a valid XML identifier.
We can also see that there's no need to "escape" otherwise meaningful characters like & and > as & and > entities. All string content is automatically escaped because it is considered to be pure content, and has no tags or entities in it. This also means it is much easier to insert autogenerated content and there is no danger that we might forget to escape user input when we display it to other users (which could lead to all kinds of nasty cross-site scripting attacks or other annoyances).
of an SXML attribute list forms an association list, so that, when SXML is read into a Lisp program, any SXML attribute can be extracted from an attribute list using Lisp's built-in assoc function.
For the SXML data model, attributes and processing instructions look like regular elements with a distinguished name.
Therefore, query and transformation functions dedicated to attributes become redundant, because ordinary functions with distinguished names can be used.
The uniform representation for SXML elements and attributes is especially convenient for practical tasks.
Differences between elements and attributes in XML are blurred.
Choosing either an element or an attribute for representing concrete practical information is often a question of style, and
such a choice can later be changed.
Such a change in a data structure is expressed in SXML as simply an addition/removal of one hierarchy level, namely an
attribute-list.
This requires the minimal modification of an SXML application.
For the SXML notation, the only difference between an attribute and an element is that the former is contained within the attribute-list (which is a special SXML node) and cannot have nested elements.
For example, if data restructuring requires that the weight of the delivered load, initially represented as a nested element, to be represented as an attribute, the SXML element
will be changed to
Such a notation for elements and attributes simplifies SXML data restructuring and allows uniform queries to be used for data processing.
for nodes of this tree.
An SXML node can be defined on the basis of SXML grammar as a single production [N] given below.
Alternatively, an SXML node can be defined as a set of two mutually recursive datatypes: [N1], [N2] and [N3].
In the latter case, a Node is constructed by adding a name to the Nodelist as its leftmost member; a Nodelist is itself a (sorted) list of Nodes.
Such a consideration emphasizes SXML tree structure and the uniform representation for information items as S-expressions.
This makes it possible and convenient for Scheme programs to be treated as a semi-structured data
and vice versa.
Since an SXML document and its nodes are S-expressions, they can be used for representing a Scheme program. For making this possible, it is sufficient
that the first member of every list contained in the SXML tree is a function; the use of macros offers more possibilities. The rest of the members of the list are then the arguments, which are passed to that function. In accordance with SXML grammar, attribute and element names and special names must be bound to functions.
An SXML document or an SXML node that fulfills these requirements may be considered a Scheme program which can be evaluated, for example, by means of eval
function.
For example, if para and bold are defined as functions as follows:
then the following SXML element
can be treated as a program, and the result of its evaluation is the SXML element:
Note that the result of evaluating such a program is not necessarily an SXML element.
Namely, a program may return a textual representation for the source data in XML or HTML; or even have a side effect
, such as saving the SXML data in a relational database
.
Arbitrary positional queries to a node's children, like in DOM's NodeList.item or XPath's [exp] accessor (where exp is an expression that returns an integer), is not optimal, because the list must be traversed until the nth node which is O(n), vesus a vector or array where the access is O(1).
Only a subset of DOM and XPath traversal operations are possible with plain SXML, although a generator (parser or constructor) or a post-processor may annotate each node with the necessary information to fully support all traversal operations (e.g. the parent node and its index within the parent) or they may in fact annotate with any other information (e.g. references to and from other nodes by ID, reference to the represented object, etc.).
Detailed introduction, motivation and real-life case-studies of SSAX, SXML, SXPath and SXSLT.
The paper and the complementary talk presented at the International Lisp Conference 2002.
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
data in the form of S-expression
S-expression
S-expressions or sexps are list-based data structures that represent semi-structured data. An S-expression may be a nested list of smaller S-expressions. S-expressions are probably best known for their use in the Lisp family of programming languages...
s.
Textual correspondence between SXML and XML for a sample XML snippet is shown below:
XML | SXML |
---|---|
| (tag (@ (attr1 "value1") |
The following two observation can be drawn from the above example:
- Textual notations for XML and SXML are much alike; informally, SXML textually differs from XML in relying on round brackets instead of angular braces.
- Additionally, SXML is not only a straightforward textual notation for XML data, but also a primary data structureData structureIn computer science, a data structure is a particular way of storing and organizing data in a computer so that it can be used efficiently.Different kinds of data structures are suited to different kinds of applications, and some are highly specialized to specific tasks...
for a family of functional programmingFunctional programmingIn computer science, functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data. It emphasizes the application of functions, in contrast to the imperative programming style, which emphasizes changes in state...
languages, thus providing an illustrative approach for processing XML data with a general-purpose programming language.
Similarity between XML and S-expressions reified in SXML allows achieving close integration
System integration
In engineering, system integration is the bringing together of the component subsystems into one system and ensuring that the subsystems function together as a system...
between XML data and programming language
Programming language
A programming language is an artificial language designed to communicate instructions to a machine, particularly a computer. Programming languages can be used to create programs that control the behavior of a machine and/or to express algorithms precisely....
expressions, resulting in illustrativeness and simplicity of XML data processing
Data processing
Computer data processing is any process that a computer program does to enter data and summarise, analyse or otherwise convert data into usable information. The process may be automated and run on a computer. It involves recording, analysing, sorting, summarising, calculating, disseminating and...
for an application
Application software
Application software, also known as an application or an "app", is computer software designed to help the user to perform specific tasks. Examples include enterprise software, accounting software, office suites, graphics software and media players. Many application programs deal principally with...
programmer
Programmer
A programmer, computer programmer or coder is someone who writes computer software. The term computer programmer can refer to a specialist in one area of computer programming or to a generalist who writes code for many kinds of software. One who practices or professes a formal approach to...
.
Motivation
XML data processing languages suggested by the W3 ConsortiumWorld Wide Web Consortium
The World Wide Web Consortium is the main international standards organization for the World Wide Web .Founded and headed by Tim Berners-Lee, the consortium is made up of member organizations which maintain full-time staff for the purpose of working together in the development of standards for the...
, such as XPath
XPath
XPath is a language for selecting nodes from an XML document. In addition, XPath may be used to compute values from the content of an XML document...
, XSLT
XSLT
XSLT is a declarative, XML-based language used for the transformation of XML documents. The original document is not changed; rather, a new document is created based on the content of an existing one. The new document may be serialized by the processor in standard XML syntax or in another format,...
, XQuery
XQuery
- Features :XQuery provides the means to extract and manipulate data from XML documents or any data source that can be viewed as XML, such as relational databases or office documents....
etc., are not general-purpose programming languages and are thus not sufficient for implementing complete applications. For this reason, most XML applications are implemented by means of traditional programming languages, such as C and Java; or scripting languages, for example, Perl, JavaScript and Python.
An attempt to combine two different languages (for example, XPath and Java) leads to a problem known as impedance mismatch.
Impedance mismatch problem consists of two aspects:
- Different data modelData modelA data model in software engineering is an abstract model, that documents and organizes the business data for communication between team members and is used as a plan for developing applications, specifically how data is stored and accessed....
s. E.g. XPathXPathXPath is a language for selecting nodes from an XML document. In addition, XPath may be used to compute values from the content of an XML document...
models an XML document as a tree, while most general purpose programming languages have no native data typeData typeIn computer programming, a data type is a classification identifying one of various types of data, such as floating-point, integer, or Boolean, that determines the possible values for that type; the operations that can be done on values of that type; the meaning of the data; and the way values of...
s for a tree. - Different programming paradigmProgramming paradigmA programming paradigm is a fundamental style of computer programming. Paradigms differ in the concepts and abstractions used to represent the elements of a program and the steps that compose a computation A programming paradigm is a fundamental style of computer programming. (Compare with a...
s. Say, XSLTXSLTXSLT is a declarative, XML-based language used for the transformation of XML documents. The original document is not changed; rather, a new document is created based on the content of an existing one. The new document may be serialized by the processor in standard XML syntax or in another format,...
is a functionalFunctional programmingIn computer science, functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data. It emphasizes the application of functions, in contrast to the imperative programming style, which emphasizes changes in state...
language, while JavaJava (programming language)Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
is object-orientedObject-oriented programmingObject-oriented programming is a programming paradigm using "objects" – data structures consisting of data fields and methods together with their interactions – to design applications and computer programs. Programming techniques may include features such as data abstraction,...
, and PerlPerlPerl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...
is a proceduralProcedural programmingProcedural programming can sometimes be used as a synonym for imperative programming , but can also refer to a programming paradigm, derived from structured programming, based upon the concept of the procedure call...
one.
Impedance mismatch requires complicated converters and API
Application programming interface
An application programming interface is a source code based specification intended to be used as an interface by software components to communicate with each other...
s (for example, DOM
Document Object Model
The Document Object Model is a cross-platform and language-independent convention for representing and interacting with objects in HTML, XHTML and XML documents. Aspects of the DOM may be addressed and manipulated within the syntax of the programming language in use...
) to be used for combining such two languages.
Impedance mismatch problem can be reduced and even eliminated using the Scheme functional programming language for XML data processing
-
- Nested lists (S-expressions in Scheme) provide a natural representation for nested XML documents. Scheme represents its code and data as nested lists of dynamically typed elements. XML document, being a hierarchical structure of nested XML elements, can be thought of as a hierarchical nested Scheme list (so-called S-expression).
- Scheme is a functional language, as most XML-languages are (e.g. XSLT and XQuery). Scheme processes (nested) lists in a recursiveRecursion (computer science)Recursion in computer science is a method where the solution to a problem depends on solutions to smaller instances of the same problem. The approach can be applied to many types of problems, and is one of the central ideas of computer science....
manner which can be thought of as traversing/transforming the document tree.
Scheme, being a Lisp dialect, is a widely recognized scripting language
Scripting language
A scripting language, script language, or extension language is a programming language that allows control of one or more applications. "Scripts" are distinct from the core code of the application, as they are usually written in a different language and are often created or at least modified by the...
.
It is one of the most elegant and compact programming languages practically used: the description of Scheme standard consists
of 40 pages only.
Scheme is a high-level programming language
High-level programming language
A high-level programming language is a programming language with strong abstraction from the details of the computer. In comparison to low-level programming languages, it may use natural language elements, be easier to use, or be from the specification of the program, making the process of...
, suitable for fast prototyping.
Moreover, Scheme programs are generally several times more compact than the equivalent C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
programs.
XML and SXML textual notations are much alike: informally, SXML replaces XML start/end tags with opening/closing brackets.
On the other hand, SXML is an S-expression and is thus the main data structure
Data structure
In computer science, a data structure is a particular way of storing and organizing data in a computer so that it can be used efficiently.Different kinds of data structures are suited to different kinds of applications, and some are highly specialized to specific tasks...
for Scheme programming language, and consequently SXML can be easily and naturally processed via Scheme.
XML, XML Information Set and SXML
An XML document is essentially a tree structureTree (data structure)
In computer science, a tree is a widely-used data structure that emulates a hierarchical tree structure with a set of linked nodes.Mathematically, it is an ordered directed tree, more specifically an arborescence: an acyclic connected graph where each node has zero or more children nodes and at...
.
The start and the end tags of the root element
Root element
Each XML document has exactly one single root element. This element is also known as the document element. It encloses all the other elements and is therefore the sole parent element to all the other elements....
enclose the whole content of the document, which may include other elements or arbitrary character data.
Text with familiar angular brackets is an external representation of an XML document.
Applications ought to deal with an internalized form: an XML Information Set
XML Information Set
XML Information Set is a W3C specification describing an abstract data model of an XML document in terms of a set of information items...
, or its specializations (such as the DOM
Document Object Model
The Document Object Model is a cross-platform and language-independent convention for representing and interacting with objects in HTML, XHTML and XML documents. Aspects of the DOM may be addressed and manipulated within the syntax of the programming language in use...
).
This internalized form lets an application locate specific data or transform an XML tree into another tree.
The W3 Consortium
World Wide Web Consortium
The World Wide Web Consortium is the main international standards organization for the World Wide Web .Founded and headed by Tim Berners-Lee, the consortium is made up of member organizations which maintain full-time staff for the purpose of working together in the development of standards for the...
defines the XML Information Set (Infoset) as an abstract data set
Abstract data type
In computing, an abstract data type is a mathematical model for a certain class of data structures that have similar behavior; or for certain data types of one or more programming languages that have similar semantics...
that describes information available in a well-formed XML document.
An XML document's information set consists of a number of information items, which denote elements, attributes, character data, processing instruction
Processing Instruction
A Processing Instruction is an SGML and XML node type, which may occur anywhere in the document, intended to carry instructions to the application....
s, and other components of the document.
Each information item has a number of associated properties, e.g., name, namespace URI.
Some properties—for example, "children" and "attributes" -- are collections
Collection (computing)
In computer science, a collection is a grouping of some variable number of data items that have some shared significance to the problem being solved and need to be operated upon together in some controlled fashion. Generally, the data items will be of the same type or, in languages supporting...
of other information items.
Although technically Infoset is specified for XML, it largely applies to other semi-structured data
Semi-structured model
The semi-structured model is a database model. In this model, there is no separation between the data and the schema, and the amount of structure used depends on the purpose.The advantages of this model are the following:...
formats, in particular, HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....
.
XML document parsing
Parsing
In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens , to determine its grammatical structure with respect to a given formal grammar...
is just one of possible ways to create an instance of XML Infoset.
It is worth a note that XML Information Set recommendation does not attempt to be exhaustive, nor does it constitute a minimum set of information items and properties.
Its purpose is to provide a consistent set of definitions for use in other specifications that need to refer to the information in a well-formed XML document.
The abstract data model defined in the XML Information Set Recommendation is applicable to every XML-related specification of the W3 Consortium.
Namely, the Document Object Model can be considered the application programming interface
Application programming interface
An application programming interface is a source code based specification intended to be used as an interface by software components to communicate with each other...
(API) for dealing with information items; the XPath data model
Xpath data model
XPath is a language for selecting portions of an XML Document . XPath uses a specific conceptual interpretation of XML documents, referred to as the XPath Data Model. Technical documents on XML often use the same terminology as the XPath data model....
uses the concept of nodes which can be derived from information items, etc.
The DOM and the XPath data model are thus two instances of XML
Information Set.
XML Information Set Recommendation itself imposes no restrictions on data structure
Data structure
In computer science, a data structure is a particular way of storing and organizing data in a computer so that it can be used efficiently.Different kinds of data structures are suited to different kinds of applications, and some are highly specialized to specific tasks...
s or interface
Interface (computer science)
In the field of computer science, an interface is a tool and concept that refers to a point of interaction between components, and is applicable at the level of both hardware and software...
s for accessing information items.
Different interpretations are thus possible for the XML Information Set abstract data model.
For example, it is convenient to consider an XML Information Set a tree structure, and the terms "information set" and "information item" are then similar in meaning to the generic terms "tree
Tree (data structure)
In computer science, a tree is a widely-used data structure that emulates a hierarchical tree structure with a set of linked nodes.Mathematically, it is an ordered directed tree, more specifically an arborescence: an acyclic connected graph where each node has zero or more children nodes and at...
" and "node
Node (computer science)
A node is a record consisting of one or more fields that are links to other nodes, and a data field. The link and data fields are often implemented by pointers or references although it is also quite common for the data to be embedded directly in the node. Nodes are used to build linked, often...
" respectively.
An information item may be also considered as a container
Container (data structure)
In computer science, a container is a class, a data structure, or an abstract data type whose instances are collections of other objects. In other words; they are used for storing objects in an organized way following specific access rules...
for its properties, either text strings (e.g. name, namespace URI) or containers themselves (e.g. child elements for an XML element).
The information set is thus a hierarchy
Hierarchy
A hierarchy is an arrangement of items in which the items are represented as being "above," "below," or "at the same level as" one another...
of nested containers.
Such a hierarchy of containers comprising text strings and other containers greatly lends itself to be described by an S-expression
S-expression
S-expressions or sexps are list-based data structures that represent semi-structured data. An S-expression may be a nested list of smaller S-expressions. S-expressions are probably best known for their use in the Lisp family of programming languages...
, because the latter is recursively defined as a list whose members are either atomic values or S-expressions themselves.
S-expressions are easy to parse into an internal representation suitable for traversal; they also have a simple external notation, which is relatively easy to compose even by hand.
SXML is a concrete instance of the XML Infoset in the form of S-expressions.
Infoset's goal is to present in some form all relevant pieces of data and their abstract, container-slot relationships with each other.
SXML gives the nest of containers a concrete realization as S-expressions, and provides means of accessing items and their properties.
SXML is a "relative" of XPath
XPath
XPath is a language for selecting nodes from an XML document. In addition, XPath may be used to compute values from the content of an XML document...
and the DOM
Document Object Model
The Document Object Model is a cross-platform and language-independent convention for representing and interacting with objects in HTML, XHTML and XML documents. Aspects of the DOM may be addressed and manipulated within the syntax of the programming language in use...
, whose data models are two other instances of the XML Infoset.
SXML is particularly suitable for Scheme-based XML/HTML authoring, XPath queries, and tree transformations.
XML and SXML can thus be considered two syntactically different representations for the XML Information Set.
SXML Specification
As noted in the previous section, SXML is the concrete instance of the XML Infoset in the form of S-expressions.Further discussion on SXML in this section is based on the SXML specification.
A simplified SXML grammar in EBNF
Extended Backus–Naur form
In computer science, Extended Backus–Naur Form is a family of metasyntax notations used for expressing context-free grammars: that is, a formal way to describe computer programming languages and other formal languages. They are extensions of the basic Backus–Naur Form metasyntax notation.The...
notation is presented below.
An SXML
[1]::= ( *TOP* * )
[2]::= ( ? * )
[3]::= ( @ * )
[4]::= ( "value"? )
[5]::= | "character data" |
[6]::= ( *PI* pi-target "processing instruction content string" )
Since an information item in the XML Infoset is a sum of its properties, a list is a particularly suitable data structure to represent an item.
The head of the list, a Scheme identifier, names the item.
For many items this is their (expanded) item name.
For an information item that denotes an XML element, the corresponding list starts with element name, optionally followed by a collection
Collection (computing)
In computer science, a collection is a grouping of some variable number of data items that have some shared significance to the problem being solved and need to be operated upon together in some controlled fashion. Generally, the data items will be of the same type or, in languages supporting...
of attributes.
The rest of the element item list is an ordered sequence of element children, character data, processing instructions, and other elements in turn.
Every child is unique; items never share their children even if the latter have the identical content.
The following example illustrates an XML element and its SXML
form (which satisfies the
| (WEIGHT (@ (unit "pound")) |
The value of an attribute is normally a string
String (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....
; it may be omitted (in case of HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....
) for a boolean
Boolean datatype
In computer science, the Boolean or logical data type is a data type, having two values , intended to represent the truth values of logic and Boolean algebra...
attribute, e.g., a "certified" attribute in the above example.
A collection of attributes is considered an information item in its own right, tagged with a special name @.
The character "@" may not occur in a well-formed XML name; therefore an
An XML document renders attributes, processing instructions and other meta-data differently from
the element markup.
In contrast, SXML represents element content and meta-data uniformly—as tagged lists.
RELAX NG
RELAX NG
In computing, RELAX NG is a schema language for XML, based on Murata Makoto's RELAX and James Clark's TREX. A RELAX NG schema specifies a pattern for the structure and content of an XML document...
-- a schema language
XML schema
An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself...
for XML—also aims to treat attributes as uniformly as possible with elements.
This uniform treatment, argues James Clark
James Clark (XML expert)
James Clark, is the author of groff and expat and has done much work with open-source software and XML. Born in London, and educated at Charterhouse and Merton College, Oxford, Clark has lived in Bangkok, Thailand since 1995, and is now a permanent resident...
, is a significant factor in simplifying the language.
SXML takes advantage of the fact that every XML name is also a valid Scheme identifier, but not every Scheme identifier is a
valid XML name.
This observation lets us introduce administrative names such as @, *PI*, *TOP* without worrying about potential name clashes.
The observation also makes the relationship between XML and
SXML well-defined
Well-defined
In mathematics, well-definition is a mathematical or logical definition of a certain concept or object which uses a set of base axioms in an entirely unambiguous way and satisfies the properties it is required to satisfy. Usually definitions are stated unambiguously, and it is clear they satisfy...
.
An XML document converted to SXML can be reconstructed into an equivalent XML document (in terms of the Infoset).
Moreover, due to the implementation freedom given by the Infoset specification, SXML itself is an instance of the Infoset.
The XML Recommendation specifies that processing instructions (PI) are distinct from elements and character data; processing instructions must be passed through to applications.
In SXML, PIs are therefore represented by nodes of a dedicated type *PI*.
XPath
XPath
XPath is a language for selecting nodes from an XML document. In addition, XPath may be used to compute values from the content of an XML document...
and the DOM Level 2 treat processing instructions in a similar way.
A sample XML document and its SXML representation are both shown below, thus providing an illustrative comparison between nested XML tags and nested SXML lists.
Note that the SXML document is a bit more compact than its XML counterpart.
| (*TOP* (*PI* xml "version='1.0'") |
SXML can also be considered an abstract syntax tree
Abstract syntax tree
In computer science, an abstract syntax tree , or just syntax tree, is a tree representation of the abstract syntactic structure of source code written in a programming language. Each node of the tree denotes a construct occurring in the source code. The syntax is 'abstract' in the sense that it...
for a parsed XML document.
An XML document or a well-formed part of it can automatically be converted into the corresponding SXML form via a functional Scheme XML parsing framework SSAX.
It is worth a note that SXML represents all the information contained in XML documents, including comments, namespaces
XML Namespace
xmlns tagged XML namespaces are used for providing uniquely named elements and attributes in an XML document. They are defined in a W3C recommendation. An XML instance may contain element or attribute names from more than one XML vocabulary...
and external entities.
These are omitted in this section for the sake of simplicity,
but they are considered in the SXML specification.
Example
For example, a simple XHTMLXHTML
XHTML is a family of XML markup languages that mirror or extend versions of the widely-used Hypertext Markup Language , the language in which web pages are written....
page which looks like this:
Hi, there!
This is just an >>example<< to show XHTML & SXML.
When translated to SXML it looks like this:
(*TOP* (@ (*NAMESPACES* (x "http://www.w3.org/1999/xhtml")))
(x:html (@ (xml:lang "en") (lang "en"))
(x:head
(x:title "An example page"))
(x:body
(x:h1 (@ (id "greeting")) "Hi, there")
(x:p "This is just an >>example<< to show XHTML & SXML."))))
Each element's tag pair is replaced by a set of parentheses. The tag's name is not repeated at the end, it is simply the first symbol in the list. The element's contents follow, which are either elements themselves or strings. There is no special syntax required for XML attributes. In SXML they are simply represented as just another node, which has the special name of
We can also see that there's no need to "escape" otherwise meaningful characters like & and > as & and > entities. All string content is automatically escaped because it is considered to be pure content, and has no tags or entities in it. This also means it is much easier to insert autogenerated content and there is no danger that we might forget to escape user input when we display it to other users (which could lead to all kinds of nasty cross-site scripting attacks or other annoyances).
SXML features
This section considers some important features of SXML, deductible from SXML grammar and properties of S-expressions.SXML attributes
The cdrCar and cdr
car and cdr are primitive operations on cons cells introduced in the Lisp programming language. A cons cell is composed of two pointers; the car operation extracts the first pointer, and the cdr operation extracts the second.Thus, the expression evaluates to x, and evaluates to...
of an SXML attribute list forms an association list, so that, when SXML is read into a Lisp program, any SXML attribute can be extracted from an attribute list using Lisp's built-in assoc function.
SXML elements and attributes
The uniformity of the SXML representation for elements, attributes, and processing instructions simplifies queries and transformations.For the SXML data model, attributes and processing instructions look like regular elements with a distinguished name.
Therefore, query and transformation functions dedicated to attributes become redundant, because ordinary functions with distinguished names can be used.
The uniform representation for SXML elements and attributes is especially convenient for practical tasks.
Differences between elements and attributes in XML are blurred.
Choosing either an element or an attribute for representing concrete practical information is often a question of style, and
such a choice can later be changed.
Such a change in a data structure is expressed in SXML as simply an addition/removal of one hierarchy level, namely an
attribute-list.
This requires the minimal modification of an SXML application.
For the SXML notation, the only difference between an attribute and an element is that the former is contained within the attribute-list (which is a special SXML node) and cannot have nested elements.
For example, if data restructuring requires that the weight of the delivered load, initially represented as a nested element, to be represented as an attribute, the SXML element
(delivery
...
(weight "789")))
will be changed to
(delivery
(@ (weight "789"))
...)
Such a notation for elements and attributes simplifies SXML data restructuring and allows uniform queries to be used for data processing.
SXML document as a tree of uniform nodes
Since an SXML document is essentially a tree structure, it can be described in a more uniform way by introducing the term of an SXML nodeNode (computer science)
A node is a record consisting of one or more fields that are links to other nodes, and a data field. The link and data fields are often implemented by pointers or references although it is also quite common for the data to be embedded directly in the node. Nodes are used to build linked, often...
for nodes of this tree.
An SXML node can be defined on the basis of SXML grammar as a single production [N] given below.
Alternatively, an SXML node can be defined as a set of two mutually recursive datatypes: [N1], [N2] and [N3].
In the latter case, a Node is constructed by adding a name to the Nodelist as its leftmost member; a Nodelist is itself a (sorted) list of Nodes.
[N]::= | | | "character data: text string" | |
[N1]::= ( . ) | "text string"
[N2]::= ( * )
[N3]::= | @ | *TOP* | *PI*
Such a consideration emphasizes SXML tree structure and the uniform representation for information items as S-expressions.
SXML as a Scheme program
The syntax of LISP family programming languages, in particular, Scheme, is based on S-expressions used for both data and code representation.This makes it possible and convenient for Scheme programs to be treated as a semi-structured data
Semi-structured model
The semi-structured model is a database model. In this model, there is no separation between the data and the schema, and the amount of structure used depends on the purpose.The advantages of this model are the following:...
and vice versa.
Since an SXML document and its nodes are S-expressions, they can be used for representing a Scheme program. For making this possible, it is sufficient
Necessary and sufficient conditions
In logic, the words necessity and sufficiency refer to the implicational relationships between statements. The assertion that one statement is a necessary and sufficient condition of another means that the former statement is true if and only if the latter is true.-Definitions:A necessary condition...
that the first member of every list contained in the SXML tree is a function; the use of macros offers more possibilities. The rest of the members of the list are then the arguments, which are passed to that function. In accordance with SXML grammar, attribute and element names and special names must be bound to functions.
An SXML document or an SXML node that fulfills these requirements may be considered a Scheme program which can be evaluated, for example, by means of eval
Eval
In some programming languages, eval is a function which evaluates a string as though it were an expression and returns a result; in others, it executes multiple lines of code as though they had been included instead of the line including the eval...
function.
For example, if para and bold are defined as functions as follows:
(define (para . x) (cons 'p x))
(define (bold . x) (cons 'b x))
then the following SXML element
(para "plain"
(bold "highlighted")
"plain")
can be treated as a program, and the result of its evaluation is the SXML element:
(p "plain"
(b "highlighted")
"plain")
Note that the result of evaluating such a program is not necessarily an SXML element.
Namely, a program may return a textual representation for the source data in XML or HTML; or even have a side effect
Side effect (computer science)
In computer science, a function or expression is said to have a side effect if, in addition to returning a value, it also modifies some state or has an observable interaction with calling functions or the outside world...
, such as saving the SXML data in a relational database
Relational database
A relational database is a database that conforms to relational model theory. The software used in a relational database is called a relational database management system . Colloquial use of the term "relational database" may refer to the RDBMS software, or the relational database itself...
.
SXML shortcomings
Because the underlying structure is based on singly linked lists, the nodes have no default access to either the parent node and the siblings nodes, only to their child nodes.Arbitrary positional queries to a node's children, like in DOM's NodeList.item or XPath's [exp] accessor (where exp is an expression that returns an integer), is not optimal, because the list must be traversed until the nth node which is O(n), vesus a vector or array where the access is O(1).
Only a subset of DOM and XPath traversal operations are possible with plain SXML, although a generator (parser or constructor) or a post-processor may annotate each node with the necessary information to fully support all traversal operations (e.g. the parent node and its index within the parent) or they may in fact annotate with any other information (e.g. references to and from other nodes by ID, reference to the represented object, etc.).
External links
- SXML Tools Tutorial by Dmitry Lizorkin
- Main SSAX/SXML page
- XML Matters: Investigating SXML and SSAX: Manipulating XML in the Scheme programming language by David Mertz, Ph.D. IBM developerWorks article
Detailed introduction, motivation and real-life case-studies of SSAX, SXML, SXPath and SXSLT.
The paper and the complementary talk presented at the International Lisp Conference 2002.