Tag soup
Encyclopedia
In Web development
, "tag soup" refers to formatted markup
written for a web page
that is very much like HTML
but does not consist of correct HTML syntax and document structure. Because web browsers have historically treated HTML syntax or structural errors leniently, there has been little pressure for web developers to follow published standards, and therefore there is a need for all browser implementations be able to treat what looks like HTML as "tag soup", accepting and correcting for invalid
syntax and structure.
An HTML parser (part of a web browser
) that is capable of interpreting HTML-like markup even if it contains invalid syntax or structure may be called a tag soup parser. All major web browsers currently have a tag soup parser for interpreting malformed HTML.
"Tag soup" may collectively refer to a large number of common authoring mistakes, such as malformed HTML tags
, improperly-nested HTML element
s, and unescaped
character entities
(especially ampersands (&) and less-than signs (<)).
The Markup Validation Service
is a resource for web page authors to avoid creating tag soup.
While many graphical web editors produce well-formed markup, an author writing code manually with a text-editor and then testing only in one browser can easily miss such errors. The presentation can therefore vary drastically from one browser to another as each tries to “correct” the authorʼs intent in different ways and then applies styling to those “corrections”.
Many graphic web editors still produce invalid markup. Moreover, many professional web designers and authors pay little attention to issues of validity. It is common to see invalid markup in many of the sites throughout the World Wide Web.
To some extent, this problem was slowed by the introduction of new standards by the W3C, such as CSS, introduced in 1998, which helped to provide greater flexibility in the presentation and layout of web pages without the need for large numbers of additional HTML elements and attributes.
In later standards, many elements have either been combined into a single semantic construct (such as object elements replacing proprietary applet, and embed elements) or have been deprecated (such as the "s", "strike" and "u" elements). Nevertheless, browser developers have continued to introduce new elements to HTML when they have perceived a need. Some browsers include tabindex attributes on any element. WebKit developers aligned with Apple introduced the "canvas" element that behaves much like the "object" or "embed" element. Mozilla then introduced their own "canvas" element, which behaves even more like the "object" element.
(CSS) provide a mechanism to specify the presentation of elements in a document without altering the markup structure of the document. Before CSS was commonplace, web developers may have resorted to some structurally invalid markup to achieve certain presentational goals - for example, including block level elements within inline elements to obtain a particular effect.
is a reformulation of the HTML language based on XML
. XHTML was developed to address many of the problems associated with tag soup.
XML allows parsers to separate the process of interpreting the document syntax and its structure. In HTML and SGML, a parser needed to know certain rules about elements during parsing, such as what elements could be contained within other elements and which elements implicitly close the previous element. This is because in HTML and SGML, closing tags and even opening tags were optional on some elements. By requiring all elements to have explicit opening and closing tags, XML parsers can parse the document and produce a document tree without any knowledge of the document type. This allows parsers to be universal and very light-weight, and to be separated from the process of validating or interpreting the document.
The XML specification clearly defines that a conforming user agent (such as a web browser) must not accept a document, and not continue parsing it, if any syntactical error is encountered. Thus, a browser interpreting a web page as XHTML will refuse to display the page if it encounters a formation error. This can help ensure that when authors test XHTML code against a conforming browser they will immediately be informed of malformation problems: perhaps the most severe problem facing web browsers. When code is malformed, the intent of the author is ambiguous. Without the directives of XML, HTML browsers must use complex algorithms to infer the author's intended meaning in a wide range of cases where invalid syntax is encountered.
XML and XHTML introduce the concept of namespaces. With namespaces, authors or communities of authors can define new elements and attributes with new semantics, and intermix those within their XHTML documents. Namespaces ensure that element names from the various namespaces will not be conflated. For example, a "table" element could be defined in a new namespace with new semantics different from the HTML "table" element and the browser will be able to differentiate between the two. In providing namespaces, XHTML combined with CSS allow authoring communities to easily extend the semantic vocabulary of documents. This accommodates the use of proprietary elements so long as those elements can be presented to the intended audience through complete style sheet definitions (including aural/speech and tactile styles).
XHTML documents may be served on the web using the internet media type
versions (6, 7 and 8) do not display XHTML documents served as
Thus, the HTML 5 specification has altered its definition of HTML syntax both to accommodate common syntax in use today, and to explicitly describe exactly how "badly-formed code" should be treated by the parser. The handling of badly-formed code now has a place in the specification itself, hopefully reducing the need for future HTML parsers to implement additional, out-of-specification measures for dealing with code that it does not recognize.
Web development
Web development is a broad term for the work involved in developing a web site for the Internet or an intranet . This can include web design, web content development, client liaison, client-side/server-side scripting, web server and network security configuration, and e-commerce development...
, "tag soup" refers to formatted markup
Markup language
A markup language is a modern system for annotating a text in a way that is syntactically distinguishable from that text. The idea and terminology evolved from the "marking up" of manuscripts, i.e. the revision instructions by editors, traditionally written with a blue pencil on authors' manuscripts...
written for a web page
Web page
A web page or webpage is a document or information resource that is suitable for the World Wide Web and can be accessed through a web browser and displayed on a monitor or mobile device. This information is usually in HTML or XHTML format, and may provide navigation to other web pages via hypertext...
that is very much like HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....
but does not consist of correct HTML syntax and document structure. Because web browsers have historically treated HTML syntax or structural errors leniently, there has been little pressure for web developers to follow published standards, and therefore there is a need for all browser implementations be able to treat what looks like HTML as "tag soup", accepting and correcting for invalid
Validator
A validator is a computer program used to check the validity or syntactical correctness of a fragment of code or document. The term is commonly used in the context of validating HTML, CSS and XML documents or RSS feeds though it can be used for any defined format or language.-HTML validator:In the...
syntax and structure.
An HTML parser (part of a web browser
Web browser
A web browser is a software application for retrieving, presenting, and traversing information resources on the World Wide Web. An information resource is identified by a Uniform Resource Identifier and may be a web page, image, video, or other piece of content...
) that is capable of interpreting HTML-like markup even if it contains invalid syntax or structure may be called a tag soup parser. All major web browsers currently have a tag soup parser for interpreting malformed HTML.
"Tag soup" may collectively refer to a large number of common authoring mistakes, such as malformed HTML tags
Well-formed element
In web page design, and generally for all markup languages such as SGML, HTML, and XML, a well-formed element is one that is either*opened and subsequently closed,*an empty element, which in that case must be terminated,...
, improperly-nested HTML element
HTML element
An HTML element is an individual component of an HTML document. HTML documents are composed of a tree of HTML elements and other nodes, such as text nodes. Each element can have attributes specified. Elements can also have content, including other elements and text. HTML elements represent...
s, and unescaped
Escape character
In computing and telecommunication, an escape character is a character which invokes an alternative interpretation on subsequent characters in a character sequence. An escape character is a particular case of metacharacters...
character entities
Character encodings in HTML
HTML has been in use since 1991, but HTML 4.0 was the first standardized version where international characters were given reasonably complete treatment...
(especially ampersands (&) and less-than signs (<)).
The Markup Validation Service
W3C Markup Validation Service
The Markup Validation Service is a validator by the World Wide Web Consortium that allows Internet users to check HTML and XHTML documents for well-formed markup...
is a resource for web page authors to avoid creating tag soup.
Overview
"Tag soup" is a term used to denigrate various practices in web authoring. Some of these (roughly ordered from most severe to least severe) include:- Malformed markup where tags are improperly nested or incorrectly closed. For example, the following:
This is a malformed fragment of HTML.
- Invalid structure where elements are improperly nested according to the DTDDocument Type DefinitionDocument Type Definition is a set of markup declarations that define a document type for SGML-family markup languages...
for the document. Examples of this include nesting a "ul" element directly inside another "ul" element for any of the HTML 4.01 or XHTML DTDs. - Use of proprietary or undefined elements and attributes instead of those defined in W3C recommendations.
Malformed markup
Malformed markup is arguably the most severe problem in web authoring. However, thanks to better education and information and perhaps with some help from XHTML, the issue of malformed markup is becoming less common. Browsers, when faced with malformed markup, must guess the intended meaning of the author. They must infer closing tags where they expect them and then infer opening tags to match other closing-tags. The interpretation can vary markedly from one browser to the next. Ian Hickson wrote a detailed article investigating the differences between how browsers handle tag soup.While many graphical web editors produce well-formed markup, an author writing code manually with a text-editor and then testing only in one browser can easily miss such errors. The presentation can therefore vary drastically from one browser to another as each tries to “correct” the authorʼs intent in different ways and then applies styling to those “corrections”.
Invalid document structure
Invalid document structure here means only the use of attributes and elements where they do not belong. For example, placing a "cite" attribute on a "cite" element is invalid since the HTML and XHTML DTDs do not ascribe any meaning to that attribute on that element. Similarly, including a "p" element within the content of an "em" element is also invalid. With the move toward separating malformed markup from invalid markup, the problems with invalid markup have increasingly been seen as less severe. Some have begun to advocate looser content models that allow greater flexibility in authoring HTML documents (whether in HTML or XHTML). However, use of invalid markup can blur the author's intended meaning, though not as severely as malformed markup.Many graphic web editors still produce invalid markup. Moreover, many professional web designers and authors pay little attention to issues of validity. It is common to see invalid markup in many of the sites throughout the World Wide Web.
Use of proprietary/discontinued elements
In the early age of the web (much of the 1990s), the design of the official HTML specification became increasingly strained, compared to the desire of designers for flexibility in creating visually vibrant designs. In response to this pressure, browser makers unilaterally added new proprietary features to HTML that fell outside of the standards at the time. This meant there were proprietary elements in HTML that worked in some browsers, but not in others.To some extent, this problem was slowed by the introduction of new standards by the W3C, such as CSS, introduced in 1998, which helped to provide greater flexibility in the presentation and layout of web pages without the need for large numbers of additional HTML elements and attributes.
In later standards, many elements have either been combined into a single semantic construct (such as object elements replacing proprietary applet, and embed elements) or have been deprecated (such as the "s", "strike" and "u" elements). Nevertheless, browser developers have continued to introduce new elements to HTML when they have perceived a need. Some browsers include tabindex attributes on any element. WebKit developers aligned with Apple introduced the "canvas" element that behaves much like the "object" or "embed" element. Mozilla then introduced their own "canvas" element, which behaves even more like the "object" element.
Evolving specifications to solve tag soup
While some of the issues of tag soup are due to shortcomings of browsers and sometimes due to a lack of information for web authors, some of the proliferation of tag soup was due to missing links in the web standards themselves. The W3C has spearheaded several efforts to address the shortcomings of web standards. As more browsers support newer revisions of standards, the pressure on web developers to use non-standard code to solve problems diminishes.Cascading Style Sheets (CSS)
Cascading Style SheetsCascading Style Sheets
Cascading Style Sheets is a style sheet language used to describe the presentation semantics of a document written in a markup language...
(CSS) provide a mechanism to specify the presentation of elements in a document without altering the markup structure of the document. Before CSS was commonplace, web developers may have resorted to some structurally invalid markup to achieve certain presentational goals - for example, including block level elements within inline elements to obtain a particular effect.
XML and XHTML
XHTMLXHTML
XHTML is a family of XML markup languages that mirror or extend versions of the widely-used Hypertext Markup Language , the language in which web pages are written....
is a reformulation of the HTML language based on XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
. XHTML was developed to address many of the problems associated with tag soup.
XML allows parsers to separate the process of interpreting the document syntax and its structure. In HTML and SGML, a parser needed to know certain rules about elements during parsing, such as what elements could be contained within other elements and which elements implicitly close the previous element. This is because in HTML and SGML, closing tags and even opening tags were optional on some elements. By requiring all elements to have explicit opening and closing tags, XML parsers can parse the document and produce a document tree without any knowledge of the document type. This allows parsers to be universal and very light-weight, and to be separated from the process of validating or interpreting the document.
The XML specification clearly defines that a conforming user agent (such as a web browser) must not accept a document, and not continue parsing it, if any syntactical error is encountered. Thus, a browser interpreting a web page as XHTML will refuse to display the page if it encounters a formation error. This can help ensure that when authors test XHTML code against a conforming browser they will immediately be informed of malformation problems: perhaps the most severe problem facing web browsers. When code is malformed, the intent of the author is ambiguous. Without the directives of XML, HTML browsers must use complex algorithms to infer the author's intended meaning in a wide range of cases where invalid syntax is encountered.
XML and XHTML introduce the concept of namespaces. With namespaces, authors or communities of authors can define new elements and attributes with new semantics, and intermix those within their XHTML documents. Namespaces ensure that element names from the various namespaces will not be conflated. For example, a "table" element could be defined in a new namespace with new semantics different from the HTML "table" element and the browser will be able to differentiate between the two. In providing namespaces, XHTML combined with CSS allow authoring communities to easily extend the semantic vocabulary of documents. This accommodates the use of proprietary elements so long as those elements can be presented to the intended audience through complete style sheet definitions (including aural/speech and tactile styles).
XHTML documents may be served on the web using the internet media type
Internet media type
An Internet media type, originally called a MIME type after MIME and sometimes a Content-type after the name of a header in several protocols whose value is such a type, is a two-part identifier for file formats on the Internet.The identifiers were originally defined in RFC 2046 for use in email...
application/xhtml+xml
or text/html
Current Microsoft Internet ExplorerInternet Explorer
Windows Internet Explorer is a series of graphical web browsers developed by Microsoft and included as part of the Microsoft Windows line of operating systems, starting in 1995. It was first released as part of the add-on package Plus! for Windows 95 that year...
versions (6, 7 and 8) do not display XHTML documents served as
application/xhtml+xml
. IE9 beta releases appear to be compliant. See also the discussion of this issue in the XHTML article.HTML5
HTML5 aims to be the most complete solution to the problem of tag soup thus far while remaining as backwards- and forwards-compatible as possible. By contrast to XHTML, which departs from backwards compatibility and takes the approach that parsers should become less tolerant of badly-formed markup, HTML5 acknowledges that badly-formed HTML code already exists in large quantities and will probably continue to be used, and takes the view that the specification should be expanded to ensure maximum compatibility with such code.Thus, the HTML 5 specification has altered its definition of HTML syntax both to accommodate common syntax in use today, and to explicitly describe exactly how "badly-formed code" should be treated by the parser. The handling of badly-formed code now has a place in the specification itself, hopefully reducing the need for future HTML parsers to implement additional, out-of-specification measures for dealing with code that it does not recognize.
Tools to fix tag soup
- HTML TidyHTML TidyHTML Tidy is a computer program and a library whose purpose is to fix invalid HTML and to improve the layout and indent style of the resulting markup....
is a software tool available for many platforms which can correct invalid syntax, and most invalid document structure, converting HTML-like code to HTML or XHTML. - Aggiorno is a Visual Studio add-in that focuses on making websites standards-compliant
- Tagsoup is a Java library that parses HTML, cleans it up, and delivers a stream of SAXSimple API for XMLSAX is an event-based sequential access parser API developed by the XML-DEV mailing list for XML documents. SAX provides a mechanism for reading data from an XML document that is an alternative to that provided by the Document Object Model...
events representing well-formed and valid XHTML - Beautiful Soup is a Python DOMDocument Object ModelThe Document Object Model is a cross-platform and language-independent convention for representing and interacting with objects in HTML, XHTML and XML documents. Aspects of the DOM may be addressed and manipulated within the syntax of the programming language in use...
parser for soupy HTML/XML