One of the things I hate most about HTML 5 is that documents don't need to be well-formed XML, and are even encouraged not to be - `<hr>` instead of `<hr/>` etc, thus excluding XML processing tools from being able to work with HTML documents. Then one needs to support tag-soup / follow the HTML5 parsing guidelines to the letter when the whole mess could have easily been avoided. This affects text editor plugins etc. where one might want to utilise a single plugin/codebase and use XPath to traverse both XML and HTML documents easily
I don't quite understand the XML fetishism. HTML is originally based on SGML, and SGML is every bit as structured as XML by definition, since XML is specified as a proper subset of XML. From the XML spec:
> The Extensible Markup Language (XML) is a subset of SGML that is completely described in this document. Its goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. XML has been designed for ease of implementation and for interoperability with both SGML and HTML.
The "generic" part refers to XML being canonical, fully-tagged markup not requiring vocabulary-specific markup declarations for tag omission/inference, empty elements and enumerated attributes like is necessary for HTML and other SGML vocabularies making use of these features.
That XML has failed on the web doesn't mean one has to give up structured documents. In fact, HTML can be converted easily into XHTML using SGML [1]. If anything, markup geeks should embrace SGML (an ISO standard no less) to discover the power of a true text authoring format. For example, SGML supports Wiki syntaxes (short references) such as markdown.
Look at this "<p<a href="/">first part of the text</> second part". This is a valid document fragment in HTML 4.01 because HTML is authored in SGML.
Writing a correct XML parser is much easier than writing a correct SGML parser, and what's more important, it's much easier to recognize errors.
I agree with OP that HTML5 should have been XML from the start. Nowadays, you hardly write any HTML by hand and even if you do, it's easy to write syntactically correct XML.
It's true that you can convert any HTML into XML with ease but it's still a stupid, unnecessary step.
> I agree with OP that HTML5 should have been XML from the start.
The key requirement for HTML5, And why it succeeded where XHTML had limited success, was that existing HTML docs had to work with it. Which is why it has both an HTML and an XML format.
It was not wrong for it not to be pure XML, it was absolutely necessary.
> You could write XHTML 1.0 documents that were backwards compatible to browsers that only understood HTML 4.01.
You could and a lot of people _tried_, or at least pretended to. But the vast majority of documents that tried to do this failed to actually be well-formed XML, for various reasons... In practice, even restricting parsing as XML to cases when the page was explicitly sent with the application/xhtml+xml MIME type would leave a browser with problems when sites sent non-well-formed XML with that MIME type. This was a pretty serious problem for Gecko back in the day when we attempted to push XHTML usage (e.g. by putting "application/xhtml+xml" ahead of "text/html" in the Accept header). So we stopped pushing that, since it was actively harming our users...
The point is that this hasn't happened; neither back in XML's heyday, and much less today. Now you can bemoan XML's demise until the end of time, or you can fallback to XML's big sister SGML. As I said, SGML has lots of features over XML that are in fact desirable for an authoring format, such as Wiki syntaxes, type-safe/injection-free templating, stylesheets, etc. on top of being able to parse HTML. Many of these features are being reinvented in modern file-based CMSs and static site generators, so there's definitely a use case for this. Whereas editing XML (a delivery rather then authoring format) by hand is quite cumbersome, verbose and redundant, yet still doesn't help at all in how text content is actually created on the web.
Is SGML even still used? The only usecase I remember besides HTML is DocBook and that of course also has a XML variant for a long time.
SGML is needlessly complex as an authoring format. Even HTML was considered too complex and that's why we got lightweight markup languages like MarkDown and AsciiDoc.
I would be very surprised if we ever turn back to something like SGML. Especially as there are well designed LML as AsciiDoc or reStructuredText.
To give you an idea of what SGML is capable of, see my tutorial at [1]. It implements a lightweight content app where markdown syntax is parsed and transformed into HTML via SGML short references, then gets HTML5 sectioning elements inferred (eg. the HTML5 outlining algorithm is implemented in SGML), then gets rendered as a page with a table of content nav-list linking to full body text, and with HTML boilerplate added, all without procedural code.
SGML was in fact designed to be typed by hand, as an evolution of earlier mainframe markup languages at IBM. The idiosyncratic shortcut features are supposed to reduce the number of keystrokes needed for entering text.
HTML was based on SGML, but HTML 5 is explicitly not SGML anymore and specification calls the format "inspired by SGML". So to be fully conformant you need a custom processor instead of being able to use standard tools.
HTML5 doesn't cease to be based on SGML by a browser cartel with the express intent to transform the web into JavaScript-heavy web apps declaring so. WHATWG isn't an accredited standards body so what they declare a "standard" or "conformant" means shit. Especially if they don't bother to actually publish a standard that doesn't change all the time. Their "living standard" thing is at best a collaborative Wiki space of sorts where (a closed group of) "browser vendors" attempt to agree on how to do things, and is falling apart lately. WHATWG's "standard" has witnessed the web becoming a Chrome monopoly, and Opera and MS to cease browser development altogether.
SGML is the only game in town able to parse (a significant part of) HTML based on an actual standard, and is also the only realistic perspective for folks interested in the web as a standardized communication medium going forward.
HTML was "based on" SGML only in the sense that it borrowed a lot from SGML. However in practice it was never an application of SGML. HTML4 tried its best to move developers to SGML based HTML but devs ignored it.
HTML5 recognises that there was this gulf between the specification and the actual usage and sided with real world usage.
Even if `<hr />` (or others, e.g., `<img />`) would be required XML processing would not work. HTML (prio to HTML5) was full of quirks (e.g., table handling, formatting elements, ...), which cannot be expressed by a DTD that is used from XML. As a result, the DOM as seen from an XML POV could always be different from the real DOM even if the source could be parsed.
HTML5 just standardized all these quirks, leading to a uniform parsing model instead of an even bigger x-browser mess.
I wouldn’t say “easily” avoided. You can’t ignore the billions of web pages that would have already existed in a format that was non-compliant with XML at the time. With so much “prior art”, there is simply no way that any browser will ever be able to throw out its fuzzy/imprecise parser, which means that support for well-formed XML requires them to maintain two readers: precise and imprecise.
As far as XML “tools”, I am shocked at how even now I encounter real XML parsers that don’t necessarily reject malformed data files but do atrocious things with them (like silently pretend that certain tags were not even in the file). Thus, I end up using extra steps like a linter as a front-end sanity check. And while this example is a pure-data application, a linter is also a sensible front-end sanity check for HTML. XML isn’t going to win over HTML if it requires the same steps to clean up imperfections in the process.