Navarr's Tech Side The Technical Side of my Life

3Jan/0823

Why HTML should become a dead language

Go ahead, take a gander around the Internet.  Lets see who is using HTML, and who is using XHTML.

  • XHTML
    • Blogger [Served as HTML]
    • Opera [Served as XML]
    • FireFox (Mozilla) [Served as HTML]
    • Facebook [Served as HTML]
    • MySpace [Served as HTML]
    • Twitter [Served as HTML]
    • W3C [Served as HTML]
  • HTML
    • Microsoft
    • Yahoo!
    • Apple
    • Google
    • WHATWG

As you can see, the "newer" websites are serving in XHTML.  A few of them Transitional (mainly websites where users can input HTML), however, a large number of them are marked as XHTML 1.0 Strict using the W3C DTD.

On to my main point, HTML should NOT receive an update.  The primary reason people want it to, is because it is FAMILIAR to designers and developers.  However, HTML is very lax on how it is interpreted, a little too lax in my personal opinion.

An Example.  <input type="text" disabled>.  The tag does not end.  Its interpreted as a single tag, and disabled means that its disabled.  I, myself, find this to be horrible, horrible code.  In XHTML, all tags must end, and all properties must have a value.  Which is easier to program, an HTML or an XHTML parser?

Probably the XHTML.  HTML was not designed to be parsed, and is really quite a pain to attempt to parse using such things like Regex, considering how lax of a language it is.

In my own personal opinion, HTML should become a dead language.  Some people hold on to it, let them.  Let them continue making their websites in Archaic HTML 4.  But do we really need to run an update to it?  No.  We should, move to XHTML.  XML has already proven to be a very dependable language.  Its used in such popular applications like Jabber, and Twitter, and is used commonly with AJAX, which is the height of the "Web 2.0" Revolution.

So why are we trying to revise it?  Why can't we adopt a stricter set of rules so that they can better be implemented?  In my personal opinion, work on HTML should be halted, and moved to XHTML.  I still believe that the W3C and WHATWG should work together on a new version of XHTML.  XMLEvents and XForms should adopt the magnificent features that Web Forms 2.0 has created.  Again, this is my personal opinion.

EDIT: Added What Some sites are served as.  Its actually pretty depressing.

  • Hixie

    Putting an XHTML DOCTYPE at the top of the file or putting an xmlns=”" XHTML namespace on the root element isn’t enough to actually use XHTML. You also have to use the right MIME type . None of the sites you mention use the right MIME type. They all use text/html. So actually, they’re all really using HTML, just mislabeled.

  • Navarr Barnier

    As true as that is, it doesn’t really devalidate my point that we should move to a stricter language.. but, you are right. Its a tad bit depressing that the only one serving with a proper mine type is Opera.

  • jgraham

    It’s not clear to me that “it’s easier to write a parser” is a sufficient reason to prefer XHTML over HTML*. The number of people actually writing parsers need not be large – for example there is a small set of widely used XML parsing libraries (expat, libxml2, etc.) and most people will never roll their own (and, generally speaking, when they try to implement simplistic parsers using e.g. regexps, they make significant errors). For HTML the situation has historically been different because the language has had all sorts of de-facto-required, yet undocumented, error-handling behavior. However HTML 5 is changing this with a detailed spec for parsing HTML in a manner compatible with major browsers. This means that we are starting to see robust HTML-parsing libraries with well defined, browser-compatible, behavior e.g html5lib (Python and Ruby) and the validator.nu parser (Java). So, the situation for people looking to extract information from XHTML and HTML is not so different – your best bet is to use a library that has been written to the relevant standard and well tested, trying to roll your own may cause problems.

    If the situation for people trying to parse the two languages isn’t so different, what other reasons are there to prefer one over the other? Well XML does have the advantage of being more extensible – you can add SVG or MathML content inline in an XHTML document but not, presently, in an HTML document. This is something that there is interest in changing for HTML 5 but we’ll have to see how that pans out.

    The other big issue is error handling. HTML has it and XHTML doesn’t. If you get anything that is a fatal XML error wrong in your backend system using XHTML then your users see an incomprehensible error message where they were expecting your site. These problems can often be tiny things that could easily be handled, like encountering a character forbidden in XML 1.0. The amount of effort needed by all the people who write website backends (of whom there are far more than write parsers) to get this right should not be under-estimated. It seems like a poor decision to help the few by making the many pay.

    Even worse, when things go wrong with XHTML, it’s the poor person trying to access the site that suffers. End users are generally totally the wrong people to show errors to because not only are they in no position to offer a fix (or often even inform you of the problem) but it undermines their confidence in your website. It’s hard to imagine why someone like Amazon or Ebay would want to add an extra, highly user visible, mode of failure to their website just so the front end code is more aesthetically pleasing. HTML with its generous error handling has a real advantage here.

    In my opinion, the future of XML on the web depends on XML developing some form of error handling. This need not be identical to that in HTML – indeed it probably ought to be much more uniform. However, it does have to ensure that every possible input sequence results in an output tree, containing as much of the original text as possible, never a fatal error. Happily work in this direction has started, for example “XML5″.

    * As an aside Doctypes add a whole bunch of complexity to XML parsers that isn’t there in HTML. But HTML parsers are probably more complex overall.

  • distler

    I believe that XML5 is a hopelessly ill-conceived idea (as you may be able to tell from my comments on Anne’s blog).

    The “present” of XHTML on the web is that it is a niche language for those with special needs (like embedding MathML and SVG).

    Its “future” may be to fade away (if a suitable extension mechanism for HTML5 is adopted). Or it may be to blossom, if someone comes up with a “killer application” for it that motivates CMS authors to write something capable of reliably producing XHTML.

    Either way, the “future” is a long way off, and those of us with “special needs” are not waiting around for it to arrive.

  • distler

    I wrote:

      as you may be able to tell from my comments on Anne’s blog

    Sorry. I meant on Sam’s blog.

  • jgraham

    Jacques, its not clear to me why you think that XML5 is a “hopelessly ill conceived idea”. Clearly there is some disagreement over the details – e.g. whether ascii representations of non ascii characters should be included (via entity references or some other mechanism) and how such content should be served. But I’m not sure what the major philosophical objection to a language with a non-vocabulary-specific parsing model and error handling is. In particular, your set of proposed futures for XHTML fail to cover the case where no workable extensibility mechanism for HTML can be found, but people are still as unwilling to accept the XHTML error handing behavior as today.

    If we don’t explore the path forward, the future will be a long way off forever :)

  • distler

    Anne’s conception, literally, was that XML5 would be a replacement for XML 1.0. I think that is a nonstarter. There is simply too much XML 1.0 infrastructure in place.

    Sam had a more realistic suggestion: a “liberal” parsing mode (with error correction) for XHTML5.

    That’s a perfectly reasonable suggestion. But it does have its pitfalls: Aside from the MIME type, you have no way of guessing whether a given XHTML5 document is processable with XML 1.0 tools. People will, surely try to use XML 1.0 tools to consume such content, either directly, or when syndicated in an Atom feed, or …

    So you are setting yourself up for the interoperability problems that Appendix C engendered: lots and lots of content that looks like XML, but isn’t.

    This is not an insuperable objection to Sam’s suggestion. But it is something to worry about.

    XML5, however, as a blanket replacement for XML 1.0, just makes no sense. Heck, not even XML 1.1 has gotten any traction and, by comparison, it’s a minor modification of XML 1.0.

  • distler

    Oh, and lest I forget, there’s another reason why XML5 is a dumb idea.

    Successful error-correction requires not merely that the error-correction algorithm be well-specified. It also needs to offer a better-than-even chance of capturing the author’s intent.

    This is highly language-dependent. It requires a knowledge of what are the common authoring errors in that particular language.

    In the case of HTML5, that knowledge was gathered over the space of many years by the browser vendors, on the basis of billions of error-laden HTML pages.

    But I think it is impossible to ascertain (even if the question made any sense) what the most common authoring errors are in an arbitrary (unknown) XML dialect.

    So, even if it made sense to try, I’m not sure you could come up with a successful general-purpose error-correction algorithm for XML.

  • karl

    Just to make it clear. Hixie’s comment is intentionally misleading. The Web sites use an HTML mime type on an XHTML format (XML). This mime type is authorized by the XHTML 1.0 specification. (I’m not discussing here if it was right to do this or not.) So basically the document is not viewed as XML on the Web in a browser, that doesn’t mean it losts all its XML properties.

    Using an XHTML format has benefits for certain users who are using a XML hub to process their documents. You can also get the characters stream of XML data, and process it locally.

    So no reasons to be depressed, you can use XHTML if you need it. It’s just a question of being pragmatic for yourself.

  • distler

    From XHTML Media Types:

    —————
    The ‘text/html’ media type [RFC2854] is primarily for HTML, not for XHTML. In general, this media type is NOT suitable for XHTML. However, as [RFC2854] says, [XHTML1] defines a profile of use of XHTML which is compatible with HTML 4.01 and which may also be labeled as text/html.”

    [XHTML1], Appendix C “HTML Compatibility Guidelines” summarizes design guidelines
    for authors who wish their XHTML documents to render on existing HTML user agents. The use of ‘text/html’ for XHTML SHOULD be limited for the purpose of rendering on
    existing HTML user agents, and SHOULD be limited to [XHTML1] documents which follow the HTML Compatibility Guidelines. In particular, ‘text/html’ is NOT suitable for XHTML Family document types that adds elements and attributes from foreign namespaces, such as XHTML+MathML [XHTML+MathML].

    XHTML documents served as ‘text/html’ will not be processed as XML [XML10], e.g. well-formedness errors may not be detected by user agents. Also be aware that HTML rules will be applied for DOM and style sheets (see C.11 and C13 of [XHTML1] respectively).
    ————————

    The last point is particulary important. From the client’s perspective, any document sent as ‘text/html’ is to be treated as an HTML document.

    Karl added:

    Using an XHTML format has benefits for certain users who are using a XML hub to process their documents.

    I highly doubt any of the websites mentioned in this post are using an XML-based backend to produce their “XHTML” documents.

    Not that it’s impossible to do so, just that — in practice — no current CMS’s actually work that way.

  • karl

    Jacques :)

    1. There are a few CMSes working with XML
    2. You took only one side of my assertion. I said two things: a) content producer using XML b) content consumer using XML.

    In some cases, I don’t care if a web site serves its XHTML pages as text/html, when I do a web scraping application which download the document and can play with it locally. Then it makes my work a lot easier when it is already xml :) But I guess it is a question of taste ;)

  • distler

    Karl wrote:

       
    1. There are a few CMSes working with XML

    Names?

    I am, for obvious reasons, very interested in such CMS’s.

       You took only one side of my assertion. I said two things: a) content producer using XML b) content consumer using XML.

    Yes, it’s possible, as a content-consumer, to take a text/html document and feed it to an XML parser.

    It is also possible to feed it to a GIF decoder.

    In both cases, in ignoring the MIME type, the results will be unreliable.

    The percentage of “XHTML” websites which are
    a) served as text/html
    b) well-formed
    is so minute as to be indistinguishable from zero.

    If you want to screen-scrape, there are far more reliable off-the-shelf HTML parsers you could use. Using an XML parser would be a rather poor idea.

  • jgraham

    Uniform error correction for all XML vocabularies is, as you note, not going to correct errors as well as vocabulary-specific error correction and so it’s not surprising that (iirc) there are people working on XML parsers with pluggable per-language error recovery.

    However XML5 is interesting partially because it is the simplest thing that could possibly work.

    Sure it’s not going to render broken markup exactly as the author intended. But, with a language like HTML, for most classes of error, it will allow a site where an unexpected error creeps in to keep working. Yes, authors will have to be more careful than they are when producing text/html, but not that much more careful.

    My hope is that there is a sweet spot somewhere on the spectrum between having a format with a language-specific grammar, excellent error recovery, but severe constraints on the future extensibility, and a format designed for syntactic uniformity and extensibility but with no tolerance of a wide range of errors. XML 5 seems like a decent first stab at finding that sweet spot.

  • karl
  • distler

    Out of curiosity, Karl:

    1) Have you used any of these “XML CMSs”?
    2) If not, why not?
    3) If yes, has the experience given you any insight as to why only a neglible fraction of “XHTML” websites are well-formed?

    James wrote:

       Sure it’s not going to render broken markup exactly as the author intended. But, with a language like HTML, for most classes of error, it will allow a site where an unexpected error creeps in to keep working.

    I don’t argue that error-correction for XHTML5 (along the lines of Sam’s suggestion) could be of value.

    What I question is whether there is any value in a general-purpose error correcting replacement for XML 1.0.

    XML5 seems like the worst possible compromise: if you’re targetting XHTML, you could do a much better, language-specific job of error correction. If you’re targetting generic XML, this seems like a “solution” in search of a problem.

  • jgraham

    I don’t argue that error-correction for XHTML5 (along the lines of Sam’s suggestion) could be of value.

    What I question is whether there is any value in a general-purpose error correcting replacement for XML 1.0.

    So I set up a XHTML5 weblog which allows some markup in comments and, for a long time I unwittingly fail to strip numeric character references to XML-forbidden characters from the input. Even if someone happens to enter such a reference there is no problem because the XHTML-specific error correction takes care of it. One day I decide to allow SVG in my comments and all of a sudden I have a problem because, there is no error handling for the SVG content and a stray reference to a forbidden character can bring down the site.

    Of course one could go through and define error handling rules for each and every XML vocabulary that will ever be served to a web browser. However that rather breaks the idea of distributed extensibility since the browser will have to have knowledge of the error handling rules for each and every type of content it encounters.

    So I agree that, as long as the set of markup languages being served to browsers is somehow kept small, per-language error-handling rules will provide better recovery. Having said that, it’s not clear that the improvement will be worth the cost. Specifying error handling is hard so people are quite likely to do it badly. Moreover implementing different error recovery schemes depending on the current node’s language is much harder than taking a uniform approach, leading to a greater possibility of bugs and incompatibilities.

    If, however, you believe in distributed extensibility, browsers cannot have error handling for every unknown language they encounter and so something like XML5 is the only alternative to die-on-error.

  • distler

      So I set up a XHTML5 weblog which allows some markup in comments and, for a long time I unwittingly fail to strip numeric character references to XML-forbidden characters from the input. Even if someone happens to enter such a reference there is no problem because the XHTML-specific error correction takes care of it. One day I decide to allow SVG in my comments and all of a sudden I have a problem because, there is no error handling for the SVG content and a stray reference to a forbidden character can bring down the site.

    I’m not sure I quite understand your scenario. The host language for the page on which the SVG comments appear is XHTML5, right?

    Ergo, it will be parsed by the client using whatever error-correction pertains to XHTML5, no? If this handles stray NCRs to illegal characters, great. Otherwise …

    I don’t see why embedding SVG markup (perfectly welcome, I should say, in the comments on my blog) changes (or should change) that.

    What this scenario does highlight, however, is a different problem.

    Note that, in Sam’s proposal, the default handling of this content, in non-XHTML5-aware browsers, would be as text/html.

    With XML5, if I understand the proposal correctly, the default handling in non-XML5-aware user agents (such as the Gecko browser I am currently using) would be as XML.

    In your scenario, that would be a disaster, with or without the embedded SVG.

  • jgraham

    I’m not sure I quite understand your scenario. The host language for the page on which the SVG comments appear is XHTML5, right?

    A scheme where the error correction appropriate to the document root namespace would apply to all content in the document regardless of namespace didn’t occur to me. Doesn’t that approach undermine the benefits of per-language error correction, since, for example, SVG in an HTML document would have HTML-optimised error correction, whilst HTML in a SVG document would have SVG optimised error correction? It could also cause confusion as people copied fragments from one context to another and go different error-correcting behavior.

    I agree that the issue of fallback would need to be addressed, since fallback to XML 1.0 would be unpleasant.

  • distler

       Doesn’t that approach undermine the benefits of per-language error correction, since, for example, SVG in an HTML document would have HTML-optimised error correction,

    How practical would switching error-correction algorithms, based on the namespace of the current content be? After all, since we’re not assuming the document is well-formed, you’d have to do a certain amount of error-correction before you could even decide what error-correction algorithm to apply!

    Moreover, while a large amount of research has gone into the error correction algorithms for (X)HTML, little or none has been done on what are the common authoring errors, and what should be the error-correcting algorithm for other XML dialects.

    Given this lack of research*, you could just take a stab in the dark and choose some algorithm or other. But whatever algorithm you choose, chances are that it will produce poor results.

    In the case of XHTML5 pages with embedded markup from other namespaces, you can concentrate on doing a good job with the part you know how to fix, and accept that you are going to do a poor job on the embedded markup.

    For other XML dialects, its not even worth trying. The results are going to be poor and (unlike XHTML5, where you could arrange a fallback to text/html processes), the interoperability problems are very serious.

    ——————–

    * I’m not sure how you’d do this research either. Unlike with HTML, there isn’t a large reservoir of broken SVG (say) pages, nor are SVG user-agent vendors continuously tweaking their liberal parsers to handle those broken pages.

  • karl

    hehe I didn’t see that the discussions were going on.

    So to answer, Jacques Distler, I tried a few in the past when I was creating a list of CMS for the fun on my personal time.

    My *personal* blog is an XML pipe and I’m satisfied with it served as application/xhtml+xml. :)

    For the QA blog, some bloggers should make the difference between ego and community. The QA blog is not mine.

  • distler

    Karl,

    Well, I’m curious what “XML pipeline” you use for your personal blog, then.

    The #1 hit on that Google Search you suggested is Syncato. As it turns out, I looked into that system several years ago.

    1) The project has been moribund since about 2003.
    2) Of the User Sites listed, only one (in addition to the Syncato site itself) is still using Syncato. Not even the author is still using Syncato for his own site. (Insert prose about eating one’s own dogfood here.)
    3) In both cases, the site is served as text/html.
    4) Which is good, because in both cases, the site is not Namespace-Well-Formed. (This is because XHTML requires Namespaces, but Syncato is not Namespace-aware; or, at least, that’s the excuse.)

    I haven’t looked into the others, but if Syncato is at all representative of what one can find in the “xml content management system” universe, then the outlook is very dark, indeed.

  • Navarr Barnier

    Okay, we should all remember that serving as text/html is allowable for any version of XHTML less than 1.1, for browser compatability reasons.

  • distler

    Sure it’s “allowed.”

    But what you are getting is HTML with extra forward slashes.

    It’s not XML, and if you tried to treat it as XML (as the example of the output of Syncato illustrates), it would burst into flames.

    That was the point made in Hixie’s comment which started this thread.