Thursday, January 3, 2008

Why HTML should become a dead language

Go ahead, take a gander around the Internet.  Lets see who is using
HTML, and who is using XHTML.

  • XHTML
    • Blogger [Served as
      HTML]
    • Opera [Served as XML]
    • FireFox (Mozilla) [Served as HTML]

    • Facebook [Served as HTML]
    • MySpace [Served as HTML]
    • Twitter
      [Served as HTML]
    • W3C [Served as HTML]
  • HTML
    • Microsoft
    • Yahoo!
    • Apple
    • Google
    • WHATWG

As you can see, the
"newer" websites are serving in XHTML.  A few of them Transitional (mainly
websites where users can input HTML), however, a large number of them are marked as XHTML 1.0 Strict using
the W3C DTD.

On to my main point, HTML should NOT receive an update.  The
primary reason people want it to, is because it is FAMILIAR to designers and developers.  However,
HTML is very lax on how it is interpreted, a little too lax in my personal opinion.

An
Example.  <input type="text" disabled>.  The tag does not
end.  Its interpreted as a single tag, and disabled means that its disabled.  I, myself,
find this to be horrible, horrible code.  In XHTML, all tags must end, and all properties must have
a value.  Which is easier to program, an HTML or an XHTML parser?

Probably the
XHTML.  HTML was not designed to be parsed, and is really quite a pain to attempt to parse using
such things like Regex, considering how lax of a language it is.

In my own personal
opinion, HTML should become a dead language.  Some people hold on to it, let them.  Let
them continue making their websites in Archaic HTML 4.  But do we really need to run an update to
it?  No.  We should, move to XHTML.  XML has already proven to be a very
dependable language.  Its used in such popular applications like Jabber, and Twitter, and is used
commonly with AJAX, which is the height of the "Web 2.0" Revolution.

So
why are we trying to revise it?  Why can't we adopt a stricter set of rules so that they can better
be implemented?  In my personal opinion, work on HTML should be halted, and moved to XHTML. 
I still believe that the W3C and WHATWG should work together on a new version of XHTML.  XMLEvents
and XForms should adopt the magnificent features that Web Forms 2.0 has created.  Again, this is my
personal opinion.

EDIT: Added What Some
sites are served as.  Its actually pretty depressing.

22 comments:

  1. Putting an XHTML DOCTYPE at the top of the file or putting an xmlns="" XHTML namespace
    on the root element isn't enough to actually use XHTML. You also have to use the right MIME type . None of
    the sites you mention use the right MIME type. They all use text/html. So actually, they're all really using
    HTML, just mislabeled.

    ReplyDelete
  2. As true as that is, it doesn't really devalidate my point that we should move to a
    stricter language.. but, you are right. Its a tad bit depressing that the only one serving with a proper
    mine type is Opera.

    ReplyDelete
  3. I believe that XML5 is a hopelessly ill-conceived idea (as you may be able to tell from
    my comments on Anne's blog).

    The "present" of XHTML on the web is that it is a niche
    language for those with special needs (like embedding MathML and SVG).

    Its "future" may
    be to fade away (if a suitable extension mechanism for HTML5 is adopted). Or it may be to blossom, if
    someone comes up with a "killer application" for it that motivates CMS authors to write something capable of
    reliably producing XHTML.

    Either
    way, the "future" is a long way off, and those of us with "special needs" are not waiting around for it to
    arrive.

    ReplyDelete
  4. I wrote:

      as you may be able to tell
    from my comments on Anne's blog


    Sorry. I meant on Sam's blog.

    ReplyDelete
  5. Jacques, its not clear to me why you think that XML5 is a "hopelessly ill conceived
    idea". Clearly there is some disagreement over the details - e.g. whether ascii representations of non ascii
    characters should be included (via entity references or some other mechanism) and how such content should be
    served. But I'm not sure what the major philosophical objection to a language with a non-vocabulary-specific
    parsing model and error handling is. In particular, your set of proposed futures for XHTML fail to cover the
    case where no workable extensibility mechanism for HTML can be found, but people are still as unwilling to
    accept the XHTML error handing behavior as today.

    If we don't explore the path forward,
    the future will be a long way off forever :)

    ReplyDelete
  6. Anne's conception, literally, was that XML5 would be a replacement for XML 1.0.
    I think that is a nonstarter. There is simply too much XML 1.0 infrastructure in place.

    Sam
    had a more realistic suggestion: a "liberal" parsing mode (with error correction) for XHTML5.

    That's
    a perfectly reasonable suggestion. But it does have its pitfalls: Aside from the MIME type, you have no way
    of guessing whether a given XHTML5 document is processable with XML 1.0 tools. People will, surely try to
    use XML 1.0 tools to consume such content, either directly, or when syndicated in an Atom feed, or ...

    So
    you are setting yourself up for the interoperability problems that Appendix C engendered: lots and lots of
    content that looks like XML, but isn't.

    This is not an insuperable
    objection to Sam's suggestion. But it is something to worry about.

    XML5, however, as a
    blanket replacement for XML 1.0, just makes no sense. Heck, not even XML 1.1 has gotten any traction and, by
    comparison, it's a minor modification of XML 1.0.

    ReplyDelete
  7. Oh, and lest I forget, there's another reason why XML5 is a dumb idea.

    Successful
    error-correction requires not merely that the error-correction algorithm be well-specified. It also needs to
    offer a better-than-even chance of capturing the author's intent.

    This
    is highly language-dependent. It requires a knowledge of what are the common authoring errors in that
    particular language.

    In the case of HTML5, that knowledge was gathered over the space
    of many years by the browser vendors, on the basis of billions of error-laden HTML pages.

    But
    I think it is impossible to ascertain (even if the question made any sense) what the most common authoring
    errors are in an arbitrary (unknown) XML dialect.

    So, even if it made sense to try, I'm
    not sure you could come up with a successful general-purpose
    error-correction algorithm for XML.

    ReplyDelete
  8. Just to make it clear. Hixie's comment is intentionally misleading. The Web sites use
    an HTML mime type on an XHTML format (XML). This mime type is authorized by the XHTML 1.0 specification.
    (I'm not discussing here if it was right to do this or not.) So basically the document is not viewed as XML
    on the Web in a browser, that doesn't mean it losts all its XML properties.

    Using
    an XHTML
    format has benefits for certain users who are using a XML hub to
    process their documents. You can also get the characters stream of XML data, and process it locally.

    So
    no reasons to be depressed, you can use XHTML if you need it. It's just a question of being pragmatic for
    yourself.

    ReplyDelete
  9. From XHTML Media
    Types
    :

    ---------------
    The 'text/html' media type [RFC2854] is
    primarily for HTML, not for XHTML. In general, this media type is NOT suitable for XHTML. However, as
    [RFC2854] says, [XHTML1] defines a profile of use of XHTML which is compatible with HTML 4.01 and which may
    also be labeled as text/html."

    [XHTML1], Appendix C "HTML Compatibility Guidelines"
    summarizes design guidelines
    for authors who wish their XHTML documents to render on existing HTML
    user agents. The use of 'text/html' for XHTML SHOULD be limited for the purpose of rendering on
    existing
    HTML user agents, and SHOULD be limited to [XHTML1] documents which follow the HTML Compatibility
    Guidelines. In particular, 'text/html' is NOT suitable for XHTML Family document types that adds elements
    and attributes from foreign namespaces, such as XHTML+MathML [XHTML+MathML].

    XHTML
    documents served as 'text/html' will not be processed as XML [XML10], e.g. well-formedness errors may not be
    detected by user agents. Also be aware that HTML rules will be applied for DOM and style sheets (see C.11
    and C13 of [XHTML1] respectively).
    ------------------------

    The last point is
    particulary important. From the client's perspective, any document sent as 'text/html' is to be treated as
    an HTML document.

    Karl added:

    Using an XHTML format has
    benefits for certain users who are using a XML hub to process their documents.


    I
    highly doubt any of the websites mentioned in this post are using an XML-based backend to produce their
    "XHTML" documents.

    Not that it's impossible to do so, just that — in practice
    — no current CMS's actually work that way.

    ReplyDelete
  10. Jacques :)

    1. There are a few CMSes working with XML
    2.
    You took only one side of my assertion. I said two things: a) content producer using XML b) content consumer
    using XML.

    In some cases, I don't care if a web site serves its XHTML pages as
    text/html, when I do a web scraping application which download the document and can play with it locally.
    Then it makes my work a lot easier when it is already xml :) But I guess it is a question of taste ;)

    ReplyDelete
  11. Karl wrote:

       
    1.
    There are a few CMSes working with XML


    Names?

    I am, for
    obvious reasons, very interested in such CMS's.

       You
    took only one side of my assertion. I said two things: a) content producer using XML b) content consumer
    using XML.


    Yes, it's possible, as a content-consumer, to take a text/html
    document and feed it to an XML parser.

    It is also possible to feed it to a GIF decoder.

    In
    both cases, in ignoring the MIME type, the results will be unreliable.

    The percentage
    of "XHTML" websites which are
    a) served as text/html
    b) well-formed
    is so
    minute as to be indistinguishable from zero.

    If you want to screen-scrape, there are
    far more reliable off-the-shelf HTML parsers you could use. Using an XML parser would be a rather poor idea.

    ReplyDelete
  12. Uniform error correction for all XML vocabularies is, as you note, not going to correct
    errors as well as vocabulary-specific error correction and so it's not surprising that (iirc) there are
    people working on XML parsers with pluggable per-language error recovery.

    However XML5
    is interesting partially because it is the simplest thing that could possibly work.

    Sure
    it's not going to render broken markup exactly as the author intended. But, with a language like HTML, for
    most classes of error, it will allow a site where an unexpected error creeps in to keep working. Yes,
    authors will have to be more careful than they are when producing text/html, but not that
    much
    more careful.

    My hope is that there is a sweet spot somewhere on the
    spectrum between having a format with a language-specific grammar, excellent error recovery, but severe
    constraints on the future extensibility, and a format designed for syntactic uniformity and extensibility
    but with no tolerance of a wide range of errors. XML 5 seems like a decent first stab at finding that sweet
    spot.

    ReplyDelete
  13. Out of curiosity, Karl:

    1) Have you used any of
    these "XML CMSs"?
    2) If not, why not?
    3)
    If yes, has the experience given you any insight as to why only a neglible fraction of
    "XHTML" websites are well-formed?

    James wrote:

       Sure
    it's not going to render broken markup exactly as the author intended. But, with a language like HTML, for
    most classes of error, it will allow a site where an unexpected error creeps in to keep working.


    I
    don't argue that error-correction for XHTML5 (along the lines of Sam's suggestion) could be of value.

    What
    I question is whether there is any value in a general-purpose error correcting replacement for XML 1.0.

    XML5
    seems like the worst possible compromise: if you're targetting XHTML, you could do a much better,
    language-specific job of error correction. If you're targetting generic XML, this seems like a "solution" in
    search of a problem.

    ReplyDelete
  14. I don't argue that error-correction for XHTML5 (along the lines of Sam's
    suggestion) could be of value.


    What I question is whether there is
    any value in a general-purpose error correcting replacement for XML 1.0.


    So I
    set up a XHTML5 weblog which allows some markup in comments and, for a long time I unwittingly fail to strip
    numeric character references to XML-forbidden characters from the input. Even if someone happens to enter
    such a reference there is no problem because the XHTML-specific error correction takes care of it. One day I
    decide to allow SVG in my comments and all of a sudden I have a problem because, there is no error handling
    for the SVG content and a stray reference to a forbidden character can bring down the site.

    Of
    course one could go through and define error handling rules for each and every XML vocabulary that will ever
    be served to a web browser. However that rather breaks the idea of distributed extensibility since the
    browser will have to have knowledge of the error handling rules for each and every type of content it
    encounters.

    So I agree that, as long as the set of markup languages being served to
    browsers is somehow kept small, per-language error-handling rules will provide better recovery. Having said
    that, it's not clear that the improvement will be worth the cost. Specifying error handling is hard so
    people are quite likely to do it badly. Moreover implementing different error recovery schemes depending on
    the current node's language is much harder than taking a uniform approach, leading to a greater possibility
    of bugs and incompatibilities.

    If, however, you believe in distributed extensibility,
    browsers cannot have error handling for every unknown language they encounter and so something like XML5 is
    the only alternative to die-on-error.

    ReplyDelete
  15.   So I set up a XHTML5 weblog which allows some markup in
    comments and, for a long time I unwittingly fail to strip numeric character references to XML-forbidden
    characters from the input. Even if someone happens to enter such a reference there is no problem because the
    XHTML-specific error correction takes care of it. One day I decide to allow SVG in my comments and all of a
    sudden I have a problem because, there is no error handling for the SVG content and a stray reference to a
    forbidden character can bring down the site.


    I'm not sure I quite
    understand your scenario. The host language for the page on which the SVG comments appear is XHTML5, right?

    Ergo,
    it will be parsed by the client using whatever error-correction pertains to XHTML5, no? If this handles
    stray NCRs to illegal characters, great. Otherwise ...

    I don't see why embedding SVG
    markup (perfectly welcome, I should say, in the comments on my blog) changes (or should
    change) that.

    What this scenario does highlight, however, is a different problem.

    Note
    that, in Sam's proposal, the default handling of this content, in non-XHTML5-aware browsers, would be as
    text/html.

    With XML5, if I understand the proposal correctly, the default handling in
    non-XML5-aware user agents (such as the Gecko browser I am currently using) would be as XML.

    In
    your scenario, that would be a disaster, with or without the embedded SVG.

    ReplyDelete
  16. I'm not sure I quite understand your scenario. The host language for the page
    on which the SVG comments appear is XHTML5, right?


    A scheme where the error
    correction appropriate to the document root namespace would apply to all content in the document regardless
    of namespace didn't occur to me. Doesn't that approach undermine the benefits of per-language error
    correction, since, for example, SVG in an HTML document would have HTML-optimised error correction, whilst
    HTML in a SVG document would have SVG optimised error correction? It could also cause confusion as people
    copied fragments from one context to another and go different error-correcting behavior.

    I
    agree that the issue of fallback would need to be addressed, since fallback to XML 1.0 would be unpleasant.

    ReplyDelete
  17.    Doesn't that approach undermine the benefits of
    per-language error correction, since, for example, SVG in an HTML document would have HTML-optimised error
    correction,


    How practical would switching error-correction algorithms, based
    on the namespace of the current content be? After all, since we're not assuming the document is well-formed,
    you'd have to do a certain amount of error-correction before you could even decide what error-correction
    algorithm to apply!

    Moreover, while a large amount of research has gone into the error
    correction algorithms for (X)HTML, little or none has been done on what are the common authoring errors, and
    what should be the error-correcting algorithm for other XML dialects.

    Given this lack
    of research*, you could just take a stab in the dark and choose some algorithm or other.
    But whatever algorithm you choose, chances are that it will produce poor results.

    In
    the case of XHTML5 pages with embedded markup from other namespaces, you can concentrate on doing a good job
    with the part you know how to fix, and accept that you are going to do a poor job on the embedded markup.

    For
    other XML dialects, its not even worth trying. The results are going to be poor and (unlike XHTML5, where
    you could arrange a fallback to text/html processes), the interoperability problems are very serious.

    --------------------

    *
    I'm not sure how you'd do this research either. Unlike with HTML, there isn't a large reservoir of broken
    SVG (say) pages, nor are SVG user-agent vendors continuously tweaking their liberal parsers to handle those
    broken pages.

    ReplyDelete
  18. hehe I didn't see that the discussions were going on.

    So to
    answer, Jacques Distler, I tried a few in the past when I was creating a list of CMS for the fun on my
    personal time.

    My *personal* blog is an XML pipe and I'm satisfied with it served as
    application/xhtml+xml. :)

    For the QA blog, some bloggers should make the difference
    between ego and community. The QA blog is not mine.

    ReplyDelete
  19. Karl,

    Well, I'm curious what "XML pipeline" you use for your
    personal blog, then.

    The #1 hit on that Google Search you suggested is Syncato. As it turns out, I looked into that
    system several years ago.

    1) The project has been moribund since about 2003.
    2)
    Of the User Sites
    listed, only one (in addition to the Syncato site itself) is still using Syncato. Not
    even the author is still using Syncato for his own site. (Insert prose about eating one's
    own dogfood here.)
    3) In both cases, the site is served as text/html.
    4) Which is good,
    because in both cases, the site is not Namespace-Well-Formed. (This is because XHTML requires Namespaces,
    but Syncato is not Namespace-aware; or, at least, that's the excuse.)

    I haven't looked
    into the others, but if Syncato is at all representative of what one can find in the "xml
    content management system" universe, then the outlook is very dark, indeed.

    ReplyDelete
  20. Okay, we should all remember that serving as text/html is allowable for any version of
    XHTML less than 1.1, for browser compatability reasons.

    ReplyDelete
  21. Sure it's "allowed."

    But what you are getting is HTML with extra
    forward slashes.

    It's not XML, and if you tried to treat it as XML (as the example of
    the output of Syncato illustrates), it would burst into flames.

    That was the point made
    in Hixie's comment which started this thread.

    ReplyDelete