Navarr Barnier: Why HTML should become a dead language

Thursday, January 3, 2008

Why HTML should become a dead language

Go ahead, take a gander around the Internet. Lets see who is using
HTML, and who is using XHTML.

XHTML
- Blogger [Served as
  HTML]
- Opera [Served as XML]
- FireFox (Mozilla) [Served as HTML]
- Facebook [Served as HTML]
- MySpace [Served as HTML]
- Twitter
  [Served as HTML]
- W3C [Served as HTML]
HTML
- Microsoft
- Yahoo!
- Apple
- Google
- WHATWG

As you can see, the
"newer" websites are serving in XHTML. A few of them Transitional (mainly
websites where users can input HTML), however, a large number of them are marked as XHTML 1.0 Strict using
the W3C DTD.

On to my main point, HTML should NOT receive an update. The
primary reason people want it to, is because it is FAMILIAR to designers and developers. However,
HTML is very lax on how it is interpreted, a little too lax in my personal opinion.

An
Example. <input type="text" disabled>. The tag does not
end. Its interpreted as a single tag, and disabled means that its disabled. I, myself,
find this to be horrible, horrible code. In XHTML, all tags must end, and all properties must have
a value. Which is easier to program, an HTML or an XHTML parser?

Probably the
XHTML. HTML was not designed to be parsed, and is really quite a pain to attempt to parse using
such things like Regex, considering how lax of a language it is.

In my own personal
opinion, HTML should become a dead language. Some people hold on to it, let them. Let
them continue making their websites in Archaic HTML 4. But do we really need to run an update to
it? No. We should, move to XHTML. XML has already proven to be a very
dependable language. Its used in such popular applications like Jabber, and Twitter, and is used
commonly with AJAX, which is the height of the "Web 2.0" Revolution.

So
why are we trying to revise it? Why can't we adopt a stricter set of rules so that they can better
be implemented? In my personal opinion, work on HTML should be halted, and moved to XHTML.
I still believe that the W3C and WHATWG should work together on a new version of XHTML. XMLEvents
and XForms should adopt the magnificent features that Web Forms 2.0 has created. Again, this is my
personal opinion.

EDIT: Added What Some
sites are served as. Its actually pretty depressing.

22 comments:

HixieJanuary 3, 2008 at 5:01 PM
Putting an XHTML DOCTYPE at the top of the file or putting an xmlns="" XHTML namespace
on the root element isn't enough to actually use XHTML. You also have to use the right MIME type . None of
the sites you mention use the right MIME type. They all use text/html. So actually, they're all really using
HTML, just mislabeled.
ReplyDelete
Replies
Navarr BarnierJanuary 3, 2008 at 5:15 PM
As true as that is, it doesn't really devalidate my point that we should move to a
stricter language.. but, you are right. Its a tad bit depressing that the only one serving with a proper
mine type is Opera.
ReplyDelete
Replies
distlerJanuary 5, 2008 at 12:16 PM
I believe that XML5 is a hopelessly ill-conceived idea (as you may be able to tell from
my comments on Anne's blog).

The "present" of XHTML on the web is that it is a niche
language for those with special needs (like embedding MathML and SVG).

Its "future" may
be to fade away (if a suitable extension mechanism for HTML5 is adopted). Or it may be to blossom, if
someone comes up with a "killer application" for it that motivates CMS authors to write something capable of
reliably producing XHTML.

Either
way, the "future" is a long way off, and those of us with "special needs" are not waiting around for it to
arrive.
ReplyDelete
Replies
distlerJanuary 5, 2008 at 1:10 PM
I wrote:

as you may be able to tell
from my comments on Anne's blog

Sorry. I meant on Sam's blog.
ReplyDelete
Replies
jgrahamJanuary 6, 2008 at 5:50 AM
Jacques, its not clear to me why you think that XML5 is a "hopelessly ill conceived
idea". Clearly there is some disagreement over the details - e.g. whether ascii representations of non ascii
characters should be included (via entity references or some other mechanism) and how such content should be
served. But I'm not sure what the major philosophical objection to a language with a non-vocabulary-specific
parsing model and error handling is. In particular, your set of proposed futures for XHTML fail to cover the
case where no workable extensibility mechanism for HTML can be found, but people are still as unwilling to
accept the XHTML error handing behavior as today.

If we don't explore the path forward,
the future will be a long way off forever :)
ReplyDelete
Replies
distlerJanuary 6, 2008 at 6:52 AM
Anne's conception, literally, was that XML5 would be a replacement for XML 1.0.
I think that is a nonstarter. There is simply too much XML 1.0 infrastructure in place.

Sam
had a more realistic suggestion: a "liberal" parsing mode (with error correction) for XHTML5.

That's
a perfectly reasonable suggestion. But it does have its pitfalls: Aside from the MIME type, you have no way
of guessing whether a given XHTML5 document is processable with XML 1.0 tools. People will, surely try to
use XML 1.0 tools to consume such content, either directly, or when syndicated in an Atom feed, or ...

So
you are setting yourself up for the interoperability problems that Appendix C engendered: lots and lots of
content that looks like XML, but isn't.

This is not an insuperable
objection to Sam's suggestion. But it is something to worry about.

XML5, however, as a
blanket replacement for XML 1.0, just makes no sense. Heck, not even XML 1.1 has gotten any traction and, by
comparison, it's a minor modification of XML 1.0.
ReplyDelete
Replies
distlerJanuary 6, 2008 at 10:47 AM
Oh, and lest I forget, there's another reason why XML5 is a dumb idea.

Successful
error-correction requires not merely that the error-correction algorithm be well-specified. It also needs to
offer a better-than-even chance of capturing the author's intent.

This
is highly language-dependent. It requires a knowledge of what are the common authoring errors in that
particular language.

In the case of HTML5, that knowledge was gathered over the space
of many years by the browser vendors, on the basis of billions of error-laden HTML pages.

But
I think it is impossible to ascertain (even if the question made any sense) what the most common authoring
errors are in an arbitrary (unknown) XML dialect.

So, even if it made sense to try, I'm
not sure you could come up with a successful general-purpose
error-correction algorithm for XML.
ReplyDelete
Replies
karlJanuary 6, 2008 at 1:24 PM
Just to make it clear. Hixie's comment is intentionally misleading. The Web sites use
an HTML mime type on an XHTML format (XML). This mime type is authorized by the XHTML 1.0 specification.
(I'm not discussing here if it was right to do this or not.) So basically the document is not viewed as XML
on the Web in a browser, that doesn't mean it losts all its XML properties.

Using
an XHTML format has benefits for certain users who are using a XML hub to
process their documents. You can also get the characters stream of XML data, and process it locally.

So
no reasons to be depressed, you can use XHTML if you need it. It's just a question of being pragmatic for
yourself.
ReplyDelete
Replies
distlerJanuary 6, 2008 at 2:12 PM
From XHTML Media
Types:

---------------
The 'text/html' media type [RFC2854] is
primarily for HTML, not for XHTML. In general, this media type is NOT suitable for XHTML. However, as
[RFC2854] says, [XHTML1] defines a profile of use of XHTML which is compatible with HTML 4.01 and which may
also be labeled as text/html."

[XHTML1], Appendix C "HTML Compatibility Guidelines"
summarizes design guidelines
for authors who wish their XHTML documents to render on existing HTML
user agents. The use of 'text/html' for XHTML SHOULD be limited for the purpose of rendering on
existing
HTML user agents, and SHOULD be limited to [XHTML1] documents which follow the HTML Compatibility
Guidelines. In particular, 'text/html' is NOT suitable for XHTML Family document types that adds elements
and attributes from foreign namespaces, such as XHTML+MathML [XHTML+MathML].

XHTML
documents served as 'text/html' will not be processed as XML [XML10], e.g. well-formedness errors may not be
detected by user agents. Also be aware that HTML rules will be applied for DOM and style sheets (see C.11
and C13 of [XHTML1] respectively).
------------------------

The last point is
particulary important. From the client's perspective, any document sent as 'text/html' is to be treated as
an HTML document.

Karl added:

Using an XHTML format has
benefits for certain users who are using a XML hub to process their documents.

I
highly doubt any of the websites mentioned in this post are using an XML-based backend to produce their
"XHTML" documents.

Not that it's impossible to do so, just that — in practice
— no current CMS's actually work that way.
ReplyDelete
Replies
karlJanuary 7, 2008 at 1:08 AM
Jacques :)

1. There are a few CMSes working with XML
2.
You took only one side of my assertion. I said two things: a) content producer using XML b) content consumer
using XML.

In some cases, I don't care if a web site serves its XHTML pages as
text/html, when I do a web scraping application which download the document and can play with it locally.
Then it makes my work a lot easier when it is already xml :) But I guess it is a question of taste ;)
ReplyDelete
Replies
distlerJanuary 7, 2008 at 3:02 AM
Karl wrote:

1.
There are a few CMSes working with XML

Names?

I am, for
obvious reasons, very interested in such CMS's.

You
took only one side of my assertion. I said two things: a) content producer using XML b) content consumer
using XML.

Yes, it's possible, as a content-consumer, to take a text/html
document and feed it to an XML parser.

It is also possible to feed it to a GIF decoder.

In
both cases, in ignoring the MIME type, the results will be unreliable.

The percentage
of "XHTML" websites which are
a) served as text/html
b) well-formed
is so
minute as to be indistinguishable from zero.

If you want to screen-scrape, there are
far more reliable off-the-shelf HTML parsers you could use. Using an XML parser would be a rather poor idea.
ReplyDelete
Replies
jgrahamJanuary 7, 2008 at 4:09 AM
Uniform error correction for all XML vocabularies is, as you note, not going to correct
errors as well as vocabulary-specific error correction and so it's not surprising that (iirc) there are
people working on XML parsers with pluggable per-language error recovery.

However XML5
is interesting partially because it is the simplest thing that could possibly work.

Sure
it's not going to render broken markup exactly as the author intended. But, with a language like HTML, for
most classes of error, it will allow a site where an unexpected error creeps in to keep working. Yes,
authors will have to be more careful than they are when producing text/html, but not that
much more careful.

My hope is that there is a sweet spot somewhere on the
spectrum between having a format with a language-specific grammar, excellent error recovery, but severe
constraints on the future extensibility, and a format designed for syntactic uniformity and extensibility
but with no tolerance of a wide range of errors. XML 5 seems like a decent first stab at finding that sweet
spot.
ReplyDelete
Replies
karlJanuary 7, 2008 at 6:34 AM
Jacques:

I'm surprised you never did XML content management system">XML Content Management System
ReplyDelete
Replies
distlerJanuary 7, 2008 at 11:16 AM
Out of curiosity, Karl:

1) Have you used any of
these "XML CMSs"?
2) If not, why not?
3)
If yes, has the experience given you any insight as to why only a neglible fraction of
"XHTML" websites are well-formed?

James wrote:

Sure
it's not going to render broken markup exactly as the author intended. But, with a language like HTML, for
most classes of error, it will allow a site where an unexpected error creeps in to keep working.

I
don't argue that error-correction for XHTML5 (along the lines of Sam's suggestion) could be of value.

What
I question is whether there is any value in a general-purpose error correcting replacement for XML 1.0.

XML5
seems like the worst possible compromise: if you're targetting XHTML, you could do a much better,
language-specific job of error correction. If you're targetting generic XML, this seems like a "solution" in
search of a problem.
ReplyDelete
Replies
jgrahamJanuary 8, 2008 at 2:17 AM
I don't argue that error-correction for XHTML5 (along the lines of Sam's
suggestion) could be of value.

What I question is whether there is
any value in a general-purpose error correcting replacement for XML 1.0.

So I
set up a XHTML5 weblog which allows some markup in comments and, for a long time I unwittingly fail to strip
numeric character references to XML-forbidden characters from the input. Even if someone happens to enter
such a reference there is no problem because the XHTML-specific error correction takes care of it. One day I
decide to allow SVG in my comments and all of a sudden I have a problem because, there is no error handling
for the SVG content and a stray reference to a forbidden character can bring down the site.

Of
course one could go through and define error handling rules for each and every XML vocabulary that will ever
be served to a web browser. However that rather breaks the idea of distributed extensibility since the
browser will have to have knowledge of the error handling rules for each and every type of content it
encounters.

So I agree that, as long as the set of markup languages being served to
browsers is somehow kept small, per-language error-handling rules will provide better recovery. Having said
that, it's not clear that the improvement will be worth the cost. Specifying error handling is hard so
people are quite likely to do it badly. Moreover implementing different error recovery schemes depending on
the current node's language is much harder than taking a uniform approach, leading to a greater possibility
of bugs and incompatibilities.

If, however, you believe in distributed extensibility,
browsers cannot have error handling for every unknown language they encounter and so something like XML5 is
the only alternative to die-on-error.
ReplyDelete
Replies
distlerJanuary 8, 2008 at 6:27 AM
So I set up a XHTML5 weblog which allows some markup in
comments and, for a long time I unwittingly fail to strip numeric character references to XML-forbidden
characters from the input. Even if someone happens to enter such a reference there is no problem because the
XHTML-specific error correction takes care of it. One day I decide to allow SVG in my comments and all of a
sudden I have a problem because, there is no error handling for the SVG content and a stray reference to a
forbidden character can bring down the site.

I'm not sure I quite
understand your scenario. The host language for the page on which the SVG comments appear is XHTML5, right?

Ergo,
it will be parsed by the client using whatever error-correction pertains to XHTML5, no? If this handles
stray NCRs to illegal characters, great. Otherwise ...

I don't see why embedding SVG
markup (perfectly welcome, I should say, in the comments on my blog) changes (or should
change) that.

What this scenario does highlight, however, is a different problem.

Note
that, in Sam's proposal, the default handling of this content, in non-XHTML5-aware browsers, would be as
text/html.

With XML5, if I understand the proposal correctly, the default handling in
non-XML5-aware user agents (such as the Gecko browser I am currently using) would be as XML.

In
your scenario, that would be a disaster, with or without the embedded SVG.
ReplyDelete
Replies
jgrahamJanuary 8, 2008 at 7:44 AM
I'm not sure I quite understand your scenario. The host language for the page
on which the SVG comments appear is XHTML5, right?

A scheme where the error
correction appropriate to the document root namespace would apply to all content in the document regardless
of namespace didn't occur to me. Doesn't that approach undermine the benefits of per-language error
correction, since, for example, SVG in an HTML document would have HTML-optimised error correction, whilst
HTML in a SVG document would have SVG optimised error correction? It could also cause confusion as people
copied fragments from one context to another and go different error-correcting behavior.

I
agree that the issue of fallback would need to be addressed, since fallback to XML 1.0 would be unpleasant.
ReplyDelete
Replies
distlerJanuary 8, 2008 at 2:14 PM
Doesn't that approach undermine the benefits of
per-language error correction, since, for example, SVG in an HTML document would have HTML-optimised error
correction,

How practical would switching error-correction algorithms, based
on the namespace of the current content be? After all, since we're not assuming the document is well-formed,
you'd have to do a certain amount of error-correction before you could even decide what error-correction
algorithm to apply!

Moreover, while a large amount of research has gone into the error
correction algorithms for (X)HTML, little or none has been done on what are the common authoring errors, and
what should be the error-correcting algorithm for other XML dialects.

Given this lack
of research*, you could just take a stab in the dark and choose some algorithm or other.
But whatever algorithm you choose, chances are that it will produce poor results.

In
the case of XHTML5 pages with embedded markup from other namespaces, you can concentrate on doing a good job
with the part you know how to fix, and accept that you are going to do a poor job on the embedded markup.

For
other XML dialects, its not even worth trying. The results are going to be poor and (unlike XHTML5, where
you could arrange a fallback to text/html processes), the interoperability problems are very serious.

--------------------

*
I'm not sure how you'd do this research either. Unlike with HTML, there isn't a large reservoir of broken
SVG (say) pages, nor are SVG user-agent vendors continuously tweaking their liberal parsers to handle those
broken pages.
ReplyDelete
Replies
karlJanuary 15, 2008 at 4:29 PM
hehe I didn't see that the discussions were going on.

So to
answer, Jacques Distler, I tried a few in the past when I was creating a list of CMS for the fun on my
personal time.

My *personal* blog is an XML pipe and I'm satisfied with it served as
application/xhtml+xml. :)

For the QA blog, some bloggers should make the difference
between ego and community. The QA blog is not mine.
ReplyDelete
Replies
distlerJanuary 17, 2008 at 2:43 AM
Karl,

Well, I'm curious what "XML pipeline" you use for your
personal blog, then.

The #1 hit on that Google Search you suggested is Syncato. As it turns out, I looked into that
system several years ago.

1) The project has been moribund since about 2003.
2)
Of the User Sites
listed, only one (in addition to the Syncato site itself) is still using Syncato. Not
even the author is still using Syncato for his own site. (Insert prose about eating one's
own dogfood here.)
3) In both cases, the site is served as text/html.
4) Which is good,
because in both cases, the site is not Namespace-Well-Formed. (This is because XHTML requires Namespaces,
but Syncato is not Namespace-aware; or, at least, that's the excuse.)

I haven't looked
into the others, but if Syncato is at all representative of what one can find in the "xml
content management system" universe, then the outlook is very dark, indeed.
ReplyDelete
Replies
Navarr BarnierJanuary 17, 2008 at 2:54 AM
Okay, we should all remember that serving as text/html is allowable for any version of
XHTML less than 1.1, for browser compatability reasons.

ReplyDelete
Replies
distlerJanuary 17, 2008 at 8:23 AM
Sure it's "allowed."

But what you are getting is HTML with extra
forward slashes.

It's not XML, and if you tried to treat it as XML (as the example of
the output of Syncato illustrates), it would burst into flames.

That was the point made
in Hixie's comment which started this thread.
ReplyDelete
Replies

Add comment