Support HTML mummification.
We need to support mummification of "normal" HTML documents, that is, HTML files that are not XHTML. We already support Markdown and XHTML which covers both the "common user" and the "power user". But we should probably support HTML as well to support the "technically-savvy user" as well as assist in converting legacy sites.
The parsing should be somewhat lenient, but there is no need to handle obscure or ancient HTML, as long as an error is produced if something isn't understood. The idea here is not to consumer every HTML source that exists, but to provide a useful format for users who know what they're doing, as well as to convert reasonable HTML that already exists. The parser should err on the site of accuracy (i.e. not changing semantics) rather than leniency; an error should be produced rather than guessing in really mangled HTML.
Note that HTML should use the RDFa prefix designation when determining prefixes for namespaces; see GUISE-68.
We'll need to find the best HTML parsing library that produces a good XHTML DOM tree—with the correct HTML namespace, although this can be rectified after parsing if the library doesn't provide this.
A top contender is jsoup which apparently " parses HTML to the same DOM as modern browsers do", but from looking at the source code this unfortunately doesn't appear to mean the actual W3C Java DOM interfaces. Update: Good news: there seems to be a W3CDom class which can convert to a real DOM tree, although watch out for bugs such as #1096 and #1098.
From personal experience HtmlCleaner does not emphasize standards-compliance or rigor, and instead just provides a way to parse old, mangled HTML as sort of an initial pass, relying on the consumer to do further processing.
See also Which HTML Parser is the best? on Stack Overflow.
As a side note, one thing to watch out for is that the DOM API for retrieving an element attribute indicates that it returns an empty string "" instead of null for missing values, but it has been rumored that many browser DOM implementations actually return null and that DOM 4 may switch to returning null. So if an HTML parsing library is mimicking a browser, it needs verified which implementation they choose. The very-old GlobalMentor XML parsing code, which originally followed the letter of the W3C DOM API, may need to be updated to make sure it handles both transparently.