Clean up XHTML 1.1 document implied attributes.
The legacy XHTML 1.1 modular DTD -//W3C//DTD XHTML 1.1//EN results in many implied attributes being reified and others erroneously added for some reason. The attributes xmlns="http://www.w3.org/1999/xhtml" and xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" are added on almost every element, and moreover xml:space="preserve" is added for no apparently reason. (Note that later it was found that HTML by default preserves space, so maybe that was the reasoning.)
For example this simple XHTML 1.1 document:
Results in this DOM tree:
See discussion at Java XML parser adding unnecessary xmlns and xml:space attributes.
Add some sort of cleanup step(s) that removes all this distracting cruft. For example xmlns="http://www.w3.org/1999/xhtml" should only be defined on the root element.
will do a lot to mitigate this, as it will result in dropping the xml:space attribute altogether, as HTML5 prescribes.
I've found a workaround, although it's not ideal. The idea is that when a document asks to be parsed with the XHTML 1.1 DTD -//W3C//DTD XHTML 1.1//EN, to really use the XHTML 1.0 Strict DTD -//W3C//DTD XHTML 1.0 Strict//EN instead. For most practical purposes this DTD is effectively almost the same as the one they asked for, but it doesn't bring in all the default cruft.
The implementation looks something like this:
It's somewhat of a kludge, and semantically I don't like it. But for mummification of legacy documents in Guise Mummy we just need a clean, well-formed parsed document with correct entity replacement, so in practice it may produce effectively the same results for most documents.
One approach that at least seems to work, as mentioned above, is importing the DOM into another document altogether, one without a DTD. It looks something like this:
I thought I was getting somewhere. I was just going to remove the unnecessary and undesired attributes. It started out like this:
Unfortunately this doesn't work; the DOM just adds back the deleted attributes because they have default values specified in the DTD. I have an open question on Stack Overflow, but so far there is no solution. The only way forward right now seems to be to import the DOM tree into a new document with no DTD, but that takes a while.
Interestingly this is not a problem with XHTML 1.0 strict.
When parsed that yields:
Just as we would expect. But the extra information is added with -//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN. So this seems to be just an XHTML 1.1 problem.