Convert Windows-1252 characters to Unicode.
Description
The Windows-1252 character set has different characters than ISO-8859-1 in the range 0x80 (128) to 0x9F (159), corresponding to Unicode characters with other code points. But many HTML documents contain character references such as ’ mistakenly to indicate Unicode characters such as a single right quotation mark. Some browsers show the Unicode characters even though the code point is from Windows-1252, so it may not be immediately obvious that there is a problem.
Add a Guise feature to convert Unicode code points within the range U+0080 to U+009F (which are control characters and not commonly used) to their supposed intended Unicode code points. Because this is technically changing content and there is a small chance the content intended these obscure code points, perhaps some heuristics should check to try to determine which was intended, and/or provide a configuration setting to enable/disable this behavior.
€ - € (U+20AC)
‚ - ‚ (U+201A)
ƒ - ƒ (U+0192)
„ - „ (U+201E)
… - … (U+2026)
† - † (U+2020)
‡ - ‡ (U+2021)
ˆ - ˆ (U+02C6)
‰ - ‰ (U+2030)
Š - Š (U+0160)
‹ - ‹ (U+2039)
Œ - Œ (U+0152)
Ž - Ž (U+017D)
‘ - ‘ (U+2018)
’ - ’ (U+2019)
“ - “ (U+201C)
” - ” (U+201D)
• - • (U+2022)
– - – (U+2013)
— - — (U+2014)
˜ - ˜ (U+02DC)
™ - ™ (U+2122)
š - š (U+0161)
› - › (U+203A)
œ - œ (U+0153)
ž - ž (U+017E)
Ÿ - Ÿ (U+0178)
See Character References Explained and Elimination of Text Corruption in XML for more discussion.