Convert Windows-1252 characters to Unicode.

Description

The Windows-1252 character set has different characters than ISO-8859-1 in the range 0x80 (128) to 0x9F (159), corresponding to Unicode characters with other code points. But many HTML documents contain character references such as ’ mistakenly to indicate Unicode characters such as a single right quotation mark. Some browsers show the Unicode characters even though the code point is from Windows-1252, so it may not be immediately obvious that there is a problem.

Add a Guise feature to convert Unicode code points within the range U+0080 to U+009F (which are control characters and not commonly used) to their supposed intended Unicode code points. Because this is technically changing content and there is a small chance the content intended these obscure code points, perhaps some heuristics should check to try to determine which was intended, and/or provide a configuration setting to enable/disable this behavior.

  • € - € (U+20AC)

  • ‚ - ‚ (U+201A)

  • ƒ - ƒ (U+0192)

  • „ - „ (U+201E)

  • … - … (U+2026)

  • † - † (U+2020)

  • ‡ - ‡ (U+2021)

  • ˆ - ˆ (U+02C6)

  • ‰ - ‰ (U+2030)

  • Š - Š (U+0160)

  • ‹ - ‹ (U+2039)

  • Œ - Œ (U+0152)

  • Ž - Ž (U+017D)

  • ‘ - ‘ (U+2018)

  • ’ - ’ (U+2019)

  • “ - “ (U+201C)

  • ” - ” (U+201D)

  • • - • (U+2022)

  • – - – (U+2013)

  • — - — (U+2014)

  • ˜ - ˜ (U+02DC)

  • ™ - ™ (U+2122)

  • š - š (U+0161)

  • › - › (U+203A)

  • œ - œ (U+0153)

  • ž - ž (U+017E)

  • Ÿ - Ÿ (U+0178)

See Character References Explained and Elimination of Text Corruption in XML for more discussion.

Environment

None

Assignee

Garret Wilson

Reporter

Garret Wilson

Labels

None

Components

Priority

Minor
Configure