Monday, February 17, 2014

[xvwobgcd] Unicode considered harmful

I suspect that defaulting the encoding of text you read to UTF-8, in contrast to the C locale, is dangerous in a vaguely similar way as having Javascript always turned on.  There are many pitfalls, known and yet unknown, which may trap an unsuspecting user.  Recently seen is the mess of punycode and the IDN homograph attack.

Avoid writing in UTF-8 when 7 bit ASCII will do.
Only render UTF-8 when you are expecting something that needs it, and if possible, render only the subset you expect, e.g., a certain language, or smart quotes.

Inspired by the mess of normalization.  "Abnormal" Unicode could be used for steganography and watermarking. Also inspired by Mosh, which forces the use of UTF-8.

No comments :