Re: Mangling Unicode (was: Fwd: Gzarondan: Spelling Review)
From: | Mark J. Reed <markjreed@...> |
Date: | Monday, October 18, 2004, 4:32 |
On Sat, Oct 16, 2004 at 11:01:40PM +0300, Isaac A. Penzev wrote:
> Paul Bennett wrote:
> > It has to do with their
> > underlying bytes containing values 128-160, IIRC, but there's no simple
> > memorable way to tell which actual characters will be effected.
I don't know about 128-159, but 160 is definitely a problem. The list
server software turns 160s into 32s, so any UTF-8 sequence containing
160 is hosed (technical term). And even if that is the only problematic
byte, it means that 50,418 code points are affected, although only 2,030
of them are in the basic multilingual plane.
The mathematics of UTF-8 encoding dictates that code points of the form
xx20, xx60, xxA0, xxE0 are potential victims, though not all such code
points are mangled. There are also a few ranges of 64 characters which
are all subject to mangling (in the BMP; in the full space there are
ranges of 4096 consecutive mangleworthy code points!).
-Marcos