Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: Mangling Unicode (was: Fwd: Gzarondan: Spelling Review)

From:Mark J. Reed <markjreed@...>
Date:Monday, October 18, 2004, 4:32
On Sat, Oct 16, 2004 at 11:01:40PM +0300, Isaac A. Penzev wrote:
> Paul Bennett wrote: > > It has to do with their > > underlying bytes containing values 128-160, IIRC, but there's no simple > > memorable way to tell which actual characters will be effected.
I don't know about 128-159, but 160 is definitely a problem. The list server software turns 160s into 32s, so any UTF-8 sequence containing 160 is hosed (technical term). And even if that is the only problematic byte, it means that 50,418 code points are affected, although only 2,030 of them are in the basic multilingual plane. The mathematics of UTF-8 encoding dictates that code points of the form xx20, xx60, xxA0, xxE0 are potential victims, though not all such code points are mangled. There are also a few ranges of 64 characters which are all subject to mangling (in the BMP; in the full space there are ranges of 4096 consecutive mangleworthy code points!). -Marcos