Conlang: Re: Unicode vs The Rest Of The World (Again) (was Re: Re: Le tilde a-t-il été utilisé en français?) (Garth Wallace, May 1 '04, 4:42)

> On Fri, 30 Apr 2004 17:38:02 -0700, Garth Wallace <gwalla@...> > wrote: > >> Paul Bennett wrote: >> >>> It's only a small subset of Unicode that gets mangled, rather than every >>> character (we've seen it on the Georgian alphabet, notably), at least >>> with >>> UTF-8. UTF-8 is not merely raw Unicode, but rather a set of multi-byte >>> codes, only some of which lie within the deadly 128-150 range. >> >> >> Ah, so it's only the Unicode characters that contain bytes matching >> ASCII control characters with the 8th bit set that get mangled. Okay. > > > Not Unicode characters. UTF-8 strings, which are not the same thing. A > UTF-8 string can be one or more bytes long (bytes which are kinda supposed > to be "safe" bytes to pass), and resolves mathematically to a single > Unicode character. See my example below.

>>> Should anyone post in pure UTF-16, I imagine the problem might manifest >>> itself more often, especially if they use the right (or wrong?) Unicode >>> pages. >> >> >> Yeah, UTF-16 interpreted as ASCII would be chock-full of nulls. > > Nulls that any sensible software[*] would simply either skip or print as a > non-spacing space.

From:	Garth Wallace <gwalla@...>
Date:	Saturday, May 1, 2004, 4:42