Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: Unicode vs The Rest Of The World (Again) (was Re: Re: Le tilde a-t-il été utilisé en français?)

From:Paul Bennett <paul-bennett@...>
Date:Saturday, May 1, 2004, 1:27
On Fri, 30 Apr 2004 17:38:02 -0700, Garth Wallace <gwalla@...>
wrote:

> Paul Bennett wrote: >> It's only a small subset of Unicode that gets mangled, rather than every >> character (we've seen it on the Georgian alphabet, notably), at least >> with >> UTF-8. UTF-8 is not merely raw Unicode, but rather a set of multi-byte >> codes, only some of which lie within the deadly 128-150 range. > > Ah, so it's only the Unicode characters that contain bytes matching > ASCII control characters with the 8th bit set that get mangled. Okay.
Not Unicode characters. UTF-8 strings, which are not the same thing. A UTF-8 string can be one or more bytes long (bytes which are kinda supposed to be "safe" bytes to pass), and resolves mathematically to a single Unicode character. See my example below.
>> Should anyone post in pure UTF-16, I imagine the problem might manifest >> itself more often, especially if they use the right (or wrong?) Unicode >> pages. > > Yeah, UTF-16 interpreted as ASCII would be chock-full of nulls.
Nulls that any sensible software[*] would simply either skip or print as a non-spacing space. The problem comes in the fact that UTF-16 is a much "purer" Unicode representation. It represents all Plane 0 characters as 2-byte sequences, matching 1:1 with the Unicode number of that character. UTF-8 represents all Unicode values 7F and less (identical to ASCII) with single bytes, but this then means that multi-byte sequences are needed to represent anything outside of ASCII. <completely fictional example alert> Suppose the Zygluzistani character {captial fred} occurs at U+9004, it'll actually get thru unscathed, because under UTF-8 it's represented by some other string of bytes, maybe C0,45,31 or something (hey, gimme a break, I'm making numbers up as I go along, here). Whereas, if it were sent UTF-16, it would get mangled, to U+1004, which is still a valid character. *Every* Zygluzistani character would get similarly mangled in UTF-16, but untouched in UTF-8 (except maybe a few other (p)lucky examples). Take another example, of the Sfenqese character {third-case gviqqertc hkujg}, which occurs at U+DEAD. This passes thru unscathed in UTF-16, but is (say) AC,74,8D in UTF-8, which gets mangled, and read as AC, 74, 0D, which happens (in spur-of-the-moment-example land) to not resolve to a valid Unicode character, and gets shown as a question mark in a diamond, a hollow rectangle, a question mark, or some other "ARGH MY BRAIN" symbol. Then again, it's not likely that very many other Sfengese characters would contain such a forbidden byte. [*] Um? We're not strictly talking about sensible software here, are we? Humph. Paul ... and here I stop, for fear of crossing a certain line if I babble much more today

Reply

Garth Wallace <gwalla@...>