Re: Unicode vs The Rest Of The World (Again) (was Re: Re: Le tilde a-t-il été utilisé en français?)
From: | Paul Bennett <paul-bennett@...> |
Date: | Saturday, May 1, 2004, 1:27 |
On Fri, 30 Apr 2004 17:38:02 -0700, Garth Wallace <gwalla@...>
wrote:
> Paul Bennett wrote:
>> It's only a small subset of Unicode that gets mangled, rather than every
>> character (we've seen it on the Georgian alphabet, notably), at least
>> with
>> UTF-8. UTF-8 is not merely raw Unicode, but rather a set of multi-byte
>> codes, only some of which lie within the deadly 128-150 range.
>
> Ah, so it's only the Unicode characters that contain bytes matching
> ASCII control characters with the 8th bit set that get mangled. Okay.
Not Unicode characters. UTF-8 strings, which are not the same thing. A
UTF-8 string can be one or more bytes long (bytes which are kinda supposed
to be "safe" bytes to pass), and resolves mathematically to a single
Unicode character. See my example below.
>> Should anyone post in pure UTF-16, I imagine the problem might manifest
>> itself more often, especially if they use the right (or wrong?) Unicode
>> pages.
>
> Yeah, UTF-16 interpreted as ASCII would be chock-full of nulls.
Nulls that any sensible software[*] would simply either skip or print as a
non-spacing space. The problem comes in the fact that UTF-16 is a much
"purer" Unicode representation. It represents all Plane 0 characters as
2-byte sequences, matching 1:1 with the Unicode number of that character.
UTF-8 represents all Unicode values 7F and less (identical to ASCII) with
single bytes, but this then means that multi-byte sequences are needed to
represent anything outside of ASCII.
<completely fictional example alert>
Suppose the Zygluzistani character {captial fred} occurs at U+9004, it'll
actually get thru unscathed, because under UTF-8 it's represented by some
other string of bytes, maybe C0,45,31 or something (hey, gimme a break,
I'm making numbers up as I go along, here). Whereas, if it were sent
UTF-16, it would get mangled, to U+1004, which is still a valid character.
*Every* Zygluzistani character would get similarly mangled in UTF-16, but
untouched in UTF-8 (except maybe a few other (p)lucky examples).
Take another example, of the Sfenqese character {third-case gviqqertc
hkujg}, which occurs at U+DEAD. This passes thru unscathed in UTF-16, but
is (say) AC,74,8D in UTF-8, which gets mangled, and read as AC, 74, 0D,
which happens (in spur-of-the-moment-example land) to not resolve to a
valid Unicode character, and gets shown as a question mark in a diamond, a
hollow rectangle, a question mark, or some other "ARGH MY BRAIN" symbol.
Then again, it's not likely that very many other Sfengese characters would
contain such a forbidden byte.
[*] Um? We're not strictly talking about sensible software here, are we?
Humph.
Paul
... and here I stop, for fear of crossing a certain line if I babble much
more today
Reply