Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: TECH: Testing again, no new on-topic content (was Re: "Language Creation" in your conlang)

From:Mark J. Reed <markjreed@...>
Date:Monday, November 17, 2003, 1:19
On Sun, Nov 16, 2003 at 07:49:32PM -0500, Paul Bennett wrote:
> Er, are you sure? I thought that virtually every 128-255 character came > through unscathed between Latin-1 and UTF-8.
Absolutely not. Unicode *code points* from 128-255 are the same as Latin-1, but the way the UTF-8 encoding works, characters with those code points are encoded as two bytes. For instance, good old LATIN SMALL LETTER E WITH ACUTE, é, is code point U+00E9: e9 hexadecimal, 233 decimal. In Latin-1 email, that is sent as a single byte whose value is 233. However, in UTF-8, it's sent as two bytes: 195 followed by 169. So if a message is sent in UTF-8 (like this one) but treated by the receiving end as Latin-1, everywhere there's supposed to be a lowercase e with an acute accent there will instead be a capital A with a tilde followed by a copyright sign. This difference is necessary to enable the encoding of characters above 255; if every byte represented itself there'd be no way to say "Hey, interpreter! This character is more than one byte long!". Obviously there are many ways that information could have been encoded, but the compromise between efficiency and compatibility chosen for UTF-8 is that only the first 128 characters (code points 0 through U+007F = 127 decimal) represent themselves, and bytes with the high bit set (values >=128) represent parts of multibyte characters. -Mark

Reply

Mark J. Reed <markjreed@...>