Re: TECH: Testing again, no new on-topic content (was Re: "Language Creation" in your conlang)
From: | Paul Bennett <paul-bennett@...> |
Date: | Monday, November 17, 2003, 3:46 |
On Sun, 16 Nov 2003 20:16:49 -0500, John Cowan <cowan@...>
wrote:
> Paul Bennett scripsit:
>
>> >[B]e aware that many people
>> >who would otherwise receive Latin-1 characters fine won't even see
>> >those if they're in a UTF-8 message.
>>
>> Er, are you sure? I thought that virtually every 128-255 character came
>> through unscathed between Latin-1 and UTF-8. I bow to Jown as final
>> arbiter, obviously, but that has always been my understanding.
>
> You're confusing characters and their representations. It's true that
> the first 256 characters of Unicode are identical to the 256 characters
> of Latin-1. But the UTF-8 *representation* of the last 128 characters
> of Latin-1 is quite different from the Latin-1 representation. To
> say no more, UTF-8 represents each of them with two bytes, whereas Latin-
> 1
> uses a single byte for each.
This I knew, but I thought that there were only very few (maybe 2) of the
8th-bit characters which were used to signal an upcoming multibyte
character. You live and learn, I guess.
>> What happens if I set my mail client to default encoding of "Latin-1"
>> and
>> paste some non-Latin-1 characters into the email? Is there an RFC that
>> defines a suitable way of coping?
>
> If your mail client understands Unicode at all, then it depends on the
> conversion libraries that it uses: the usual convention is to map
> unrepresentable characters into a question mark. If the mail client
> is Unicode-blind, it probably sends out the bytes of the UTF-8
> representation, ignoring the Latin-1 encoding tag, which produces
> gibberish. If this second process is iterated, then the gibberish
> doubles each time, as each byte is reinterpreted as Latin-1 and
> then re-encoded as UTF-8 again.
And Muke also pointed out...
> Opera's M2 appears to upsmart the outgoing encoding to UTF-8
> automagically.
>
Bah, and a considerable amount of humbug. We've touched in the recent past
on email headers being able to include strings of non-ASCII by means of a
pre-defined escape character sequence, and I was hoping that something very
similar would hold in the body of a message, too. If not, then not, I
suppose, but it strikes me as an area in which there is some scope for
improvement. I ought to be able to send a Latin-1 email, and if I
accidentally slip a few Greek or Devanagari characters in, they should be
encodable without having to change the encoding of the whole message.
What's the "suggestions" address for those fine folks at RFC headquarters?
;-)
For now, I will continue to post ASCIIfication alongside UTF-8, and leave
it to the individual reader to become RFC-compliant. It's not hard. I just
did it a few days ago, almost painlessly.
OTOH, what's the most useful purely 8-bit encoding for me to post in? My
main requirements are to post acuted and trema'd vowels, some kind of
accented "s" (ideally s-acute, but any other diacritised "s" will do), and
some kind of accented "n" (ideally n-acute, but eng would do). Also nice to
have are "combining acute", "combining ogonek" and "combining dot below",
but I suspect there's not an 8-bit encoding that handles any of those.
Paul
Replies