Re: TECH: Testing again, no new on-topic content (was Re: "Language Creation" in your conlang)
From: | John Cowan <cowan@...> |
Date: | Monday, November 17, 2003, 5:05 |
Paul Bennett scripsit:
> This I knew, but I thought that there were only very few (maybe 2) of the
> 8th-bit characters which were used to signal an upcoming multibyte
> character. You live and learn, I guess.
A multibyte character in UTF-8 consists of a byte between 80 and AF
followed by one, two, or three bytes between B0 and FF.
> Bah, and a considerable amount of humbug. We've touched in the recent past
> on email headers being able to include strings of non-ASCII by means of a
> pre-defined escape character sequence, and I was hoping that something
> very
> similar would hold in the body of a message, too. If not, then not, I
> suppose, but it strikes me as an area in which there is some scope for
> improvement. I ought to be able to send a Latin-1 email, and if I
> accidentally slip a few Greek or Devanagari characters in, they should be
> encodable without having to change the encoding of the whole message.
There's the *character set* used in the message, and then there's the
*transfer encoding syntax* used. If I write a message in Latin-1
characters, it gets encoded using one byte per character. But then
there are three different ways it can actually get transmitted:
"8bit", which means that it actually goes out one byte per byte;
"quoted-printable", which sends bytes in the range 20-7E as
themselves, and other bytes using "=xx" syntax where xx is a
two-byte hex number;
"base64", which encodes every run of three bytes into four bytes
in the ranges corresponding to ASCII a-z, A-Z, 0-9, +, and -.
But if the encoding of the document is Latin-1, you can only use characters
from the Latin-1 repertoire in it.
> OTOH, what's the most useful purely 8-bit encoding for me to post in? My
> main requirements are to post acuted and trema'd vowels, some kind of
> accented "s" (ideally s-acute, but any other diacritised "s" will do), and
> some kind of accented "n" (ideally n-acute, but eng would do). nice
Sounds like Latin-2 might serve your purpose, aka ISO 8859-2
> have are "combining acute", "combining ogonek" and "combining dot below",
> but I suspect there's not an 8-bit encoding that handles any of those.
There are 8-bit encodings that handle combining characters, but they
are very rarely used outside library contexts.
--
"We are lost, lost. No name, no business, no Precious, nothing. Only empty.
Only hungry: yes, we are hungry. A few little fishes, nassty bony little
fishes, for a poor creature, and they say death. So wise they are; so just,
so very just." --Gollum jcowan@reutershealth.com www.ccil.org/~cowan