Conlang: Re: TECH: Testing again, no new on-topic content (was Re: "Language Creation" in your conlang) (John Cowan, Nov 17 '03, 5:05)

Re: TECH: Testing again, no new on-topic content (was Re: "Language Creation" in your conlang)

From:	John Cowan <cowan@...>
Date:	Monday, November 17, 2003, 5:05

From:

John Cowan <cowan@...>

Date:

Monday, November 17, 2003, 5:05

Paul Bennett scripsit:

> This I knew, but I thought that there were only very few (maybe 2) of the > 8th-bit characters which were used to signal an upcoming multibyte > character. You live and learn, I guess.

A multibyte character in UTF-8 consists of a byte between 80 and AF followed by one, two, or three bytes between B0 and FF.

> Bah, and a considerable amount of humbug. We've touched in the recent past > on email headers being able to include strings of non-ASCII by means of a > pre-defined escape character sequence, and I was hoping that something > very > similar would hold in the body of a message, too. If not, then not, I > suppose, but it strikes me as an area in which there is some scope for > improvement. I ought to be able to send a Latin-1 email, and if I > accidentally slip a few Greek or Devanagari characters in, they should be > encodable without having to change the encoding of the whole message.

There's the *character set* used in the message, and then there's the *transfer encoding syntax* used. If I write a message in Latin-1 characters, it gets encoded using one byte per character. But then there are three different ways it can actually get transmitted: "8bit", which means that it actually goes out one byte per byte; "quoted-printable", which sends bytes in the range 20-7E as themselves, and other bytes using "=xx" syntax where xx is a two-byte hex number; "base64", which encodes every run of three bytes into four bytes in the ranges corresponding to ASCII a-z, A-Z, 0-9, +, and -. But if the encoding of the document is Latin-1, you can only use characters from the Latin-1 repertoire in it.

> OTOH, what's the most useful purely 8-bit encoding for me to post in? My > main requirements are to post acuted and trema'd vowels, some kind of > accented "s" (ideally s-acute, but any other diacritised "s" will do), and > some kind of accented "n" (ideally n-acute, but eng would do). nice

Sounds like Latin-2 might serve your purpose, aka ISO 8859-2

> have are "combining acute", "combining ogonek" and "combining dot below", > but I suspect there's not an 8-bit encoding that handles any of those.

There are 8-bit encodings that handle combining characters, but they are very rarely used outside library contexts. -- "We are lost, lost. No name, no business, no Precious, nothing. Only empty. Only hungry: yes, we are hungry. A few little fishes, nassty bony little fishes, for a poor creature, and they say death. So wise they are; so just, so very just." --Gollum jcowan@reutershealth.com www.ccil.org/~cowan