Re: OT: Question: Unicode
From: | Carlos Thompson <chlewey@...> |
Date: | Monday, May 19, 2003, 6:16 |
----- Original Message -----
From: "Roger Mills" <romilly@...>
To: <CONLANG@...>
Sent: Sunday, May 18, 2003 10:28 AM
Subject: Re: Question: Unicode
> Carlos Thompson wrote:
>
> > Roger Mills wrote:
> >
> >
> > > I've created a web page using MS Word, and Lucida Sans Unicode. In
the
> > > header, MS says "charset-MS 1252" or somesuch. Should this be changed
to
> > > UTF8?
> >
> > Well, you should say UTF-8 if the text file is in UTF format, that is,
if
> > you will give entities above ASCII with variable length codes (those
that
> > look like ë for an á).
> (snip)
> Gracias, Carlos. Yes, I do include characters from the range above 0255,
> such as glottal stop, and several vowels with macrons or breves. So it
> appears I should change the setting to UTF 8.
>
> Oddly, although I can type in the characters (hex number e.g. 012B, plus
> Alt-x) I see that they're automatically converted to decimal. How clever
of
> MS Word.
>
> > UTF-8: makes shorter files if you are using lots of codes not
available
> in
> > Latin-1 or any other ISO-8859 code page. The UTF-8 files are difficult
to
> > edit in common text editors (vi, pico, notepad, wordpad, etc)
>
> In what way is it difficult?
Because you won't see a-acute as a-acute (á) but as A-tilde followed by
double-angled open quotation (ë) (assuming an editor in Windows or any
other Latin-1 environment).
So, if you want to represent a glotal stop, you have to check your unicode
chart and you know is 660 (decimal) or 294 hexadecimal: this means:
110011000 binary. This is above 7 binary digits (ASCII) but bellow 12, so
you need two bytes to represent it in UTF-8: the first byte should have the
pattern 110x xxxx, and the second one: 10xx xxxx, where the 11 x's represent
the code of the character: 00110011000, so the character will look as: first
byte: 1100 0110, this is Latin-1 for AE ligature (Æ), and the second one
will be 1001 1000, which is not an ISO Latin-1 legal charactar, but will
look like a closing double-quotation in Microsoft Latin-1: so a glotal stop
will look like Æ", where " will be a closing quotation or a square box,
accordong to your software. (In a DOS editor it will look like a box
drawing character followed by o-dieresis), so you have to learn all those
Latin-1 representations of UTF-8 encodings.
Note that UTF-8 encodes characters bellow 128 using one byte, characters
between 128 and 2047 using 2 bytes, between 2048 and 65536 usign 3 bytes,
etc.
At least using HTML numeric entities you will have a more straight forward
way to know what each code represents: it begins with &# and ends with ; and
between is the decimal code you can check in any Unicode chart. (but a
glotal stop encoded as numeric entity in either ASCII, Latin-1 or UTF-8,
will use 6 bytes: "ʔ" if writen as HTML numeric entity).
Well. Unless your text editor handles UTF-8 natively: then you will see a
glotal stop. But most text editors will match one character per byte. (or,
possibly 1 character per 3 or 4 bytes if they are Unicode, like the Unicode
version of Wordpad, but rarely variable length characters, like those in
UTF-8).
Well, you can still set the file encoding as UTF-8 and write the characters
above 255 as HTML numeric entities (a good compromise for supporting
Netscape 4.x), but then you also need to write Latin-1 characters as HTML
entities. (numeric or literal) or as two-byte UTF8 codes: a acute then will
look as á, á, á, or A-tilde+open double angled quotation
(ë), but not as a-acute in your text editor.
Well, it will look like a acute (á) in your WYSIWYG editor, so why worry?
-- Carlos Th
Reply