Conlang: Re: OT: Question: Unicode (Carlos Thompson, May 19 '03, 6:16)

Re: OT: Question: Unicode

From:	Carlos Thompson <chlewey@...>
Date:	Monday, May 19, 2003, 6:16

From:

Carlos Thompson <chlewey@...>

Date:

Monday, May 19, 2003, 6:16

----- Original Message ----- From: "Roger Mills" <romilly@...> To: <CONLANG@...> Sent: Sunday, May 18, 2003 10:28 AM Subject: Re: Question: Unicode

> Carlos Thompson wrote: > > > Roger Mills wrote: > > > > > > > I've created a web page using MS Word, and Lucida Sans Unicode. In

the

> > > header, MS says "charset-MS 1252" or somesuch. Should this be changed

> > > UTF8? > > > > Well, you should say UTF-8 if the text file is in UTF format, that is,

> > you will give entities above ASCII with variable length codes (those

that

> > look like Ã« for an á). > (snip) > Gracias, Carlos. Yes, I do include characters from the range above 0255, > such as glottal stop, and several vowels with macrons or breves. So it > appears I should change the setting to UTF 8. > > Oddly, although I can type in the characters (hex number e.g. 012B, plus > Alt-x) I see that they're automatically converted to decimal. How clever

> MS Word. > > > UTF-8: makes shorter files if you are using lots of codes not

available

> in > > Latin-1 or any other ISO-8859 code page. The UTF-8 files are difficult

> > edit in common text editors (vi, pico, notepad, wordpad, etc) > > In what way is it difficult?

Because you won't see a-acute as a-acute (á) but as A-tilde followed by double-angled open quotation (Ã«) (assuming an editor in Windows or any other Latin-1 environment). So, if you want to represent a glotal stop, you have to check your unicode chart and you know is 660 (decimal) or 294 hexadecimal: this means: 110011000 binary. This is above 7 binary digits (ASCII) but bellow 12, so you need two bytes to represent it in UTF-8: the first byte should have the pattern 110x xxxx, and the second one: 10xx xxxx, where the 11 x's represent the code of the character: 00110011000, so the character will look as: first byte: 1100 0110, this is Latin-1 for AE ligature (Æ), and the second one will be 1001 1000, which is not an ISO Latin-1 legal charactar, but will look like a closing double-quotation in Microsoft Latin-1: so a glotal stop will look like Æ", where " will be a closing quotation or a square box, accordong to your software. (In a DOS editor it will look like a box drawing character followed by o-dieresis), so you have to learn all those Latin-1 representations of UTF-8 encodings. Note that UTF-8 encodes characters bellow 128 using one byte, characters between 128 and 2047 using 2 bytes, between 2048 and 65536 usign 3 bytes, etc. At least using HTML numeric entities you will have a more straight forward way to know what each code represents: it begins with &# and ends with ; and between is the decimal code you can check in any Unicode chart. (but a glotal stop encoded as numeric entity in either ASCII, Latin-1 or UTF-8, will use 6 bytes: "ʔ" if writen as HTML numeric entity). Well. Unless your text editor handles UTF-8 natively: then you will see a glotal stop. But most text editors will match one character per byte. (or, possibly 1 character per 3 or 4 bytes if they are Unicode, like the Unicode version of Wordpad, but rarely variable length characters, like those in UTF-8). Well, you can still set the file encoding as UTF-8 and write the characters above 255 as HTML numeric entities (a good compromise for supporting Netscape 4.x), but then you also need to write Latin-1 characters as HTML entities. (numeric or literal) or as two-byte UTF8 codes: a acute then will look as á, á, á, or A-tilde+open double angled quotation (Ã«), but not as a-acute in your text editor. Well, it will look like a acute (á) in your WYSIWYG editor, so why worry? -- Carlos Th

Re: OT: Question: Unicode

Reply