Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: Unicode 3.0

From:Don Blaheta <dpb@...>
Date:Friday, October 1, 1999, 18:11
Quoth Paul Bennett:
> Rob writes: > > Since I've been looking for fonts I ancountered the word Unicode. What =
is
> > it? How can I get all those neat scripts like Thaana and (especially) > > Mongolian vertical script? >=20 > To answer this question in "laymans terms":
I'll get a little more technical on ya. :)
> Computers store documents as a sequence of numbers, with different > numeric values representing the symbols for different letters, numbers > and so forth. Traditional numbering schemes (ASCII, ANSI, EBCDIC are > examples) can store only about 250 characters, so you need different > (and conflicting) mappings from number to character for each character > set. Worse still, the most poular scheme (ASCII) is only really > firmly defined for about 130 of those 250.
To be exact, ASCII is a 7-bit set; it has 128 possible values, of which several (33) are taken up with "control" values, like "null", "backspace", "end of line", and so forth. The printable ASCII characters are exactly those which appear on a standard US keyboard. There were in the 80s a number of "national" sets which replaced characters such as {} with their own forms like n-tilde, a-umlaut, and so on. The ISO approved a series of 8-bit character sets (iso-8859) in the late 80s (?), each of which had 256 potential characters. But the first 128 of each set were identical to ASCII, and 32 of the remaining 128 were taken for more control characters (which have never really been used...). Anyway, so effectively each iso-8859 charset has 96 "foreign" characters in it. iso-8859-1 (aka iso-latin-1) is the most commonly used one; it covers most western European languages, and was adopted by the ANSI and Microsoft, and is mostly mappable to the main Apple Mac charset. Many people refer to it as ASCII, although that's not exactly true. The other iso-8859 charsets cover the remainder of Europe and some other latin-alphabet-using scripts, as well as Greek, Hebrew, Russian, and Arabic, but few people use the Russian and Arabic sets (not sure about Greek and Hebrew). The whole thing is sort of a mess, because each document has to specify which character set it's in, and a lot of programs just assume iso-latin-1.
> Unicode is different to this, as it allows over 65,500 possible > characters, letting you have (pretty much) a unique number for any > symbol in any script, and the value-to-symbol mapping is guaranteed to > be compatible with any other Unicode-using software.
Enter Unicode. Rather than restrict itself to 8 bits, the Unicode consortium decided to make a 16-bit standard. This gave them 65,535 character values to play with; finally, they could create one character set to include every character in every script currently in use, and several that aren't. Of course, this isn't without its problems. One-byte codes are *very* entrenched in the computer world, and there is a lot of extant code that assumes that characters are only one byte long. In fact, almost all of it. You've heard of the Y2K problem---this is worse (if perhaps a bit less urgent). Essentially, every bit of code that is intended to handle Unicode has to be written from scratch, and that's not trivial. And what's more, there are a lot of brand-new issues that never got play during the ASCII/iso-latin days, like: what about joining letters? In e.g. Arabic, a letter has a different form based on what it's adjacent to, and the display has to pick the correct glyph for each letter and join it up. Then there are font-coordination issues; naturally every single font isn't going to include all the glyphs for all 65,536 letters, they'll typically handle small subsets like "all the letters used in England, France, and Germany" or "all the syllables of Japanese" or even "all the characters used in Chinese", which though large is still swamped by the rest. And so any editor or display program has to be able to deal well with searching through several fonts to find one with the requisite glyphs.... Then there are issues that exist already but still have yet to be solved like capitalisation (many C programs know that A and a are paired, but what about =C1 and =E1?) and "identical letters" like A and capital-alpha. WHAT THIS MEANS FOR YOU: Unicode is something you can plan to have... eventually. Virtually everyone doing anything in the computer world pledges full unicode support (some of them claim they already have it, but afaict they're all lying). True full unicode support will mean support at the OS level, the application level, and at the GUI level, plus having all the right fonts that you need. That'll take a while. So for now, I'd say, design anything you do with an eye toward using Unicode in the future, but don't count on using it terribly soon.... --=20 -=3D-Don Blaheta-=3D-=3D-dpb@cs.brown.edu-=3D-=3D-<http://www.cs.brown.edu/= ~dpb/>-=3D- For large values of one, one equals two, for small values of two.