Re: Unicode 3.0
From: | Don Blaheta <dpb@...> |
Date: | Friday, October 1, 1999, 18:11 |
Quoth Paul Bennett:
> Rob writes:
> > Since I've been looking for fonts I ancountered the word Unicode. What =
is
> > it? How can I get all those neat scripts like Thaana and (especially)
> > Mongolian vertical script?
>=20
> To answer this question in "laymans terms":
I'll get a little more technical on ya. :)
> Computers store documents as a sequence of numbers, with different
> numeric values representing the symbols for different letters, numbers
> and so forth. Traditional numbering schemes (ASCII, ANSI, EBCDIC are
> examples) can store only about 250 characters, so you need different
> (and conflicting) mappings from number to character for each character
> set. Worse still, the most poular scheme (ASCII) is only really
> firmly defined for about 130 of those 250.
To be exact, ASCII is a 7-bit set; it has 128 possible values, of which
several (33) are taken up with "control" values, like "null",
"backspace", "end of line", and so forth. The printable ASCII
characters are exactly those which appear on a standard US keyboard.
There were in the 80s a number of "national" sets which replaced
characters such as {} with their own forms like n-tilde, a-umlaut, and
so on. The ISO approved a series of 8-bit character sets (iso-8859) in
the late 80s (?), each of which had 256 potential characters. But the
first 128 of each set were identical to ASCII, and 32 of the remaining
128 were taken for more control characters (which have never really been
used...). Anyway, so effectively each iso-8859 charset has 96 "foreign"
characters in it. iso-8859-1 (aka iso-latin-1) is the most commonly
used one; it covers most western European languages, and was adopted by
the ANSI and Microsoft, and is mostly mappable to the main Apple Mac
charset. Many people refer to it as ASCII, although that's not exactly
true. The other iso-8859 charsets cover the remainder of Europe and
some other latin-alphabet-using scripts, as well as Greek, Hebrew,
Russian, and Arabic, but few people use the Russian and Arabic sets (not
sure about Greek and Hebrew). The whole thing is sort of a mess,
because each document has to specify which character set it's in, and a
lot of programs just assume iso-latin-1.
> Unicode is different to this, as it allows over 65,500 possible
> characters, letting you have (pretty much) a unique number for any
> symbol in any script, and the value-to-symbol mapping is guaranteed to
> be compatible with any other Unicode-using software.
Enter Unicode. Rather than restrict itself to 8 bits, the Unicode
consortium decided to make a 16-bit standard. This gave them 65,535
character values to play with; finally, they could create one character
set to include every character in every script currently in use, and
several that aren't.
Of course, this isn't without its problems. One-byte codes are *very*
entrenched in the computer world, and there is a lot of extant code that
assumes that characters are only one byte long. In fact, almost all of
it. You've heard of the Y2K problem---this is worse (if perhaps a bit
less urgent). Essentially, every bit of code that is intended to handle
Unicode has to be written from scratch, and that's not trivial. And
what's more, there are a lot of brand-new issues that never got play
during the ASCII/iso-latin days, like: what about joining letters? In
e.g. Arabic, a letter has a different form based on what it's adjacent
to, and the display has to pick the correct glyph for each letter and
join it up. Then there are font-coordination issues; naturally every
single font isn't going to include all the glyphs for all 65,536
letters, they'll typically handle small subsets like "all the letters
used in England, France, and Germany" or "all the syllables of Japanese"
or even "all the characters used in Chinese", which though large is
still swamped by the rest. And so any editor or display program has to
be able to deal well with searching through several fonts to find one
with the requisite glyphs.... Then there are issues that exist already
but still have yet to be solved like capitalisation (many C programs
know that A and a are paired, but what about =C1 and =E1?) and "identical
letters" like A and capital-alpha.
WHAT THIS MEANS FOR YOU: Unicode is something you can plan to have...
eventually. Virtually everyone doing anything in the computer world
pledges full unicode support (some of them claim they already have it,
but afaict they're all lying). True full unicode support will mean
support at the OS level, the application level, and at the GUI level,
plus having all the right fonts that you need. That'll take a while.
So for now, I'd say, design anything you do with an eye toward using
Unicode in the future, but don't count on using it terribly soon....
--=20
-=3D-Don Blaheta-=3D-=3D-dpb@cs.brown.edu-=3D-=3D-<http://www.cs.brown.edu/=
~dpb/>-=3D-
For large values of one, one equals two, for small values of two.