Re: OT: TECH: Dumb Unicode question
From: | John Cowan <cowan@...> |
Date: | Friday, November 21, 2003, 18:13 |
Mark J. Reed scripsit:
> It seems that either 1,048,576/0x100000 (16 planes) or
> 16,777,216/0x1000000 (256 planes) would have been more logical than
> the chosen 1,114,112/0x110000 (17 planes), which doesn't fit into any
> nice power of two.
/me takes a deep breath, inserts bit between teeth, and begins...
There are actually two different standards, Unicode (a vendor standard
developed by the Unicode Technical Committee) and ISO 10646 (an
international standard developed by ISO/IEC JTC1/SC2/WG2 . The difference
is almost never important, since the two committees closely interoperate
to make sure that the repertoire of characters in both standards is
always exactly the same. Unicode also includes lots of details about
each character that ISO 10646 does not, but that doesn't matter here.
In the beginning, Unicode had one plane of 2^16 = 65536 codepoints (but
didn't call it a "plane") and ISO 10646 had no less than 2^15 planes
(32768 planes) for 2^31 or 2,147,483,648 codepoints. By definition,
Plane 0 of ISO 10646 was the same as Unicode.
When in the course of standardization events, it became clear that
Unicode would need more characters than could fit on Plane 0, a hacque
was devised. By this hacque, two blocks of 1024 characters each were
set aside, called the "low surrogates" and the "high surrogates",
and that allowed 1024^2 = 2^16 = 1,048,576 additional characters to be
represented using two consecutive Plane 0 codepoints, one from each block.
These 2^16 characters were mapped onto ISO 10646 planes 1 through 17.
WG2 then agreed never to assign any characters in planes 18 through
32767, and eventually agreed to remove them from ISO 10646 altogether.
So now both standards have exactly the same architecture as well as
assigning the same meaning to each codepoint.
Eventually the 16-bit representation of Unicode was demoted from *the*
representation to one of three equal representations: UTF-8, using one
to three bytes for each Plane 0 character and four bytes for each Astral
Plane [not an official term] character; UTF-16, using one short integer
for each Plane 0 character and two short integers for each Astral Plane
character; and UTF-32, using one long integer for every character.
The codepoints assigned to the surrogate blocks are never used in UTF-8
or UTF-32 representations.
--
John Cowan jcowan@reutershealth.com www.ccil.org/~cowan www.reutershealth.com
"If he has seen farther than others,
it is because he is standing on a stack of dwarves."
--Mike Champion, describing Tim Berners-Lee (adapted)
Replies