Conlang: Re: OT: TECH: Dumb Unicode question (John Cowan, Nov 21 '03, 18:13)

Re: OT: TECH: Dumb Unicode question

From:	John Cowan <cowan@...>
Date:	Friday, November 21, 2003, 18:13

From:

John Cowan <cowan@...>

Date:

Friday, November 21, 2003, 18:13

Mark J. Reed scripsit:

> It seems that either 1,048,576/0x100000 (16 planes) or > 16,777,216/0x1000000 (256 planes) would have been more logical than > the chosen 1,114,112/0x110000 (17 planes), which doesn't fit into any > nice power of two.

/me takes a deep breath, inserts bit between teeth, and begins... There are actually two different standards, Unicode (a vendor standard developed by the Unicode Technical Committee) and ISO 10646 (an international standard developed by ISO/IEC JTC1/SC2/WG2 . The difference is almost never important, since the two committees closely interoperate to make sure that the repertoire of characters in both standards is always exactly the same. Unicode also includes lots of details about each character that ISO 10646 does not, but that doesn't matter here. In the beginning, Unicode had one plane of 2^16 = 65536 codepoints (but didn't call it a "plane") and ISO 10646 had no less than 2^15 planes (32768 planes) for 2^31 or 2,147,483,648 codepoints. By definition, Plane 0 of ISO 10646 was the same as Unicode. When in the course of standardization events, it became clear that Unicode would need more characters than could fit on Plane 0, a hacque was devised. By this hacque, two blocks of 1024 characters each were set aside, called the "low surrogates" and the "high surrogates", and that allowed 1024^2 = 2^16 = 1,048,576 additional characters to be represented using two consecutive Plane 0 codepoints, one from each block. These 2^16 characters were mapped onto ISO 10646 planes 1 through 17. WG2 then agreed never to assign any characters in planes 18 through 32767, and eventually agreed to remove them from ISO 10646 altogether. So now both standards have exactly the same architecture as well as assigning the same meaning to each codepoint. Eventually the 16-bit representation of Unicode was demoted from *the* representation to one of three equal representations: UTF-8, using one to three bytes for each Plane 0 character and four bytes for each Astral Plane [not an official term] character; UTF-16, using one short integer for each Plane 0 character and two short integers for each Astral Plane character; and UTF-32, using one long integer for every character. The codepoints assigned to the surrogate blocks are never used in UTF-8 or UTF-32 representations. -- John Cowan jcowan@reutershealth.com www.ccil.org/~cowan www.reutershealth.com "If he has seen farther than others, it is because he is standing on a stack of dwarves." --Mike Champion, describing Tim Berners-Lee (adapted)

Replies