Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: Tech: Unicode (was...)

From:Mark J. Reed <markjreed@...>
Date:Saturday, May 8, 2004, 17:40
On Sat, May 08, 2004 at 05:56:37AM -0700, Philippe Caquant wrote:
> Looks like I missed something again. I had just > understood that Unicode used two bytes for encoding a > single character.
"Unicode" doesn't deal in *bytes* at all. Unicode is a set of mappings from numbers - called "code points" - to characters. The maximum number in the Unicode range is 1,114,111, but not all numbers in the range are assigned to characters, and several ranges are reserved for other purposes and will never be assigned to characters. Currently there are around 90,000 assigned characters. But again, all that is being mapped is numbers. Unicode tells you that, for instance, the Cyrillic lowercase letter yeru is represented by the number whose hexadecimal representation is 44B, which is the same number whose decimal representation is 1099. But the Unicode standard itself doesn't tell you anything about how to go about conveying that number in a file or on a network or whatever. That final step is called "encoding", and there are many different encodings to choose from. All "national" single-byte character sets, for instance, can be regarded from the Unicode point of view as encodings of Unicode characters - incomplete ones, in that some code points are completely unrepresentable, but encodings nevertheless. For complete encodings, where every possible Unicode character is representable, the basic choices are the Unicode Transformation Formats, or UTFs, which are numbered according to the *minimum* number of bits required to encode a character. Thus, UTF-32 uses 32 bits or four bytes per character - wasteful of space, in that you never need more than 21 of those 32 bits, but time-effecient for modern computer architectures to process. UTF-32 is especially wasteful since most of the currently-defined Unicode characters are in the range 0 to 65535 - what is called the "Basic Multilingual Plane", or BMP. If all your characters are in this range, then half of the space is being wasted in UTF-32. UTF-16 uses 16 bits or two bytes for each character. Unicode strings in memory in most programming languages, including Java, and C++ on Windows when using the Windows Unicode system API, are generally in this form. Characters outside the BMP are represented by pairs of 16-bit numbers chosen from a range in Unicode reserved for this purpose- that is, these numbers will never represent individual characters, but are reserved to represent half of the encoding for characters outside the BMP. Note that UTF-16 and UTF-32 don't specify byte order; whatever the natural order is on a given system may be used. There is a Unicode character called the Byte Order Mark, whose byte-reversed value is illegal, that is customarily put at the start of such files to indicate the endianness of the encoding. UTF-8 uses 8 bits or one byte for characters in the range 0-127, that is, the equivalent of US-ASCII. Higher-numbered characters require more bytes: two bytes for the range 128-2047, three bytes for the rest of the BMP, and four bytes for characters outside the BMP. UTF-7 uses 7 bits (actually, one byte with a 0 high bit) for characters in the US-ASCII range except for +, and sequences of bytes with the high bit clear for + and for characters outside the US-ASCII range. As a result, it is 7-bit-clean and suitable for transmission through 8-bit-hostile environments, such as our friendly neighborhood conlang list server. Then there are the SCSU and BOCU encodings, which are called "compressions" because they attempt to get an average storage requirement equivalent to national character sets for messages whose characters all come from the equivalent Unicode range. Thus, you can encode Western European languages at close to 8 bits per character, and Japanese at close to 16 bits per character, while still having access to the entire Unicode repertoire when needed. Finally, note that there is not a one-to-one correspondence between "character", which is what Unicode encodes, and "glyph", which is what shows up on your screen. Different fonts may have very different glyphs for what is logically the same character. Some characters are non-spacing marks which are intended to be superimposed over an adjacent glyph, and the font may treat the combination as a separate glyph entirely instead of a composition, etc. -Mark