Re: Tech: Unicode (was...)

From:	Mark J. Reed <markjreed@...>
Date:	Saturday, May 8, 2004, 17:40
|< < Post > >| << List/Tree >> Reference May 2004 Index
On Sat, May 08, 2004 at 05:56:37AM -0700, Philippe Caquant wrote:
> Looks like I missed something again. I had just
> understood that Unicode used two bytes for encoding a
> single character.
"Unicode" doesn't deal in *bytes* at all.  Unicode is a set of mappings
from numbers - called "code points" - to characters.  The maximum number
in the Unicode range is 1,114,111, but not all numbers in the range are
assigned to characters, and several ranges are reserved for other
purposes and will never be assigned to characters.  Currently there are
around 90,000 assigned characters.

But again, all that is being mapped is numbers.  Unicode tells you that,
for instance, the Cyrillic lowercase letter yeru is represented by the
number whose hexadecimal representation is 44B, which is the same number
whose decimal representation is 1099.  But the Unicode standard itself
doesn't tell you anything about how to go about conveying that number in
a file or on a network or whatever.

That final step is called "encoding", and there are many different
encodings to choose from.  All "national" single-byte character sets,
for instance, can be regarded from the Unicode point of view as
encodings of Unicode characters - incomplete ones, in that some code
points are completely unrepresentable, but encodings nevertheless.

For complete encodings, where every possible Unicode character is
representable, the basic choices are the Unicode Transformation Formats,
or UTFs, which are numbered according to the *minimum* number of bits
required to encode a character.  Thus, UTF-32 uses 32 bits or four bytes
per character - wasteful of space, in that you never need more than 21
of those 32 bits, but time-effecient for modern computer architectures
to process.

UTF-32 is especially wasteful since most of the currently-defined
Unicode characters are in the range 0 to 65535 - what is called the
"Basic Multilingual Plane", or BMP.  If all your characters are in this
range, then half of the space is being wasted in UTF-32.

UTF-16 uses 16 bits or two bytes for each character.  Unicode strings in
memory in most programming languages, including Java, and C++ on Windows when
using the Windows Unicode system API, are generally in this form.
Characters outside the BMP are represented by pairs of 16-bit numbers
chosen from a range in Unicode reserved for this purpose- that is, these
numbers will never represent individual characters, but are reserved to
represent half of the encoding for characters outside the BMP.

Note that UTF-16 and UTF-32 don't specify byte order; whatever the
natural order is on a given system may be used.  There is a Unicode
character called the Byte Order Mark, whose byte-reversed value is
illegal, that is customarily put at the start of such files to indicate
the endianness of the encoding.

UTF-8 uses 8 bits or one byte for characters in the range 0-127, that
is, the equivalent of US-ASCII.  Higher-numbered characters require more
bytes: two bytes for the range 128-2047, three bytes for the rest of the
BMP, and four bytes for characters outside the BMP.

UTF-7 uses 7 bits (actually, one byte with a 0 high bit) for characters
in the US-ASCII range except for +, and sequences of bytes with the high
bit clear for + and for characters outside the US-ASCII range.  As a
result, it is 7-bit-clean and suitable for transmission through
8-bit-hostile environments, such as our friendly neighborhood conlang
list server.

Then there are the SCSU and BOCU encodings, which are called
"compressions" because they attempt to get an average storage requirement
equivalent to national character sets for messages whose characters all
come from the equivalent Unicode range.  Thus, you can encode Western
European languages at close to 8 bits per character, and Japanese at
close to 16 bits per character, while still having access to the entire
Unicode repertoire when needed.

Finally, note that there is not a one-to-one correspondence between
"character", which is what Unicode encodes, and "glyph", which is what
shows up on your screen.  Different fonts may have very different glyphs
for what is logically the same character.  Some characters are
non-spacing marks which are intended to be superimposed over an adjacent glyph,
and the font may treat the combination as a separate glyph entirely
instead of a composition, etc.

-Mark
|< < Post > >| << List/Tree >> Reference May 2004 Index