Re: OT: TECH: Dumb Unicode question

From:	John Cowan <cowan@...>
Date:	Friday, November 21, 2003, 19:34

|< < Post > >| << List/Tree >> Reference November 2003 Index

Mark J. Reed scripsit:

> > These 2^16 characters were mapped onto ISO 10646 planes 1 through 17.
>
> You mean 2^20, not 2^16.
Yes.

> I realize that modern computers deal most efficiently with 16-
> and 32-bit quantities, but it still seems like there ought to be a
> UTF-24, for external storage if nothing else.  That extra byte per
> character in UTF-32 is never ever needed for anything.
Well, on the DECsystem-10 and -20 computers (which these days exist
pretty much only in emulation), the UTF-9 standard uses 9-bit bytes
packed into the native 36-bit words.  UTF-9 represents all Unicode
characters as one, two, or three bytes, where the high-order-bit set
indicates that there are more bytes to come.

> But I guess the idea is to always use UTF-8 or SCSU for external
> storage, and UTF-16 or UTF-32 for in-memory processing.
No, sometimes UTF-16 wins for external representation, especially in
Chinese, Japanese, and Korean, which require 3 bytes per character in
UTF-8 and only two in UTF-16.  There are also two compression schemes,
SCSU and BOCU-1, which attempt to reduce the asymptotic storage
requirements for Unicode down to about 8 bits for alphabetic languages
and 16 bits for ideographic ones.

--
Knowledge studies others / Wisdom is self-known;      John Cowan
Muscle masters brothers / Self-mastery is bone;       jcowan@reutershealth.com
Content need never borrow / Ambition wanders blind;   www.ccil.org/~cowan
Vitality cleaves to the marrow / Leaving death behind.    --Tao 33 (Bynner)

|< < Post > >| << List/Tree >> Reference November 2003 Index

Reply

Mark J. Reed <markjreed@...>