Re: OT: TECH: Dumb Unicode question
From: | John Cowan <cowan@...> |
Date: | Friday, November 21, 2003, 19:34 |
Mark J. Reed scripsit:
> > These 2^16 characters were mapped onto ISO 10646 planes 1 through 17.
>
> You mean 2^20, not 2^16.
Yes.
> I realize that modern computers deal most efficiently with 16-
> and 32-bit quantities, but it still seems like there ought to be a
> UTF-24, for external storage if nothing else. That extra byte per
> character in UTF-32 is never ever needed for anything.
Well, on the DECsystem-10 and -20 computers (which these days exist
pretty much only in emulation), the UTF-9 standard uses 9-bit bytes
packed into the native 36-bit words. UTF-9 represents all Unicode
characters as one, two, or three bytes, where the high-order-bit set
indicates that there are more bytes to come.
> But I guess the idea is to always use UTF-8 or SCSU for external
> storage, and UTF-16 or UTF-32 for in-memory processing.
No, sometimes UTF-16 wins for external representation, especially in
Chinese, Japanese, and Korean, which require 3 bytes per character in
UTF-8 and only two in UTF-16. There are also two compression schemes,
SCSU and BOCU-1, which attempt to reduce the asymptotic storage
requirements for Unicode down to about 8 bits for alphabetic languages
and 16 bits for ideographic ones.
--
Knowledge studies others / Wisdom is self-known; John Cowan
Muscle masters brothers / Self-mastery is bone; jcowan@reutershealth.com
Content need never borrow / Ambition wanders blind; www.ccil.org/~cowan
Vitality cleaves to the marrow / Leaving death behind. --Tao 33 (Bynner)
Reply