Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: OT: TECH: Dumb Unicode question

From:John Cowan <cowan@...>
Date:Friday, November 21, 2003, 19:34
Mark J. Reed scripsit:

> > These 2^16 characters were mapped onto ISO 10646 planes 1 through 17. > > You mean 2^20, not 2^16.
Yes.
> I realize that modern computers deal most efficiently with 16- > and 32-bit quantities, but it still seems like there ought to be a > UTF-24, for external storage if nothing else. That extra byte per > character in UTF-32 is never ever needed for anything.
Well, on the DECsystem-10 and -20 computers (which these days exist pretty much only in emulation), the UTF-9 standard uses 9-bit bytes packed into the native 36-bit words. UTF-9 represents all Unicode characters as one, two, or three bytes, where the high-order-bit set indicates that there are more bytes to come.
> But I guess the idea is to always use UTF-8 or SCSU for external > storage, and UTF-16 or UTF-32 for in-memory processing.
No, sometimes UTF-16 wins for external representation, especially in Chinese, Japanese, and Korean, which require 3 bytes per character in UTF-8 and only two in UTF-16. There are also two compression schemes, SCSU and BOCU-1, which attempt to reduce the asymptotic storage requirements for Unicode down to about 8 bits for alphabetic languages and 16 bits for ideographic ones. -- Knowledge studies others / Wisdom is self-known; John Cowan Muscle masters brothers / Self-mastery is bone; jcowan@reutershealth.com Content need never borrow / Ambition wanders blind; www.ccil.org/~cowan Vitality cleaves to the marrow / Leaving death behind. --Tao 33 (Bynner)

Reply

Mark J. Reed <markjreed@...>