Re: OT: TECH: Dumb Unicode question
From: | Mark J. Reed <markjreed@...> |
Date: | Friday, November 21, 2003, 20:09 |
On Fri, Nov 21, 2003 at 02:34:02PM -0500, John Cowan wrote:
> Well, on the DECsystem-10 and -20 computers (which these days exist
> pretty much only in emulation), the UTF-9 standard uses 9-bit bytes
> packed into the native 36-bit words. UTF-9 represents all Unicode
> characters as one, two, or three bytes, where the high-order-bit set
> indicates that there are more bytes to come.
There really is a UTF-9, eh? I was recently contemplating the creation of
such a beast for use in a ternary system (the -9 would have
referred to trits rather than bits).
> No, sometimes UTF-16 wins for external representation, especially in
> Chinese, Japanese, and Korean, which require 3 bytes per character in
> UTF-8 and only two in UTF-16. There are also two compression schemes,
> SCSU and BOCU-1, which attempt to reduce the asymptotic storage
> requirements for Unicode down to about 8 bits for alphabetic languages
> and 16 bits for ideographic ones.
And I'd expect Chinese/Japanese/Korean Unicode applications to opt for
SCSU (or BOCU-1, which I'm not familiar with?) in lieu of UTF-16, since
they'd then be guaranteed to do at least somewhat better than 2 bytes
per character, possibly much better with good use of the windows.
-Mark
Reply