Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: OT: TECH: Dumb Unicode question

From:Mark J. Reed <markjreed@...>
Date:Friday, November 21, 2003, 20:09
On Fri, Nov 21, 2003 at 02:34:02PM -0500, John Cowan wrote:
> Well, on the DECsystem-10 and -20 computers (which these days exist > pretty much only in emulation), the UTF-9 standard uses 9-bit bytes > packed into the native 36-bit words. UTF-9 represents all Unicode > characters as one, two, or three bytes, where the high-order-bit set > indicates that there are more bytes to come.
There really is a UTF-9, eh? I was recently contemplating the creation of such a beast for use in a ternary system (the -9 would have referred to trits rather than bits).
> No, sometimes UTF-16 wins for external representation, especially in > Chinese, Japanese, and Korean, which require 3 bytes per character in > UTF-8 and only two in UTF-16. There are also two compression schemes, > SCSU and BOCU-1, which attempt to reduce the asymptotic storage > requirements for Unicode down to about 8 bits for alphabetic languages > and 16 bits for ideographic ones.
And I'd expect Chinese/Japanese/Korean Unicode applications to opt for SCSU (or BOCU-1, which I'm not familiar with?) in lieu of UTF-16, since they'd then be guaranteed to do at least somewhat better than 2 bytes per character, possibly much better with good use of the windows. -Mark

Reply

John Cowan <cowan@...>