Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: Optimum number of symbols

From:John Cowan <jcowan@...>
Date:Wednesday, May 22, 2002, 22:00
And Rosta scripsit:

> Suppose we can recognize the following sorts of basic unit: > > A. Minimal independent character. [analogous to words] > B. Minimal meaningful component of a character. [analogous to morphemes] > C. Minimal graphical component of a character (e.g. bow, stem, > x-scender, orientation etc.). [analogous to phoneme] > > Am I right that in choosing which of A-C to encode for a script, > Unicode chooses the optimal balance between tradition and whichever > gives the smallest number of encoded units?
If under tradition you include traditional character encodings on computers, then a qualified yes. Radical/phonetic encoding of Chinese would produce many fewer encoded units, but it would be a labor like unto moving mountains to do the work, and it would greatly complicate any rendering engines that had to do the work of reassembly into the little squares. Also, I think C is never used; the choice, when there is a choice, is between A and B. A partial exception would be the "pieces of brackets" used to construct large-size brackets for mathematical use: there are special encodings for "top of bracket", "vertical part of bracket" (use N times to get the right size) and "bottom of bracket". But this is not linguistic.
> If there were no pressures on 'encoding space', so that the encoding > reflected purely graphological considerations, which of A-C do > you think should be encoded? I shall desist from offering my own > answer, my sagacity in these matters being of lesser mettle. (To > which asseveration you will of course respond Pish!)
I think the pressure on encoding space is negligible: 65536 characters provide for all of modern use nicely, and 1,114,112 total characters provide for far more than will ever be needed. Unicode favors style B, but compromises with A where necessary. In particular, it has now been laid down that no more decomposable characters will ever be encoded: even if it is proved that some language absolutely requires Q with circumflex, this will appear in Unicode as two characters, Q and COMBINING CIRCUMFLEX. On reflection, I am interpreting "meaningful" loosely: thus I take A WITH ACUTE to have two "meaningful" parts, even though the meaning of ACUTE varies from language to language. And though Nordics will tell you that their A WITH RING and O WITH DIAERESIS and what have you have nothing to do with the underlying Latin letters, to this claim we respond Pish! -- John Cowan <jcowan@...> http://www.reutershealth.com I amar prestar aen, han mathon ne nen, http://www.ccil.org/~cowan han mathon ne chae, a han noston ne 'wilith. --Galadriel, _LOTR:FOTR_