Re: Optimum number of symbols
From: | John Cowan <jcowan@...> |
Date: | Wednesday, May 22, 2002, 22:00 |
And Rosta scripsit:
> Suppose we can recognize the following sorts of basic unit:
>
> A. Minimal independent character. [analogous to words]
> B. Minimal meaningful component of a character. [analogous to morphemes]
> C. Minimal graphical component of a character (e.g. bow, stem,
> x-scender, orientation etc.). [analogous to phoneme]
>
> Am I right that in choosing which of A-C to encode for a script,
> Unicode chooses the optimal balance between tradition and whichever
> gives the smallest number of encoded units?
If under tradition you include traditional character encodings on
computers, then a qualified yes. Radical/phonetic encoding of Chinese
would produce many fewer encoded units, but it would be a labor like
unto moving mountains to do the work, and it would greatly complicate
any rendering engines that had to do the work of reassembly into the
little squares.
Also, I think C is never used; the choice, when there is a choice, is
between A and B. A partial exception would be the "pieces of brackets"
used to construct large-size brackets for mathematical use: there are
special encodings for "top of bracket", "vertical part of bracket"
(use N times to get the right size) and "bottom of bracket". But this
is not linguistic.
> If there were no pressures on 'encoding space', so that the encoding
> reflected purely graphological considerations, which of A-C do
> you think should be encoded? I shall desist from offering my own
> answer, my sagacity in these matters being of lesser mettle. (To
> which asseveration you will of course respond Pish!)
I think the pressure on encoding space is negligible: 65536 characters
provide for all of modern use nicely, and 1,114,112 total characters
provide for far more than will ever be needed. Unicode favors style B,
but compromises with A where necessary. In particular, it has now been
laid down that no more decomposable characters will ever be encoded: even
if it is proved that some language absolutely requires Q with circumflex,
this will appear in Unicode as two characters, Q and COMBINING CIRCUMFLEX.
On reflection, I am interpreting "meaningful" loosely: thus I take A WITH
ACUTE to have two "meaningful" parts, even though the meaning of ACUTE
varies from language to language. And though Nordics will tell you that
their A WITH RING and O WITH DIAERESIS and what have you have nothing
to do with the underlying Latin letters, to this claim we respond Pish!
--
John Cowan <jcowan@...> http://www.reutershealth.com
I amar prestar aen, han mathon ne nen, http://www.ccil.org/~cowan
han mathon ne chae, a han noston ne 'wilith. --Galadriel, _LOTR:FOTR_