stevo's Unicode hanzi project
|Date:||Monday, March 10, 2008, 3:06|
In a message dated 3/9/2008 15:35:16 PM Central Daylight Time,
> I have a personal descriptive code for each character now (it took me
> a few
> months) to help me find characters in Unicode. My code can also
> characters not in Unicode.
> How does it work?
To illustrate, here's a short list of some characters containing the small
square ('mouth'), represented by the term "c32". (I can't put the hanzi
themselves in an email.)
[Unicode (or "swl") / total strokes / my code]
4e2d 4 $c32a01 =.d43
swl 5 $c32b03 =.e69 =$b03c32
53f2 5 $c32b77 =.e11
675f 7 $d140c32 =.g42 ~f68
77f3 5 (/a10a17c32 =.e35
5449 7 (/a20c9084c32 =.g921
53ef 5 (b00c32 =.e48
518f 7 (b19/b38c32 =.g47 =(b19e934
518b 5 (b19c32 =.e85
The basic idea is that the codes are expressions. An expression is 1) a term
or 2) a joining symbol followed (usually) by two expressions. They are
Terms are letter-number combinations, e.g., c32 (small square, 'mouth'), a00
(horizontal stroke; '1'). The letter in a term indicates how many strokes the
component has, and the number gives the sequential position in the list of all
the codes with that number of strokes. Right now the numbers are assigned in
a chronological order as I assign new ones. As a result, they are in no
useful order at all. I intend to reallocate the numbers after I've finished
assigning them all, in order to bring similar characters together in the list.
The (roughly 22) joining symbols [e.g., "/|$( "] indicate the structure in
dyadic prefix notation. The monadic symbol "." indicates the definition of a
"$" means the components represented by the two terms that follow it cross
"|" means the components represented by the two terms that follow it are
arranged as a left-right group.
"!" is a special case of "|". It is monadic and represents two identical
components next to each other. Three identical components in a row are indicated
by monadic "!3".
"/" means the components represented by the two terms that follow it are
arranged as a top-bottom group.
"_" is a special case of "/". Like "!", it is monadic and represents two
identical components one over the other.
"^" represents three identical components, one over two side by side.
"(" means the components represented by the two terms that follow it are
arranged as an outer-inner group
"=" indicates alternative expressions for all or (occasionally) part of the
"~" means 'is similar to'.
Any expression that has, e.g., "(b19c32" (indicating component b19
surrounding component c32) can (and should) have that expression replaced by the term
Using my expressions I can find all the Unicode hanzi that have a particular
component, e.g., all the c32's (there are 929, not including all the different
terms (100, in fact) that use c32 as a defining component).
Another example: Using A, B, C, D for distinct terms, one can write (AB.
This means that A surrounds B. |AB means that A is on the left and B is on the
right. /AB means that A is above B. $AB means that A crosses B.
|/ABC means the combination /AB, where A is above B, is left of C. |A/BC
means that A is left of the combination /BC, where B is above C. If I assign a
new name, D, to the combination /BC, then |A/BC becomes |AD. D can now be used
as a component in other codes.
An example of how this works: the character for "tree, wood" is d140. Two
of those next to each other make !d140 = h911. h911 inside c61 gives (c61h911
= k05. k05 surrounding h041 =(k05h041 = s08.
Every new component name, i.e., term, essentially inherits all of the
attributes of its constituent components.
Some characters are hard to break down into smaller pieces. In the worst
case, I just list the individual strokes that make them up, along with how they
are connected. E.g., ,4a12a07a02a02 =.d000 is the description of the character
'heart'. The initial comma indicates that the next components are close to
one another, but not touching. (Touching is indicated by ";".) The '4' after
the comma means there are four components grouped by the comma: the individual
strokes a12 (short upper-right to lower-left), a07 (down, then right, with a
flip up at the end), then a02 (like a12, but upper-left to lower-right), then
All of the Unicode characters for CJK have codes now. All but about 3800 are
simplified to two components. If a component occurs only as itself and in
one other character, then I do not create a new term for component; it remains
as an expression of two (or more) terms, and the other character is not
simplified. If at least two characters have a common component not represented by an
existing character, then I create an ad hoc intermediate virtual character
and give that a new term. Then this new term is used to simplify the
expressions for the other characters.
The ranges of characters in my list are
4e00..9fa5 = 20902 characters
f900..fa2d = 302 characters
swl = 361 elements (ad hoc characters)