Theiling Online    Sitemap    Conlang Mailing List HQ   

stevo's Unicode hanzi project

Date:Monday, March 10, 2008, 3:06
In a message dated 3/9/2008 15:35:16 PM Central Daylight Time,
dedalvs@GMAIL.COM writes:

> Stevo: > << > I have a personal descriptive code for each character now (it took me > a few > months) to help me find characters in Unicode. My code can also > describe > characters not in Unicode. > >> > > How does it work? > > -David >
To illustrate, here's a short list of some characters containing the small square ('mouth'), represented by the term "c32". (I can't put the hanzi themselves in an email.) [Unicode (or "swl") / total strokes / my code] 4e2d 4 $c32a01 =.d43 swl 5 $c32b03 =.e69 =$b03c32 53f2 5 $c32b77 =.e11 675f 7 $d140c32 =.g42 ~f68 77f3 5 (/a10a17c32 =.e35 5449 7 (/a20c9084c32 =.g921 53ef 5 (b00c32 =.e48 518f 7 (b19/b38c32 =.g47 =(b19e934 518b 5 (b19c32 =.e85 The basic idea is that the codes are expressions. An expression is 1) a term or 2) a joining symbol followed (usually) by two expressions. They are recursive. Terms are letter-number combinations, e.g., c32 (small square, 'mouth'), a00 (horizontal stroke; '1'). The letter in a term indicates how many strokes the component has, and the number gives the sequential position in the list of all the codes with that number of strokes. Right now the numbers are assigned in a chronological order as I assign new ones. As a result, they are in no useful order at all. I intend to reallocate the numbers after I've finished assigning them all, in order to bring similar characters together in the list. The (roughly 22) joining symbols [e.g., "/|$( "] indicate the structure in dyadic prefix notation. The monadic symbol "." indicates the definition of a new code. "$" means the components represented by the two terms that follow it cross one another. "|" means the components represented by the two terms that follow it are arranged as a left-right group. "!" is a special case of "|". It is monadic and represents two identical components next to each other. Three identical components in a row are indicated by monadic "!3". "/" means the components represented by the two terms that follow it are arranged as a top-bottom group. "_" is a special case of "/". Like "!", it is monadic and represents two identical components one over the other. "^" represents three identical components, one over two side by side. "(" means the components represented by the two terms that follow it are arranged as an outer-inner group "=" indicates alternative expressions for all or (occasionally) part of the character. "~" means 'is similar to'. Any expression that has, e.g., "(b19c32" (indicating component b19 surrounding component c32) can (and should) have that expression replaced by the term e85. Using my expressions I can find all the Unicode hanzi that have a particular component, e.g., all the c32's (there are 929, not including all the different terms (100, in fact) that use c32 as a defining component). Another example: Using A, B, C, D for distinct terms, one can write (AB. This means that A surrounds B. |AB means that A is on the left and B is on the right. /AB means that A is above B. $AB means that A crosses B. |/ABC means the combination /AB, where A is above B, is left of C. |A/BC means that A is left of the combination /BC, where B is above C. If I assign a new name, D, to the combination /BC, then |A/BC becomes |AD. D can now be used as a component in other codes. An example of how this works: the character for "tree, wood" is d140. Two of those next to each other make !d140 = h911. h911 inside c61 gives (c61h911 = k05. k05 surrounding h041 =(k05h041 = s08. Every new component name, i.e., term, essentially inherits all of the attributes of its constituent components. Some characters are hard to break down into smaller pieces. In the worst case, I just list the individual strokes that make them up, along with how they are connected. E.g., ,4a12a07a02a02 =.d000 is the description of the character 'heart'. The initial comma indicates that the next components are close to one another, but not touching. (Touching is indicated by ";".) The '4' after the comma means there are four components grouped by the comma: the individual strokes a12 (short upper-right to lower-left), a07 (down, then right, with a flip up at the end), then a02 (like a12, but upper-left to lower-right), then another a02. All of the Unicode characters for CJK have codes now. All but about 3800 are simplified to two components. If a component occurs only as itself and in one other character, then I do not create a new term for component; it remains as an expression of two (or more) terms, and the other character is not simplified. If at least two characters have a common component not represented by an existing character, then I create an ad hoc intermediate virtual character and give that a new term. Then this new term is used to simplify the expressions for the other characters. The ranges of characters in my list are 4e00..9fa5 = 20902 characters f900..fa2d = 302 characters swl = 361 elements (ad hoc characters) stevo </HTML>