Theiling Online    Sitemap    Conlang Mailing List HQ    Attic   

Re: Avoiding near-collisions in vocabulary coinage

From:Eldin Raigmore <eldin_raigmore@...>
Date:Wednesday, August 6, 2008, 18:03
On Tue, 5 Aug 2008 10:56:11 -0400, Jim Henry <jimhenry1973@...>
wrote:
>[snip] >That sounds useful. Do you have or know of a script >to apply this wickelphonic-similarity calculation to a set of words?
I wish I did, but I don't. I'd bet there is one somewhere, though; I just don't know how to find it, and haven't gotten around to learning how to make one myself.
>However, the vast majority of gzb roots being only 2-4 phonemes long, >it might be less useful for gzb than for other languages.
(1) I was talking about "words", not "roots". I hadn't realized you were talking about "roots"; I apologize. I might use something different for "roots" myself; something like what Veoler and Henrik mentioned. (But I'd go for more redundancy; I'd want 225000 to 625000 "legal" phoneme-sequences of appropriate length, to cover 12000 to 50000 roots (though in fact I doubt I'd ever get around to defining more than 2000 to 5000 roots).) I'd use slightly longer roots than are absolutely necessary; require that they differ in at least two places instead of just one; and require that at least one of the phonemes differ in at least two characteristics (PoA and MoA, or PoA and voicing, or MoA and voicing) rather than only one. For instance, if I wanted (C)V(C)(V)(C) roots, I might require that any two roots differ in at least one consonant, and also that they differ in at least one of the first four phonemes. So only one root of, for instance, the form bVdVC would show up. (2) I don't see how you get by with such short roots. I need roots four to six phonemes long; (C)V((C)V((C)V)) for 1-to-3-syllable roots when the syllable structure is (C)V, (C)V(C)((C)V(C)) for 1-to-2-syllable roots when the syllable structure is (C)V(C), (C)(C)V(V)(C)(C) for 1-syllable roots. By straining I can get enough 1-to-5-phoneme roots with the (C)V(C)(V)(C) or (C)(V)(C)V(C) structures. To get enough roots with only up-to-four-phonemes I need to allow all variations; (C)(C)(C)V, (C)(C)V(C), (C)(C)V(V), (C)V(C)(C), (C)V(C)(V), (C)V(V)(C), (C)V(V)(V), V(C)(C)(C), V(C)(C)(V), V(C)(V)(C), V(C)(V)(V), V(V)(C)(C), V(V)(C)(V), V(V)(V)(C), V(V)(V)(V). Do you have a large phoneme inventory? Or phonemic tone or lexical tone?
>[snip] >In gzb nearly all morphemes are one syllable, and no syllable >is more than five phonemes (the average is 3.36 phonemes per >root morpheme). So for nearly all words, all phonemes >would be relevant for this kind of similarity calculation. (There's only one >root in the lexicon with more than 8 phonemes, {θrî'sě'kjurn} "ibis".) > >My present technique just looks for words that have similar phonemes >in each or any slot in the word. So (simplifying, and using Kalusa >phonology instead of gzb phonology), if I were checking to see if >a potential word "kalu" were too similar to an existing word, I would run it >through a script that turns it into the regex > >/^[kg][ae][lr][uo]$/ > >and then searches the lexicon for words matching that regex, which >would turn up (if they existed) "galu", "gero", "karu", etc., along >with their glosses, and I could decide if any of them really >sounded too similar to "kalu" and also had too-similar meanings.
(1) In my conlangs as well morphemes tend to be shorter than 8 phonemes. In those that aren't isolating and analytic, the words tend to be longer than the roots, and the roots tend to be longer than the non-root morphemes; and as I've mentioned the roots tend to be four-to-six-phonemes long. The average word (other than pronouns, adpositions, and conjunctions (possibly also omitting adverbs)) is usually at least two morphemes long counting the root; so the averagee word is probably around two to four syllables long. (In the more highly synthetic conlangs the average word is probably more like four or more morphemes long.) ---- (2) I think I would want at least one phoneme in each root to differ more than minimally from "the phoneme in the same slot" in the other root. Or, have two (or more) slots that are filled by different phonemes in the different roots. (Or both). One problem is, what constitutes "the same slot" when the roots are not the same length?
>[snip] >Do you have a script to apply this calculation you describe?
Again, I wish I did but I don't. I'm pretty sure there isn't one and won't be one until I, or someone who's read my posts, gets around to making one. Or, at least, not one that implements the actual count I described before. There may be one that implements something that is, in effect, the same or a very similar concept. And maybe it looks at, say, the first three phonemes and the last phoneme, or the first two phonemes and the last two phonemes, or some such thing. And maybe it takes into account syllable-stress, and regards the first stressed syllable and the last stressed syllable in case these aren't the first syllable and the last syllable of the word. Anyway, as described, it works better for words than for roots, at least for my conlangs whose roots usually aren't longer than about six phonemes. I didn't illlustrate what happens when one or both of the words is shorter than eight phonemes, but the maximum score, and hence the denominator of the fraction, would be decreased. I didn't say at all how to handle what happens when one or both of the words is shorter than four phonemes; whoever writes the program could think of something.