Re: Avoiding near-collisions in vocabulary coinage
From: | Eldin Raigmore <eldin_raigmore@...> |
Date: | Wednesday, August 6, 2008, 18:03 |
On Tue, 5 Aug 2008 10:56:11 -0400, Jim Henry <jimhenry1973@...>
wrote:
>[snip]
>That sounds useful. Do you have or know of a script
>to apply this wickelphonic-similarity calculation to a set of words?
I wish I did, but I don't.
I'd bet there is one somewhere, though; I just don't know how to find it, and
haven't gotten around to learning how to make one myself.
>However, the vast majority of gzb roots being only 2-4 phonemes long,
>it might be less useful for gzb than for other languages.
(1) I was talking about "words", not "roots".
I hadn't realized you were talking about "roots"; I apologize.
I might use something different for "roots" myself; something like what Veoler
and Henrik mentioned. (But I'd go for more redundancy; I'd want 225000 to
625000 "legal" phoneme-sequences of appropriate length, to cover 12000 to
50000 roots (though in fact I doubt I'd ever get around to defining more than
2000 to 5000 roots).)
I'd use slightly longer roots than are absolutely necessary; require that they
differ in at least two places instead of just one; and require that at least one
of the phonemes differ in at least two characteristics (PoA and MoA, or PoA
and voicing, or MoA and voicing) rather than only one.
For instance, if I wanted (C)V(C)(V)(C) roots, I might require that any two
roots differ in at least one consonant, and also that they differ in at least one
of the first four phonemes. So only one root of, for instance, the form bVdVC
would show up.
(2) I don't see how you get by with such short roots.
I need roots four to six phonemes long; (C)V((C)V((C)V)) for 1-to-3-syllable
roots when the syllable structure is (C)V, (C)V(C)((C)V(C)) for 1-to-2-syllable
roots when the syllable structure is (C)V(C), (C)(C)V(V)(C)(C) for 1-syllable
roots.
By straining I can get enough 1-to-5-phoneme roots with the (C)V(C)(V)(C)
or (C)(V)(C)V(C) structures.
To get enough roots with only up-to-four-phonemes I need to allow all
variations; (C)(C)(C)V, (C)(C)V(C), (C)(C)V(V), (C)V(C)(C), (C)V(C)(V),
(C)V(V)(C), (C)V(V)(V), V(C)(C)(C), V(C)(C)(V), V(C)(V)(C), V(C)(V)(V),
V(V)(C)(C), V(V)(C)(V), V(V)(V)(C), V(V)(V)(V).
Do you have a large phoneme inventory?
Or phonemic tone or lexical tone?
>[snip]
>In gzb nearly all morphemes are one syllable, and no syllable
>is more than five phonemes (the average is 3.36 phonemes per
>root morpheme). So for nearly all words, all phonemes
>would be relevant for this kind of similarity calculation. (There's only one
>root in the lexicon with more than 8 phonemes, {θrî'sě'kjurn} "ibis".)
>
>My present technique just looks for words that have similar phonemes
>in each or any slot in the word. So (simplifying, and using Kalusa
>phonology instead of gzb phonology), if I were checking to see if
>a potential word "kalu" were too similar to an existing word, I would run it
>through a script that turns it into the regex
>
>/^[kg][ae][lr][uo]$/
>
>and then searches the lexicon for words matching that regex, which
>would turn up (if they existed) "galu", "gero", "karu", etc., along
>with their glosses, and I could decide if any of them really
>sounded too similar to "kalu" and also had too-similar meanings.
(1)
In my conlangs as well morphemes tend to be shorter than 8 phonemes. In
those that aren't isolating and analytic, the words tend to be longer than the
roots, and the roots tend to be longer than the non-root morphemes; and as
I've mentioned the roots tend to be four-to-six-phonemes long. The average
word (other than pronouns, adpositions, and conjunctions (possibly also
omitting adverbs)) is usually at least two morphemes long counting the root;
so the averagee word is probably around two to four syllables long. (In the
more highly synthetic conlangs the average word is probably more like four or
more morphemes long.)
----
(2)
I think I would want at least one phoneme in each root to differ more than
minimally from "the phoneme in the same slot" in the other root. Or, have two
(or more) slots that are filled by different phonemes in the different roots. (Or
both).
One problem is, what constitutes "the same slot" when the roots are not the
same length?
>[snip]
>Do you have a script to apply this calculation you describe?
Again, I wish I did but I don't.
I'm pretty sure there isn't one and won't be one until I, or someone who's read
my posts, gets around to making one.
Or, at least, not one that implements the actual count I described before.
There may be one that implements something that is, in effect, the same or a
very similar concept. And maybe it looks at, say, the first three phonemes
and the last phoneme, or the first two phonemes and the last two phonemes,
or some such thing. And maybe it takes into account syllable-stress, and
regards the first stressed syllable and the last stressed syllable in case these
aren't the first syllable and the last syllable of the word.
Anyway, as described, it works better for words than for roots, at least for my
conlangs whose roots usually aren't longer than about six phonemes. I didn't
illlustrate what happens when one or both of the words is shorter than eight
phonemes, but the maximum score, and hence the denominator of the
fraction, would be decreased. I didn't say at all how to handle what happens
when one or both of the words is shorter than four phonemes; whoever writes
the program could think of something.