Theiling Online    Sitemap    Conlang Mailing List HQ    Attic   

Re: Avoiding near-collisions in vocabulary coinage

From:Eldin Raigmore <eldin_raigmore@...>
Date:Monday, August 4, 2008, 23:18
I have four methods, of which I will relate two here.

(See also
< http://conscripts.s4.bizhat.com/conscripts-ftopic450.html >
)

One I won't go over is one of those you mentioned; just searching for
substrings.

The other three all count on being able to quantify "how similar are these two
words?".

One of them I won't go over wasn't my invention, but was put forward by
someone on fantasyconlangguild (if there still is such a thing. I can no longer
access it.).  It requires being able to quantify how similar two phonemes are.

The other two are based on ideas I've read from professional linguists.

---------------------------

One is counting Wickelphones.  (A Wickelphone is a set of three consecutive
phonemes that occur in a word; plus one for the first two phonems of the
word and one for the last two phonemes of the word.  So "haplology" would
have wickelphones {$ha hap apl plo lol olo log ogy gy#} in it.) Make a fraction
whose numerator is the number of wickelphones that occur in both words, and
whose denominator is the number of wickelphones that occur in at least one of
them.  If this is "1" then every wickelphone in either word also is in the other;
the words are too similar to have both.  If it's "0" (for instance, "a" has only
{$a#}, "i" has only {$i#}, and "o" has only {$o#}; no two of them share even
one wickelphone) then they are considered so dissimilar they couldn't be
confused with each other.  (I know, that might just be wishful thinking; this is
an heuristic, not something perfect.)

For example, "haplogy" has {$ha hap apl plo log ogy gy#} in it, all of which are
also in "haplology".  9 wickelphones occur in at least one of the two words,
and 7 occur in both, so "haplogy" and "haplology" are 7/9 (77.78%) similar.

--------------------------------

Another is based on the fact that words with the same sounds in the first four
or last four phonemes of the word are likely to be confused with each other.

The segments (that is, phonemes) closer to the beginnings and closer to the
endings of words, make more difference in how similar the words are felt to be,
than the segments far from either end.

Those nearer the beginning are somewhat more important than those nearer
the end. (Think of the word as a guy sitting in his bathtub; his head sticks out
at one end and his feet stick out at the other, but his head sticks out further
than his feet.)

A book I've read (whose author I unfortunately can't recall at the moment)
showed research on how likely two words were to be confused with each
other based on the first, first two, first three, or first four phonemes, and
based on the last, last two, last three, or last four phonemes.

Concentrating on the beginning for the moment;
Every segment that's one of the first four segments of both words, makes
them a bit similar; if the segment shows up in the same position in both words
it makes them a bit more similar than if it doesn't.
Every segment that's one of the first three segments of both words, makes
them a bit more similar than if it's the fourth segment in one of them; if the
segment shows up in the same position in both words it makes them a bit more
similar than if it doesn't.
Every segment that's one of the first two segments of both words, makes
them a bit more similar than if it's the third segment in one of them; if the
segment shows up in the same position in both words it makes them a bit more
similar than if it doesn't.
If the two words have the same first segment, that makes them a bit more
similar than if the first segment of one is the second segment of the other.

Concentrating on the end for the moment, we have similar results;
Every segment that's one of the last four segments of both words, makes
them a bit similar; if the segment shows up in the same position in both words
it makes them a bit more similar than if it doesn't.
Every segment that's one of the last three segments of both words, makes
them a bit more similar than if it's the fourth-from-last segment in one of
them; if the segment shows up in the same position in both words it makes
them a bit more similar than if it doesn't.
Every segment that's one of the last two segments of both words, makes
them a bit more similar than if it's the antepenultimate segment in one of
them; if the segment shows up in the same position in both words it makes
them a bit more similar than if it doesn't.
If the two words have the same last segment, that makes them a bit more
similar than if the last segment of one is the penultimate segment of the other.

In all cases, similarity at the beginning is a bit more important than similarity at
the end.

(A fact I shalll ignore for the purposes of this post:
If the first syllable of a word is unstressed, its influence is somewhat shared
with the first stressed syllable. If the last syllable of a word is unstressed, its
influence is somewhat shared with the last stressed syllable.)

So, here's my suggestion (and I'm just going to assume that both words have
the first and last syllables stressed):
Give the pair 14 points if they have the same first phoneme.
Give the pair 13 more points if they have the same last phoneme.
Give the pair 12 more points if they have the same second phoneme.
Give the pair 11 more points if they have the same penultimate phoneme.
Give the pair 10 more points if the first word's 2nd phoneme is the 2nd word's
1st phoneme.
Give the pair 10 more points if the 2nd word's 2nd phoneme is the 1st word's
1st phoneme.
Give the pair 9 more points if the first word's penultimate phoneme is the 2nd
word's last phoneme.
Give the pair 9 more points if the 2nd word's penultimate phoneme is the 1st
word's last phoneme.
Give the pair 8 more points if they have the same 3rd phoneme.
Give the pair 7 more points if they have the same antepenultimate phoneme.
Give the pair 6 more points if the 1st word's 3rd phoneme is the 1st or 2nd
phoneme of the 2nd word.
Give the pair 6 more points if the 2nd word's 3rd phoneme is the 1st or 2nd
phoneme of the 1st word.
Give the pair 5 more points if the 1st word's antepenultimate phoneme is the
last or penultimate phoneme of the 2nd word.
Give the pair 5 more points if the 2nd word's antepenultimate phoneme is the
last or penultimate phoneme of the 1st word.
Give the pair 4 more points if they have the same 4th phoneme.
Give the pair 3 more points if they have the same 4th-from-last phoneme.
Give the pair 2 more points if the 1st word's 4th phoneme is the 1st or 2nd or
3rd phoneme of the 2nd word.
Give the pair 2 more points if the 2nd word's 4th phoneme is the 1st or 2nd or
3rd phoneme of the 1st word.
Give the pair 1 more points if the 1st word's 4th-from-last phoneme is the last
or penultimate or antepenultimate phoneme of the 2nd word.
Give the pair 1 more points if the 2nd word's 4th-from-last phoneme is the last
or penultimate or antepenultimate phoneme of the 1st word.

If both words have at least 8 segments, the maximum possible score is 72
(=14+13+12+11+8+7+4+3). The similarity can be a fraction, actual-score-
divided-by-maximum-possible-score.

Obviously if either word is shorter than four segments, the maximum possible
score will be different; and if either is shorter than eight segments, it could
complicate things.

Also, if both words are longer than eight segments, they could get the
maximum score and still not be the same.

Here are some examples:
"florist" and "florits": 14 + 12 + 9 + 9 + 8 + 7 + 4 + 3 = 66/72
"florist" and "florsit": 14 + 13 + 12 + 8 + 5 + 5 + 4 + 3 = 64/72
"florist" and "floirst": 14 + 13 + 12 + 11 + 8 + 5 + 1 + 1 = 65/72
"florist" and "flroist": 14 + 13 + 12 + 11 + 7 + 2 + 2 = 61/72
"florist" and "folrist": 14 + 13 + 11 + 7 + 6 + 6 + 4 + 3 = 64/72
"florist" and "lforist": 13 + 11 + 10 + 10 +8 + 7 + 4 + 3 = 66/72

----------------------------------------------------------------------------
===========================================================
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
***********************************************************
__________________________________________________________________
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I try to avoid words that are "too" similar by any of the above criteria.  The
bigger my lexicon gets, the more similar "not too similar" becomes.  I might
start out by saying "any two words have to be less than 50% similar"; then go
on to "less than 2/3 similar", "not more than 90% similar", "not more than 95%
similar", etc.
I don't think I'll ever allow two words that are 100% similar by both criteria.
(Obviously having search-software that can search my lexicon and calculate
these similarities helps a lot; at the moment I don't have that.)

Reply

Jim Henry <jimhenry1973@...>