Re: Avoiding near-collisions in vocabulary coinage
From: | Jim Henry <jimhenry1973@...> |
Date: | Wednesday, August 6, 2008, 19:14 |
On Wed, Aug 6, 2008 at 2:43 PM, Benct Philip Jonsson <bpj@...> wrote:
> I've had thoughts on creating a script for identifying
> minimal pairs in an existing vocabulary but not come up
> with anything better than sucessively replace every grapheme
> of every word with \w and compare it with every other word
> in the dictionary. Any ideas?
An adaptation of my findsimilar.pl script, so that it
compares every word of the lexicon with every other
word instead of comparing its command line argument
to every word of the lexicon, would be better than
nothing. It has a couple of flaws, though; it fails
to find minimal pairs where one morpheme is one
phoneme longer or shorter than another (e.g. /ka/
vs /kap/, /an/ vs. /tan/, /pef/ vs. /pwef/, etc.) and it
turns up too many false positives, words of the same
general pattern but where almost every individual
phoneme is different. That's OK when I'm looking
for one word at a time, but would be overwhelming
when comparing every word to every other.
A script to do this properly would need to know not
only the orthography of the language involved,
but enough about its phonotactics to identify
slots where an optional phoneme is missing but could be
added, or where a phoneme is present but could be
omitted (to catch those pairs I mentioned above).
Also, instead of generating one regex to find
all similar words, it should probably identify each
slot in the word and generate a regex for each
slot. E.g, for input /kaf/ where the phonotactic rule
is C(S)V(S)(C), you would use regexes like
[kgx]af
k[jrw]af
k[aiu]f
ka[jrw]f
ka[fvp]
I think that would identify all minimal pairs, assuming
/k g x f v p b j r w a i u/ is our phoneme inventory.
For a broader definition of "minimal pair" you would
use
[kgxfvpbjrw]af
k[jrw]af
k[aiu]f
ka[jrw]f
ka[kgxfvpbjrw]
(Also, all those regexes should be wrapped in /^ .... $/, else
you would get false positive substring maches like
/pikokafitex/.)
--
Jim Henry
http://www.pobox.com/~jimhenry/