Iterative conlang design with corpus analysis, Or, Build one to throw away
|From:||Jim Henry <jimhenry1973@...>|
|Date:||Tuesday, April 18, 2006, 19:03|
Though I'm still working out the exact phonology and phonotactics of
my new engelang, and parts of the grammar are still uncertain
(VSO and prepositional, mostly isolating; but ergative or active?),
I think I've come up with a good approach to lexicon design that
may not have been tried before, as far as I can recall hearing.
I'd like to sketch it out and see if any of y'all see serious flaws in it.
I'll start out with a quite small lexicon -- on the order of Toki Pona,
plus or minus a couple of dozen morphemes. I'm probably going
to use frequency analyses on my Toki Pona corpus and my gzb
corpus, an already available list of the most common morphemes
in Esperanto, and the list of "semantic primes" from the Wikipedia
article on "Natural semantic metalanguage" as sources for ideas,
along with the first few sections of Rick Harrison's ULD. However,
any concept I can figure out how to represent with a phrase rather
than a root word (compounds won't figure in this language, at least
in the first version), I will. Probably I'll have a considerably more radical
use of opposite-derivation than Esperanto, for instance (except the "mal-"
morpheme will be a preposition rather than an affix).
Then I'll start writing sample sentences as I work out the grammar,
trying to include at least one sample sentence for every word in the
lexicon; and after a bit, translate several short texts, probably coining
some new words as necessary in the process.
When I've got a corpus of at least a thousand words, I'll do a frequency
analysis on it -- not just of the relative frequency of words, but of
two- and three-word sequences. I may also use my experience with the
language so far to rule out some consonant clusters that I thought
feasible at first, but have proven too difficult to pronounce consistently,
in modifying the phonology format files and regenerating a new list
of root words.
Then I'll relex the language, applying the following rules:
1. If a word occurs more commonly than another word in the corpus,
it should be at least equally short in the relex.
2. Any sequence of 2+ words that occurs often enough in the corpus
will get its own word in the relex which will be at least equally short as
other words and phrases of equal or lesser frequency.
3. The above rules may need to be bent to allow part-of-speech marking.
For instance, if the 100th most common word or phrase in the corpus
is a noun and the 101st most common is a verb, and I've run out of
monosyllabic noun roots but have some monosyllabic verb roots left,
then a slightly more common word will get a word that's longer than a
less common word. But if this happens a lot I'll adjust the part-of-speech
marking system to allow more roots for the kinds of words that
occur more often.
Then convert the corpus to the new lexicon, spend some time
familiarizing myself with the relex, and write and translate more
text. When the corpus gets larger and more representative, do
another frequency analysis and relex again.
Obviously I'm not going to make a serious attempt to become fluent
in this language until it's going through several such relexes
(each probably less drastic than the last).
There are a couple of points I'd like y'all's advice about:
1. How to count length of words? By phonemes, or syllables,
or some weighted count where vowels count more than
continuant consonants which count more than plosives?
2. How to ensure a representative corpus? I could try to duplicate
the Brown Corpus in miniature, i.e. have texts of the same genres
in roughly similar proportions. I also expect
the corpus will get more representative over time as it includes
more connected narratives and articles and the grammar
example sentences are a smaller proportion of it. Maybe I
should simply exclude the latter once I have enough connected
3. What should be the criterion for a phrase occuring "often enough"
in the corpus to deserve its own root word? There will probably be
some hapax legomenae in the corpus, names of animal and plants
for instance, and I don't think two occurrences is enough for
a phrase to get its own root.
4. I have an old Awk script to find the frequencies of words in an
input file; does anyone know of an existing program (ideally
in Awk or Perl, else C or C++, or a Windows I386 binary) that
will do the same not only for words but for sequences of two or
more words? This wouldn't be hard to write in Perl, and maybe
I should just go ahead and do so to refresh my memory on the
bits of Perl I haven't used recently. [Edited to add: I have
actually started writing it and it sort of works, but I'd still like
to see how other people might have implemented it.]
I'll probably try to write Perl scripts to automatically sort words and phrases
by their frequency in the corpus and assign them to the newly generated
words sorted by length; the tricky bit will be matching up the part of speech
tagging, especially for the common phrases. Maybe I'll have the script
automatically relex the root words, leaving some word-shapes unallocated,
and give me a list of common phrases that it recommends should be replaced
by new roots. The lexicon will be a tab-delimited flat file database
that grows by another column with every relex, various columns containing
the old forms of a word in previous versions.