Re: Iterative conlang design with corpus analysis, Or, Build one to throw away
From: | Herman Miller <hmiller@...> |
Date: | Wednesday, April 19, 2006, 1:52 |
Jim Henry wrote:
> Though I'm still working out the exact phonology and phonotactics of
> my new engelang, and parts of the grammar are still uncertain
> (VSO and prepositional, mostly isolating; but ergative or active?),
> I think I've come up with a good approach to lexicon design that
> may not have been tried before, as far as I can recall hearing.
> I'd like to sketch it out and see if any of y'all see serious flaws in it.
>
> I'll start out with a quite small lexicon -- on the order of Toki Pona,
> plus or minus a couple of dozen morphemes. I'm probably going
> to use frequency analyses on my Toki Pona corpus and my gzb
> corpus, an already available list of the most common morphemes
> in Esperanto, and the list of "semantic primes" from the Wikipedia
> article on "Natural semantic metalanguage" as sources for ideas,
> along with the first few sections of Rick Harrison's ULD.
I looked at the examples of definitions on the Wikipedia article and I'm
not sure this is a good approach for defining words. Plants are defined
as "living things / these things can't feel something / these things
can't do something", but doesn't it count as "doing something" when a
plant's leaves turn to face the sun? When a Venus flytrap snaps shut?
You could say that's alluding to the fact that plants have no nervous
systems, but not every living thing without a nervous system is a plant.
Is the sky really a "place"? Or is it just a name for what we see when
we look away from the ground and nothing nearby is visible above us?
These definitions are more confusing than helpful.
And yet some of the "primitives" can be defined in terms of other words
that would seem to be more useful in definitions. "Die" could be defined
as "stop living"; "stop" seems like a more useful primitive than "die".
"A long time", "a short time", and "for some time" could all be derived
from "duration". "Now" is "this moment" (both of which are already in
the primitives list!) "Here" is "this place". "Other" is "not the same".
If numbers can be reduced to "one" and "two", why not define "two" as
"one more than one"? "There is" is listed as distinct from "have", but
these are related concepts in some languages.
The ULD is always a good source for ideas, but I wouldn't assume that
the most basic words are all in the first few sections.
> However,
> any concept I can figure out how to represent with a phrase rather
> than a root word (compounds won't figure in this language, at least
> in the first version), I will. Probably I'll have a considerably more radical
> use of opposite-derivation than Esperanto, for instance (except the "mal-"
> morpheme will be a preposition rather than an affix).
You could end up with some really long phrases that way. Jeffrey
Henning's Dublex had some rather long compounds with a basic vocabulary
of 400 roots. (I'm guessing this was the source of the Kali-sise
vocabulary.) But this could be a useful way to organize a vocabulary.
live
stop living = die
cause to stop living = kill
able to cause to stop living = deadly
substance able to cause to stop living = poison
producing substance able to cause to stop living = venomous
living thing with long body producing substance able to cause to stop
living = venomous snake
living thing with long body colored red and black with yellow stripes
producing substance able to cause to stop living = coral snake
> Then I'll relex the language, applying the following rules:
>
> 1. If a word occurs more commonly than another word in the corpus,
> it should be at least equally short in the relex.
>
> 2. Any sequence of 2+ words that occurs often enough in the corpus
> will get its own word in the relex which will be at least equally short as
> other words and phrases of equal or lesser frequency.
This could help for some of the simpler concepts, but I think you'll
still end up with some rather long words. With my examples, "stop
living" and "living thing" are likely to occur often enough to have
their own words. Possibly "able to cause" and "cause to stop" would as
well. You'd then have to look at the different ways you could reduce
"able to cause to stop living". But unless your corpus has lots of
phrases relating to snakes, you might not recognize "living thing with
long body" as something that should be reduced.
> 1. How to count length of words? By phonemes, or syllables,
> or some weighted count where vowels count more than
> continuant consonants which count more than plosives?
Depends on the rhythm of the language; some syllables are longer than
others. You could make it like Japanese and count morae. At least you'd
probably want to give a different weight to vowels and consonants.
> 2. How to ensure a representative corpus? I could try to duplicate
> the Brown Corpus in miniature, i.e. have texts of the same genres
> in roughly similar proportions. I also expect
> the corpus will get more representative over time as it includes
> more connected narratives and articles and the grammar
> example sentences are a smaller proportion of it. Maybe I
> should simply exclude the latter once I have enough connected
> text?
Translate sentences picked from random Wikipedia articles?