Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: Iterative conlang design with corpus analysis, Or, Build one to throw away

From:Herman Miller <hmiller@...>
Date:Wednesday, April 19, 2006, 1:52
Jim Henry wrote:
> Though I'm still working out the exact phonology and phonotactics of > my new engelang, and parts of the grammar are still uncertain > (VSO and prepositional, mostly isolating; but ergative or active?), > I think I've come up with a good approach to lexicon design that > may not have been tried before, as far as I can recall hearing. > I'd like to sketch it out and see if any of y'all see serious flaws in it. > > I'll start out with a quite small lexicon -- on the order of Toki Pona, > plus or minus a couple of dozen morphemes. I'm probably going > to use frequency analyses on my Toki Pona corpus and my gzb > corpus, an already available list of the most common morphemes > in Esperanto, and the list of "semantic primes" from the Wikipedia > article on "Natural semantic metalanguage" as sources for ideas, > along with the first few sections of Rick Harrison's ULD.
I looked at the examples of definitions on the Wikipedia article and I'm not sure this is a good approach for defining words. Plants are defined as "living things / these things can't feel something / these things can't do something", but doesn't it count as "doing something" when a plant's leaves turn to face the sun? When a Venus flytrap snaps shut? You could say that's alluding to the fact that plants have no nervous systems, but not every living thing without a nervous system is a plant. Is the sky really a "place"? Or is it just a name for what we see when we look away from the ground and nothing nearby is visible above us? These definitions are more confusing than helpful. And yet some of the "primitives" can be defined in terms of other words that would seem to be more useful in definitions. "Die" could be defined as "stop living"; "stop" seems like a more useful primitive than "die". "A long time", "a short time", and "for some time" could all be derived from "duration". "Now" is "this moment" (both of which are already in the primitives list!) "Here" is "this place". "Other" is "not the same". If numbers can be reduced to "one" and "two", why not define "two" as "one more than one"? "There is" is listed as distinct from "have", but these are related concepts in some languages. The ULD is always a good source for ideas, but I wouldn't assume that the most basic words are all in the first few sections.
> However, > any concept I can figure out how to represent with a phrase rather > than a root word (compounds won't figure in this language, at least > in the first version), I will. Probably I'll have a considerably more radical > use of opposite-derivation than Esperanto, for instance (except the "mal-" > morpheme will be a preposition rather than an affix).
You could end up with some really long phrases that way. Jeffrey Henning's Dublex had some rather long compounds with a basic vocabulary of 400 roots. (I'm guessing this was the source of the Kali-sise vocabulary.) But this could be a useful way to organize a vocabulary. live stop living = die cause to stop living = kill able to cause to stop living = deadly substance able to cause to stop living = poison producing substance able to cause to stop living = venomous living thing with long body producing substance able to cause to stop living = venomous snake living thing with long body colored red and black with yellow stripes producing substance able to cause to stop living = coral snake
> Then I'll relex the language, applying the following rules: > > 1. If a word occurs more commonly than another word in the corpus, > it should be at least equally short in the relex. > > 2. Any sequence of 2+ words that occurs often enough in the corpus > will get its own word in the relex which will be at least equally short as > other words and phrases of equal or lesser frequency.
This could help for some of the simpler concepts, but I think you'll still end up with some rather long words. With my examples, "stop living" and "living thing" are likely to occur often enough to have their own words. Possibly "able to cause" and "cause to stop" would as well. You'd then have to look at the different ways you could reduce "able to cause to stop living". But unless your corpus has lots of phrases relating to snakes, you might not recognize "living thing with long body" as something that should be reduced.
> 1. How to count length of words? By phonemes, or syllables, > or some weighted count where vowels count more than > continuant consonants which count more than plosives?
Depends on the rhythm of the language; some syllables are longer than others. You could make it like Japanese and count morae. At least you'd probably want to give a different weight to vowels and consonants.
> 2. How to ensure a representative corpus? I could try to duplicate > the Brown Corpus in miniature, i.e. have texts of the same genres > in roughly similar proportions. I also expect > the corpus will get more representative over time as it includes > more connected narratives and articles and the grammar > example sentences are a smaller proportion of it. Maybe I > should simply exclude the latter once I have enough connected > text?
Translate sentences picked from random Wikipedia articles?