Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: Iterative conlang design with corpus analysis, Or, Build one to throw away

From:Jim Henry <jimhenry1973@...>
Date:Wednesday, April 19, 2006, 18:05
On 4/18/06, Herman Miller <hmiller@...> wrote:

> Jim Henry wrote: > > I'll start out with a quite small lexicon -- on the order of Toki Pona, > > plus or minus a couple of dozen morphemes. I'm probably going > > to use frequency analyses on my Toki Pona corpus and my gzb > > corpus, an already available list of the most common morphemes > > in Esperanto, and the list of "semantic primes" from the Wikipedia > > article on "Natural semantic metalanguage" as sources for ideas, > > along with the first few sections of Rick Harrison's ULD. > > I looked at the examples of definitions on the Wikipedia article and I'm > not sure this is a good approach for defining words. Plants are defined > as "living things / these things can't feel something / these things
... I agree that this theory may have problems, but it still seems a useful list of fairly basic concepts if taken critically.
> And yet some of the "primitives" can be defined in terms of other words > that would seem to be more useful in definitions. "Die" could be defined > as "stop living"; "stop" seems like a more useful primitive than "die".
Agreed. I'm not going to uncritically create a root for every "semantic prime" in the list, but I will make sure that the first version of the language can express all those ideas if not with a root, then with a short phrase.
> The ULD is always a good source for ideas, but I wouldn't assume that > the most basic words are all in the first few sections.
Good point.
> > However, > > any concept I can figure out how to represent with a phrase rather > > than a root word (compounds won't figure in this language, at least > > in the first version), I will. Probably I'll have a considerably more radical > > use of opposite-derivation than Esperanto, for instance (except the "mal-" > > morpheme will be a preposition rather than an affix). > > You could end up with some really long phrases that way. ....
> live > stop living = die > cause to stop living = kill > able to cause to stop living = deadly > substance able to cause to stop living = poison > producing substance able to cause to stop living = venomous > > living thing with long body producing substance able to cause to stop > living = venomous snake
Yes, I expect the first version will be quite verbose, much like Toki Pona. But with iterative relexes replacing frequent sequences of morphemes with their own roots, it will get terser as it grows its vocabulary.
> > Then I'll relex the language, applying the following rules: > > > > 1. If a word occurs more commonly than another word in the corpus, > > it should be at least equally short in the relex. > > > > 2. Any sequence of 2+ words that occurs often enough in the corpus > > will get its own word in the relex which will be at least equally short as > > other words and phrases of equal or lesser frequency. > > This could help for some of the simpler concepts, but I think you'll > still end up with some rather long words. With my examples, "stop > living" and "living thing" are likely to occur often enough to have > their own words. Possibly "able to cause" and "cause to stop" would as > well. You'd then have to look at the different ways you could reduce > "able to cause to stop living". But unless your corpus has lots of > phrases relating to snakes, you might not recognize "living thing with > long body" as something that should be reduced.
Probably so. But remember this analyze-and-relex will be iterated several times (assuming I stick with the project long enough). If the first relex creates roots for "die" and "kill", then the second relex might find common two- and three-word phrases involving those roots and replace them with root words for "deadly" or "venom" or whatever. And I think I've already mentioned that I plan another approach for zoological terms. In the first draft several genera and families will get two-syllable nouns, and species will be denoted by those plus adjectives. In later iterations, the more commonly mentioned genera and species will get monosyllable roots and eventually the least commonly mentioned may get 3+ syllable roots instead.
> > 2. How to ensure a representative corpus? I could try to duplicate > > the Brown Corpus in miniature, i.e. have texts of the same genres
> Translate sentences picked from random Wikipedia articles?
A randomly chosen article is likely to be an article auto-generated from census data about a U.S. county or city. But using Wikipedia as a source for nonfiction texts about a variety of subjects to translate -- that's a good idea. On 4/18/06, David J. Peterson <dedalvs@...> wrote:
> Sounds like a fantastic project! Will you be webifying it step-by- > step?
Probably. The paper about generating redundant vocabulary was the first step.
> 1. How to count length of words? By phonemes, or syllables, > or some weighted count where vowels count more than > continuant consonants which count more than plosives? > >> > > Not by phonemes, but not necessarily by conventional weight > measurements, I'd recommend. It, of course, depends a lot on > your phonology, though. I believe I read that this is your
.........
> then that'd be your measure. For values, I'd recommend the > following: > > Onset = 0 > Main V = 1 > Coda C = 1 (modulo your preference for stress rules--see below) > Onset Cluster C = 0.5 (or less, but not 0)
....... Yes; I also figure that in general a fricative is likely to be longer than a stop, and nasals longer than liquids -- but, as you point out, it's going to depend on how I actually pronounce the language. I'll have to figure out the weighting factors for each phoneme or class of phonemes as I go along.
> 3. What should be the criterion for a phrase occuring "often enough" > in the corpus to deserve its own root word? > >> > > I'd suggest that you'd have to see it to know. : \ I don't think you > can come up with a metric beforehand. Sounds like it's going to be
Yes, I reckon so. I've been testing out my word and phrase frequency analysis script on my Toki Pona corpus (basically, a bunch of text files into which I've copied and pasted Toki Pona text from web pages and email messages), and have about decided that a too rigid application of my rule #2 would probably be a bad idea. If I were applying these rules to a relex of Toki Pona, some of the most common nouns in object position would get a separate accusative form, a few would get a dative... I.e., the most common two-word phrases are mostly preposition or other particle + a common noun root.
> Anyway, though, sounds like a cool project! Sounds like something > I'd like to try too, in fact, if I didn't happen to lack any programming > skill or knowledge how to use a corpus...
The frequency analysis script is at http://www.pobox.com/~jimhenry/conlang/frequencies.pl Default behavior is to count individual words. You can count two-word phrases with -p, three-word phrases with -p 3; phrases of 1 to 4 words with -r 1-4, etc. It reads standard input or a list of filenames given after the other arguments. The output has to be sorted with sort -n. -- Jim Henry http://www.pobox.com/~jimhenry/gzb/gzb.htm