Re: Iterative conlang design with corpus analysis, Or, Build one to throw away
From: | Jim Henry <jimhenry1973@...> |
Date: | Wednesday, April 19, 2006, 18:05 |
On 4/18/06, Herman Miller <hmiller@...> wrote:
> Jim Henry wrote:
> > I'll start out with a quite small lexicon -- on the order of Toki Pona,
> > plus or minus a couple of dozen morphemes. I'm probably going
> > to use frequency analyses on my Toki Pona corpus and my gzb
> > corpus, an already available list of the most common morphemes
> > in Esperanto, and the list of "semantic primes" from the Wikipedia
> > article on "Natural semantic metalanguage" as sources for ideas,
> > along with the first few sections of Rick Harrison's ULD.
>
> I looked at the examples of definitions on the Wikipedia article and I'm
> not sure this is a good approach for defining words. Plants are defined
> as "living things / these things can't feel something / these things
...
I agree that this theory may have problems, but it still seems
a useful list of fairly basic concepts if taken critically.
> And yet some of the "primitives" can be defined in terms of other words
> that would seem to be more useful in definitions. "Die" could be defined
> as "stop living"; "stop" seems like a more useful primitive than "die".
Agreed. I'm not going to uncritically create a root for
every "semantic prime" in the list, but I will make sure
that the first version of the language can express all those
ideas if not with a root, then with a short phrase.
> The ULD is always a good source for ideas, but I wouldn't assume that
> the most basic words are all in the first few sections.
Good point.
> > However,
> > any concept I can figure out how to represent with a phrase rather
> > than a root word (compounds won't figure in this language, at least
> > in the first version), I will. Probably I'll have a considerably more radical
> > use of opposite-derivation than Esperanto, for instance (except the "mal-"
> > morpheme will be a preposition rather than an affix).
>
> You could end up with some really long phrases that way. ....
> live
> stop living = die
> cause to stop living = kill
> able to cause to stop living = deadly
> substance able to cause to stop living = poison
> producing substance able to cause to stop living = venomous
>
> living thing with long body producing substance able to cause to stop
> living = venomous snake
Yes, I expect the first version will be quite verbose, much
like Toki Pona. But with iterative relexes replacing frequent
sequences of morphemes with their own roots, it will get terser
as it grows its vocabulary.
> > Then I'll relex the language, applying the following rules:
> >
> > 1. If a word occurs more commonly than another word in the corpus,
> > it should be at least equally short in the relex.
> >
> > 2. Any sequence of 2+ words that occurs often enough in the corpus
> > will get its own word in the relex which will be at least equally short as
> > other words and phrases of equal or lesser frequency.
>
> This could help for some of the simpler concepts, but I think you'll
> still end up with some rather long words. With my examples, "stop
> living" and "living thing" are likely to occur often enough to have
> their own words. Possibly "able to cause" and "cause to stop" would as
> well. You'd then have to look at the different ways you could reduce
> "able to cause to stop living". But unless your corpus has lots of
> phrases relating to snakes, you might not recognize "living thing with
> long body" as something that should be reduced.
Probably so. But remember this analyze-and-relex will be
iterated several times (assuming I stick with the project long
enough). If the first relex creates roots for "die" and "kill",
then the second relex might find common two- and three-word phrases
involving those roots and replace them with root words
for "deadly" or "venom" or whatever.
And I think I've already mentioned that I plan another approach
for zoological terms. In the first draft several genera and
families will get two-syllable nouns, and species will be
denoted by those plus adjectives. In later iterations, the
more commonly mentioned genera and species
will get monosyllable roots and eventually the least
commonly mentioned may get 3+ syllable roots instead.
> > 2. How to ensure a representative corpus? I could try to duplicate
> > the Brown Corpus in miniature, i.e. have texts of the same genres
> Translate sentences picked from random Wikipedia articles?
A randomly chosen article is likely to be an article auto-generated
from census data about a U.S. county or city. But using Wikipedia
as a source for nonfiction texts about a variety of
subjects to translate -- that's a good idea.
On 4/18/06, David J. Peterson <dedalvs@...> wrote:
> Sounds like a fantastic project! Will you be webifying it step-by-
> step?
Probably. The paper about generating redundant
vocabulary was the first step.
> 1. How to count length of words? By phonemes, or syllables,
> or some weighted count where vowels count more than
> continuant consonants which count more than plosives?
> >>
>
> Not by phonemes, but not necessarily by conventional weight
> measurements, I'd recommend. It, of course, depends a lot on
> your phonology, though. I believe I read that this is your
.........
> then that'd be your measure. For values, I'd recommend the
> following:
>
> Onset = 0
> Main V = 1
> Coda C = 1 (modulo your preference for stress rules--see below)
> Onset Cluster C = 0.5 (or less, but not 0)
.......
Yes; I also figure that in general a fricative is likely to be longer
than a stop, and nasals longer than liquids -- but, as
you point out, it's going to depend on how I actually
pronounce the language. I'll have to figure out the
weighting factors for each phoneme or class
of phonemes as I go along.
> 3. What should be the criterion for a phrase occuring "often enough"
> in the corpus to deserve its own root word?
> >>
>
> I'd suggest that you'd have to see it to know. : \ I don't think you
> can come up with a metric beforehand. Sounds like it's going to be
Yes, I reckon so. I've been testing out my word and
phrase frequency analysis script on my Toki Pona corpus
(basically, a bunch of text files into which I've copied and
pasted Toki Pona text from web pages and email messages),
and have about decided that a too rigid application of my
rule #2 would probably be a bad idea. If I were applying
these rules to a relex of Toki Pona, some of the most
common nouns in object position would get a
separate accusative form, a few would get a dative...
I.e., the most common two-word phrases are mostly
preposition or other particle + a common noun root.
> Anyway, though, sounds like a cool project! Sounds like something
> I'd like to try too, in fact, if I didn't happen to lack any programming
> skill or knowledge how to use a corpus...
The frequency analysis script is at
http://www.pobox.com/~jimhenry/conlang/frequencies.pl
Default behavior is to count individual words. You can count
two-word phrases with -p, three-word phrases with
-p 3; phrases of 1 to 4 words with -r 1-4, etc.
It reads standard input or a list of filenames given
after the other arguments. The output has to be
sorted with sort -n.
--
Jim Henry
http://www.pobox.com/~jimhenry/gzb/gzb.htm