Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: Efficiency/Spatial Compactness

From:Jim Henry <jimhenry1973@...>
Date:Saturday, July 21, 2007, 18:25
On 7/21/07, MorphemeAddict@wmconnect.com <MorphemeAddict@...> wrote:
> In a message dated 7/19/2007 3:26:43 PM Central Daylight Time, > joerg_rhiemeier@WEB.DE writes:
> > The basic idea, as far as I understand it, is to build a language, > > translate texts into it, measure the token frequencies of the > > morphemes, and then relex is using the shortest morphs for the > > most frequent morphemes.
> I started a project like this just a few days ago. The original language is > Esperanto (translated from German). The text is "La Karavano" by Wilhelm > Hauff (found at the Gutenberg Project), over 35,000 words long. > I'm in the process of splitting all the words into their morphemes right now. > Then I'll make a frequency list of the morphemes, and, finally, I'll assign > the Esperanto morphemes to new ones by their frequency (and probably morpheme > type, too).
Cool. What program are you using to measure the frequency of tokens? Does it measure frequency of phrases as well? You can get such a script (in Perl) from my site: http://www.pobox.com/~jimhenry/conlang/frequencies.pl (I have a newer, better version than what is on my website, but I can't FTP-upload it from the hospital wireless network. I'll do that sometime after I get out. Meanwhile I could email it to you if you want it.) If you have something that will measure the frequency of wildcard phrases (e.g. how often two words occur with any word between them, or with any two words, or...) let me know. This thread has inspired me to start work again on säb zjed'a. Maybe in another few months I can report results from the first frequency analysis and relex of the corpus. (It's only about 650 words so far, all isolated sentences with no continuous text; I want it over 2000 or 3000 words before I do the first relex.) -- Jim Henry http://www.pobox.com/~jimhenry

Reply

And Rosta <and.rosta@...>