Re: Efficiency/Spatial Compactness
From: | Jim Henry <jimhenry1973@...> |
Date: | Saturday, July 21, 2007, 18:25 |
On 7/21/07, MorphemeAddict@wmconnect.com <MorphemeAddict@...> wrote:
> In a message dated 7/19/2007 3:26:43 PM Central Daylight Time,
> joerg_rhiemeier@WEB.DE writes:
> > The basic idea, as far as I understand it, is to build a language,
> > translate texts into it, measure the token frequencies of the
> > morphemes, and then relex is using the shortest morphs for the
> > most frequent morphemes.
> I started a project like this just a few days ago. The original language is
> Esperanto (translated from German). The text is "La Karavano" by Wilhelm
> Hauff (found at the Gutenberg Project), over 35,000 words long.
> I'm in the process of splitting all the words into their morphemes right now.
> Then I'll make a frequency list of the morphemes, and, finally, I'll assign
> the Esperanto morphemes to new ones by their frequency (and probably morpheme
> type, too).
Cool. What program are you using to measure the frequency of
tokens? Does it measure frequency of phrases as well?
You can get such a script (in Perl) from my site:
http://www.pobox.com/~jimhenry/conlang/frequencies.pl
(I have a newer, better version than what is on my website,
but I can't FTP-upload it from the hospital wireless network.
I'll do that sometime after I get out. Meanwhile I could email
it to you if you want it.)
If you have something that will measure the frequency of
wildcard phrases (e.g. how often two words occur with
any word between them, or with any two words, or...)
let me know.
This thread has inspired me to start work again on säb zjed'a.
Maybe in another few months I can report results from the
first frequency analysis and relex of the corpus. (It's only about
650 words so far, all isolated sentences with no continuous
text; I want it over 2000 or 3000 words before I do the first
relex.)
--
Jim Henry
http://www.pobox.com/~jimhenry
Reply