Re: Unsupervised learning of natural languages
From: | tomhchappell <tomhchappell@...> |
Date: | Thursday, November 3, 2005, 19:27 |
--- In conlang@yahoogroups.com, Sanghyeon Seo <sanxiyn@G...> wrote:
>
> I thought people on this list may be interested in the following
paper:
>
>
http://www.pnas.org/cgi/content/short/102/33/11629
>
http://www.cs.tau.ac.il/~ruppin/pnas_adios.pdf
>
> Unsupervised learning of natural languages
> Zach Solan, David Horn, Eytan Ruppin, and Shimon Edelman
>
> This inducts grammar rule from raw data (unsegmented writing,
> continuous speech, etc.), and is also generative and predictive. The
> algorithm is also believed to be linear, thus computationally
> feasible.
>
> Applying this to your conlang and generating few sentences may be an
> interesting experience... If someone can implement this.
>
> Seo Sanghyeon
>
Thank you.
I read the paper and liked it a lot, which is not to say I have
completely understood it yet.
They made concrete, and filled in the details of, an idea that hazily
occurred to me soon after I started thinking of this topic.
The only thing else I could wish for is, in all of the computerized
applications I have seen so far, including theirs, the input to the
computer is already segmentized. In any natural language, whether
sign-language or speech, the learner must somehow figure out for
himself or herself what the segments are and what the words are.
I have yet to see how a computer might do that.
Perhaps contrary to what Henrik T. seemed to be conjecturing a while
ago, it looks like they were able to use their algorithm on a corpus
of 31,100 sentences in each of six languages; Danish, Swedish,
French, Spanish, English, and Chinese. The corpora were actually all
meant to be translations of the same thing -- (the Bible).
They "typologized" these six languages according to what paths the
algorithm had to take in dealing with the corpus in that language.
Unsurprisingly Chinese was an outlier from the European languages.
Unsurprisingly the Romance pair were closer to each other than to
anything else. Unsurprisingly the Scandinavian pair were closer to
each other than to anything else. Surprisingly English was closer to
the Romance pair than to its fellow Germanic languages -- just from
the point of view of this algorithm, and just while dealing with the
Bible.
Tom H.C. in MI
Reply