Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: Unsupervised learning of natural languages

From:tomhchappell <tomhchappell@...>
Date:Thursday, November 3, 2005, 19:27
--- In conlang@yahoogroups.com, Sanghyeon Seo <sanxiyn@G...> wrote:
> > I thought people on this list may be interested in the following
paper:
> > http://www.pnas.org/cgi/content/short/102/33/11629 > http://www.cs.tau.ac.il/~ruppin/pnas_adios.pdf > > Unsupervised learning of natural languages > Zach Solan, David Horn, Eytan Ruppin, and Shimon Edelman > > This inducts grammar rule from raw data (unsegmented writing, > continuous speech, etc.), and is also generative and predictive. The > algorithm is also believed to be linear, thus computationally > feasible. > > Applying this to your conlang and generating few sentences may be an > interesting experience... If someone can implement this. > > Seo Sanghyeon >
Thank you. I read the paper and liked it a lot, which is not to say I have completely understood it yet. They made concrete, and filled in the details of, an idea that hazily occurred to me soon after I started thinking of this topic. The only thing else I could wish for is, in all of the computerized applications I have seen so far, including theirs, the input to the computer is already segmentized. In any natural language, whether sign-language or speech, the learner must somehow figure out for himself or herself what the segments are and what the words are. I have yet to see how a computer might do that. Perhaps contrary to what Henrik T. seemed to be conjecturing a while ago, it looks like they were able to use their algorithm on a corpus of 31,100 sentences in each of six languages; Danish, Swedish, French, Spanish, English, and Chinese. The corpora were actually all meant to be translations of the same thing -- (the Bible). They "typologized" these six languages according to what paths the algorithm had to take in dealing with the corpus in that language. Unsurprisingly Chinese was an outlier from the European languages. Unsurprisingly the Romance pair were closer to each other than to anything else. Unsurprisingly the Scandinavian pair were closer to each other than to anything else. Surprisingly English was closer to the Romance pair than to its fellow Germanic languages -- just from the point of view of this algorithm, and just while dealing with the Bible. Tom H.C. in MI

Reply

Henrik Theiling <theiling@...>