Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: Unsupervised learning of natural languages

From:Gary Shannon <fiziwig@...>
Date:Wednesday, November 2, 2005, 19:45
--- Henrik Theiling <theiling@...> wrote:

<snip>
> > Interesting, I will have to read that. > > > Applying this to your conlang and generating few > sentences may be an > > interesting experience... If someone can implement > this.
<snip> After reading the article (my background is comp sci so I'm pretty familiar with the techniques discussed) it looks like the same technique could be applied to a corpus of individual words at the letter level, rather than a corpus of individual sentences at the word level, and the technique would extract not the rules of grammar, but the rules of orthography and word formation. Those rules could then be used generatively to create new words for a conlang in the style of whatever vocabulary was supplied as the training sample. Thus there could be generative rules to create words with an Icelandic flavor, or words with a Tibetan flavor, or words with a Korean flavor, etc. And by blending extracted rule sets (or equivalently, blending input lexicons) rules could be found for generating words with compound or hybrid flavors like Russian-Japanese hybrid words, or Polynesian-Hungarian hybrid words. That sounds like an interesting project!
> > Also, it would be interesting to see what it does > for highly > inflecting languages like Kalaallisut or Ancient > Greek. If it fails > here, too, the whole approach would not be too > surprising at all, > since one would naturally expect these things to > fail. >
It looks to me as though this method would have no trouble with inflected languages. The method extracts rules recursively starting at the lowest level and re-writing the graph at a more abstract, or generalized level before extracting rules at the next higher level. It would have to build its initial graph on the basis of individual letters, rather than individual words, however, so that the first rules it extracted would be at the level of the inflection rules. The article does mention that the input string can be in the form of letters without spaces between the words (letterswithoutspacesbetweenthewords) and that the grammar rules are still successfully extracted once the low-level word boundry rules have been extracted. --gary
> But, ok, these thoughts are premature -- I haven't > read the article > yet. > > **Henrik >

Reply

Henrik Theiling <theiling@...>