Re: Unsupervised learning of natural languages
From: | Gary Shannon <fiziwig@...> |
Date: | Wednesday, November 2, 2005, 19:45 |
--- Henrik Theiling <theiling@...> wrote:
<snip>
>
> Interesting, I will have to read that.
>
> > Applying this to your conlang and generating few
> sentences may be an
> > interesting experience... If someone can implement
> this.
<snip>
After reading the article (my background is comp sci
so I'm pretty familiar with the techniques discussed)
it looks like the same technique could be applied to a
corpus of individual words at the letter level, rather
than a corpus of individual sentences at the word
level, and the technique would extract not the rules
of grammar, but the rules of orthography and word
formation. Those rules could then be used generatively
to create new words for a conlang in the style of
whatever vocabulary was supplied as the training
sample. Thus there could be generative rules to create
words with an Icelandic flavor, or words with a
Tibetan flavor, or words with a Korean flavor, etc.
And by blending extracted rule sets (or equivalently,
blending input lexicons) rules could be found for
generating words with compound or hybrid flavors like
Russian-Japanese hybrid words, or Polynesian-Hungarian
hybrid words.
That sounds like an interesting project!
>
> Also, it would be interesting to see what it does
> for highly
> inflecting languages like Kalaallisut or Ancient
> Greek. If it fails
> here, too, the whole approach would not be too
> surprising at all,
> since one would naturally expect these things to
> fail.
>
It looks to me as though this method would have no
trouble with inflected languages. The method extracts
rules recursively starting at the lowest level and
re-writing the graph at a more abstract, or
generalized level before extracting rules at the next
higher level. It would have to build its initial graph
on the basis of individual letters, rather than
individual words, however, so that the first rules it
extracted would be at the level of the inflection
rules.
The article does mention that the input string can be
in the form of letters without spaces between the
words (letterswithoutspacesbetweenthewords) and that
the grammar rules are still successfully extracted
once the low-level word boundry rules have been
extracted.
--gary
> But, ok, these thoughts are premature -- I haven't
> read the article
> yet.
>
> **Henrik
>
Reply