Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: Unsupervised learning of natural languages

From:Henrik Theiling <theiling@...>
Date:Thursday, November 3, 2005, 21:05
Hi!

tomhchappell <tomhchappell@...> writes:
>... > Perhaps contrary to what Henrik T. seemed to be conjecturing a while > ago, it looks like they were able to use their algorithm on a corpus > of 31,100 sentences in each of six languages; Danish, Swedish, > French, Spanish, English, and Chinese. >...
Hmm? I never doubted they perform well in the given six languages. What I said was not about the number of languages they handle, but the structure of the languages. The six given languages have a relatively context-free syntax structure with nicely embedded sub-phrases. I merely said I would have been more surprised of a working algorithm if they had tested a more interesting language. E.g. Dutch, which has a very funny verb order in embedded phrases: ... dat ik jou zag lezen. that I you saw read 'that I saw you read.' The interesting part is that 'I saw' is one sub-phrase and 'you read' is another and that the final structure contains the subjects in a row followed by the verbs in the same order. For arbitrarily deep nesting, this cannot be generated with a context-free grammar. Further, with a given context length, you can only generate a fixed number of reversals, so I think the grammar structure they are generating is just not suited for Dutch und thus for natural language in general. In Dutch you can have: dat ik jou haar hem hoor vragen helpen koken. A B C D a b c d that I hear you ask her help him cook. A a B b C c D d This *is* quite an artificial example, but it illustrates the algorithmical problems I suspect. Btw., German is much nicer than Dutch wrt. to context-free grammar structure since is uses a bracket-like structure: daß ich Dich sie ihm kochen helfen bitten höre. A B C D d c b a (Ok, ok, this is almost incomprehensible...) I think production and rewriting rules are not the perfect means for natural language processing, since even context free grammars are too much by allowing arbitrary nesting, which the human brain doesn't, while on the other hand, they are too restricted for structures like in Dutch. Further, there are language with free word order so even searching syntax rules for the order of words is an algorithmic guide-line and thus a supervision. So 'Unsupervised learning of natural languages' just feels like an illegitimate euphemism. If the paper title was a little less lurid, I'd probably be much more positive. :-) **Henrik PS: The funny thing is, intuitively I feel that the German structure is the most complex one for my brain, followed by the Dutch one, which is easier for me, followed by the English structure, which is easiest of the three for me. I.e., I can parse the English example above without problems and adding more sub-phrases is hardly a problem, and I can quite easily parse the Dutch example, but the German example is just one sub-phrase too deeply nested to be understood easily. This is despite the fact that German is my mother tongue, and that I'm, therefore, quite used to parentheses in syntax structure, so this supports my felt inappropriateness of context free structures (and rewriting systems) for natural languages, as the order of felt easiness is different from what I'd expect from CFGs.

Reply

tomhchappell <tomhchappell@...>