Re: Unsupervised learning of natural languages
From: | Henrik Theiling <theiling@...> |
Date: | Thursday, November 3, 2005, 21:05 |
Hi!
tomhchappell <tomhchappell@...> writes:
>...
> Perhaps contrary to what Henrik T. seemed to be conjecturing a while
> ago, it looks like they were able to use their algorithm on a corpus
> of 31,100 sentences in each of six languages; Danish, Swedish,
> French, Spanish, English, and Chinese.
>...
Hmm? I never doubted they perform well in the given six languages.
What I said was not about the number of languages they handle, but the
structure of the languages. The six given languages have a relatively
context-free syntax structure with nicely embedded sub-phrases. I
merely said I would have been more surprised of a working algorithm if
they had tested a more interesting language. E.g. Dutch, which has a
very funny verb order in embedded phrases:
... dat ik jou zag lezen.
that I you saw read
'that I saw you read.'
The interesting part is that 'I saw' is one sub-phrase and 'you read'
is another and that the final structure contains the subjects in a row
followed by the verbs in the same order. For arbitrarily deep
nesting, this cannot be generated with a context-free grammar.
Further, with a given context length, you can only generate a fixed
number of reversals, so I think the grammar structure they are
generating is just not suited for Dutch und thus for natural language
in general. In Dutch you can have:
dat ik jou haar hem hoor vragen helpen koken.
A B C D a b c d
that I hear you ask her help him cook.
A a B b C c D d
This *is* quite an artificial example, but it illustrates the
algorithmical problems I suspect. Btw., German is much nicer than
Dutch wrt. to context-free grammar structure since is uses a
bracket-like structure:
daß ich Dich sie ihm kochen helfen bitten höre.
A B C D d c b a
(Ok, ok, this is almost incomprehensible...)
I think production and rewriting rules are not the perfect means for
natural language processing, since even context free grammars are too
much by allowing arbitrary nesting, which the human brain doesn't,
while on the other hand, they are too restricted for structures like in
Dutch.
Further, there are language with free word order so even searching
syntax rules for the order of words is an algorithmic guide-line and
thus a supervision.
So 'Unsupervised learning of natural languages' just feels like an
illegitimate euphemism. If the paper title was a little less lurid,
I'd probably be much more positive. :-)
**Henrik
PS: The funny thing is, intuitively I feel that the German structure is
the most complex one for my brain, followed by the Dutch one,
which is easier for me, followed by the English structure, which
is easiest of the three for me. I.e., I can parse the English
example above without problems and adding more sub-phrases is
hardly a problem, and I can quite easily parse the Dutch example,
but the German example is just one sub-phrase too deeply nested
to be understood easily. This is despite the fact that German is
my mother tongue, and that I'm, therefore, quite used to parentheses
in syntax structure, so this supports my felt inappropriateness
of context free structures (and rewriting systems) for natural
languages, as the order of felt easiness is different from what
I'd expect from CFGs.
Reply