Re: About making a translator
From: | H. S. Teoh <hsteoh@...> |
Date: | Tuesday, October 26, 2004, 23:35 |
On Wed, Oct 27, 2004 at 02:30:28AM +0400, Alexander Savenkov wrote:
> Hello,
>
> 2004-10-26T16:06:04+03:00 Ray Brown <ray.brown@...> wrote:
>
> > But, as Richard has written & I have discovered from experience, it
> > is a highly non-trivial task.
>
> According to what I've read, this is an impossible task for now.
> Machine translation will be possible with the invention of AI.
[...]
Impossible to be 100% correct, yes. But may be possible to do an
approximation.
The essence of the problem is that natural language is inherently
ambiguous, and requires (usually implicit) context to interpret
correctly. Take for example the following quote, which I got from
somebody on this list:
Time flies like an arrow.
Fruit flies like a banana.
The second sentence is particularly pathological, in that it has two
possible parses, both of which have sensible semantics:
1) Fruit-NP flies-V like-ADV (a banana)-NP
2) (Fruit flies)-NP like-V (a banana)-NP
The problem with this kind of ambiguity is that it is an inherent
ambiguity in English grammar. (And it's not just English alone; I
believe most, if not all, natlangs are inherently ambiguous.) We don't
have good algorithms for dealing with ambiguous grammars: it will
require an exponential-complexity algorithm to determine all possible
parses of these ambiguous sentences. Furthermore, once this has been
done, you need some way to decide which of these parses is the correct
one.
But worse yet, context is required to properly interpret natlang
sentences. We don't have good algorithms for dealing with
context-sensitive grammars. In fact, I doubt if anyone even knows how
to go about writing a context-sensitive grammar that expresses basic
context-sensitivity, such as whether a referent has occurred earlier
in the given text. Even if such a grammar were written, it would be
extremely complex and difficult to understand. And we still don't have
a feasible algorithm for parsing text according to context-sensitive
grammars.
Now, existing computer language compilers do deal with
context-sensitivity, but only to a limited extent. The programming
language's context-sensitive grammar is reduced to a non-ambiguous,
context-free grammar, and the context-sensitivity is implemented as
ad-hoc rules applied after the parse. The language would be designed
and refined so that these rules are relatively straightforward to
implement. When it comes to natural language, however, we don't have
this option. And the context-sensitivity rules aren't as well-defined
as in the computer language case, but are usually mere heuristics.
Add on top of this the fact that most natlang texts leave out a lot of
context that is required to properly interpret it. For example, some
technical jargon uses common English words, but with different
meanings from common usage. In technical journals, however, the
explanation of such words is usually not included because it is
well-known among the audience. From a computer's standpoint, this
means that the context required to interpret such a text is not
available, and so ambiguity cannot be resolved. Sometimes, the
necessary context is *never* defined anywhere, simply because it is
culturally understood. To write an algorithm to interpret such texts
would require the coding of cultural conventions, which I doubt we
even know how to represent digitally in a form usable by a parsing
algorithm.
And this doesn't even begin to account for regional differences,
dialectal differences, and personal differences, which can sometimes
play a big role in properly interpreting a given piece of text. And
even if you can somehow surmount all these barriers, you still have to
resolve the problem of how to map the highly-idiosyncratic parse
you've just constructed to the grammar, context, and convention of the
target language. Most of the time, the mapping is very non-trivial.
What ends up happening most of the time is that you take the common
denominator between the two and throw out everything else.
Unfortunately, most of what gets thrown out is usually what carries
the most important information.
Having said all this, all hope is not lost; it is still possible to
write approximate algorithms that can more-or-less parse natural
language and produce approximate translations. Heuristics can be
applied to make guesses that are right 90% of the time. Approximate
being the keyword here, however, because currently existing
translators fall woefully short of the quality needed for general use.
As someone once said, "Heuristics are buggy by definition, because if
they weren't buggy, they'd be algorithms." Perhaps AI might help in
improving this, but with the current state of AI, I'm not holding my
breath.
T
--
Creativity is not an excuse for sloppiness.
Reply