Re: A way to identify languages algorithmically (was: Re: Online Language Identifier)
|From:||Yahya Abdal-Aziz <yahya@...>|
|Date:||Friday, September 9, 2005, 7:41|
On Thu, 8 Sep 2005, taliesin the storyteller wrote:
> > I spent some time looking at the Xerox site, but I don't think they
> > want to tell us how their methods. Patents, you know; money and such...
> I happen to have read the paper in the reference that is now 404. It
> compares two techniques:
> * Comparing most frequent trigrams
> * Comparing most frequent short words
> In both cases you need a database of either frequent trigrams per
> language or frequent short words per language. For the former I'd think
> "'s " and " th" are good trigrams for English. For the latter approach,
> "a", "is", "the", "of" are dead giveaways for English.
> One of the methods were better for short texts but I don't remember
> which one...
=== AND ===
> Found the paper in question:
Excellent! Thank you, Taliesin.
Your intuition on " th" was good, tho "'s " didn't make the top ten.
The trigram method was superior for short sentences. One likely
reason for that, is that we have about three or more times as many
trigrams as we have words in any short sentence consisting only of
short words. One comparison they did _not_ make was when the
attributes in question were equal in number. In other words, a
fairer test of the discriminative power of both methods would have
been to compare -
(1) the short-word method on sentences containing N short words
(2) the trigram method on sentences containing N trigrams.
Still, either method seems useful when we have a suitably large
corpus, though the short-word method has the added advantage
that we may already have a dictionary to hand.
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.344 / Virus Database: 267.10.19/93 - Release Date: 8/9/05