Re: A way to identify languages algorithmically (was: Re: Online Language Identifier)

From:Yahya Abdal-Aziz <yahya@...>
Date:Friday, September 9, 2005, 7:41
On Thu, 8 Sep 2005, taliesin the storyteller wrote:
> > I spent some time looking at the Xerox site, but I don't think they > > want to tell us how their methods. Patents, you know; money and such
> > I happen to have read the paper in the reference that is now 404. It > compares two techniques: > > * Comparing most frequent trigrams > * Comparing most frequent short words > > In both cases you need a database of either frequent trigrams per > language or frequent short words per language. For the former I'd think > "'s " and " th" are good trigrams for English. For the latter approach, > "a", "is", "the", "of" are dead giveaways for English. > > One of the methods were better for short texts but I don't remember > which one...
=== AND ===
> Found the paper in question:
>
> -two-language-identification-schemes.pdf

Excellent!  Thank you, Taliesin.

Your intuition on " th" was good, tho "'s " didn't make the top ten.

The trigram method was superior for short sentences.  One likely reason
for that, is that we have about three or more times as many trigrams as
we have words in any short sentence consisting only of short words.

One comparison they did _not_ make was when the attributes in question
were equal in number.  In other words, a fairer test of the
discriminative power of both methods would have been to compare -
(1) the short-word method on sentences containing N short words with
(2) the trigram method on sentences containing N trigrams.

Still, either method seems useful when we have a suitably large corpus,
though the short-word method has the added advantage that we may
already have a dictionary to hand.