Re: A way to identify languages algorithmically (was: Re: Online Language Identifier)
From: | taliesin the storyteller <taliesin-conlang@...> |
Date: | Thursday, September 8, 2005, 10:31 |
* Yahya Abdal-Aziz said on 2005-09-08 07:18:10 +0200
> * Peter Bleackley said:
> > I don't know if this is what Xerox are doing, but one way of performing
> > language identification is as follows
> >
> > 1) Take large files of text in each of your candidate languages. Run
> > each of them through a data compression algorithm, which uses the
> > statistical properties of the data to improve the efficiency of its
> > binary representation. Note the file sizes produced.
> > 2) Append your sample text to each of the uncompressed files.
> > Compress them again and compare the new file sizes with the old ones.
> > 3) The language for which the compressed file size shows the smallest
> > increase is the one whose statistical properties best match those of
> > the sample text.
>
> I spent some time looking at the Xerox site, but I don't think they
> want to tell us how their methods. Patents, you know; money and such ...
I happen to have read the paper in the reference that is now 404. It
compares two techniques:
* Comparing most frequent trigrams
* Comparing most frequent short words
In both cases you need a database of either frequent trigrams per
language or frequent short words per language. For the former I'd think
"'s " and " th" are good trigrams for English. For the latter approach,
"a", "is", "the", "of" are dead giveaways for English.
One of the methods were better for short texts but I don't remember
which one...
t.