Conlang: Re: A way to identify languages algorithmically (was: Re: Online Language Identifier) (taliesin the storyteller, Sep 8 '05, 10:31)

Re: A way to identify languages algorithmically (was: Re: Online Language Identifier)

From:	taliesin the storyteller <taliesin-conlang@...>
Date:	Thursday, September 8, 2005, 10:31

From:

taliesin the storyteller <taliesin-conlang@...>

Date:

Thursday, September 8, 2005, 10:31

* Yahya Abdal-Aziz said on 2005-09-08 07:18:10 +0200

> * Peter Bleackley said: > > I don't know if this is what Xerox are doing, but one way of performing > > language identification is as follows > > > > 1) Take large files of text in each of your candidate languages. Run > > each of them through a data compression algorithm, which uses the > > statistical properties of the data to improve the efficiency of its > > binary representation. Note the file sizes produced. > > 2) Append your sample text to each of the uncompressed files. > > Compress them again and compare the new file sizes with the old ones. > > 3) The language for which the compressed file size shows the smallest > > increase is the one whose statistical properties best match those of > > the sample text. > > I spent some time looking at the Xerox site, but I don't think they > want to tell us how their methods. Patents, you know; money and such ...

I happen to have read the paper in the reference that is now 404. It compares two techniques: * Comparing most frequent trigrams * Comparing most frequent short words In both cases you need a database of either frequent trigrams per language or frequent short words per language. For the former I'd think "'s " and " th" are good trigrams for English. For the latter approach, "a", "is", "the", "of" are dead giveaways for English. One of the methods were better for short texts but I don't remember which one... t.