Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: Online Language Identifier

From:Paul Bennett <paul-bennett@...>
Date:Wednesday, August 31, 2005, 9:28
On Wed, 31 Aug 2005 04:42:08 -0400, Peter Bleackley
<Peter.Bleackley@...> wrote:

> I don't know if this is what Xerox are doing, but one way of performing > language identification is as follows > > 1) Take large files of text in each of your candidate languages. Run > each of them through a data compression algorithm, which uses the > statistical properties of the data to improve the efficiency of its > binary representation. Note the file sizes produced. > 2) Append your sample text to each of the uncompressed files. Compress > them again and compare the new file sizes with the old ones. > 3) The language for which the compressed file size shows the smallest > increase is the one whose statistical properties best match those of the > sample text.
Ingenious! Somewhat like using a cannon to kill a mosquito, but effective, and a very clever use of existing technology. Paul