Re: Online Language Identifier
From: | Paul Bennett <paul-bennett@...> |
Date: | Wednesday, August 31, 2005, 9:28 |
On Wed, 31 Aug 2005 04:42:08 -0400, Peter Bleackley
<Peter.Bleackley@...> wrote:
> I don't know if this is what Xerox are doing, but one way of performing
> language identification is as follows
>
> 1) Take large files of text in each of your candidate languages. Run
> each of them through a data compression algorithm, which uses the
> statistical properties of the data to improve the efficiency of its
> binary representation. Note the file sizes produced.
> 2) Append your sample text to each of the uncompressed files. Compress
> them again and compare the new file sizes with the old ones.
> 3) The language for which the compressed file size shows the smallest
> increase is the one whose statistical properties best match those of the
> sample text.
Ingenious! Somewhat like using a cannon to kill a mosquito, but effective,
and a very clever use of existing technology.
Paul