Conlang: Re: Online Language Identifier (Paul Bennett, Aug 31 '05, 9:28)

> I don't know if this is what Xerox are doing, but one way of performing > language identification is as follows > > 1) Take large files of text in each of your candidate languages. Run > each of them through a data compression algorithm, which uses the > statistical properties of the data to improve the efficiency of its > binary representation. Note the file sizes produced. > 2) Append your sample text to each of the uncompressed files. Compress > them again and compare the new file sizes with the old ones. > 3) The language for which the compressed file size shows the smallest > increase is the one whose statistical properties best match those of the > sample text.

From:	Paul Bennett <paul-bennett@...>
Date:	Wednesday, August 31, 2005, 9:28