A way to identify languages algorithmically (was: Re: Online Language Identifier)
|From:||Yahya Abdal-Aziz <yahya@...>|
|Date:||Thursday, September 8, 2005, 5:18|
Date: Wed, 31 Aug 2005 09:42:08 +0100
From: Peter Bleackley
Subject: Re: Online Language Identifier
> I don't know if this is what Xerox are doing, but one way of performing
> language identification is as follows
> 1) Take large files of text in each of your candidate languages. Run
> each of them through a data compression algorithm, which uses the
> statistical properties of the data to improve the efficiency of its
> binary representation. Note the file sizes produced.
> 2) Append your sample text to each of the uncompressed files.
> Compress them again and compare the new file sizes with the old ones.
> 3) The language for which the compressed file size shows the smallest
> increase is the one whose statistical properties best match those of
> the sample text.
I think I recall seeing something like this a long time ago, but can't
where. The logic behind this approach is that when your reference file is
in the same language as the sample, it will already contain most of the
words in the sample, so will have to add fewer to its "dictionary" of
popular words. OTOH, in every other language, the reference file will
contain few - or even none - of the words in the sample, so will have to
add more to its dictionary. Similar logic applies for those compression
algorithms that don't use explicit dictionaries, but still record the most
popular character strings.
As an experiment, I zipped the text of Miguel Cervantes' "Don Quijote"
in Spanish and in English, took the Prólogo (The Author's Preface) as the
sample text in each language, added each sample to the back of each
(full book) text file and zipped them again.
The original files are es.txt and en.txt; their zips are es.zip and en.zip.
The Prologue is es Prólogo.txt and en The Author's Preface.txt.
In a (Windows XP) command window, I issued these commands:
> copy /b es.txt+es Prólogo.txt eses.txt
> copy /b es.txt+en The Author's Preface.txt esen.txt
> copy /b en.txt+es Prólogo.txt enes.txt
> copy /b en.txt+en The Author's Preface.txt enen.txtI then zipped each of these four text files.
The resulting file sizes, in bytes, are:
13,632 es prólogo.txt
13,995 en The Author's Preface.txt
With sample file es prólogo.txt, the Spanish zip expands from es.zip
(761,481) to eses.zip (766,749), an increase of 5,268. The English zip
expands from en.zip (797,042) to enes.zip (802,972), an increase of
5,930. So Spanish best compresses Spanish.
With sample file en The Author's Preface.txt, the Spanish zip expands
from es.zip (761,481) to esen.zip (767,763), an increase of 6,282. The
English zip expands from en.zip (797,042) to enen.zip (802,494), an
increase of 5,452. So English best compresses English.
So it seems to work quite well in this limited experiment. I might
note that the tone and mood ofthe Prólog is quite unlike that of
the body of the novel.
I spent some time looking at the Xerox site, but I don't think they
want to tell us how their methods. Patents, you know; money and such ...
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.344 / Virus Database: 267.10.19/92 - Release Date: 7/9/05