Conlang: A way to identify languages algorithmically (was: Re: Online Language Identifier) (Yahya Abdal-Aziz, Sep 8 '05, 5:18)

A way to identify languages algorithmically (was: Re: Online Language Identifier)

From:	Yahya Abdal-Aziz <yahya@...>
Date:	Thursday, September 8, 2005, 5:18

From:

Yahya Abdal-Aziz <yahya@...>

Date:

Thursday, September 8, 2005, 5:18

Original message: -------------------- Date: Wed, 31 Aug 2005 09:42:08 +0100 From: Peter Bleackley Subject: Re: Online Language Identifier [much clipt]

> I don't know if this is what Xerox are doing, but one way of performing > language identification is as follows > > 1) Take large files of text in each of your candidate languages. Run > each of them through a data compression algorithm, which uses the > statistical properties of the data to improve the efficiency of its > binary representation. Note the file sizes produced. > 2) Append your sample text to each of the uncompressed files. > Compress them again and compare the new file sizes with the old ones. > 3) The language for which the compressed file size shows the smallest > increase is the one whose statistical properties best match those of > the sample text. > > Pete

Hi all, I think I recall seeing something like this a long time ago, but can't recall where. The logic behind this approach is that when your reference file is in the same language as the sample, it will already contain most of the words in the sample, so will have to add fewer to its "dictionary" of popular words. OTOH, in every other language, the reference file will contain few - or even none - of the words in the sample, so will have to add more to its dictionary. Similar logic applies for those compression algorithms that don't use explicit dictionaries, but still record the most popular character strings. As an experiment, I zipped the text of Miguel Cervantes' "Don Quijote" in Spanish and in English, took the Prólogo (The Author's Preface) as the sample text in each language, added each sample to the back of each (full book) text file and zipped them again. The original files are es.txt and en.txt; their zips are es.zip and en.zip. The Prologue is es Prólogo.txt and en The Author's Preface.txt. In a (Windows XP) command window, I issued these commands:

> copy /b es.txt+es Prólogo.txt eses.txt > copy /b es.txt+en The Author's Preface.txt esen.txt > copy /b en.txt+es Prólogo.txt enes.txt > copy /b en.txt+en The Author's Preface.txt enen.txt

I then zipped each of these four text files. The resulting file sizes, in bytes, are: 2,098,246 es.txt 2,184,744 en.txt 797,042 en.zip 761,481 es.zip 13,632 es prólogo.txt 13,995 en The Author's Preface.txt 2,111,879 eses.txt 2,198,377 enes.txt 2,198,740 enen.txt 2,112,242 esen.txt 766,749 eses.zip 802,972 enes.zip 802,494 enen.zip 767,763 esen.zip With sample file es prólogo.txt, the Spanish zip expands from es.zip (761,481) to eses.zip (766,749), an increase of 5,268. The English zip expands from en.zip (797,042) to enes.zip (802,972), an increase of 5,930. So Spanish best compresses Spanish. With sample file en The Author's Preface.txt, the Spanish zip expands from es.zip (761,481) to esen.zip (767,763), an increase of 6,282. The English zip expands from en.zip (797,042) to enen.zip (802,494), an increase of 5,452. So English best compresses English. So it seems to work quite well in this limited experiment. I might note that the tone and mood ofthe Prólog is quite unlike that of the body of the novel. I spent some time looking at the Xerox site, but I don't think they want to tell us how their methods. Patents, you know; money and such ... Regards, Yahya -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.344 / Virus Database: 267.10.19/92 - Release Date: 7/9/05

Replies

taliesin the storyteller <taliesin-conlang@...>
taliesin the storyteller <taliesin-conlang@...>