Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: Online Language Identifier

From:Peter Bleackley <peter.bleackley@...>
Date:Wednesday, August 31, 2005, 8:42
At 07:42 31/08/2005, you wrote:
>>Anyway, try it out! It's great fun! Plus, this might help out the >>"What language is this song/text in?" threads. I hear for real >>languages it's pretty accurate. >For Seinundjé (the full Litany against Fear text, and, separately, >the Monastery Key text in archaic dialect) it guesses Hungarian, >which is not the first time someone's suggested Sein' looks like an >eastern European language.
I used that text myself - not really surprising, as I did write it. As I mentioned, Khangaþyagon came out as Turkish_iso9. In some ways, I suppose it's not surprising that it came out as an another agglutinating language. I don't know if this is what Xerox are doing, but one way of performing language identification is as follows 1) Take large files of text in each of your candidate languages. Run each of them through a data compression algorithm, which uses the statistical properties of the data to improve the efficiency of its binary representation. Note the file sizes produced. 2) Append your sample text to each of the uncompressed files. Compress them again and compare the new file sizes with the old ones. 3) The language for which the compressed file size shows the smallest increase is the one whose statistical properties best match those of the sample text. I'm back from the wilds of NOMAIL, by the way. Pete

Reply

Paul Bennett <paul-bennett@...>