Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: TECH: Language Detection

From:Henrik Theiling <theiling@...>
Date:Thursday, July 6, 2006, 11:55
Hi!

For longer texts, a good way of determining the language seems to be
to append the text to a text in a well-known language and use a
standard compression algorithm to compress it.  Repeat this for all
languages in question and compare for which one it compressed best.

This was posted a while ago here, and I think one of our fellow
conlangers programmed it this way.  But the limitation is the length
of the text.  For shorter texts, I'd first try to find typical words
using a lexicon, thereby reducing the number of languages to
distinguish, and then run the compression test as a last resort.

Also, you'll depend on text normalisation -- e.g.  in emails, you
might encounter "a for ä in German.  Without normalisation, the above
technique will probably fail.

**Henrik