Re: TECH: Language Detection
From: | Henrik Theiling <theiling@...> |
Date: | Thursday, July 6, 2006, 11:55 |
Hi!
For longer texts, a good way of determining the language seems to be
to append the text to a text in a well-known language and use a
standard compression algorithm to compress it. Repeat this for all
languages in question and compare for which one it compressed best.
This was posted a while ago here, and I think one of our fellow
conlangers programmed it this way. But the limitation is the length
of the text. For shorter texts, I'd first try to find typical words
using a lexicon, thereby reducing the number of languages to
distinguish, and then run the compression test as a last resort.
Also, you'll depend on text normalisation -- e.g. in emails, you
might encounter "a for ä in German. Without normalisation, the above
technique will probably fail.
**Henrik