Re: TECH: Language Detection
|From:||taliesin the storyteller <taliesin-conlang@...>|
|Date:||Saturday, July 8, 2006, 22:27|
* Arthaey Angosii said on 2006-07-05 21:47:43 +0200
> I'm working on a project to automatically detect what language some
> text is in. Said text is really more like a phrase at a time, with a
> high percentage of proper nouns.
Grefenstette, Gregory (1995) "Comparing two language identification schemes"
Proceedings of JADT, 3rd International conference on Statistical
Analysis of Textual Data, Rome
In short: using stopwords (common words) is safer for short texts, using
trigrams is just as good for longer texts; both are easily derived from
known good data. If your text is so noun-heavy it might be worth a paper
to tell how well these two methods work on such text.