Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: TECH: Language Detection

From:taliesin the storyteller <taliesin-conlang@...>
Date:Saturday, July 8, 2006, 22:27
* Arthaey Angosii said on 2006-07-05 21:47:43 +0200
> I'm working on a project to automatically detect what language some > text is in. Said text is really more like a phrase at a time, with a > high percentage of proper nouns.
Grefenstette, Gregory (1995) "Comparing two language identification schemes" Proceedings of JADT, 3rd International conference on Statistical Analysis of Textual Data, Rome http://www.xrce.xerox.com/competencies/content-analysis/tools/publis/jadt.ps In short: using stopwords (common words) is safer for short texts, using trigrams is just as good for longer texts; both are easily derived from known good data. If your text is so noun-heavy it might be worth a paper to tell how well these two methods work on such text. HTH, t.