Re: TECH: Language Detection
From: | taliesin the storyteller <taliesin-conlang@...> |
Date: | Saturday, July 8, 2006, 22:27 |
* Arthaey Angosii said on 2006-07-05 21:47:43 +0200
> I'm working on a project to automatically detect what language some
> text is in. Said text is really more like a phrase at a time, with a
> high percentage of proper nouns.
Grefenstette, Gregory (1995) "Comparing two language identification schemes"
Proceedings of JADT, 3rd International conference on Statistical
Analysis of Textual Data, Rome
http://www.xrce.xerox.com/competencies/content-analysis/tools/publis/jadt.ps
In short: using stopwords (common words) is safer for short texts, using
trigrams is just as good for longer texts; both are easily derived from
known good data. If your text is so noun-heavy it might be worth a paper
to tell how well these two methods work on such text.
HTH,
t.