Re: TECH: Language Detection

From:	taliesin the storyteller <taliesin-conlang@...>
Date:	Saturday, July 8, 2006, 22:27

|< < Post > >| << List/Tree >> Reference July 2006 Index

* Arthaey Angosii said on 2006-07-05 21:47:43 +0200
> I'm working on a project to automatically detect what language some
> text is in. Said text is really more like a phrase at a time, with a
> high percentage of proper nouns.
Grefenstette, Gregory (1995) "Comparing two language identification schemes"
Proceedings of JADT, 3rd International conference on Statistical
Analysis of Textual Data, Rome

http://www.xrce.xerox.com/competencies/content-analysis/tools/publis/jadt.ps

In short: using stopwords (common words) is safer for short texts, using
trigrams is just as good for longer texts; both are easily derived from
known good data. If your text is so noun-heavy it might be worth a paper
to tell how well these two methods work on such text.

HTH,


t.

|< < Post > >| << List/Tree >> Reference July 2006 Index