Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: TECH: Language Detection

From:Gary Shannon <fiziwig@...>
Date:Thursday, July 6, 2006, 7:10
I wrote one many, many years ago, but I don't have the
code any more. As I recall it took a surprising small
number of most-popular words from each language. If
you find "the" and "and" you can be virtually certain
it's English. If you find "der", "und", "wir", or
"ich"... etc.

Perhaps 10 or 20 words, or if the text is short, maybe
as many as a couple hundred of the very most common
words from each target language should suffice. Start
with the word for "and", any definite or indefinite
articles, if the language has them, common titles like
Mr., Mrs., "senior", common high-use words like "mit",
"auf", "to", "with", "uno", "mas"...

One way to try out your selection of words is to do a
Google search on a half-dozen words in some language
and see if it brings up only pages in that language.

Just for fun I just now tried to Google "siku mimi
nyumba" and every hit was, indeed, in Swahili. And if
you google "der ihr uns" virtually every page that
pops up is in German, or has German text on the page.

--gary

--- Arthaey Angosii <arthaey@...> wrote:

> Since we have a fair number of programmers here, I > figured it was a > good place to ask my question. :) > > I'm working on a project to automatically detect > what language some > text is in. Said text is really more like a phrase > at a time, with a > high percentage of proper nouns. > > Do any of you have any experience with programmatic > language > detection? I'll probably be using character and > n-gram frequencies, > perhaps supplemented by a custom dictionary (so the > proper nouns that > reoccur can be used to increase accuracy in the > future). > > Any other techniques I should consider, or common > pitfalls I should avoid? > > Thanks! > > > -- > AA > http://conlang.arthaey.com >

Reply

Henrik Theiling <theiling@...>