From: "Mark P. Line" <mark@...>
> I was reading a TR the other day about collecting minority-language corpus
> material from the Web. They were going through all kinds of high-falootin'
> machine learning shenanigans, which was the point of their (computer
> science) research -- but it struck me that it doesn't have to be that hard
> if you know something about the language you're looking for.
...
> I spot-checked the first 500 hits for "ang mga", and there wasn't a single
> page that wasn't in Tagalog (very high precision), although a couple
> seemed to be mixed Tagalog/English. The query pulled up 66,200 pages --
> which is probably meaningless except for comparison with possible
> alternative queries that might pull in even more hits. I didn't check for
> deadwood.
...
I got 2.78 million hits for Arabic /la:/ 'no' (a ligature) with pretty high
precision. For Hindi, there are 40,700 pages with /hai/ 'he/she/it is', but
there may be some other Devanagari-script languages involved.