magic natlang corpus harvesting
|From:||Mark P. Line <mark@...>|
|Date:||Thursday, May 27, 2004, 3:50|
I was reading a TR the other day about collecting minority-language corpus
material from the Web. They were going through all kinds of high-falootin'
machine learning shenanigans, which was the point of their (computer
science) research -- but it struck me that it doesn't have to be that hard
if you know something about the language you're looking for.
I googled for "ang mga" (including the quotes, so it looks for this as a
phrase) and pulled up the Tagalog Subweb in all its glory.
Some informal information retrieval metrics might be useful for getting an
idea of how well a query is working for a particular language: the more
hits that are really in the target language, the better (precision), and
the more hits in comparison to the (presumably unknowable) total number of
indexed target-language web pages, the better (recall).
I spot-checked the first 500 hits for "ang mga", and there wasn't a single
page that wasn't in Tagalog (very high precision), although a couple
seemed to be mixed Tagalog/English. The query pulled up 66,200 pages --
which is probably meaningless except for comparison with possible
alternative queries that might pull in even more hits. I didn't check for
Would anybody be interested in trying out and reporting on
corpus-harvesting google queries for other languages (or improving on "ang
mga", for that matter)? I can compile everything that we figure out onto a
webpage, if there's any interest.
I'm assuming that anything that will render in Mozilla can be pasted into
the google search field and that Google will search correctly with it. I
tried this with a (Cyrillic) Mongolian query and got about 30% precision
and low recall, but I was just using an interrogative particle (yuu) that
randomly came to mind. I'll try to find a good Mongolian query tonight.
I guess there's no reason why the same technique couldn't be used to find
Sindarin, Quenya, Klingon, lojban or Volapu"k texts... *shrug*