Re: magic natlang corpus harvesting

From:	Danny Wier <dawiertx@...>
Date:	Thursday, May 27, 2004, 10:04

|< < Post > >| << List/Tree >> Reference May 2004 Index

From: "Mark P. Line" <mark@...>

> I was reading a TR the other day about collecting minority-language corpus
> material from the Web. They were going through all kinds of high-falootin'
> machine learning shenanigans, which was the point of their (computer
> science) research -- but it struck me that it doesn't have to be that hard
> if you know something about the language you're looking for.
...

> I spot-checked the first 500 hits for "ang mga", and there wasn't a single
> page that wasn't in Tagalog (very high precision), although a couple
> seemed to be mixed Tagalog/English. The query pulled up 66,200 pages --
> which is probably meaningless except for comparison with possible
> alternative queries that might pull in even more hits. I didn't check for
> deadwood.
...

I got 2.78 million hits for Arabic /la:/ 'no' (a ligature) with pretty high
precision. For Hindi, there are 40,700 pages with /hai/ 'he/she/it is', but
there may be some other Devanagari-script languages involved.

|< < Post > >| << List/Tree >> Reference May 2004 Index

Replies

Danny Wier <dawiertx@...>
Emily Zilch <emily0@...>	MNCH (was: magic natlang corpus harvesting)