Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: magic natlang corpus harvesting

From:Danny Wier <dawiertx@...>
Date:Thursday, May 27, 2004, 10:04
From: "Mark P. Line" <mark@...>

> I was reading a TR the other day about collecting minority-language corpus > material from the Web. They were going through all kinds of high-falootin' > machine learning shenanigans, which was the point of their (computer > science) research -- but it struck me that it doesn't have to be that hard > if you know something about the language you're looking for.
...
> I spot-checked the first 500 hits for "ang mga", and there wasn't a single > page that wasn't in Tagalog (very high precision), although a couple > seemed to be mixed Tagalog/English. The query pulled up 66,200 pages -- > which is probably meaningless except for comparison with possible > alternative queries that might pull in even more hits. I didn't check for > deadwood.
... I got 2.78 million hits for Arabic /la:/ 'no' (a ligature) with pretty high precision. For Hindi, there are 40,700 pages with /hai/ 'he/she/it is', but there may be some other Devanagari-script languages involved.

Replies

Danny Wier <dawiertx@...>
Emily Zilch <emily0@...>MNCH (was: magic natlang corpus harvesting)