Re: magic natlang corpus harvesting
From: | Gary Shannon <fiziwig@...> |
Date: | Thursday, May 27, 2004, 21:25 |
--- "Mark P. Line" <mark@...> wrote:
<snip>
>
> Would anybody be interested in trying out and
> reporting on
> corpus-harvesting google queries for other languages
> (or improving on "ang
> mga", for that matter)? I can compile everything
> that we figure out onto a
> webpage, if there's any interest.
This is a very interesting idea. Just a random brain
dump of thoughts:
It seems that the ideal search string would have
certain characteristics such as:
1. The word is unique to the target language, or
significantly more common in the target language than
in any other language. Thus "der", "das", "ist" might
be good search strings for German, but "die", "den",
and "hat" would not because those same three letter
combinations occur in English as well as in German.
2. The word or string of words should be common enough
that they can be expected to appear in any reasonable
text in the target language. Thus
"versicherungsgesselschaft", while possibly "more
unique" to German than "der", "das", and "ist", would
be less likely to occur on any given web page.
(der+das+ist yields 139 million hits, the first 200 of
which are all purely German pages, while
"versicherungsgesselschaft" yields only two hits, both
of which are purely German.)
2a. Corollary: The search string should not be too
"exclusive", i.e., it should ideally not exclude any
pages, and in practice not exclude too many pages in
the target language.
3. Language identifying words should be able to be
concatenated with subject-specific search terms to
find specific topics in a specific language. For
example, to find German-language pages on the subject
of global warming one could use the search terms
der+das+ist+Erderwärmung (34 hits, all in German).
This is possibly more precise than simply using the
German word for global warming "Erderwärmung" which
produces 79 hits, including 7 non-German language
sites, 1 bibliography with book titles in assorted
languages, 4 sites for English-speaking students
learning German, one site for German-speaking students
of English, and several mixed-language forums where
the word is mentioned.
Even using words that are not unique to the target
language, good results can be had by combining several
non-unique words into the search. For example,
consider these three words which exist in both German
and English, but are more common in German than in
English: die, den, hat. Using them as search terms
turned up the following percentages of German language
sites. (100% simply means no non-German sites were
found in the first 200 pages returned from the
search.)
hat 5% German
den 19% German
die 58% German
die+hat 98% German
die+den 99% German
den+hat 99% German
die+den+hat 100% German
The non-German pages were mostly English, Dutch and
Afrikaans, with a few pages in Swedish or Danish.