Conlang: Re: magic natlang corpus harvesting (Gary Shannon, May 27 '04, 21:25)

Re: magic natlang corpus harvesting

From:	Gary Shannon <fiziwig@...>
Date:	Thursday, May 27, 2004, 21:25

From:

Gary Shannon <fiziwig@...>

Date:

Thursday, May 27, 2004, 21:25

--- "Mark P. Line" <mark@...> wrote: <snip>

> > Would anybody be interested in trying out and > reporting on > corpus-harvesting google queries for other languages > (or improving on "ang > mga", for that matter)? I can compile everything > that we figure out onto a > webpage, if there's any interest.

This is a very interesting idea. Just a random brain dump of thoughts: It seems that the ideal search string would have certain characteristics such as: 1. The word is unique to the target language, or significantly more common in the target language than in any other language. Thus "der", "das", "ist" might be good search strings for German, but "die", "den", and "hat" would not because those same three letter combinations occur in English as well as in German. 2. The word or string of words should be common enough that they can be expected to appear in any reasonable text in the target language. Thus "versicherungsgesselschaft", while possibly "more unique" to German than "der", "das", and "ist", would be less likely to occur on any given web page. (der+das+ist yields 139 million hits, the first 200 of which are all purely German pages, while "versicherungsgesselschaft" yields only two hits, both of which are purely German.) 2a. Corollary: The search string should not be too "exclusive", i.e., it should ideally not exclude any pages, and in practice not exclude too many pages in the target language. 3. Language identifying words should be able to be concatenated with subject-specific search terms to find specific topics in a specific language. For example, to find German-language pages on the subject of global warming one could use the search terms der+das+ist+Erderwärmung (34 hits, all in German). This is possibly more precise than simply using the German word for global warming "Erderwärmung" which produces 79 hits, including 7 non-German language sites, 1 bibliography with book titles in assorted languages, 4 sites for English-speaking students learning German, one site for German-speaking students of English, and several mixed-language forums where the word is mentioned. Even using words that are not unique to the target language, good results can be had by combining several non-unique words into the search. For example, consider these three words which exist in both German and English, but are more common in German than in English: die, den, hat. Using them as search terms turned up the following percentages of German language sites. (100% simply means no non-German sites were found in the first 200 pages returned from the search.) hat 5% German den 19% German die 58% German die+hat 98% German die+den 99% German den+hat 99% German die+den+hat 100% German The non-German pages were mostly English, Dutch and Afrikaans, with a few pages in Swedish or Danish.