Re: MNCH (was: magic natlang corpus harvesting)
From: | Danny Wier <dawiertx@...> |
Date: | Friday, May 28, 2004, 6:54 |
From: "Emily Zilch" <emily0@...>
> { 20040527,0304 | Danny Wier }
>
> "I got 2.78 million hits for Arabic /la:/ 'no' (a ligature) with pretty
> high precision. For Hindi, there are 40,700 pages with /hai/ 'he/she/it
> is', but there may be some other Devanagari-script languages involved."
>
> The ligature LA+ALIF is used with great frequency in Farsi. In fact, I
> bet it appears in every Arabic-alifba-using natlang.
But I searched for the word /la:/ by itself, not as part of a word. But I
wasn't aware that the word for 'not' in Arabic is also the word for 'strand'
or 'layer' in Farsi (I just looked that up), so a better word could be used.
I tried /?\ala:/ 'on, over, above' in Arabic, but it returned almost 6
million hits, a lot of them in Farsi. And no native words in the latter have
either voiceless or voiced pharyngeal fricative.
> Of course, there may be a qualitative difference in the encoding since
> Farsi et al. use a different handwriting style, the so-called KUFIC or
> "horizontal" script, while Arabic(s) and African natlang alifba
> borrowers use the "vertical" script, but this may appear in coding
> simply as a font choice.
It's just a font choice, and Farsi can be written in any calligraphic style
Arabic can be. It's Urdu that is normally written in Nastaleeq, rather than
Naskh.