Re: Unknown Language Identifier!
From: | dirk elzinga <dirk.elzinga@...> |
Date: | Monday, January 29, 2001, 16:23 |
On Sun, 28 Jan 2001, Padraic Brown wrote:
Hmmm. Seems orthography is very important. I used a short Tepa
text and got:
Oromo: 0.0450
Czech: 0.0381
AngloSaxon: 0.0314
Somali: 0.0306
I fiddled a bit with the orthography, substituting <q> with <g>
(velar nasal), <y> with <j> (palatal glide), and <e> with <y>
(high central unrounded vowel). I ran it through again and got:
Oromo: 0.0480
Somali: 0.0384
Hausa: 0.0314
Klingon: 0.0278
Oromo is still on top, but of the remaining three only Somali
showed up again.
I reverted to the original orthography, and used a larger sample
and got:
Czech: 0.0425
Swahili: 0.0418
Lithuanian: 0.0341
SerboCroatian: 0.0303
Comparing with the original short sample, the longer sample
maintains some similarity to Czech (according to the software,
anyway), but the other three languages are replaced.
I also ran Shemspreg through (Shemspreg is my PIE take-off) and
got:
Hungarian: 0.0541
French: 0.0437
Manx: 0.0362
Latin: 0.0333
Three out of four IE lgs, but Hungarian gets top billing. Nice.
Then, just out of curiosity, I ran through a Shoshoni story. My
first impression of Shoshoni written in the official orthography
was of Finnish--both have a limited consonantal inventory and
lots of geminates. This is what I got:
Hausa: 0.0810
Swahili: 0.0717
Finnish: 0.0683
Kutchin: 0.0673
Finnish is in there, but not on top.
I have to agree with Matt: this doesn't seem very useful to
someone who is genuinely interested in the proper identification
of a text (unless that text happens to be in a language included
in the database).
Also the fact that the orthography can muck things up so easily
is disappointing (though not surprising).
Dirk
--
Dirk Elzinga dirk.elzinga@m.cc.utah.edu
"The strong craving for a simple formula
has been the undoing of linguists." - Edward Sapir