Re: Unknown Language Identifier!
| From: | dirk elzinga <dirk.elzinga@...> | 
|---|
| Date: | Monday, January 29, 2001, 16:23 | 
|---|
On Sun, 28 Jan 2001, Padraic Brown wrote:
Hmmm. Seems orthography is very important. I used a short Tepa
text and got:
Oromo:       0.0450
Czech:       0.0381
AngloSaxon:  0.0314
Somali:      0.0306
I fiddled a bit with the orthography, substituting <q> with <g>
(velar nasal), <y> with <j> (palatal glide), and <e> with <y>
(high central unrounded vowel). I ran it through again and got:
Oromo:    0.0480
Somali:   0.0384
Hausa:    0.0314
Klingon:  0.0278
Oromo is still on top, but of the remaining three only Somali
showed up again.
I reverted to the original orthography, and used a larger sample
and got:
Czech:          0.0425
Swahili:        0.0418
Lithuanian:     0.0341
SerboCroatian:  0.0303
Comparing with the original short sample, the longer sample
maintains some similarity to Czech (according to the software,
anyway), but the other three languages are replaced.
I also ran Shemspreg through (Shemspreg is my PIE take-off) and
got:
Hungarian:  0.0541
French:     0.0437
Manx:       0.0362
Latin:      0.0333
Three out of four IE lgs, but Hungarian gets top billing. Nice.
Then, just out of curiosity, I ran through a Shoshoni story. My
first impression of Shoshoni written in the official orthography
was of Finnish--both have a limited consonantal inventory and
lots of geminates. This is what I got:
Hausa:    0.0810
Swahili:  0.0717
Finnish:  0.0683
Kutchin:  0.0673
Finnish is in there, but not on top.
I have to agree with Matt: this doesn't seem very useful to
someone who is genuinely interested in the proper identification
of a text (unless that text happens to be in a language included
in the database).
Also the fact that the orthography can muck things up so easily
is disappointing (though not surprising).
Dirk
--
Dirk Elzinga                          dirk.elzinga@m.cc.utah.edu
"The strong craving for a simple formula
has been the undoing of linguists."               - Edward Sapir