Re: Unknown Language Identifier!

From:	J Matthew Pearson <pearson@...>
Date:	Monday, January 29, 2001, 4:17
|< < Post > >| << List/Tree >> January 2001 Index
First, as a test, I tried a few sentences of Malagasy, and got the
following results:

  Sotho     0.0935
  Choctaw   0.0458
  Swahili   0.0410
  Manx      0.0330

I then added a few more sentences of Malagasy and tried it again.  I got
essentially the same results, except that Manx and Choctaw did a switch:

  Sotho     0.0717
  Manx      0.0554
  Swahili   0.0442
  Choctaw   0.0397

At first I was disappointed.  But then I remembered that while Malagasy
is an Austronesian language (closely related to Tagalog), the phonology
and phonotactics of the language have been heavily influenced by Bantu.
So I guess it's no surprise that two Bantu languages (Sotho and Swahili)
show up on the list.  What the hell Manx is doing there is beyond me.

Next I tried Tokana, entering random example sentences from my Reference
Grammar, and here's what I got:

  Tosk      0.0678
  Polish    0.0501
  Gheg      0.0490
  Tzeltal   0.0349

Again, a curious result.  The phonology of Tokana is modelled on
Choctaw, Lakhota, Quechua, Finnish, Malagasy, and Greek.  That none of
these languages made the top five is disappointing.  Furthermore, I find
Albanian and Polish to be extremely unlovely--at least on the printed
page (phonetically they're much more appealing).  The Mayan languages I
like not much better...

Wearing my linguist hat, I have to say that I'm extremely dubious that
this would be useful tool in determining the relatedness of languages.
It seems to have all of the disadvantages of glottochronology and other
discredited techniques, and none of the advantages.  (Malagasy is a good
test case for this, since it looks superficially like it should belong
to one family, Bantu, but in fact belongs to a completely different
family, Austronesian.)  Of course, the algorithm might work better if
(a) it had more texts to compare the sample to, and (b) you could
abstract away from orthographic differences (perhaps by entering samples
in IPA?).  Even so, though, I suspect it would do a better job of
placing the unknown language in the correct geographic area (Sprachbund)
than in identifying its genetic affiliation.  (I would refer those who
are interested in this issue to R.M.W. Dixon's recent book, "The Rise
and Fall of Languages".)

Wearing my conlanger hat, however, I have to say that it's a fun toy to
play with!

Matt.
|< < Post > >| << List/Tree >> January 2001 Index