Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: Unknown Language Identifier!

From:dirk elzinga <dirk.elzinga@...>
Date:Monday, January 29, 2001, 19:24
On Mon, 29 Jan 2001, John Cowan wrote:

> dirk elzinga wrote: > > > Also the fact that the orthography can muck things up so easily > > is disappointing (though not surprising). > > An orthography is a standard part of a written language. When > we want to identify text in English, we expect it to be written > using English orthography, not some random orthography.
So do all of the languages in his database have an established Latin orthography? I wonder if some of the languages, especially the Native American languages, have several competing schemes or even none at all. (I'm thinking specifically of Apache and Kutchin; I don't know that either of these have official orthographies established by the respective tribal governments.) So what is to be said about orthography in those cases? Is it random? No, but which orthography you choose will have implications for the recognition algorithm, as my little Tepa experiment showed. Also, the author explicitly states that the orthographies were ASCIIized; how was this done? By simply stripping all characters of their diacritics? Do this to Navajo and what you get isn't really Navajo anymore, is it? Or were the texts coded in a SAMPA/Kirschenbaum sort of way? There seems already to be a great deal of potential indeterminacy in the orthographic forms the algorithm looks at. So this leads to the following idea. Take an English text and respell it using Wijk's Regularized English and see how it scores. I'd imagine that if Regularized English spelling were really that (regularized), it would score as well or *better* than standard English spelling. Try the same for New Spelling, Cut Spelling, etc. Dirk -- Dirk Elzinga dirk.elzinga@m.cc.utah.edu "The strong craving for a simple formula has been the undoing of linguists." - Edward Sapir