Theiling Online    Sitemap    Conlang Mailing List HQ   

Re: Online Language Identifier

From:Julia "Schnecki" Simon <helicula@...>
Date:Wednesday, August 31, 2005, 5:49
Hello!

On 8/30/05, David J. Peterson <dedalvs@...> wrote:

[snip snip]

> To add to the fun, I did a bunch of nonsense sentences. The results > (copied from my blog): > > -Aaaaaaaaaaa aaaa aaaaaaaaaaaaa aaa aaaaa. Finnish. (This is Finnish > for, "And they forged the sampo." I'm certain it is.)
It must be one of the weirder dialects, possibly from the Savo or Kotka areas. ;-) I decided to join in the fun and run some tests of my own. Since the vocabulary of my own language is still extremely small (just a few words that I needed as stems when working on inflection), I decided to dig out some of the (semi-) weird "real-world" languages I've accumulated over the years. Ah, if my old linguistics professors could see me now. ;-) David hypothesized:
> I think if they don't know what to do with a language, they call > it Romanian. And perhaps if they like the language they call it > Estonian. ;)
That's plausible, considering that Old High German (specifically, a combination of some spells -- _Lorscher Bienensegen_, _Pro Nessia_, and the _Merseburger Zaubersprüche_) and Middle High German (the poem _Du bist mîn_) were both identified as Romanian. (The _Merseburger Zaubersprüche_ on their own were recognized as Latin, though. No idea why, since I took great care to remove all Latin phrases from my sources.) I didn't get the analysis "Estonian" for anything, though. ;-) With some other Old and Middle High German material I entered, the system did much better: both _Merigarto_ (OHG) and _Ich saz ûf eime steine_ (MHG) were recognized as German. Yay. :-) It also managed to recognize older versions of Italian (_Cantico del Frate Sole_) and Finnish (some random snippings from Agricola's _Rucousciria_) as Italian and Finnish, respectively. On the other hand, I'm highly suspicious of any algorithm that looks at Agricola-style spelling and determines that the language of the text must be Finnish. What about all those icky foreign letters like <w> and <c>? Surely no Finn in his or her right mind would ever use *those*? ;-) Then I decided to get nasty and found the following: - Mohawk is consistently identified as Hungarian. Probably all those accents. ;-) - Nahuatl in traditional orthography is identified as Slovakian for some reason. Nahuatl in modern orthography (as used in some of the stories on http://www.kokone.com.mx/) is identified as Indonesian (and I can see why -- the patterns do look similar). - Tocharian B is identified as Latvian. No idea why. (Do I sense an inspiration for conlangs based on odd combinations of natlangs here? A language with a huuuuge morphology based on Hungarian treatment of nouns and Mohawk treatment of verbs? A Slovakian con-dialect with lateral affricates and glottal stops popping up in odd places?) Oh well, that's enough fun for now. Back to (serious) work. Regards, Julia -- Julia Simon (Schnecki) -- Sprachen-Freak vom Dienst _@" schnecki AT iki DOT fi / helicula AT gmail DOT com "@_ si hortum in bybliotheca habes, deerit nihil (M. Tullius Cicero)