Re: Online Language Identifier
From: | Julia "Schnecki" Simon <helicula@...> |
Date: | Wednesday, August 31, 2005, 5:49 |
Hello!
On 8/30/05, David J. Peterson <dedalvs@...> wrote:
[snip snip]
> To add to the fun, I did a bunch of nonsense sentences. The results
> (copied from my blog):
>
> -Aaaaaaaaaaa aaaa aaaaaaaaaaaaa aaa aaaaa. Finnish. (This is Finnish
> for, "And they forged the sampo." I'm certain it is.)
It must be one of the weirder dialects, possibly from the Savo or
Kotka areas. ;-)
I decided to join in the fun and run some tests of my own. Since the
vocabulary of my own language is still extremely small (just a few
words that I needed as stems when working on inflection), I decided to
dig out some of the (semi-) weird "real-world" languages I've
accumulated over the years. Ah, if my old linguistics professors could
see me now. ;-)
David hypothesized:
> I think if they don't know what to do with a language, they call
> it Romanian. And perhaps if they like the language they call it
> Estonian. ;)
That's plausible, considering that Old High German (specifically, a
combination of some spells -- _Lorscher Bienensegen_, _Pro Nessia_,
and the _Merseburger Zaubersprüche_) and Middle High German (the poem
_Du bist mîn_) were both identified as Romanian. (The _Merseburger
Zaubersprüche_ on their own were recognized as Latin, though. No idea
why, since I took great care to remove all Latin phrases from my
sources.)
I didn't get the analysis "Estonian" for anything, though. ;-)
With some other Old and Middle High German material I entered, the
system did much better: both _Merigarto_ (OHG) and _Ich saz ûf eime
steine_ (MHG) were recognized as German. Yay. :-)
It also managed to recognize older versions of Italian (_Cantico del
Frate Sole_) and Finnish (some random snippings from Agricola's
_Rucousciria_) as Italian and Finnish, respectively. On the other
hand, I'm highly suspicious of any algorithm that looks at
Agricola-style spelling and determines that the language of the text
must be Finnish. What about all those icky foreign letters like <w>
and <c>? Surely no Finn in his or her right mind would ever use
*those*? ;-)
Then I decided to get nasty and found the following:
- Mohawk is consistently identified as Hungarian. Probably all those
accents. ;-)
- Nahuatl in traditional orthography is identified as Slovakian for
some reason. Nahuatl in modern orthography (as used in some of the
stories on http://www.kokone.com.mx/) is identified as Indonesian
(and I can see why -- the patterns do look similar).
- Tocharian B is identified as Latvian. No idea why.
(Do I sense an inspiration for conlangs based on odd combinations of
natlangs here? A language with a huuuuge morphology based on Hungarian
treatment of nouns and Mohawk treatment of verbs? A Slovakian
con-dialect with lateral affricates and glottal stops popping up in
odd places?)
Oh well, that's enough fun for now. Back to (serious) work.
Regards,
Julia
--
Julia Simon (Schnecki) -- Sprachen-Freak vom Dienst
_@" schnecki AT iki DOT fi / helicula AT gmail DOT com "@_
si hortum in bybliotheca habes, deerit nihil
(M. Tullius Cicero)