Re: Unknown Language Identifier!
From: | Dan Sulani <dnsulani@...> |
Date: | Tuesday, January 30, 2001, 7:23 |
On 29 Jan, Dirk Elzinga wrote:
>I have to agree with Matt: this doesn't seem very useful to
>someone who is genuinely interested in the proper identification
>of a text (unless that text happens to be in a language included
>in the database).
>
>Also the fact that the orthography can muck things up so easily
>is disappointing (though not surprising).
Not taking the whole thing too seriously,
but wanting to follow up on the idea of orthography,
just for the fun of it I put some rap lyrics through their paces.
My "best" result was with the following:
[Snoop] Yeah, ha ha, Snoop Dogg
[W.C.] Dub C.. heh, yeah
[Snoop] All up in here, bay-bay.. yeah
[W.C.] Uh-huh
[Snoop] Straight G thang, yeah
That got me: AngloSaxon 0.0266
followed by: Choctaw 0.0166
Klingon 0.0163
(Klingon?! ;-) )
Sotho 0.0150
When I tried the complete lyrics of songs (including this one)
by a number of rappers, I kept getting
English at around the 31 percent level, always followed by
Scots at around 24 percent, AngloSaxon at around 12 percent
and Icelandic at around 1 percent.
Dan Sulani
--------------------------------------------------------------------
likehsna rtem zuv tikuhnuh auag inuvuz vaka'a.
A word is an awesome thing.