THEORY: Unknown Language Confuser!
|From:||Lars Henrik Mathiesen <thorinn@...>|
|Date:||Tuesday, January 30, 2001, 2:27|
will you PLEASE read just a little of the explanatory text on that web
Unless the highest language is at least twice as good as the next
language (that isn't closely related to it), it is a negative result.
'A negative result' means, IT DOESN'T MEAN ANYTHING AT ALL!
Statistically speaking, it's noise. Nonsense. Gibberish. It may look
suggestive, but that just means it's suggestive gibberish.
And if you get a positive result, you can only use the top language.
Unless they are closely related to that, the identities of the lower
ranked languages are totally irrelevant. They are only named to let
you check if they are in fact closely related, and how close their
point scores are to the top language.
Note well that this is when going from test results to conclusions
about relations. The test simply doesn't support _any_ conclusion
unless it gets a positive result. Pace Matt Pearson, the test was not
designed to avoid false negatives, and I can see nothing on the
website to give the impression that it can do so. It's false positives
that would be news to the author.
Of course, if you do have a language that's say, Insular Celtic with a
heavy Latin adstrate, you can strongly predict --- but not be sure ---
that the test will pick some Insular Celtic language with a similar
orthography as the top match, if such a language exists in the data
base --- and that it will be more likely to pick other such languages,
or languages similar to the adstrate, for the lower ranked matches.
The only almost-positive result I've seen quoted was in fact for
Brithenig, which showed as related to Welsh --- with Latin and
Rhaetoroman dialects in the next three slots. This is totally
unsurprising, because that's how Brithenig was designed to be. But if
the program had picked Inuktitut as the second language, that would
have been just as unsurprising (to a statistician), because the test
is not designed to pick 'correct' second languages.
ALL the "surprising" results I've seen quoted have actually been
negative results. In the old sf movies, the computer would just say
"insufficient data." But because a university researcher very rightly
decided to use his time on something productive instead of writing
code to spoonfeed that conclusion to the supposedly intelligent
members of this list, we've now had to endure reams of unmitigated
gibberish, even done up in new statistics in a search for the conlangy
significance of Somali.
There is none. Trust me.
(Used properly, as for instance when fed a nice juicy 2000-word debate
article from a Danish online newspaper, the program told me Danish
.48, Swedish .23, Frisian .11. That's probably the kind of numbers you
need for a 99 % confidence level --- with .40+ on the top language and
a factor of 4+ to the first non-close-relative).
Lars Mathiesen (U of Copenhagen CS Dep) <thorinn@...> (Humour NOT marked)