Conlang: Re: Korean/Japanese/Inupiaq/Aleut/Yupik/Samenoid(sp) (Lars Henrik Mathiesen, Aug 21 '00, 15:16)

Re: Korean/Japanese/Inupiaq/Aleut/Yupik/Samenoid(sp)

From:	Lars Henrik Mathiesen <thorinn@...>
Date:	Monday, August 21, 2000, 15:16

From:

Lars Henrik Mathiesen <thorinn@...>

Date:

Monday, August 21, 2000, 15:16

> Date: Sun, 20 Aug 2000 23:40:17 -0700 > From: Marcus Smith <smithma@...>

> Cavali-Svorza (I believe that's the spelling) has worked at a > genetic family tree of all the races. There is lots of interesting > stuff related to Native Americans, but since genetics is not > something I understand really well, I can't recall most of it. > Unusually large groupings around a small number of alleles, > suggestive that there was a small number of original settlers who > were closely related. Can't recall anything else.

At the time this came out, Cavalli-Sforza was criticized for not taking into consideration (or even understanding) the limitations of the technique he used. The computer will always spit out a nice tree, but in some cases the branch structure may be very sensitive to small changes in the input data. For instance, if we have three genomes A xxNxxNxxNxxNxxNxx B xxNxxNxxOxxOxxOxx C xxPxxPxxPxxPxxPxx then the distance between A and B is 2, and both A and B are at distance 5 from C; the correct tree will clearly group A and B together: A B C \ / / Y / \ / \ / Y But what about this case: A xxNxxNxxNxxNxxNxx B xxNxxNxxNxxOxxOxx C xxOxxOxxOxxOxxOxx The distance between A and B is again 2, and 5 between A and C. But B is now only 3 away from C, and there's no way of making a tree that correctly represents all three distances. The program has to choose which one to get wrong: A B C A B C B A C \ / / \ \ / \ / / Y / \ \ / Y / \ / \ Y Y \ / \ / Y Y IIRC, the program will output the first tree, not only making B seem much farther away from C than it really is, but also hiding the fact that changing just gene in B from N to O would change the branch structure at the 5 deep level. (If the program is written properly, it will of course output warnings when this sort of ambiguity is found, as well as sensitivity analyses on the input data --- but the user has to understand the mathematics to be able to interpret all that). I don't know if C-S ever put up a defense against these critics (apart from the classical "I plugged it into the chi-square table/a computer program, and this is what I got, so of course it's right." Cf. Ruhlen and the "I multiplied a lot of numbers, so there" argument). Lars Mathiesen (U of Copenhagen CS Dep) <thorinn@...> (Humour NOT marked)