Re: Korean/Japanese/Inupiaq/Aleut/Yupik/Samenoid(sp)
From: | Lars Henrik Mathiesen <thorinn@...> |
Date: | Monday, August 21, 2000, 15:16 |
> Date: Sun, 20 Aug 2000 23:40:17 -0700
> From: Marcus Smith <smithma@...>
> Cavali-Svorza (I believe that's the spelling) has worked at a
> genetic family tree of all the races. There is lots of interesting
> stuff related to Native Americans, but since genetics is not
> something I understand really well, I can't recall most of it.
> Unusually large groupings around a small number of alleles,
> suggestive that there was a small number of original settlers who
> were closely related. Can't recall anything else.
At the time this came out, Cavalli-Sforza was criticized for not
taking into consideration (or even understanding) the limitations of
the technique he used. The computer will always spit out a nice tree,
but in some cases the branch structure may be very sensitive to small
changes in the input data.
For instance, if we have three genomes
A xxNxxNxxNxxNxxNxx
B xxNxxNxxOxxOxxOxx
C xxPxxPxxPxxPxxPxx
then the distance between A and B is 2, and both A and B are at
distance 5 from C; the correct tree will clearly group A and B
together:
A B C
\ / /
Y /
\ /
\ /
Y
But what about this case:
A xxNxxNxxNxxNxxNxx
B xxNxxNxxNxxOxxOxx
C xxOxxOxxOxxOxxOxx
The distance between A and B is again 2, and 5 between A and C. But B
is now only 3 away from C, and there's no way of making a tree that
correctly represents all three distances. The program has to choose
which one to get wrong:
A B C A B C B A C
\ / / \ \ / \ / /
Y / \ \ / Y /
\ / \ Y Y
\ / \ /
Y Y
IIRC, the program will output the first tree, not only making B seem
much farther away from C than it really is, but also hiding the fact
that changing just gene in B from N to O would change the branch
structure at the 5 deep level.
(If the program is written properly, it will of course output warnings
when this sort of ambiguity is found, as well as sensitivity analyses
on the input data --- but the user has to understand the mathematics
to be able to interpret all that).
I don't know if C-S ever put up a defense against these critics (apart
from the classical "I plugged it into the chi-square table/a computer
program, and this is what I got, so of course it's right." Cf. Ruhlen
and the "I multiplied a lot of numbers, so there" argument).
Lars Mathiesen (U of Copenhagen CS Dep) <thorinn@...> (Humour NOT marked)