August 08, 2005

Biogeographical Ancestry testing

John Hawks replies to some criticism of his earlier post by a blogger at Majority Rights. The subject is the meaning and usefulness of the DNA Print admixture test (AncestryByDNA).

First, we should distinguish between admixture testing in general, and the DNA Print family of tests in particular. Problems with the latter, e.g., "Native American" affiliation in Greeks, "South Asian" affiliation in Iberians, or "Middle Eastern" affiliation in the Irish do not invalidate the utility of admixture testing in general.

First, let's see how AncestryByDNA type tests work:

Frequency data of alleles are obtained for several reference populations, i.e., European, East Asian, Sub-Saharan African, and Native American. For example, if a locus X has three alleles, A, B, C then it is recorded that Europeans may have 50% of A, 30% of B, and 20% of C, while Sub-Saharan Africans may have 20% of A, 40% of B, and 40% of C. This type of frequency information is recorded for all alleles in all reference populations.

Next, individual genotypes are recorded for each customer. For example, for locus X, the customer may have allele C. This is the "hard" mechanistic part of the test, which is almost completely accurate.

Finally, the maximum likelihood estimate of admixture proportions is calculated. Suppose, for example that our hypothetical individual is 100% European and 0% Sub-Saharan African; the probability of observing a C is then simply 0.2, i.e., the frequency of C in the European population. If he is 100% Sub-Saharan African and 0% European, then this probability is 0.4. If he is 50% E/SSA then the probability is 0.5x0.2+0.5x0.4=0.3. All such admixture proportions are tested in a systematic, algorithmic way.

So, we see that the most likely ancestral composition of this individual is 100% Sub-Saharan African and 0% European, based on a single locus. The same kind of calculation can be used for multiple loci. As more loci are tested, the confidence in the admixture proportions increases; for example, the hypothetical individual presented above could quite easily be a European: after all, 60% of Africans do not have the C allele, whereas 20% of Europeans do. However, if we systematically observed that this individual had such common African alleles in multiple loci, then the probability that he had European admixture would be diminished.

So, admixture proportions depend on the following factors:
  1. Individual genotype
  2. Parental reference populations
What we should note here is that the test does not measure exact admixture proportions from races which were once thought to be pure. Rather, if the individual can be modelled as deriving his ancestry from the reference populations, then his most likely ancestral proportions are reported.

In some cases the "if" is justified. For example, the inhabitants of the Brazil can be reasonably seen as the product of admixture between Europeans, Native Americans and Sub-Saharan Africans, because these groups are known to have settled that country.

In other cases, this assumption is not justified. For example, South Asians are not the result of admixture between the four groups listed in the Ancestry By DNA test; this is established by the phylogeny of haploid markers such as mtDNA and the Y chromosome, which establish that South Asians have a high proportion of markers that are specific to themselves, e.g., South Asian-specific subclades of mtDNA macrohaplogroup M. So, South Asians are not reasonably modelled as the product of admixture between the four groups, because these four groups do not include a significant component in the ancestry.

It is important to see what is the problem here: any genotype will be assigned admixture proportions by the maximum likelihood estimation algorithms. Even a chimp's genotype would be assigned some proportions that add up to 100%. These proportions make sense only if the individual can be reasonably expected to be derived from the differentiated populations used as references.

This brings us to the second problem: which reference populations? For example, it is true that Europeans settled both Brazil and the United States, but not the same kind of Europeans. So, the frequency data from a pan-European or English sample do not represent the European component in Brazilians.

In conclusion, admixture testing works best when the parental populations are well-defined, highly differentiated and known to have historically admixed in a given territory. It does not work well when these conditions are not met.

Addendum

John suggests an alternative way of presenting the results of the test:
Compare to this hypothetical result, based on alleles only without any reference to Linnaean taxonomy. The person is told he has 89 alleles that are common worldwide, 35 that are common in Europe but rare elsewhere, 4 that are very common in East Africa and moderately common in the Near East, 10 that are very common in China and Thailand, moderately common in India and Pakistan, and present but less common in the Near East, and 2 alleles that are very high frequency in Native Americans, but also present in Siberia, Caucasus, the Near East, and Greece.

Certainly, it would be nice to have this type of information accompanying the haplotype results. However, this type of presentation can be deceptive. For example, many alleles have slight frequency differences in different human populations. For example, an allele may have a frequency of 50% in Africans and 40% in Europeans. We can certainly not be sure whether it is derived from a European or African ancestor: it is one of the alleles that are "common worldwide" in the quoted paragraph. However, the co-occurrence of many alleles of this kind carries information, and if an individual e.g., has 10 such alleles that are slightly more frequent in Africans than in Europeans and 3 that have the opposite pattern, then we can still conclude that African ancestry is highly probable.

Update

There is a discussion in Majority Rights blog which raises some objections to my comments. Let's address them one at a time:
2. Dienekes’ labelling of certain DNAP results as “problems” is, in my opinion, not justified without further evidence. Given that “Middle Eastern” in the Irish does NOT imply any sort of direct admixture of Middle Easterners (at least, as they exist today) into the Irish genepool, how it is known a priori that this is a problem?
It is certainly a problem to claim that the Irish have more MIDEAS affiliation than the Turks. It goes against geography, history, physical anthropology and common sense. Until a satisfactory explanation for this unexpected result is offered, we are justified to view it as a bug of the test.
I have discovered that Dienekes is incorrect about the chimp comment. From http://www.dnawitness.net we see:-

Number of failed Loci

Chimpanzee - 157 out of 176 Gorilla - 151 out of 176 Orangutan - 137 out of 176”

This is completely irrelevant to my point. My point was that the genotype of a chimpanzee would have to be assigned to four numbers adding to 100%. That has nothing to do with the procedure used to obtain the genotype in the first place. Of course, we expect to have failed loci when we try to read a SNP in a different species, because primers developed for humans will not generally work in a different species; however, if we could read the letters in the 176 loci (or as many loci as are shared in the human and chimp sequence), and plugged the resulting genotype into the estimation algorithm, we would still get four numbers that add up to 100%.

In any case, one can repeat the example using an Australian instead of a chimp. An Australian's genotype would be assigned to the four groups with numbers adding to 100%, even though Australians cannot be viewed as being a mix of the four groups in question.
4. Dienekes’ comments about South Asians disregard the Euro 1.0 test.
The EURO-DNA test measures the "South Asian" component of the "European" component. However, I was not referring to the "South Asian" component of the "European" component, but to the aboriginal South Asian component which is _not_ related to the Western Eurasian (Caucasoid) component, and is evidenced primarily by the predominance of extremely ancient South Asian specific clades of mtDNA macrohaplogroup M. In short, with the exception of some Mongoloid tribes, South Asians are not descended from East Asians (Mongoloids). They are descended from very old indigenous South Asian populations as well as more recent Central Asian (Caucasoid) populations. Whatever similarity they have with East Asians is due to common ancestry _before_ the emergence of Mongoloids.

In other words, a population X may be genetically affiliated to East Asians either due to the joint possession of shared ancestral alleles, or due to the introgression of East Asian alleles into X. Someone labelled as e.g., 50% European + 50% East Asian according to DNAP may be for example (i) a first generation Japanese-Briton, or (ii) a Central Asian Turk of ancient Caucasoid-Mongoloid ancestry, or (iii) a South Asian. In cases (i,ii) he is the progeny of the admixture between Caucasoid and Mongoloid ancestors, whereas in case (iii) he is the progeny of Caucasoid and Proto-South Asian ancestors without any significant Mongoloid ancestry.

No comments: