"The problem is that Structure, which uses an algorithm called “k-means,”"I pointed out that Structure does not use k-means and a small discussion ensued on twitter. I see that the above statement has now been removed from the article, but an endnote on the topic remains:
*Originally, I wrote that STRUCTURE uses the k-means algorithm. Some population geneticists thought that I oversimplified what STRUCTURE does. Different clustering algorithms make different assumptions. STRUCTURE is indeed very similar to k-means, but with a particular error structure – binomial instead of gaussian. This is a fine technical detail compared with the principal point, which is that k is picked by the user, and does not emerge from the data automatically. To learn more, see this Twitter chain and this and this. Thanks to Graham Coop at UC Davis.I did not intend to spend more time on this, but since the author of the article invited me to comment at more than 140 characters on the topic, I thought it was a good idea to do so.
k-means is completely unrelated to the structure algorithm of Pritchard and Stephens. Remember that structure can be run in either a no-mixture or a mixture mode. In both modes, the input is a set of N individuals and K, the number of ancestral populations. In the no-mixture mode, individuals are assigned to one of K populations, while in the mixture mode, their ancestry proportions from K populations are inferred. (Incidentally, allele frequencies in the K ancestral populations are also inferred, although usually not reported).
k-means has no mixture mode, but rather it is a clustering algorithm which assigns individuals to K populations. Thus, it can be used to solve the same problem as the no-mixture mode of structure. The two algorithms solve this problem in entirely different ways. Saying that structure uses k-means is equivalent to saying that any partitioning method into k groups uses k-means.
More importantly, structure is commonly used in mixture mode, including in the landmark paper by Rosenberg et al. (2002) that both Wade and the author of the review refer to. In this mode, structure does not even solve the same problem as k-means. Rather than find some partitioning of N individuals into K disjoint clusters, it estimates the mixture proportions of each of N individuals into all K populations. In practice (including the paper by Rosenberg et al. 2002), many individuals often have most (or all) of their ancestry from one or a few of the K populations. If humans had no structure at a particular K, the algorithm could very well produce a jumbled mess of different colors. Instead it produces neat ancestral populations that correspond well to what may be instantly recognizable as major human groups.
The reader is invited to look at any standard implementation of k-means, such as the one in R to be convinced that k-means does not even produce the same output as structure. The point is a trivial one, but k-means estimates N parameters (the cluster label for each of N individuals), whereas structure estimates N(K-1) parameters (the mixture proportions of N individuals in K populations; only K-1 numbers are needed as they have to add up to unity).
The only thing these algorithms have in common is that they require that the user input K. This point has been used by the plethora of negative reviews of Wade's book to argue that the classification of humans into biological races is arbitrary as it is subjective (it relies on user input of K).
This is a rather weak objection, for at least a couple of reasons: first, K can also be estimated from data and there are indeed clustering algorithms (such as fineStructure) that do not require user input of K and identify a value of K and organize the K ancestral populations into a hierarchical tree whose deep splits correspond exactly to the continental human races. Another popular algorithm, ADMIXTURE, proposes a cross-validation procedure to choose K. So, the choice of K can be automated and need not be subjective.
The more important reason against the "subjective K" objection is that it does not in any way invalidate the partitioning of humans into different K at different levels of granularity. This is reasonably easy to understand: the whole field of taxonomy divides living things into a hierarchical structure. In some cases it is useful to speak of vertebrates, and in others it's useful to speak of mammals, or primates, etc. In humans it's sometimes useful to speak of the entire species H. sapiens in contradistinction to other species, when studying what is common to humans, and sometimes it is useful to speak of major populations of H. sapiens (such as Europeans or East Asians), or minor ones (e.g., Mongols and Vietnamese), when studying how human groups differ from one another. These groupings are not arbitrary, but appear when biological traits (e.g., SNPs) are subjected to various types of analysis (including structure and similar algorithms).