12. Qiemo 且末[Châ€™ieh-mo] = modern Charchan or Cherchen. There has been some confusion about this name as first Chavannes (1907), p. 156, and then Stein (1921), 296 ff., gave the wrong romanization for the first character (Chavannes, using the French EFEO romanization system gave tsiu, and Stein used the Wade-Giles chÃ¼). In fact the character is correctly represented by châ€™ieh in Wade-Giles and qie in Pinyin. However, there has never been any serious dispute about its identification with modern Charchan â€“ see for example, Stein (1921), p. 295, CICA, p. 92, n. 125; although Pulleyblank (1963), p. 109, following Hamilton 1958, p. 121, suggests it was Shanshan (see note 1.13 below).
Charchan is strategically located at the junction of the main route from Dunhuang to Khotan, and the route which goes south through the mountains, around the southern shore of Koko Nor, and on to Xining, and China. A branch from this second route goes south from Kharakhoto to Lhasa.
An ancient trail ran from Xining via Koko Nor and the southern turn-off towards Lhasa at Kharakhoto, and then on to Charchan in the Tarim Basin. Chinese travellers at times attempted to make use of this route to avoid the horrors of the desert journey between Dunhuang and Loulan, south of Lop Nor. West of Koko Nor the trail goes through barren country, with little fodder and was inhabited by a hostile Qiang tribe (or tribes) referred to in the Chinese texts as the Chuo (â€˜Unsubduedâ€™ or â€˜Unrulyâ€™) Qiang:
â€œCharchand is reported to lie at a monthâ€™s distance from Khoten by a road which leads all the way along the foot of a mountain range (the so-called Kue-lun of Chinese and European geographers), and between it and the great Desert of Takla-Makān or Gobi. No roads are known to lead across this range further East than that from Poloo, which brings the traveler over to the Pangong Lake in Western Tibet; but there is a road leading eastward into China, which, however, was not used by the Chinese when they were in possession of the country.â€ Shaw (1871), p. 37.
â€œWe had to go down the other side of the mountain in a cloud of dust, and then, once more, we were on a blistered, yellow table-land. It was bordered with abrupt, eroded mountains on which nothing grew. The great trail from Dulan to Lhasa by Barun wound through here and I even thought I saw traces of the plough. Yes, I was right. There were field shapes, a wall, an earthen roof. We were at Kharakhoto [west of Koko Nor at the junction of the trails from Cherchen to Xining and the road south to Lhasa; about two daysâ€™ march to the east of Dzun].â€ Maillart (1937), p. 101.
13. The kingdom of Shanshan 鄯善[Shan-shan] = ancient Loulan and included the region around Lop Nor (â€˜Lop Lakeâ€™). Its capital during early Han times is called 扜泥YÃ¼ni (often incorrectly transcribed as Wuni) in the Hanshu â€“ see CICA, p. 81-82 and n. 77 â€“which probably referred to the region of modern Ruoqiang or the Charklik oasis, to the southwest of the died-up bed of Lop Nor. It is also sometimes referred to as the kingdom of KrorÃ¤n. For the use of NW Prakrit in Kroraina (and in Kucha and Karashahr) see Bailey (1985), pp. 4-5.
â€œLou-lan is the Krorâ€™iṃna or Krorayina of the Kharoṣṭhī-documents ; it was originally, it seems, the name of the whole country and known as such to the Chinese â€“ although they may have been ignorant of its position â€“ since 176 B.C., when the Hsiung-nu ruler Mao-tun informed emperor Wen of his conquest of this and of other states (HS 94 A.10b, Urkunden I, p. 76). In a more restricted sense, Lou-lan continued to refer to the town of Krorâ€™iṃna, i.e. the area designated LA by Stein (1921), vol. I, pp. 414-415 : see also Enoki (1963), p. 147.â€ CICA, p. 81, n. 77.
This kingdom also controlled the strategic community located near the northwest corner of Lop Nor, near the outflow of the Tarim River, which was on the route from Dunhuang to Korla. This has caused considerable confusion about where the â€œcapitalâ€ lay. I tend to agree with Yu Taishan that the seat of government was always in the fertile Charkhlik oasis:
â€œOn the location of the royal government of Shanshan, there have been two main theories. The first suggests that Wuni was situated southwest of Lob Nor, around present Ruoqiang 婼羌county. The second suggests that Wuni lay northwest of Lob Nor, around the ruins of Loulan (Kroraimna, Krorayina). In addition, it has been suggested that Shanshan had established its capital at Kroraimna when the name of the state was Loulan, and later moved its capital south of Lob Nor. In my opinion, Shanshan (i.e. Loulan) never moved its capital and the seat of the royal government had always been southwest of Lob Nor.â€ Yu (1998), p. 197 â€“ and see the whole of his Appendix 2, â€œOn the Location of Capital of the State of Shanshan,â€ ibid., pp. 197-211 for his detailed presentation for this scenario.
â€œThe town of Wuni was not situated northwest of Lob Nor, but was situated in the present Ruoqiang country (Qarkilik), on the south bank of the Charchen River, by the northern foothills of the Altyn Tagh, southwest of Lob Nor.â€ Ibid., p. 201.
As it seems that both CICA and Taishan Yu have given the wrong romanization for the first character of the name of the capital (the modern Pinyin should read yu not wu), I think I should give more details here:
扜yu [yÃ¼] GR 13088 [64:3] â€œ1. To make a hand sign; 2. to pull to oneself (the string of a bow).â€ Couvrier (p. 345) gives: â€œto make a hand sign. To take.â€ This character is, unfortunately, not listed in either Pulleyblank or Karlgren.
泥ni [85:5] EMC: nεj or nεjh; K. 563d * niər.
I am unable to see why this original cannot be accepted, but, as I am no expert in these matters, and several alternatives are presented in CICA, I thought I should include them here, for the readerâ€™s consideration:
â€œWu-ni, however, has given rise to considerable discussion because of the uncertainties surrounding the word here transcribed as wu, viz. 扜. According to Yen Shih-ku, it is pronounced ·o· / ·ua, and this view is repeated in Tâ€™ai-pâ€™ing yÃ¼-lan 792.5a (it is not clear whether this passage belongs to the original Hua-lin pâ€™ien-lÃ¼eh of 524, or whether it was copied from a later â€“ Tâ€™ang or Sung â€“ manuscript of the Han-shu only around 983; see Tjan (1949), pp. 60-61; Pulleyblank (1963), p. 89, calls it â€œan anonymous glossâ€, but the chances are that it is Yen Shih-kuâ€™s remark).
Secondly, although wu扜 is included in the dictionary Shuo-wen chieh-tzu of A.D. 100 (see Shuo-wen chieh-tzu ku-lin 5505a) and even in the earlier wordlist Fang-yen, compiled before A.D. 18, if we follow the emendation by Tai Chen (1937) in his Fang-yen su-cheng, p. 295, in the Shuo-wen it is not written 扜 but 㧍. Still more curious, however, is the fact that it does not seem to occur in Han inscriptions or in pre-Han literature, i.e. it is not found in Uchinoâ€™s index to the Li shih (1966) nor in Grammata Serica Recensa. According to its rare occurrences in Han literature, assembled in T. Moroashiâ€™s Dai Kan-wa jiten, vol. V, p. 103, no. 11799, wu 扜 seems only to occur in these few placenames in HS 96!
Thirdly, there are variant readings, where wu扜 is replaced by 扞 (K. 139q : gâ€™Ã¢n / ɣÃ¢n), or 拘, K. 108p : ku / ku, or ku / kəu, or gâ€™u / gâ€™u). These variants occur in some editions of SC 123 (Shao-hsing ed. 123.1b. Palace ed. of 1739, 123.3b. This reading has not been adopted by Takigawa, SC 123.7, who writes 扜 without further explanation), cq. HHS Mem. 78.6bff., both not regarding the city of Wu-ni, but the country of Wu-ni (HS 96.16a ; see note 138).
Now either Wu-mi or ChÃ¼-mi (not ni !) may be correct for the completely different country (see below), but, as regards the capital of Lou-lan â€“ Shanshan, it would seem that 扞泥 Han-ni is right, supported as it is by the reading 驩泥 (GSR 158.l : χwÃ¢n / χuÃ¢n) in the Hou Han chi by YÃ¼an Hung (328-375), for this agrees with the word occurring in the kharoṣṭhī inscriptions : kuhani (or kvhani), meaning â€œcapitalâ€ (Enoki (1963), p. 129-135 as well as Enoki (1961) and Enoki (1967), cf. Brough (1965) ), Pulleyblank (1963), p. 89 reconstructs the â€œOld Chineseâ€ pronunciation of Wu-ni as Â·wāĥ-ne(δ) and believes it â€œunnecessary â€¦ to adopt the reading 扞 â€¦ The variant 驩泥 *hwan-nei seems closer to Â·waĥ than to *ganh as an attempt to render the hypothetical original behind khvani or kuhaniâ€.â€ CICA, pp. 81-82, n. 77.
The name of the kingdom Loulan was changed by the Chinese to Shanshan in 77 BCE. See: Chavannes (1905), p. 537, n. 2; Brough, (1965), p. 592; MolÃ¨ (1970) p. 116, n. 183 [note that the date of the name change to Shan-shan in AD 77 is incorrect]; CICA, p. 81, n. 77.
It seems that the ancient kingdom of KrorÃ¤n also included the territory of the oasis of Charkhlik to the southeast of Lop Nor, on the Southern Route, and that this was, indeed, the â€œcapitalâ€ or â€œseat of governmentâ€.
Note, however, that Pulleyblank (1963), p. 109, suggests that Shanshan was plausibly identified by Hamilton (1958, p. 121): â€œwith modern Charchan < *Jarjan. The name Shan-shan appears as a substitute for the earlier Lou-lan in the first century BCE. If the foreign original had indeed palatials at this period, we must suppose that the Chinese palatials were already beginning to develop, perhaps in and intermediate stage *d. There are too many uncertainties, however, for this to provide a firm argument.â€ This argument does, however, seem to have been overridden by the argument that it must refer to the largest oasis in the region â€“ that of modern Charklik:
â€œThere can be no doubt that by Shan-shan is here meant [that is, in the Weilue] the present Lop tract with its main oasis of Charkhlik.â€ Stein (1921), p. 328. See also: Giles (1930-1932), p. 830; Part 4, note 15; Pelliot (1963), p. 770.
â€œThe northern and southern routes came together again on the eastern rim of the Tarim near the great salt marsh of Lopnur in the kingdom of KrorÃ¤n (Kroraina, Loulan) before continuing into the world of the ethnic Chinese.â€ Mallory and Mair (2000), pp. 64-65.
â€œThe statelet that formed about the great salt marshes of Lopnur was known in Chinese sources originally as Loulan and then later was called Shanshan . . . when the territory came under Chinese dominion in 77 BC. The name Loulan reflects an attempt to render in Chinese what we find in later Indian (Kharoṣṭhī) documents from the region as Kroraâ€™ina or Krorayina (now KrorÃ¤n).... As for the name â€˜Shanshanâ€™, this has been seen as a precursor to the name â€˜ChÃ¤rchÃ¤nâ€™, where some of the most spectacular mummies have been recovered. This is hardly unexpected as the region possesses immense deserts laden with salt that both early Chinese guidebooks and modern explorers have described in detail. A 1st-century BC document informs us that from the Chinese outpost at Dunhuang to KrorÃ¤n there was a desert that stretched for 500 li [208 km] in which there was neither water nor grass.â€ Ibid. p. 81.
â€œDuring the Han period the population of KrorÃ¤n is given as 1,570 households, with 14,100 people of whom 2,912 could bear arms. The agricultural potential of the region is described as limited, its soils being too sandy and salty, and food crops had to be brought in from neighbouring states. There was an important nomadic component in the region where asses, horses and many camels are reckoned. Other products were jade, rushes, tamarisk and balsam poplar.â€ Ibid, p. 85.
â€œAnother prominent site to see some excavation is the town of KrorÃ¤n which was excavated by Sven Hedin, Aurel Stein and, most recently, by Hou Can. Its stamped earth walls still stand up to 4 m (13 ft) in height and ran about 330 m (1,083 ft) on each side. It housed clusters of temples, the government central offie, residential quarters and what has been dismissed as a â€˜slumâ€™. Within 5 km (3 miles) of the town Hou Can uncovered seven tombs. Among the burials was a middle-aged woman who had a child placed over her head, reminiscent of the â€˜Scream Babyâ€™ excavated at Zaghunluq.â€ Ibid., p. 165.
â€œThe next town [to the east, after Quemo/Cherchen], Ruoqiang (Charkhlik) is no bigger but is nevertheless the most important in the vast region encompassing the salt seabed of the dried-up Lop Nor. In the first century BC it formed part of the Kingdom of Loulan, which was later to change its name to â€˜Shanshan.â€™ At Ruoqiang the road divides, one branch heading north to Korla, the other taking a more southerly road than the original Silk road, crossing into Qinghai Province and then turning northeast to Dunhuang. East of Ruoqiang lies another archaeological site, Miran, which Stein visited in 1906. In the 1970s Chinese archaeologists found a Han-Dynasty system of irrigation canals here. To the south lie the Altun Mountains, where a large nature reserve has been established. It was here in the 1880s that the Russian explorer, Nikolai Prejewalski, discovered the only existing species of the original horse, which was named Equus prezawalski. Extinct in the wild, the species is now only bred in zoos.â€ Bonavia (1988), p. 192.
â€œKrorÃ¤n was included in the lists of conquests carried out by the Xiongnu leader Modu in 176 BC and, with the westward expansion of the Han, it found itself in the middle of two warring empiresâ€¦. The king saw that it was impossible to navigate between two such masters and tilted his hand towards the Han, who took advantage of the situation, assassinating one king and beheading another until they had installed someone they could trust, and in 77 BC the name of Shanshan. Although ostensibly under Han control, as late as AD 25 it was recorded that KrorÃ¤n was still in league with the Xiongnu.
Understandably, during the floruit of the Silk Road, KrorÃ¤n was a place of great strategic importance. About
Citation: Serang O, Mollinari M, Garcia AAF (2012) Efficient Exact Maximum a Posteriori Computation for Bayesian SNP Genotyping in Polyploids. PLoS ONE 7(2): e30906. https://doi.org/10.1371/journal.pone.0030906
Editor: Fabio Rapallo, University of East Piedmont, Italy
Received: October 22, 2011; Accepted: December 27, 2011; Published: February 17, 2012
Copyright: © 2012 Serang et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This research was supported by Fundacao de Amparo a Pesquisa do Estado de Sao Paulo (grant 2008/52197-4 and 2008/54402-4). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Most agriculturally important plant species, such as potato, sugarcane, coffee, cotton and alfalfa, are polyploids. In fact, about half of the natural flowering plant species are polyploids . Despite their importance, our understanding of these species does not fully benefit from marker technology. Molecular markers are widely used for diploid species and can be very useful for building linkage maps , finding genomic regions associated with variation in quantitative traits (or QTL) , studying the genetic architecture of quantitative traits , and assembling genome sequences .
Accurate genotyping of polyploids (even for largely uncharacterized species or in cases when the ploidy is unknown) is a missing keystone in genetics that must be solved in order to utilize the approaches that have marked a revolution in biology over the past hundred years. Accurate genotypes are necessary to understand the genetic mechanisms and specific loci that determine phenotypes via QTL mapping and association studies. These genotypes are also necessary for the creation of linkage maps, which are exceedingly useful in developing a greater understanding of genome evolution. These linkage maps will be essential for the assembly of complex polyploid genomes.
The current approach used for several genetic studies on polyploids, especially for linkage mapping, is based on marker loci with only a single copy (simplex) in one of the parents and a nulliplex in the other, in populations obtained from the cross of non-inbred parents. Markers such as AFLP and SSR ( i.e. microsatelites) are then scored as presence or absence of bands – and behave like dominant markers. For sugarcane, most available linkage maps are based on markers segregating in (single dose in one parent) or patterns (single dose in both parents) . Even if complex statistical methods are applied to obtain integrated maps that combine information from markers with both patterns simultaneously , , the available maps are based on a small sample of the genome, since markers with higher doses are normally not included; therefore, they are not well saturated and informative for genome assembly . For QTL studies in sugarcane, the situation is similar. Statistical models developed for backcrosses are used for simplexnulliplex configurations with available software that was developed for diploids . Since the ploidy level could be related with gene expression , these approaches need to be modified to incorporate allele dosage using more efficient marker systems.
Nowadays, new technologies such as Illumina GoldenGate™  and Sequenom iPLEX MassARRAY®  allow researchers to generate high-throughput genotyping data from SNPs. These data usually contain two signals for each SNP locus, each one corresponding to an intensity recorded for one of the two possible alleles. The expected value of each signal intensity is proportional to the corresponding allele dosage , , and therefore SNPs are the marker of choice for genetic studies in polyploids. They are more informative than presence/absence markers, and should allow a better coverage of the genome and the development of more realistic models for linkage studies, QTL and association mapping, among other applications.
In order to explore the full potential of such technologies, a first required step is the development of statistical methods for SNP genotype calling, i.e. inferring the (discrete) genotype of each individual for each locus, identifying the number of copies of each allele. For diploids, including humans, a number of methods are already available . This is not the case for polyploids. Methods for polyploid genotyping need to be able to deal not only with multiple copies of the alleles, but also with some complex problems such as aneuploidy and unknown ploidy, which can be present for some species.
Voorrips et al. presented an approach based on mixture models for genotype calling in autotetraploids, in a similar way as done by  in diploids. Based on the (transformed) allele signal ratio (ratio of one signal peak to the total), they fitted a mixture of five normal distributions, each one corresponding to one genotype class (from zero to four copies of the allele). They compared several models and were also able to test for Hardy-Weinberg equilibrium in a potato panel with tetraploid potato varieties. Their model could be expanded for allowing the inclusion of more classes in the mixture in order to be useful for other autopolyploids; however, in certain situations the ploidy (and hence the number of classes) is unknown and need to be estimated. Also, crosses with distinct ploidies and parents may result in similar segregation patterns, making the selection of the best model a complicated task. This is the case for sugarcane, which is a very complex polyploid and aneuploid species. Genotype calling in sugarcane is extremely difficult, especially if commercial varieties are used, since they are interspecific hybrids between domesticated and wild relatives .
Here we present a graphical Bayesian model for SNP genotyping calling. Our graphical Bayesian method can infer genotypes even when the ploidy of the population is unknown. At the core of Bayesian thinking is the notion of modeling processes forwards rather than trying to model their inverse. Generally, a great deal of prior knowledge is available regarding the way any process behaves running forwards; when the process is modeled generatively ( i.e. running forwards), this prior knowledge can be exploited to improve the fidelity with which it describes the process. In graphical models prior knowledge regarding independence and conditional independence of variables can be visualized in the structure of the graph. The highly connected subunits of the graph can be considered with modularity; that is, a subunit can be easily interchanged with another. This modularity is what allows our model to work with populations in Hardy-Weinberg equilibrium, the progeny of an cross, or any population with a known theoretical distribution of genotypes. This modularity results in a model and inference procedure that are compatible with any theoretical distribution of genotypes in the population. There are many other ways that our model, and similarly motivated models, can be easily changed and improved because of their modularity and generality.
We also introduce an algorithm for finding the exact maximum a posteriori (MAP) genotype configuration with this model. This algorithm is implemented in a freely available software package named SuperMASSA. We demonstrate the utility, efficiency, and flexibility of the model and algorithm by applying them to data from two polyploids processed with two different platforms: potato  using Illumina GoldenGateTM assay  and sugarcane using Sequenom iPLEX MassARRAY® .
Materials and Methods
An autotetraploid potato collection was used, comprising 384 SNPs scored in a panel of 224 individuals using the Illumina GoldenGateTM assay, as described in  and . This data set is distributed along with the free R package fitTetra , under the GNU General Public License. To exemplify the results obtained using the mixture model,  chose three loci, PotSNP016, PotSNP034 (Figure 1) and PotSNP192. However, for loci PotSNP192, they noted that the Illumina GoldenGate assay produced significantly different signal strengths for the alleles, resulting in skewed clusters. Thus, the intensity ratio between those alleles can not be easily used to infer genotypes. Since our model assumes the signal strength of each allele is proportional to the dosage (and that the proportionality constant for both alleles is similar), we used only PotSNP016 and PotSNP034 to exemplify our method. For this data set, we use the same model of the genotype distribution as  ( i.e. Hardy-Weinberg). Moreover, since we know the ploidy for both the diploid and tetraploid potatoes, we can check if the ploidy estimated by our model matches the actual one. These two SNPs were also scored in 64 diploid potato varieties that were used for a visual check of the goodness of fit. We also analyze the diploid individuals using PotSNP016 and PotSNP034.
A sugarcane mapping population derived from a cross between two commercial varieties (IACSP 95-3018IACSP 93-3046) was used. It was comprised of 180 individuals scored for 241 SNPs using the Sequenom iPLEX MassARRAY® technology . This assay is based on allele-specific primer extension with a mass-modified terminator . The DNA products of this reaction are analyzed by a MALDI-TOF mass spectrometer and each polymorphic region of interest is detected by a mass of the allele-specific primer . Both parents were also scored 12 times for each SNP. If the ionization efficiency is similar for both alleles, the intensities produced by mass spectrometry are proportional to abundance (with very similar proportionality constant if run in the same sample prep); therefore, the if the amplification of both alleles is similar, the skew is minimal. We observe much less skew in the sugarcane data set compared to the potato data set.
Modern sugarcane varieties have highly polyploid and aneuploid genomes, with ploidy levels ranging from 5 to 16 , . Therefore, unless there is strong cytological information for a marker, it is important to also estimate the ploidy. Since we want to test our model and do not have a reference point for sugarcane (such as the known diploids or tetraploid potato varieties), and also because sugarcane meiosis frequently result in deviations from the expected Mendelian segregation ratios –, we used a blind method to curate the data and evaluate SuperMASSA.
First, all sugarcane loci were curated by eye using several criteria. For each locus, an expert looked at raw scatter plots as shown in Figure 1 and assessed the following: i) the overall quality; ii) the number of clusters; and iii) the expected ploidy level based on parental data. This resulted in 27 SNPs that were easily classified by eye. SuperMASSA was used to predict the ploidy and number of clusters for each of these 27 loci and three of them (the three judged to be of the highest quality) are used to show the results of our model.
It is important to note that in this blind validation experiment, SuperMASSA was not used to curate the data and the model behind SuperMASSA was not changed after observing and curating the data.
Probabilistic Graphical Model
We use a Bayesian approach to model the probability of the observed data given the ploidy and all genotypes. By modeling the generative process ( i.e. the process by which the data is produced assuming we know the ploidy and genotypes of all individuals), we can build the model from realistic assumptions for the data. Using the model, we then perform inference (described in the Probabilistic Inference section) to effectively enumerate all possible ploidies and genotypes for individuals in the population, and choose the configuration that maximizes the posterior probability of the model. This configuration is known as the maximum a posterior (MAP) and is guaranteed to result in the highest possible probability.
In Figure 2 we present two probabilistic directed graphical models of the SNP genotyping process for a single locus: a Hardy-Weinberg model and an model. These models represent dependencies using directed edges. Both models share similar motivation and notation; the few differences arise from different models of the distribution of genotypes in the population. We first present the shared model components and then present the details specific to each model.
Hardy-Weinberg and Model Similarities.
For both models, the “genotype configuration” is the collection of genotype assignments for all individuals in the data set. Because the ploidy, denoted , determines the possible set of genotype outcomes, the genotype configuration depends on the ploidy . Denote the set of possible genotype outcomes for a given ploidy as . For example, for a diploid locus and the set of possible genotypes is . Both models use a uniform prior on the ploidy ; it should be noted that for the data we analyzed, the influence of any weak priors is negligible because of a pronounced drop in suboptimal posteriors relative to the MAP configuration.
The observed data is composed of a collection of data points , each of which comprises an intensity pair and an individual that gave the sample producing the pair. We assume that each data point depends only on the individual that produced it; therefore, the likelihood of any genotype configuration can be written as a product over individuals:For some , we model the likelihood proportional to using a normal distribution with unknown standard deviation :where the operator is used to perform normalization on and . This likelihood effectively uses the expected angles of each genotype and penalizes individuals deviant from the genotype of the expected angle. For this reason, “skewed” data, where the intensities measured by allele 1 and allele 2 use very different constants of proportionality with their respective dosages, cannot be modeled without including a latent variable for the skew. Sigma is given a uniform prior and inference is performed in a manner similar to inference over all ploidy.
For any genotype configuration , both models also compute , the distribution of possible genotypes. , where equals , the number of individuals assigned to genotype . The probability of any distribution is modeled using the theoretical distribution of genotypes . Given the theoretical genotype frequencies for the population where , the probability of observing any genotype distribution is multinomial:Both the Hardy-Weinberg and models allow for individuals with replicate data points. If all individuals have the same number of replicate data points, then the MAP configuration is guaranteed to be found (as shown in the Supplement S1).
Figure 2B depicts the dependencies of the model. In the model, the theoretical distribution of genotypes is modeled using hypergeometric distributions for the gametes (it is important to note that any model could be trivially applied instead). Denote to be the dosage for the first allele in the ordered pair and to be the dosage of the second allele in the pair. Given parents and , both which have values in , the probability of observing gamete from (without loss of generality) isTherefore, the probability of observing offspring isIn the model, the parent genotypes and depend on the ploidy since the outcomes of both must be in . We model the prior probability as uniform for the number of unique outcomes:
In Figure 2B dashed nodes and arrows represent variables and dependencies that exist only when data from the parents is included. The probability of these parameters can be modeled as conditionally independent, just like :When parental data is used, the parents are distinct and so the number of unique parental combinations becomes ; therefore, when parental data is available, the prior probability on parental configurations becomes uniform over these distinguishable outcomes.
Generalized Population Model.
The inference procedure described does not make any special use of the type of parameters that determine ; therefore, given the parameters that determine (and do not depend on , , or ), our inference method will find the MAP genotype configuration. This illustrates that both the Hardy-Weinberg and models are specific instances of a general model (where and , respectively). is searched in a similar manner, but since we use a uniform prior, we search all parameter configurations for a given and omit from for simplicity (this strategy also allows us to cache the table of likelihoods for a given ). When parental data is included in the model, it can be modeled by setting the prior probability (that is, the probability including available parent data but excluding data from progeny) to
We define the “generalized population model” as the model defined using . For each we will compute the MAP genotype configuration ; using the prior probability of , we can enumerate the possible outcomes of and compute both the genotype configuration and parameters that jointly maximize the posterior probability for these parameters. Using this approach we can also approximate , the posterior belief that the MAP parameter and genotype configuration is correct.
Before inference is performed, it is necessary to demonstrate that the parameters can be inferred with a sufficient amount of data ( i.e. they are “identifiable”). By the law of large numbers, the densities of the genotypes and allele intensities converge to the density expected from the parameters as ; therefore, with enough individuals, the exact distribution of genotypes and allele intensities is known. In order to prove that the parameters are identifiable, we must demonstrate that can be computed from this density ( i.e. that is one-to-one). It is sufficient here to prove that no two non-identical pair of parameters can yield the same density.
By assumption, our model considers data which is a weighted sum of Gaussians (one for each genotype), each with a mean at the expected slope for the two allele intensities. Algabraically, for two densities to be equal, the two equivalent sums of shifted Gaussians, each of the form , must use identical sets (when ). Furthermore, the corresponding weights must be equal for Gaussians shifted by the same . Together, these statements require that identical densities must be created by sets of parameters with identical angles for all possible genotypes (). This requires that all genotypes have an equal dosage to ploidy ratio for each possible genotype.
If this set of contains more than one possible genotype, then the difference between the two dosages increases for the larger ploidy (because the ploidy, the denominator in both slopes, has increased, but the slopes remains constant). Because these dosages are necessarily integers, then the difference must increase by at least one, indicating a new genotype class with expected slope between the other two. Therefore, to have the same set of , the larger ploidy has a possible genotype class not possible with the smaller ploidy, and this genotype class is not possible with the smaller ploidy. Thus, the larger ploidy must assign a weight to that new genotype class.
However, both models considered (Hardy-Weinberg and ) create unimodal (or flat) weight distributions. For this reason, they cannot create sequential weights that are nonzero, zero and then nonzero again. Furthermore, given the ploidy, the weights (or expected frequencies) are sufficient to estimate . Therefore, if more than one possible genotype exists, the parameters are identifiable (the lowest ploidy that could produce the desired angles is the only one possible). When only one possible genotype exists, the ploidy cannot be estimated (it could be any multiple of a ploidy that produces the correct angle). In this case, we use an Occam's razor approach by placing a decreasing prior on the ploidy .
In order to perform inference on the generalized population model described in the Probabilistic Graphical Model section, we introduce three approaches: a greedy approach (maximum likelihood), an exact approach (MAP) via dynamic programming, and a substantially more efficient exact approach (also MAP). For all inference methods, assume is known. The best greedy genotype configuration and can be chosen by enumerating all outcomes of and selecting the one with highest posterior.
Graphically, it is trivial to demonstrate why MAP inference is difficult. Consider , a single bin in the distribution ; it has incoming edges from all individuals' genotypes . Thus, in the the moral graph (in which all nodes with a common successor are joined by an undirected edge), an edge joins each pair of nodes , resulting in a clique of size . The treewidth  of a graph containing an clique is at least , so standard inference methods ( e.g. naive enumeration or junction tree inference , ) will require number of steps exponential in at least; for problems of the size we consider (), a runtime exponential in is infeasible.
The combinatorial dependencies between genotypes in different individuals must be recognized in order to compute the MAP genotype configuration. It is tempting to approximate these dependencies with a mixture model. A mixture model approach treats all as independent draws from the distribution ; however, a mixture model rewards configurations assigning all individuals the most probable genotype in . In reality, such a configuration is extremely improbable because there is only one series of genotype assignments that result in this outcome. On the other hand, if is chosen so that not all individuals are assigned the most probable genotype in , the multinomial probability may be larger because there are many genotype configurations that could lead to (compared to the single configuration that yields the most probable genotypes). Modeling this dependency between all individuals, although computationally challenging, is extremely important.
In the simplest approach, all possible genotype configurations can be enumerated naively in exponential time, resulting in the tree shown in Figure 3A. Although it is infeasible to think of enumerating the entire tree, it may be possible to ignore subtrees that cannot lead to an optimum, substantially reducing the search space.
Figure 3. Illustration of Exact Inference.
Exact MAP computation can be performed by enumerating all possible genotype configurations. Because each individual's genotype is among , searching through genotype configurations can be viewed as a tree in which each individual genotype assignment branches into separate outcomes. (A) A naive search progresses downward through the tree and chooses the series of genotype assignments that lead to the highest posterior probability. A naive branch and bound method derived from this tree bounds genotype configurations for which the prefix determines that all subsequent paths are poor. (B) A multinomial graph ( i.e. the subset graph of the power-set of G) merges outcomes that result in the same genotype counts . Multiple paths (from the top) can lead to any given set of genotype counts; therefore, dynamic programming is used. Given the layer above, each node can compute the most likely path from the top that leads to it. Once the most likely path and score are computed for each node in a layer, the next layer can progress. At the bottom layer, the node with the highest combined likelihood (computed via dynamic programming) and (the same for any path terminating at the node) maximizes the posterior probability. As in the naive tree, once all lower adjacent nodes in a subtree are provably suboptimal, then the subtree can be bounded. The dynamic programming approach is substantially more time space-efficient than the naive approach.
Consider individuals in an arbitrary order with some genotypes assigned: Let denote for and denote the unassigned genotypes . We refer to the assigned genotypes as a “prefix” genotype configuration and the unknown as a “suffix”. Given a prefix genotype configuration, it is possible to bound the joint probability of all configurations with this prefix by bounding the likelihood for the remaining configurations:(4)
Given a genotype and parameter configuration , any configuration including the prefix satisfying the following inequality is suboptimal:The prefixes correspond to paths from the top of the tree in Figure 3A; prefixes that are shown to be suboptimal can be “bound,” meaning that they are not branched and searched further down. The second product may be cached for all for a speedup of . It is worth noting that this second product must be included, because the likelihood constant on is unknown and so we cannot guarantee that . With all of the branch and bound approaches, the initial values can be computed using the greedy maximum likelihood approach and then improved as more probable configurations are found.
A more sophisticated dynamic programming approach (shown in Figure 3B) merges nodes of equal depth that produce identical distribution prefixesand the number of individuals with each genotype in the genotype prefix. Because , then if two prefixes produce the same distribution prefixes , the suffixes satisfying are the same as the suffixes satisfying . For this reason, other than the prefix likelihoods and , all other values in equations 4 will be the same; therefore, all prefixes producing the same prefix distribution can be grouped together, using the greatest prefix likelihood and corresponding prefix path. These grouped nodes can be added in batches for each depth to produce a “layer;” by induction the best path to each node in a layer includes the best path to the nodes in the layer above. The same bound from the naive tree is used, but subproblems that are identical are grouped and solved together to avoid redundant computation and storage.
Efficient Exact Inference.
There are a number of reasons that the naive and dynamic programming branch and bound methods are inefficient. First, the number of nodes visited in these trees may be as much as and , both of which are exponential in . This number of nodes determines the time and (if implemented in a manner that emphasizes runtime efficiency), the space required. Secondly, the suffix path is unconstrained; given , there is no restriction on , and so the bound must use the maximum likelihood for the remaining likelihood. Most importantly, the bound in equation 4 is very conservative; in order to bound a subtree with prefix , the overall likelihood of all subsequent trees must be less than the product of the overall likelihood and multinomial multiplier for a full configuration . Because even the largest multinomial probability is usually very small, the bound is extremely conservative. It is not feasible to use either the naive or dynamic programming branch and bound methods on the presented data.
For these reasons, we introduce a novel geometric branch and bound method; this method has several advantages. First, when the number of individuals is substantially larger than the ploidy (), the worst-case tree produced by our method is several orders of magnitude smaller ( rather than ). Secondly, our geometric method allows us to substantially constrain valid suffix configurations. Lastly, our method makes use of the multinomial probability in the bound; this multinomial probability is very influential in selecting the optimum (especially when the optimal is not very close to zero). Our geometric method has these advantages because it exploits a geometric property that MAP configurations must exhibit. By searching only configurations with this property, our method dramatically reduces the possible search space.
To present our branch and bound method, we first rephrase the problem in a geometric context and then derive a geometric property of optimal configurations (Figure 4). In the likelihood , both the data and the theoretical genotypes are normalized so that and . This likelihood is therefore equivalent to . This normalization effectively places the points along the line . For all and , define the operator to order them using their normalized values along the line (the direction of the ordering is arbitrary). Similarly, for all and define the distance to operate on normalized values of the points on this line. It should be noted that other methods of normalization ( e.g. normalizing on a unit circle) will also enable ordering the points in this way and are compatible with this method.
Figure 4. Illustration of a Suboptimal Genotype Configuration.
The essential motivation behind the geometric branch and bound is demonstrated. The top figure shows the original data and the bottom figure shows the data after being normalized to and within the likelihood function . The two figures on the left correspond to a suboptimal genotype configuration. In the figures on the right, a pair of “blue” and “red” points (highlighted) are switched to the opposite class. After swapping the categories, the numbers of individuals with each genotype do not change, but the total distance between these two points and their classes decreases. Decreasing this distance increases the likelihood while holding constant. Thus the joint probability . Because the MAP configuration cannot be improved by any such swaps, it must correspond to contiguous groups of class assignments along the normalized axis. Searching only the configurations that result in contiguous class assignments dramatically narrows the search space and makes inference computationally feasible where it wouldn't be with the dynamic programming branch and bound method.
Fix the genotype distribution . In the joint probability