Sorenson Genomics

Research & Development

Estimating Genetic Ancestry Using a 5-Population Model

M. Bauchet, J.J. Bryan, A.B. Carter, V.L. Vance, H. Chen, C.L. Mouritsen

Estimating the genetic ancestry of an individual has many applications. Industries where there is potential that a product may have variable effects depending on an individual’s genetic ancestry can greatly benefit from  the ability to qualitatively stratify study participants (pharmaceutics, personal care products, etc.).  In the field of forensics, it can save precious time and money, by estimating the genetic heritage of DNA evidence collected at a crime scene, when little or no other information is available. It also informs professional genealogists and their customers, since genetic ancestry is a an important clue to one’s ethno-geographic background.

We have designed a novel method of estimating human genetic ancestry against a model of 5 putative parental populations.  The populations are represented  by the following reference samples:  Western European (HapMap¹ CEU, Northwest European descent residing in Utah), West Sub-Saharan African (HapMap YRI, Yoruba from Ibadan, Nigeria), East Asian (HapMap CHB from Beijing, China), Indigenous American (HGDP-CEPH2 indigenous to North, Central, and South America including Maya, Pima, Karitiana, Surui, and Arawak descent), and the India Subcontinent (HapMap GIR, Gujarati Indian descent residing in Houston, TX).

Sorenson World-Wide Ancestry™ Test uses 190 SNP Ancestry Informative Markers (AIMs) chosen from their scored ability to specifically differentiate between the 5 reference populations using Principal Component Analysis (PCA) as the comparative analysis tool and includes some markers identified as informative in previous genetic ancestry estimation publications.  Using the program frappe3 and uniquely designed algorithms, the method compares an unknown individual sample to at least a hundred randomly selected subsets of individuals from the reference populations. Background interference is calculated simultaneously and used to estimate confidence ranges based on a calibration that we effected using thousands of worldwide individuals.  We also evaluated hundreds of individuals of known origin different from any of the reference populations, giving us an indication of ancestry profiles  for people who do not match exactly our model.  We show that our method and test offer a comprehensive and robust estimate of an individual’s genetic ancestry.

An Admixture Simulation Program for Validating Genetic Ancestry Estimation Systems

J.J. Bryan, V.L. Vance, M. Bauchet, C.L. Mouritsen

A key area of interest in studying population genetics is detection and estimation of genetic admixture in offspring from parents with mixed ancestry.  Current methods used to estimate admixture vary in the type and number of genetic markers analyzed as well as the statistical algorithms used in the calculations.  The goal of each method is to measure the genetic affinity of an individual to representative populations, generally reported in terms of relative percentages.  It is important to determine the ability of the method to accurately estimate levels of population admixture through controlled experiments, testing as many permutations as possible.  

Identifying and testing subjects with known levels of admixture for specific populations can be very costly and time consuming.  In reality, finding and testing samples to represent all possible admixture ratios, even between just a few populations, can be extremely difficult.  To address this problem, we have designed a program wherein we may generate and control for, targeted levels of admixture between the population reference samples in order to validate systems for estimating genetic ancestry. The simulation algorithm is capable of randomly generating innumerable offspring, each with unique genotypes, selected from virtual unions of up to 8 parental (P1) individuals. This program allows one to control admixture contributions from parental populations in increments of 12.5% in the 3rd filial (F3) generation.

P1 samples are randomly chosen and paired from an established parental genotype database to fulfill desired admixture ratios for a final resultant F3 generation individual.  At each union in a generation, 1 allele for every bi-allelic marker in each individual is randomly selected to contribute to the next generation.  By using this new program, 10,000 admixed simulations were generated from 383 real samples used in the P1 generation.  The P1 samples were selected from 5 putative parental populations and were compared in a variety of combinations and ratios in increments as low as 12.5% with multiple unique offspring generated from each simulated mating.  Reproducibility was tested using different P1 samples for each admixture combination and ratio.  This simulation program was developed to facilitate the validation of a 190 SNP, human genetic ancestry estimation algorithm.  The program allowed for well-controlled experimentation and significant savings in time and expenses.

The Development of Interactive Maps to Further Describe Y and mtDNA haplogroups - A New Educational Tool

H. Chen, C.L. Mouritsen, A.B. Carter

Background:  Mitochondrial DNA and Y chromosome haplogroup maps can provide visual depictions of the geographical distributions of known haplogroups. 14 peer-reviewed publications were reviewed and a database was developed of 109 different populations each being evaluated for their proportional makeup from any of 18 primary Y haplogroups. Latitude and longitude coordinates for each population were determined from the original location of the samples used in each publication. In the database, haplogroup percentages were assigned to a relevant global geographic coordinate using the Boundary Map feature of Mapview software. Through the global coordinates the haplogroups could be populated on maps with geographical associations. Ultimately, 540 data points were used to create 26 Y haplogroup maps (18 world gradient maps, 7 continental pie maps and 1 world pie map). Utilizing the same method, a database of 96 different mtDNA populations was constructed, each being evaluated for 27 primary mtDNA haplogroups according to 25 peer-reviewed articles. This resulted in 790 data points used in the creation of 34 mtDNA haplogroup maps (27 world gradient maps, 6 sub-continental pie maps and 1 world pie map).

Validation: Data within the newly constructed database were verified against the Haplogroup specifications defined by YCC (Y Chromosome Consortium) for Y haplogroups and PhyloTree for mtDNA haplogroups. 

Map descriptions: a) Gradient maps were developed to exhibit the approximate geographical distribution of a specific haplogroup using representative color intensity; b) Continental/sub-continental pie maps were created to show the percent makeup of each haplogroup in association with other haplogroups for a specific region; c) World pie maps were created to demonstrate the percent makeup of each haplogroup in association with other haplogroups for the entire world.  

Features: Special features were designed to allow viewers to navigate the maps without the need of returning to the home map.  1) Clicking map graphics to view corresponding gradient maps; 2) Clicking the title or space between letters allows one to observe the world pie map; 3) When viewing the world pie map, users can select continental/sub-continental areas to view continental/sub-continental pie maps. 

Conclusion: This educational tool provides 60 data-dependent maps of Y and mtDNA haplogroups with convenient navigation features. 

hYperplex - a YSTR Multiplex Genotyping Panel with Hyper-discriminatory Capacity

H. Chen, M. Szczepanski, C.L. Mouritsen

Y chromosome Short Tandem Repeat (Y-STR) testing has been broadly exploited in forensic casework for profiling male DNA especially in the presence of excessive amounts of female DNA or mixtures of more than one male biological sample as in some sexual assault cases. The discriminatory capacity of commercially available Y-STR kits has some limitations due to the moderate diversity values of loci used in the kits. 

Estimating Genetic Ancestry Using the Investigative LEADTM (Law Enforcement Ancestry DNA) Test

J. Bryan, M. Bauchet, V. Vance, D. Hellwig, C.L. Mouritsen

The utilization of many worldwide DNA databases is an essential tool in modern criminal investigations. Unfortunately, when an evidentiary DNA profile does not provide a viable suspect subsequent to a database search, the investigator may be left with little forensic direction. To assist in these critical situations, Sorenson Forensics introduces Investigative LEADTM; a single nucleotide polymorphism (SNP) based DNA test designed to estimate genetic ancestry against a model of 5 genetically distinct, putative parental populations. The populations and the reference samples  representing them are as follows:  Western European (HapMap CEU, Northwest European descent residing in Utah), West Sub-Saharan African (HapMap YRI, Yoruba from Ibadan, Nigeria), East Asian (HapMap CHB from Beijing, China), Indigenous American (Compilation of samples identified as being from populations indigenous to North, Central, and South America including Maya, Pima, Karitiana, Surui, and Arawak descent), and the India Subcontinent (HapMap GIH, Gujarati Indian descent residing in Houston, TX).  Our method uses 190 SNP Ancestry Informative Markers (AIMs) chosen from their scored ability to specifically differentiate between the 5 reference populations using Principal Component Analysis (PCA) as the comparative analysis tool and includes some markers identified as informative in previous genetic ancestry estimation publications.  Using the program FRAPPE and uniquely designed algorithms, the method compares an unknown individual sample to at least a hundred randomly selected subsets of individuals from the reference populations. Background interference is calculated simultaneously and is used to estimate confidence intervals based on a calibration that was effected using thousands of worldwide individuals. Validation data have shown the Investigative LEADTM test is a viable, robust and adequately sensitive test, capable of functioning on a variety of different forensic samples and DNA extract types. We believe this test will provide law enforcement investigators valuable information regarding the genetic ancestry of potential suspects. This test can be a great benefit for solving cold cases and other criminal investigations

An Automated Method for Deriving Mitochondrial (mtDNA) DNA Haplogroups Based on Changes within the Hypervariable Regions

V.L. Vance, J.J. Bryan, M.R. Szczepanski, A. Carter, C.L. Mouritsen

In 2009, van Oven and Kayser described a comprehensive phylogenetic tree of human mtDNA variation which has been made accessible at www.Phylotree.org. This phylogenetic tree is based on both coding and control (hypervariable) regions. Van Oven and Kayser identified the variances from the revised Cambridge Reference Sequence (rCRS) which define an individuals’ haplotype and corresponding mtDNA haplogroup. A new computer-based method has been developed for assigning mtDNA haplogroups using the variances and haplogroup nomenclature described by van Oven and Kayser. This new method makes use of Structured Query Language (SQL) and a mathematical algorithm that allows for the reliable determination of one’s haplogroup based solely on mtDNA sequence from the Hypervariable Regions (HVR). The SQL-based algorithm combines a database search process with a method that walks stepwise through the phylogenetic tree, which is rooted with rCRS at the first position. Using a novel scoring method to account for the number and stability of the markers that define each haplogroup, an individual’s HVR differences from rCRS are compared with the haplogroup designations defined in the mtDNA Phylotree. The algorithm has a high degree of reliability even when potential “back-mutations” and/or recent mutations are observed at key haplogroup defining positions, in which case a haplogroup is assigned based on likelihood and match criteria thresholds defined within the algorithm. In instances of ambiguous calls, the algorithm has the ability to select the nearest parental haplogroup in the tree. This new method was validated by comparing the haplogroups assigned by our method to the haplogroups assigned by van Oven and Kayser for samples with mtDNA haplotypes published in Phylotree. This comparison showed concordance of our method to be greater than 95%. Use of our system can accurately and quickly estimate over 800 different mtDNA haplogroups across the mtDNA tree using only rCRS differences within Hypervariable Region 1, over 1000 haplogroups using Hypervariable Regions 1 and 2, and nearly 1100 haplogroups using Hypervariable regions 1, 2 and 3. Since Phylotree is continually being updated as new data are published, we have incorporated a parsing tool that allows the program to be updated as the science progresses.

A Simulation Model of DNA Template Quality, Used in Validating a Genetic Ancestry Estimation System for Forensic Applications

J. Bryan, M. Bauchet, V. Vance, D. Hellwig, C.L. Mouritsen

Sorenson Forensics recently released a genetic ancestry estimation test known as Investigative LEADSM (Law Enforcement Ancestry DNA).  This new test provides a means for law enforcement agencies to identify the genetic ancestry of suspects and/or victims.  Software systems used to estimate genetic ancestry may vary based on the type and number of genetic markers, generally Single Nucleotide Polymorphisms (SNPs), observed and statistical algorithms used.   The common premise of these systems, however, is to measure the genetic affinity of an individual in relation to representative parental populations.  The result is generally reported in terms of relative percentages for each population.  Laboratory test systems used to generate SNP data may vary by chemistry and analytical method, which can play a role in the number, quality, and accuracy of the genotypes used to derive ancestry estimations for a given DNA sample.  Forensic-type samples often have low DNA copy numbers due to minimal sample amount or degradation and frequently contain inhibitors.  These factors can lead to lower recovery rates of targeted loci and allele dropout or stochastic effect.  When DNA quality or quantity is compromised, it is important to determine that the accurate genetic ancestry estimations can still be made.  Sorenson Forensics’ I-LEAD test makes use of the Applied Biosystems TaqMan OpenArray technology to genotype 192 autosomal SNPs.  A proprietary algorithm developed at Sorenson is used to create the estimations of ancestry.  An ancillary software program (Genotype Degrader) was created to assist in the validation of the algorithm used for the I-LEAD test system.  Initially, full genotype profiles for 190 SNPs were generated from DNA samples with known genetic ancestry.  Settings were selected within the Genotype Degrader simulation program to randomly introduce specific amounts of locus dropout (up to 50%) and stochastic effect into each given genotype set (up to 30%).  The simulation for each parent genotype profile can be run as many times as desired creating innumerable, unique combinations of the genotype set at a desired ‘quality-level’.  The ancestry estimation of the “degraded” genotypes can then be compared to that of the parent genotype to assess the impact random degradation would have on ancestry estimations. The Genotype Degrader software tool allowed for well-controlled experimentation to demonstrate the accuracy of the I-LEAD test when locus dropout and stochastic effects are observed in genotype data.  These simulations greatly reduced the time and expenses for this type of study over actual sample testing in the forensic laboratory.