Skip to content

dmalmer/ancestor_hmm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ancester inference of inbred populations with missing founder sequences

A method for inferring ancestral haplotypes that allows for genotype data to be absent for zero, one, or more ancestors

Introduction

Here we present a method for inferring haplotype ancestry in descendant strains bred from multi-parent populations. Through recombination, the descendant strains of multi-parent crosses are a mosaic of haplotypes inherited from their inbred ancestors. Using a hidden Markov model described below, this program uses single nucleotide-polymorphism (SNP) data in the descendant and ancestor strains to infer the most likely ancestral origin of segments in the descendant's genome.

Unique to this program, SNP information from one or more ancestors may be missing from the input data. Additionally, the program does not require knowledge of the breeding structure or recombination rates of organisms being studied. However, recombination rates and the effective number of generations used during breeding may be supplied to better inform the model.

This work was originally created for studying the genomes of Inbred Long-Sleep (ILS) and Inbred Short-Sleep (ISS) mice, where two out of the eight ancestor strains were unsequenced. The work is published in Dowell et al. "Genome Characterization of the Selected Long and Short Sleep Mouse Lines" Mammalian Genome (2016).

Usage

Input file

The ancestor inference program takes as input a single BED format file containing chromosomal locations of homozygous single-nucleotide polymorphisms (SNPs) in each genotyped ancestor strain as well as SNPs in the descendant strain. SNPs shared by multiple ancestors and/or the descendant are separated with an underscore in the fourth column of the BED file. For example, the beginning of the ISS input file looks like:

chr1    3001278 3001279 C3HHe
chr1    3001770 3001771 C3HHe
chr1    3007280 3007281 ISS
chr1    3007334 3007335 ISS_AKR_DBA2
chr1    3007757 3007758 ISS_AKR_DBA2

Meaning, C3HHe contains a SNP in the first position not shared by any ancestors or the ISS strain, ISS contains a SNP in the third position not shared by any of the ancestor strains, and ISS, AKR, and DBA2 all contain a SNP in the fourth and fifth positions.

Output file

The program outputs another BED file containing regions classified as being inherited from an ancestor. For example, running the program on the ISS data might output:

chr1	3001278	4610335	DBA2
chr1	4610502	8133696	AKR
chr1	8133749	8148378	Unk
chr1	8148457	11598300	AKR
chr1	11598590	11599319	Unk
chr1	11599350	11834039	A_AKR_BALBc

Regions classified as being inherited from an unsequenced ancestor are labeled as "Unk". Regions where multiple ancestors are identical by descent have each ancestor labeled, separated by an underscore.

Command line usage

usage: ancestor_inference.py [-h] -i INPUT_FILE -d DESC_STRAIN [-o OUTPUT_DIR]
                             [-t TRANS_IN_P] [-e EMIT_SAME_P] [-m MAX_ITER]
                             [-c PROB_DIST_CUTOFF] [-p] [-r RECOMB_RATES_FILE]
                             [-a ADJUST_RECOMB] [-u USE_UNKNOWN]
                             [-k UNK_CUTOFF] [-ep EFFECTIVE_POP]
                             [-ng NUM_GENERATIONS] [-si SV_INSERTIONS_FILE]
                             [-sd SV_DELETIONS_FILE] [-gs GRID_SIZE] [-ad]
                             [-ap] [-w] [-v]

arguments:
  -h, --help            show this help message and exit
  -i INPUT_FILE, --input-file INPUT_FILE
                        Input SNP data file (BED file format) (default: None)
  -d DESC_STRAIN, --desc-strain DESC_STRAIN
                        Name of descendant in input SNP file (default: None)
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
                        Directory for output files (if not specified, output
                        directory will default to the same directory as the
                        input file) (default: None)
  -t TRANS_IN_P, --trans-in-p TRANS_IN_P
                        Starting trans-in probability (can be a value or range
                        of values in the form "[x-y]", which is divided into
                        parts with the -gs flag) (default: 0.64)
  -e EMIT_SAME_P, --emit-same-p EMIT_SAME_P
                        Starting emit-same probability (can be a value or
                        range of values in the form "[x-y]", which is divided
                        into parts with the -gs flag) (default: 0.99)
  -m MAX_ITER, --max-iter MAX_ITER
                        Maximum number of EM iterations (default: 50)
  -c PROB_DIST_CUTOFF, --prob-dist-cutoff PROB_DIST_CUTOFF
                        Probability distance cutoff to end EM loop (default:
                        0.001)
  -p, --parallel        Run viterbi algorithm over each chromosome in parallel
                        (default: False)
  -r RECOMB_RATES_FILE, --recomb-rates-file RECOMB_RATES_FILE
                        Input file with recombination rates to be used as
                        priors for transition probabilities (default: None)
  -a ADJUST_RECOMB, --adjust-recomb ADJUST_RECOMB
                        Multiplier to adjust expected number of recombinations
                        (can be a value or range of values in the form
                        "[x-y]", which is divided into parts with the -gs
                        flag) (default: 1.0)
  -u USE_UNKNOWN, --use-unknown USE_UNKNOWN
                        Set to true to capture ungenotyped ancestors in an
                        Unknown state (default: False)
  -k UNK_CUTOFF, --unk-cutoff UNK_CUTOFF
                        Cutoff for fraction of Unk SNPs required for an
                        ancestor block to be relabeled as Unk (can be a value
                        or range of values in the form "[x-y]", which is
                        divided into parts with the -gs flag) (default: 1.0)
  -ep EFFECTIVE_POP, --effective-pop EFFECTIVE_POP
                        Effective population (N_e) used in recombination rate
                        calculations (default: 1)
  -ng NUM_GENERATIONS, --num-generations NUM_GENERATIONS
                        Estimated number of generations between ancestors and
                        descendant used in recombation rate calculations
                        (default: 1)
  -si SV_INSERTIONS_FILE, --sv-insertions-file SV_INSERTIONS_FILE
                        Input file for insertion structural variants used to
                        score HMM results (default: None)
  -sd SV_DELETIONS_FILE, --sv-deletions-file SV_DELETIONS_FILE
                        Input file for deletion structural variants used to
                        score HMM results (default: None)
  -gs GRID_SIZE, --grid-size GRID_SIZE
                        Number of items to divide a range of input values into
                        (default: 2)
  -ad, --append-date    Append date to output filename (default: False)
  -ap, --append-params  Append string to output filename based on the input
                        parameters (default: False)
  -w, --write-iter      Calculate scores and write to output file at each
                        iteration within the EM loop (default: False)
  -v, --verbose         Verbose (default: False)

Model

Overview

In multi-parental populations, descendant strains inherit distinct haplotype blocks from each ancestor strain. The haploblocks, therefore, contain identifying genotypic markers present in the originating ancestor, with slight modification due to de novo mutations. Intuitively we can say that segments of the descendant's genome with consistent single-nucleotide polymorphisms (SNPs) primarily from a single ancestor were likely inherited from that ancestor and boundaries between haploblocks represent historical recombination events (Figure 1a). To probabilistically infer the most likely boundaries and ancestral origin of every such segment in a descendant's genome, we developed the hidden Markov model (HMM) defined here. For our HMM training set, we use all SNP positions in the descendant genome and every ancestor genome where SNP data is available. Segments likely to have come from an unsequenced ancestor (for which there is no SNP data) are classified as ''Unknown''. As such, our fully-connected HMM consists of a state for each genotyped ancestor and an Unknown state if the descendant was derived from additional unsequenced ancestors (Figure 1b).

Model overview Figure 1: a) The ancestor assignments of haploblocks are based on the consistency with which the descendant strain shares SNPs with ancestor strains in a particular region. In the cartoon example shown, A1, A2, and A3 ancestor SNPs are colored if they share the SNP location with a Desc SNP and are black if they do not. The inferred origin of region is output by our model, where gaps between inferred haploblocks (black arrows) indicate regions where a recombination event took place during breeding. b) State diagram of the hidden Markov model used to infer ancestry. Each state A1, A2, ..., AN represents a sequenced ancestor and the Unk state captures any unsequenced ancestor strains. All possible transitions from state A1 are highlighted. Transitions between distinct states correspond to recombination events during breeding, while transitions to the same state indicate adjacent SNPs belong to the same ancestral haploblock.

As input, the model takes homozygous SNP data from each ancestor and the descendant strain in the form of BED format files. Each SNP position corresponds to a single observation in the HMM. Initial emission rates are based on if the ancestor SNPs at a given position are consistent with the descendant SNP. For example, if a given position on the genome contains a SNP from ancestor S1 and a SNP from the descendant, state S1 is given a high emission rate for that position. If a given position contains a SNP from the descendant, but not from ancestor S1, state S1 is given a low emission rate for that position. Lastly, if a given position contains a SNP from ancestor S1, but not from the descendant, state S1 is given a low emission rate for that position. We note that inconsistencies can arise from sequencing errors and de novo mutations, so ancestor states still have emission rates greater than 0.0 at positions where the ancestor SNP data is inconsistent with the descendant SNP data. Transitions between distinct states correspond to recombination events during breeding, while transitions to the same state indicate adjacent SNPs belong to the same ancestral haploblock. As adjacent SNPs are likely to be from the same inherited region, we set the initial transition probability of transitioning from one state to the same state as high and the transition probability of transitioning from one state to a different state as low.

Formal description

More formally, we define our HMM as follows: we denote each position in the genome where any ancestor or the descendant strain contains a SNP as p = 1, 2, ... and the state at position p as qp. qp ∈ S where S = {S1, S2, ..., SN} and S1, S2, ..., SN are states representing the ancestor from which position p in the descendant genome originated. Note that one of the states in S can be the "Unknown" state (SU) which represents a position in the descendant genome originating from an ancestor for which we have no genotype information.

At a given position p, we observe SNPs in one or more of the ancestors and/or in the descendant. We denote the possible SNPs as K = {k1, k2, ..., kM, kD}, where kD corresponds to a SNP in the descendant genome and k1, k2, ..., kM correspond to a SNP in the ancestor genomes. Notably, k1, k2, ..., kM all have corresponding states in S, but S can contain the Unknown state, SU, which is not represented in K. The set of SNPs observed at position p is Op where Op ⊆ K.

Each state has two observation symbols, V = {vC, vI}, corresponding to whether the descendant SNP data is consistent (vC) or inconsistent (vI) with the state SNP data at a given position. For a known ancestor state Si ≠ SU, Vi at position p is defined as:

For the Unknown state SU, a SNP in the descendant genome is considered consistent with the Unknown state SU only if none of the known ancestors contain a SNP at that position as well. We note that this creates a bias towards under calling the Unknown state, which is addressed via an input parameter to the model. For the Unknown ancestor state Si = SU, Vi at position p is defined as:

The state transition probability distribution is denoted as A = {αi,j} where:

The observation symbol probability distribution in state i is denoted as B = {bi(vw ∈ {vC, vI})} where:

Each state is given an equal initial probability.

To further inform the model, we allow for a genetic map to be input as a prior for the transition probabilities. We converge on optimal transition and emission probabilities by running the HMM through an Expectation-Maximization (EM) loop. In each iteration, the maximum-likelihood path (MLP) through the HMM is found using the Viterbi algorithm (maximization step), then transition and emission probabilities are recalculated based on the results (expectation step). The first MLP is found using initial transition and emission probabilities passed in by the user. In every subsequent iteration, transition probabilities are recalculated with:

Where C{i → j} is the number of times Sj is classified in a SNP position following a SNP position classified as Si, and Ci is the total number of times Si is classified (ie. the number of times Si is followed by any state, including itself).

Emission probabilities are recalculated as follows:

Where Ri is the total number of times that Si is classified at locations where Vi,p(Si) equals vC and Ti is the total number of times Vi,p(Si) equals vC. R is the total number of times that Si is classified at locations where Vi,p(Si) does not equal vC and T is the total number of times Vi,p(Si) does not equal vC. The two fractions are then normalized to 1.0 (Zi).

Additional details

Additionally, we identify regions within the descendant genome that are identical by descent (IBD) in multiple ancestors. Sometimes large segments of the ancestor genomes are identical and indistinguishable from one another. We sought to identify these regions within the HMM output, where a particular ancestor is chosen (by Viterbi) based on little to no informative positions. In these cases, there is no way to truly distinguish the ancestor of origin so we reclassify the segment as IBD in multiple ancestors, with each ancestor separated by "_" in the output BED file.

Haplotypes in the descendant strain inherited from an ungenotyped ancestor may contain informative markers shared by other ancestor strains. In these cases, the markers shows evidence for the region having originated from the shared ancestor strains rather than the ungenotyped ancestor. In fact, informative markers from the ungenotyped ancestor will only show in the data when there is a SNP in the descendant strain that is not shared by any of the sequenced ancestors. This creates a natural bias towards labeling regions as being inherited from the sequenced ancestors rather than from the unsequenced ancestors. To adjust for this bias, we include an input parameter to give more weight towards classifying a region as the Unknown state when there are informative markers in the descendant that don't belong to any of the genotyped ancestors.

After classifying each SNP as having been inherited from a particular ancestor, adjacent SNPs inherited from the same ancestor are considered to belong to the same ancestral haploblock in the descendant genome. To assess the accuracy of the model's output, we include the ability to compare small insertion and deletion structural variants (indels) called in the descendant and ancestor strains against haploblocks classified by the HMM. Each descendant indel is labeled as either a “hit” or a “miss” based on whether an ancestor indel overlapping the descendant indel is consistent or inconsistent with the HMM-classified haploblock over that region. That is, for each haploblock classified by the model, we find all indels in the descendant strain and the ancestor strains that overlap that haploblock region. Then for each unique indel region, we check for consistency between the model classification and the ancestor indels if a descendant indel is also present. We score a “hit” if the indel region contains the indel of the ancestor classified from the HMM and a “miss” if the region does not contain that indel. The final ratio of hits to misses gives the HMM output a score.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published