ERSA Alternative Hypothesis: Modeling Recent Shared Ancestry
Overview
The alternative hypothesis in ERSA (Estimation of Recent Shared Ancestry) models the scenario where two individuals share recent genealogical ancestry. Unlike the null hypothesis (random population members), these individuals have traceable common ancestors within a few generations.
Key Concepts:
- IBD segments arise from two sources: recent ancestry and population background
- Relationship characterized by parameters d (meioses) and a (ancestors)
- Segments from recent ancestry are typically longer and more numerous
- The likelihood combines both recent and background components
Variable Definitions
The variables used in ERSA alternative hypothesis:
- d = total meioses separating the individuals
- a = number of common ancestors (1 or 2)
- n_A = number of segments from recent ancestry
- n_P = number of background segments - segments from other ancestors in the population
- s_A = set of segment (lengths) from recent ancestry
- s_P = set of segment (length) from distant ancestry
- t = minimum segment threshold (2.5 cM)
Important Relationships
Note that n_P + n_A = n where n is the total number of segments shared between the pair.
s_A and s_P are two mutually exclusive subsets of s
1. Mathematical Framework of the Alternative Hypothesis
Under the alternative hypothesis H_a, IBD segments arise from two sources:
- Recent ancestry segments (subscript A): Inherited from recent common ancestors
- Background segments (subscript P): Population-level background sharing
The likelihood is:
where:
- d = total meioses separating the individuals
- a = number of common ancestors (1 or 2)
- n_A = number of segments from recent ancestry
- n_P = number of background segments - segments from other ancestors in the population
- s_A = set of segment (lengths) from recent ancestry
- s_P = set of segment (length) from distant ancestry
- t = minimum segment threshold (2.5 cM)
n_P and thus L_P follows Equation 1 of the Null Hypothesis, which is that the pair is no more related than two randomly selected individuals from the population.
"L_A is the likelihood that two individuals share n autosomal segments from recent ancestor(s) specified by d and a, with the segment lengths specified by s_A" (Huff et al., 2011, p. 770).
L_A is the likelihood (not probability) that can be expressed as the product of likelihoods of the number of shared segments and the length of each segment, which parallels Equation 1:
Key difference: The alternative hypothesis (Equation 5) includes the parameters d (total meioses) and a (number of common ancestors) that specify the targeted relatedness between individuals, whereas the null hypothesis (Equation 1) assumes no specific relationship beyond random population-level sharing.
Likelihood vs Probability
Likelihood (what ERSA uses):
This reads as: "The likelihood of observing data (nA, sA) given that the relationship is (d, a)"
- Fixed parameters: d, a (the relationship)
- Variable: the observed data (n_A, s_A)
- Question: How likely is this data if we assume this specific relationship?
Probability (Bayesian approach):
This reads as: "The probability that the relationship is (d, a) given that we observed data (nA, sA)"
- Fixed: the observed data (n_A, s_A)
- Variable: the relationship parameters (d, a)
- Question: What's the probability of this relationship given what we observed?
Key Difference
- Likelihood: L(data|relationship) - How likely is the data given a specific relationship?
- Probability: P(relationship|data) - What's the probability of the relationship given the data?
In Practice:
ERSA uses maximum likelihood estimation (MLE), finding the relationship (d, a) that maximizes the likelihood:
To get actual probabilities of relationships, you would need:
- Prior probabilities P(d, a) for each relationship type
- The marginal likelihood P(nA, sA|t) (sum over all possible relationships)
- Apply Bayes' theorem
Modeling Segment Lengths Under the Alternative Hypothesis
Recall equation 5:
The second component, SA(sA|d, t), represents the likelihood of observing the specific set of segment lengths under recent ancestry. This is computed as:
Explanation: "Equation 6 assumes that, for a given value of d, the lengths of segments are independent. This assumption is not strictly true. One might imagine that the presence of a particularly long segment would reduce the genomic space available for additional segments. However, because the length of any one segment is small relative to the genome and because the genome is physically divided into chromosomes, the segment lengths are approximately independent" (Huff et al., 2011, p. 770).
Parallel Structure: Equation 6 directly parallels Equation 2 from the null hypothesis:
Both equations use the same product structure, assuming independence of segment lengths. The key difference is:
- Equation 2 uses FP(i|t) - the population background distribution
- Equation 6 uses FA(i|t) - the distribution under recent ancestry with parameters d and a
Understanding Segment Length Distribution Under Recent Ancestry
Let's look at FA(i|t) component from equation 6.
The Inheritance Probability Framework
"For two individuals who are related by an inheritance path that is d meioses long, the probability that they will inherit any particular autosomal segment from a common ancestor on that path is equal to 1/2(d-1)" (Huff et al., 2011, p. 770).
Understanding the 1/2(d-1) formula
Let's trace how a specific segment travels from a common ancestor to two relatives. Consider first cousins who share a grandmother:
At each meiosis (when a parent passes DNA to a child), each autosomal segment has a 1/2 chance of being inherited because:
- The parent has two copies of each chromosome (one from each of their parents)
- The child randomly inherits one of these two copies
- Therefore, any specific segment from a grandparent has a 1/2 chance of being passed on
For first cousins:
- The grandmother has a specific DNA segment on one of her two chromosome copies
- Path to cousin 1: grandmother → parent 1 (1/2 chance of inheriting that specific segment) → cousin 1 (1/2 chance)
- Path to cousin 2: grandmother → parent 2 (1/2 chance) → cousin 2 (1/2 chance)
- Total: 4 meioses, so d = 4
However, let's consider probabilities:
- We specify that there is a specific segment from grandmother to parent 1 that we are following.
- This removes the 1/2 chance from grandmother to parent 1. There is no chance here.
- We have selected the specific transmitted segment that we are following.
- That same segment getting transmitted from parent 1 to cousin 1 has a 1/2 chance.
- That same segment getting transmitted from grandmother to parent 2 has a 1/2 chance.
- That same segment getting transmitted from parent 2 to cousin 2 has a 1/2 chance.
So, the probability both cousins inherit the same segment from grandmother is:
(1/2) × (1/2) × (1/2) = (1/2)3 = 1/2(d-1) where d-1 = 3
The formula is 1/2(d-1) rather than 1/2d because once we selected the segment transmitted from the common ancestor in the first meiosis, we only need to track d-1 chance (independent) transmission events for both individuals to share it.
Expected Number of Shared Segments
"The expected number of shared autosomal segments that could potentially be inherited from a common ancestor is equal to rd + c, where c is the number of autosomes and r is the expected number of recombination events per haploid genome per generation" (Huff et al., 2011, p. 770).
Breaking this down:
- c segments from chromosomes: At minimum, if no recombination occurs, relatives can share up to c = 22 autosomal segments (one per chromosome)
- rd segments from recombination: Each generation adds approximately r ≈ 35.3 recombination events, creating new segment boundaries. Over d meioses, this creates rd additional potential segments
- Total potential segments: rd + c
Understanding rd + c: How Recombination Creates Potential IBD Segments
The value rd + c represents the maximum possible number of distinct IBD segments that could be inherited from a common ancestor—it's the total number of potential segment boundaries created by recombination.
For example:
- Parent and child (d=1, a=1): Potential segments: rd + c = 35.3(1) + 22 = 57
- First cousins (d=4, a=2): Potential segments: rd + c = 35.3(4) + 22 = 163
- Third cousins (d=8, a=2): Potential segments: rd + c = 35.3(8) + 22 = 304
However, the actual number of segments shared between two relatives is much smaller. To estimate the actual number of shared segments, we recognize that each potential segment has only a 1/2(d-1) probability of being inherited by both individuals. And we also consider the number of ancestors, 1 or 2.
- The number of potential segments: rd + c
- The probability each is inherited: 1/2^{(d-1)}
- The number of common ancestors: a (doubles the expectation for full siblings vs half-siblings)
Therefore, "the expected number of shared segments is equal to a(rd + c)/2^{(d-1)}" (Huff et al., 2011, p. 770).
For example:
- Parent and child (d=1, a=1):
- Potential segments: rd + c = 35.3(1) + 22 = 57
- Expected shared segments: 1 × 57 / 2^0 = 57/1 = 57 segments
- 100% of potential segments are shared (as expected for parent-child!)
- First cousins (d=4, a=2):
- Potential segments: rd + c = 35.3(4) + 22 = 163
- Expected shared segments: 2 × 163 / 2^3 = 326/8 ≈ 41 segments
- Only ~25% of potential segments are actually shared
- Third cousins (d=8, a=2):
- Potential segments: rd + c = 35.3(8) + 22 = 304
- Expected shared segments: 2 × 304 / 2^7 = 608/128 ≈ 5 segments
- Only ~1.6% of potential segments are actually shared
Segment Length Distribution
Recall from equation 6 that the likelihood of observing a set of segment lengths under the alternative hypothesis is:
where FA(i|t) is the probability density for observing a segment of length i under recent ancestry. Here, i represents the length of an individual IBD segment in centiMorgans (cM). Equation 6 states that the total likelihood for a set of segments is the product of individual segment likelihoods, assuming independence between segment lengths.
Let's look at the segment i.
"Given d, the expected value of i is 100/d. Without conditioning on t, the distribution of segment length is exponential with mean 100/d" (Huff et al., 2011, p. 770). In this statement, i is the mean segment length whereas in equation 6, i is the individual segment length.
The expected value of i on average is 100/d.
Understanding Morgans and the 100/d Relationship
First, let's define genetic distance units:
- 1 Morgan (M) = the genetic distance over which there is, on average, 1 recombination event per meiosis
- 1 centiMorgan (cM) = 1/100 of a Morgan = 1% probability of recombination per meiosis
- The human genome is approximately 100 Morgans (or 10,000 cM) long
What this means concretely:
- A 1 cM segment has a 1% or 0.01 probability of experiencing a recombination in one meiosis
- A 10 cM segment has a 10% or 0.10 probability of experiencing a recombination in one meiosis
- A 100 cM (1 Morgan) segment has a 100% probability or 1.0 expected recombination per meiosis
Why does 100 appear in the formulas?
The 100 appears because we measure in centiMorgans but need to convert to probability:
- L cM = L% chance of recombination = L/100 as a decimal probability
- It's just like converting centimeters to meters - you divide by 100
- If we measured in Morgans instead of centiMorgans, we wouldn't need the 100
Conditioning on Minimum Length
When we only observe segments longer than threshold t (typically 2.5 cM), we need the conditional distribution. Since short segments cannot be reliably detected, we must adjust our probability density:
Understanding this formula
The unconditional exponential density is f(i) = (d/100)e-di/100 for any length i ≥ 0.
When we condition on observing only segments ≥ t:
- We shift the distribution by t: (i-t)
- We renormalize so the total probability from t to ∞ equals 1
Breaking down equation 7:
- Numerator e-d(i-t)/100:
- This is the shifted exponential decay
- The (i-t) term shifts our reference point to the threshold
- Segments just above threshold t have the highest density
- Denominator 100/d:
- This ensures the density integrates to 1 over [t, ∞)
- It's actually ∫t∞ e-d(i-t)/100 di = 100/d
- Note: This is NOT the mean of the truncated distribution (which would be t + 100/d)
- The 100 appears because:
- We measure in centiMorgans (1% units)
- The rate parameter is d/100 when using cM
Distribution of the Number of Segments
Recall from equation 5 that the alternative hypothesis likelihood has two components:
We've just covered SA (segment lengths). Now let's examine NA(n|d, a, t), the likelihood of observing n segments given the relationship (d, a).
"The probability p(t) that a shared segment is longer than t is equal to e-dt/100 (Thomas et al. 1994). Because the distribution of the number of shared segments is approximately Poisson (Thomas et al. 1994)," (Huff et al., 2011, p. 770) we have:
Breaking down this formula
Equation 8 is based on the Poisson distribution formula:
which is probability of observing exactly n segments. Appropriate because segment inheritance events are independent.
The Poisson parameter λ:
This combines:
- a(rd+c): total potential segments from recent ancestry
- 1/2(d-1): probability each is inherited by both individuals
- p(t) = e-dt/100: probability each exceeds detection threshold
Relationship Parameters and Expected Segments
Expected IBD Segment Count Distributions
Segment Length Distributions by Relationship
Maximizing the Alternative Hypothesis Likelihood
"Given nA and nP, the maximum value of the likelihood function (Eq. 4) is equal to:" (Huff et al., 2011, p. 770)
"where sx:n is equal to the x th smallest value in s. Equation 9 asserts that the likelihood is maximized when the set of segments resulting from recent ancestry is equal to the longest nA segments in s, with the remaining nP segments being due to the population background. In the Supplemental Methods section, we show that Equation 9 holds as long as μ < a(rd + c), which is true whenever a and d specify shared ancestry that is recent relative to pairs of individuals selected at random from the population." (Huff et al., 2011, p. 770)
Key insight from equation 9
- The likelihood is maximized by attributing the longest segments to recent ancestry
- The shortest segments are attributed to population background
- This makes biological sense: recent ancestry produces longer segments (less time for recombination to break them up)
- Population background produces shorter segments (many generations of recombination)
Notation clarification:
- s1:n = smallest (shortest) segment
- sn:n = largest (longest) segment
- {s1:n ... snP:n} = the nP shortest segments
- {snP+1:n ... sn:n} = the nA longest segments
Finding the Maximum Likelihood
"To identify the maximum value of the likelihood function (Eq. 4) given d, a, and t, we evaluate all possible values of nP and nA in Eq. 9:" (Huff et al., 2011, p. 770)
What equation 10 does
- Takes the maximum likelihood over all possible ways to partition the n observed segments
- nP: number of segments attributed to population background
- n - nP: number of segments attributed to recent ancestry (nA)
- For each partition, calculate likelihood using equation 9
- Select the partition that gives the highest likelihood
How Equations 9 and 10 Work Together
Equation 9 calculates the likelihood for a specific split of segments:
- Given: which segments are from ancestry vs background
- Calculate: the likelihood of that particular assignment
Equation 10 finds the best split:
- Try all possible ways to split the segments
- Use equation 9 to calculate likelihood for each split
- Pick the split with the highest likelihood
Concrete example with 5 segments
Say we observe segments of lengths: [3.1, 5.7, 8.2, 15.4, 32.6] cM
- Try nP = 0 (all from ancestry):
- Background segments: none
- Ancestry segments: [3.1, 5.7, 8.2, 15.4, 32.6]
- Calculate MLR using equation 9 → get likelihood L0
- Try nP = 1 (shortest from background):
- Background segments: [3.1]
- Ancestry segments: [5.7, 8.2, 15.4, 32.6]
- Calculate MLR using equation 9 → get likelihood L1
- Try nP = 2:
- Background segments: [3.1, 5.7]
- Ancestry segments: [8.2, 15.4, 32.6]
- Calculate MLR using equation 9 → get likelihood L2
... and so on until nP = 5
Pick the maximum: If L2 is highest, then the optimal split is 2 background, 3 ancestry
This makes biological sense: short segments are more likely from ancient background, long segments from recent ancestry!
References and Implementation Notes
ERSA has its own code at www.hufflab.org but the site was down when I last checked. The code that follows is an interpretation of the ERSA theory written in Python. Alternatively, https://github.com/rmunoz12/ersa looks promising, but I haven't used it yet.
ERSA 1: https://genome.cshlp.org/content/21/5/768.full
ERSA 2: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004144
Summary
The ERSA alternative hypothesis provides a comprehensive framework for modeling IBD patterns under recent shared ancestry. Key components include:
- Dual-source model: Segments from recent ancestry (longer, more numerous) and population background (shorter, fewer)
- Poisson segment counts: Expected number depends on relationship parameters and detection threshold
- Exponential segment lengths: Truncated distributions conditioned on minimum detectable length
- Optimal partitioning: Longest segments attributed to recent ancestry, shortest to background
- Maximum likelihood estimation: Finding the relationship parameters that best explain observed data
This framework enables quantitative relationship inference by comparing the alternative hypothesis likelihood with the null hypothesis, providing statistical evidence for specific genealogical relationships based on patterns of IBD segment sharing.