ERSA Alternative Hypothesis: Modeling Recent Shared Ancestry

Overview

The alternative hypothesis in ERSA (Estimation of Recent Shared Ancestry) models the scenario where two individuals share recent genealogical ancestry. Unlike the null hypothesis (random population members), these individuals have traceable common ancestors within a few generations.

Key Concepts:

IBD segments arise from two sources: recent ancestry and population background
Relationship characterized by parameters $d$ (meioses) and $a$ (ancestors)
Segments from recent ancestry are typically longer and more numerous
The likelihood combines both recent and background components

# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy.special import factorial, comb
import pandas as pd
import seaborn as sns
from typing import List, Tuple, Dict, Optional
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Set random seed for reproducibility
np.random.seed(42)
                        

Variable Definitions

The variables used in ERSA alternative hypothesis:

$d$ = total meioses separating the individuals
$a$ = number of common ancestors (1 or 2)
$n_A$ = number of segments from recent ancestry
$n_P$ = number of background segments - segments from other ancestors in the population
$s_A$ = set of segment (lengths) from recent ancestry
$s_P$ = set of segment (length) from distant ancestry
$t$ = minimum segment threshold (2.5 cM)

Important Relationships

Note that $n_P + n_A = n$ where $n$ is the total number of segments shared between the pair.
$s_A$ and $s_P$ are two mutually exclusive subsets of $s$

1. Mathematical Framework of the Alternative Hypothesis

Under the alternative hypothesis $H_a$ , IBD segments arise from two sources:

Recent ancestry segments (subscript A): Inherited from recent common ancestors
Background segments (subscript P): Population-level background sharing

The likelihood is:

L_R = L_A(n_A, s_A|d, a, t) · L_P(n_P, s_P|t) (4)

where:

$d$ = total meioses separating the individuals
$a$ = number of common ancestors (1 or 2)
$n_A$ = number of segments from recent ancestry
$n_P$ = number of background segments - segments from other ancestors in the population
$s_A$ = set of segment (lengths) from recent ancestry
$s_P$ = set of segment (length) from distant ancestry
$t$ = minimum segment threshold (2.5 cM)

$n_P$ and thus $L_P$ follows Equation 1 of the Null Hypothesis, which is that the pair is no more related than two randomly selected individuals from the population.

" $L_A$ is the likelihood that two individuals share $n$ autosomal segments from recent ancestor(s) specified by $d$ and $a$ , with the segment lengths specified by $s_A$ " (Huff et al., 2011, p. 770).

L_A(n_A, s_A|d, a, t) = N_A(n|d, a, t) · S_A(s_A|d, t) (5)

$L_A$ is the likelihood (not probability) that can be expressed as the product of likelihoods of the number of shared segments and the length of each segment, which parallels Equation 1:

L_P(n, s|t) = N_P(n|t) · S_P(s|t) (1)

Key difference: The alternative hypothesis (Equation 5) includes the parameters $d$ (total meioses) and $a$ (number of common ancestors) that specify the targeted relatedness between individuals, whereas the null hypothesis (Equation 1) assumes no specific relationship beyond random population-level sharing.

Likelihood vs Probability

Likelihood (what ERSA uses):

L_A(n_A, s_A|d, a, t) = N_A(n|d, a, t) · S_A(s_A|d, t) (5)

This reads as: "The likelihood of observing data (n_A, s_A) given that the relationship is (d, a)"

Fixed parameters: $d, a$ (the relationship)
Variable: the observed data $(n_A, s_A)$
Question: How likely is this data if we assume this specific relationship?

Probability (Bayesian approach):

P(d, a|n_A, s_A, t) = [L_A(n_A, s_A|d, a, t) · P(d, a)] / P(n_A, s_A|t) (Bayes)

This reads as: "The probability that the relationship is (d, a) given that we observed data (n_A, s_A)"

Fixed: the observed data $(n_A, s_A)$
Variable: the relationship parameters $(d, a)$
Question: What's the probability of this relationship given what we observed?

Key Difference

Likelihood: L(data|relationship) - How likely is the data given a specific relationship?
Probability: P(relationship|data) - What's the probability of the relationship given the data?

In Practice:

ERSA uses maximum likelihood estimation (MLE), finding the relationship (d, a) that maximizes the likelihood:

(d̂, â) = argmax_d,a L_A(n_A, s_A|d, a, t)

To get actual probabilities of relationships, you would need:

Prior probabilities P(d, a) for each relationship type
The marginal likelihood P(n_A, s_A|t) (sum over all possible relationships)
Apply Bayes' theorem

Modeling Segment Lengths Under the Alternative Hypothesis

Recall equation 5:

L_A(n_A, s_A|d, a, t) = N_A(n|d, a, t) · S_A(s_A|d, t) (5)

The second component, S_A(s_A|d, t), represents the likelihood of observing the specific set of segment lengths under recent ancestry. This is computed as:

S_A(s|d, t) = ∏_{i ∈ s} F_A(i|t) (6)

Explanation: "Equation 6 assumes that, for a given value of $d$ , the lengths of segments are independent. This assumption is not strictly true. One might imagine that the presence of a particularly long segment would reduce the genomic space available for additional segments. However, because the length of any one segment is small relative to the genome and because the genome is physically divided into chromosomes, the segment lengths are approximately independent" (Huff et al., 2011, p. 770).

Parallel Structure: Equation 6 directly parallels Equation 2 from the null hypothesis:

S_P(s|t) = ∏_{i ∈ s} F_P(i|t) (2)

Both equations use the same product structure, assuming independence of segment lengths. The key difference is:

Equation 2 uses F_P(i|t) - the population background distribution
Equation 6 uses F_A(i|t) - the distribution under recent ancestry with parameters $d$ and $a$

Understanding Segment Length Distribution Under Recent Ancestry

Let's look at F_A(i|t) component from equation 6.

The Inheritance Probability Framework

"For two individuals who are related by an inheritance path that is $d$ meioses long, the probability that they will inherit any particular autosomal segment from a common ancestor on that path is equal to 1/2^(d-1)" (Huff et al., 2011, p. 770).

Understanding the 1/2^(d-1) formula

Let's trace how a specific segment travels from a common ancestor to two relatives. Consider first cousins who share a grandmother:

At each meiosis (when a parent passes DNA to a child), each autosomal segment has a 1/2 chance of being inherited because:

The parent has two copies of each chromosome (one from each of their parents)
The child randomly inherits one of these two copies
Therefore, any specific segment from a grandparent has a 1/2 chance of being passed on

For first cousins:

The grandmother has a specific DNA segment on one of her two chromosome copies
Path to cousin 1: grandmother → parent 1 (1/2 chance of inheriting that specific segment) → cousin 1 (1/2 chance)
Path to cousin 2: grandmother → parent 2 (1/2 chance) → cousin 2 (1/2 chance)
Total: 4 meioses, so $d = 4$

However, let's consider probabilities:

We specify that there is a specific segment from grandmother to parent 1 that we are following.
- This removes the 1/2 chance from grandmother to parent 1. There is no chance here.
- We have selected the specific transmitted segment that we are following.
That same segment getting transmitted from parent 1 to cousin 1 has a 1/2 chance.
That same segment getting transmitted from grandmother to parent 2 has a 1/2 chance.
That same segment getting transmitted from parent 2 to cousin 2 has a 1/2 chance.

So, the probability both cousins inherit the same segment from grandmother is:

(1/2) × (1/2) × (1/2) = (1/2)³ = 1/2^(d-1) where d-1 = 3

The formula is 1/2^(d-1) rather than 1/2^d because once we selected the segment transmitted from the common ancestor in the first meiosis, we only need to track d-1 chance (independent) transmission events for both individuals to share it.

Expected Number of Shared Segments

"The expected number of shared autosomal segments that could potentially be inherited from a common ancestor is equal to $rd + c$ , where $c$ is the number of autosomes and $r$ is the expected number of recombination events per haploid genome per generation" (Huff et al., 2011, p. 770).

Breaking this down:

$c$ segments from chromosomes: At minimum, if no recombination occurs, relatives can share up to $c = 22$ autosomal segments (one per chromosome)
$rd$ segments from recombination: Each generation adds approximately $r \approx 35.3$ recombination events, creating new segment boundaries. Over $d$ meioses, this creates $rd$ additional potential segments
Total potential segments: $rd + c$

Understanding rd + c: How Recombination Creates Potential IBD Segments

import matplotlib.pyplot as plt
import numpy as np
from matplotlib.patches import Rectangle, FancyBboxPatch
import matplotlib.patches as mpatches

# Create figure with subplots
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
fig.suptitle('Understanding $rd + c$: How Recombination Creates Potential IBD Segments', 
             fontsize=16, fontweight='bold')

# Parameters
r = 35.3  # recombinations per generation
c = 22    # number of autosomes

# Set random seed for consistency
np.random.seed(42)

# Generate recombination positions that accumulate
# First generation breaks
first_gen_breaks = {}
for i in range(22):
    n_breaks = np.random.poisson(35.3/22)
    if n_breaks > 0:
        first_gen_breaks[i] = np.random.uniform(1.5, 8.5, min(n_breaks, 2))
    else:
        first_gen_breaks[i] = []

# ... [full visualization code from notebook] ...
                        

The value $rd + c$ represents the maximum possible number of distinct IBD segments that could be inherited from a common ancestor—it's the total number of potential segment boundaries created by recombination.

For example:

Parent and child (d=1, a=1): Potential segments: $rd + c = 35.3(1) + 22 = 57$
First cousins (d=4, a=2): Potential segments: $rd + c = 35.3(4) + 22 = 163$
Third cousins (d=8, a=2): Potential segments: $rd + c = 35.3(8) + 22 = 304$

However, the actual number of segments shared between two relatives is much smaller. To estimate the actual number of shared segments, we recognize that each potential segment has only a 1/2^(d-1) probability of being inherited by both individuals. And we also consider the number of ancestors, 1 or 2.

The number of potential segments: $rd + c$
The probability each is inherited: $1/2^{(d-1)}$
The number of common ancestors: $a$ (doubles the expectation for full siblings vs half-siblings)

Expected shared segments = a × (rd + c) × 1/2^(d-1)

Therefore, "the expected number of shared segments is equal to $a(rd + c)/2^{(d-1)}$ " (Huff et al., 2011, p. 770).

For example:

Parent and child (d=1, a=1):
- Potential segments: $rd + c = 35.3(1) + 22 = 57$
- Expected shared segments: $1 \times 57 / 2^0 = 57/1 = 57$ segments
- 100% of potential segments are shared (as expected for parent-child!)
First cousins (d=4, a=2):
- Potential segments: $rd + c = 35.3(4) + 22 = 163$
- Expected shared segments: $2 \times 163 / 2^3 = 326/8 \approx 41$ segments
- Only ~25% of potential segments are actually shared
Third cousins (d=8, a=2):
- Potential segments: $rd + c = 35.3(8) + 22 = 304$
- Expected shared segments: $2 \times 304 / 2^7 = 608/128 \approx 5$ segments
- Only ~1.6% of potential segments are actually shared

Segment Length Distribution

Recall from equation 6 that the likelihood of observing a set of segment lengths under the alternative hypothesis is:

S_A(s|d, t) = ∏_{i ∈ s} F_A(i|t) (6)

where F_A(i|t) is the probability density for observing a segment of length $i$ under recent ancestry. Here, $i$ represents the length of an individual IBD segment in centiMorgans (cM). Equation 6 states that the total likelihood for a set of segments is the product of individual segment likelihoods, assuming independence between segment lengths.

Let's look at the segment $i$ .

"Given $d$ , the expected value of $i$ is $100/d$ . Without conditioning on $t$ , the distribution of segment length is exponential with mean $100/d$ " (Huff et al., 2011, p. 770). In this statement, $i$ is the mean segment length whereas in equation 6, $i$ is the individual segment length.

𝔼[i] = 100/d

The expected value of $i$ on average is $100/d$ .

Understanding Morgans and the 100/d Relationship

First, let's define genetic distance units:

1 Morgan (M) = the genetic distance over which there is, on average, 1 recombination event per meiosis
1 centiMorgan (cM) = 1/100 of a Morgan = 1% probability of recombination per meiosis
The human genome is approximately 100 Morgans (or 10,000 cM) long

What this means concretely:

A 1 cM segment has a 1% or 0.01 probability of experiencing a recombination in one meiosis
A 10 cM segment has a 10% or 0.10 probability of experiencing a recombination in one meiosis
A 100 cM (1 Morgan) segment has a 100% probability or 1.0 expected recombination per meiosis

Why does 100 appear in the formulas?
The 100 appears because we measure in centiMorgans but need to convert to probability:

$L$ cM = $L$ % chance of recombination = $L/100$ as a decimal probability
It's just like converting centimeters to meters - you divide by 100
If we measured in Morgans instead of centiMorgans, we wouldn't need the 100

Conditioning on Minimum Length

When we only observe segments longer than threshold $t$ (typically 2.5 cM), we need the conditional distribution. Since short segments cannot be reliably detected, we must adjust our probability density:

F_A(i|d, t) = e^-d(i-t)/100 / (100/d) (7)

Understanding this formula

The unconditional exponential density is f(i) = (d/100)e^-di/100 for any length i ≥ 0.

When we condition on observing only segments ≥ t:

We shift the distribution by t: (i-t)
We renormalize so the total probability from t to ∞ equals 1

Breaking down equation 7:

Numerator e^-d(i-t)/100:
- This is the shifted exponential decay
- The (i-t) term shifts our reference point to the threshold
- Segments just above threshold t have the highest density
Denominator 100/d:
- This ensures the density integrates to 1 over [t, ∞)
- It's actually ∫_t^∞ e^-d(i-t)/100 di = 100/d
- Note: This is NOT the mean of the truncated distribution (which would be t + 100/d)
The 100 appears because:
- We measure in centiMorgans (1% units)
- The rate parameter is d/100 when using cM

Distribution of the Number of Segments

Recall from equation 5 that the alternative hypothesis likelihood has two components:

L_A(n_A, s_A|d, a, t) = N_A(n|d, a, t) · S_A(s_A|d, t) (5)

We've just covered S_A (segment lengths). Now let's examine N_A(n|d, a, t), the likelihood of observing $n$ segments given the relationship (d, a).

"The probability p(t) that a shared segment is longer than $t$ is equal to e^-dt/100 (Thomas et al. 1994). Because the distribution of the number of shared segments is approximately Poisson (Thomas et al. 1994)," (Huff et al., 2011, p. 770) we have:

N_A(n|d, a, t) = [e^{-[a(rd+c)p(t)]/2^d-1} × [a(rd+c)p(t)/2^d-1]ⁿ] / n! (8)

Breaking down this formula

Equation 8 is based on the Poisson distribution formula:

P(N=n) = e^-λλⁿ / n!

which is probability of observing exactly n segments. Appropriate because segment inheritance events are independent.

The Poisson parameter λ:

λ = [a(rd+c)p(t)] / 2^d-1 = [a(rd+c)e^-dt/100] / 2^d-1

This combines:

a(rd+c): total potential segments from recent ancestry
1/2^(d-1): probability each is inherited by both individuals
p(t) = e^-dt/100: probability each exceeds detection threshold

Relationship Parameters and Expected Segments

Expected IBD Segment Count Distributions

Segment Length Distributions by Relationship

import numpy as np
from scipy import stats
from typing import List

class ERSAAlternativeHypothesis:
    """
    Implementation of ERSA Alternative Hypothesis for relationship inference
    """
    
    def __init__(self, r: float = 35.3, c: int = 22, t: float = 2.5):
        """
        Initialize ERSA Alternative Hypothesis model
        
        Parameters:
        -----------
        r : float
            Expected number of recombinations per generation (default: 35.3)
        c : int
            Number of autosomes (default: 22 for humans)
        t : float
            Minimum segment threshold in cM (default: 2.5)
        """
        self.r = r
        self.c = c
        self.t = t
        
    def expected_segments(self, d: int, a: int) -> float:
        """
        Calculate expected number of IBD segments
        
        Parameters:
        -----------
        d : int
            Total number of meioses
        a : int
            Number of common ancestors (1 or 2)
            
        Returns:
        --------
        float
            Expected number of segments
        """
        return a * (self.r * d + self.c) / (2 ** (d - 1))
    
    def segment_probability(self, d: int) -> float:
        """
        Probability that a segment exceeds threshold t
        
        Parameters:
        -----------
        d : int
            Total number of meioses
            
        Returns:
        --------
        float
            Probability of segment > t
        """
        return np.exp(-d * self.t / 100)
    
    def poisson_parameter(self, d: int, a: int) -> float:
        """
        Calculate lambda parameter for Poisson distribution of segment counts
        
        Parameters:
        -----------
        d : int
            Total number of meioses
        a : int
            Number of common ancestors
            
        Returns:
        --------
        float
            Lambda parameter for Poisson distribution
        """
        return self.expected_segments(d, a) * self.segment_probability(d)
    
    def segment_count_likelihood(self, n: int, d: int, a: int) -> float:
        """
        Likelihood of observing n segments given relationship (d, a)
        Implements equation 8 from ERSA paper
        
        Parameters:
        -----------
        n : int
            Number of observed segments
        d : int
            Total number of meioses
        a : int
            Number of common ancestors
            
        Returns:
        --------
        float
            Probability of observing n segments
        """
        lambda_param = self.poisson_parameter(d, a)
        return stats.poisson.pmf(n, lambda_param)
    
    def segment_length_density(self, length: float, d: int) -> float:
        """
        Probability density for segment length under alternative hypothesis
        Implements equation 7 from ERSA paper (F_A(i|d,t))
        
        Parameters:
        -----------
        length : float
            Segment length in cM
        d : int
            Total number of meioses
            
        Returns:
        --------
        float
            Probability density
        """
        if length < self.t:
            return 0
        # Equation 7: F_A(i|d,t) = (d/100) * exp(-d(i-t)/100) / p(t)
        # where p(t) = exp(-dt/100) is the normalizing constant
        p_t = np.exp(-d * self.t / 100)
        return (d/100) * np.exp(-d * (length - self.t) / 100) / p_t
    
    def segment_set_likelihood(self, segments: List[float], d: int) -> float:
        """
        Likelihood of observing set of segment lengths
        Implements S_A(s|d,t) from equation 6
        
        Parameters:
        -----------
        segments : List[float]
            List of segment lengths in cM
        d : int
            Total number of meioses
            
        Returns:
        --------
        float
            Product of individual segment likelihoods
        """
        likelihood = 1.0
        for seg_length in segments:
            likelihood *= self.segment_length_density(seg_length, d)
        return likelihood
        
# Define all relationship types with their parameters
relationships = [
    {"name": "Parent-Child", "d": 1, "a": 1, "degree": 1},
    {"name": "Siblings", "d": 2, "a": 2, "degree": 1},
    {"name": "Half-Siblings", "d": 2, "a": 1, "degree": 2},
    {"name": "Grandparent-Grandchild", "d": 2, "a": 1, "degree": 2},
    {"name": "Aunt/Uncle-Niece/Nephew", "d": 3, "a": 1, "degree": 3},
    {"name": "Half-Aunt/Uncle", "d": 3, "a": 1, "degree": 3},
    {"name": "First Cousins", "d": 4, "a": 2, "degree": 3},
    {"name": "Half First Cousins", "d": 4, "a": 1, "degree": 4},
    {"name": "First Cousins Once Removed", "d": 5, "a": 2, "degree": 4},
    {"name": "Second Cousins", "d": 6, "a": 2, "degree": 5},
    {"name": "Second Cousins Once Removed", "d": 7, "a": 2, "degree": 6},
    {"name": "Third Cousins", "d": 8, "a": 2, "degree": 7},
    {"name": "Fourth Cousins", "d": 10, "a": 2, "degree": 9},
    {"name": "Fifth Cousins", "d": 12, "a": 2, "degree": 11}
]
                        

Maximizing the Alternative Hypothesis Likelihood

"Given n_A and n_P, the maximum value of the likelihood function (Eq. 4) is equal to:" (Huff et al., 2011, p. 770)

ML_R(n_P, n_A, s|d, a, t) = N_P(n_P|t)N_A(n_A|d, a, t) · S_P({s_1:n ... s_{n_P:n}}|t)S_A({s_{n_P+1:n} ... s_n:n}|d, a, t) (9)

"where s_x:n is equal to the x th smallest value in s. Equation 9 asserts that the likelihood is maximized when the set of segments resulting from recent ancestry is equal to the longest n_A segments in s, with the remaining n_P segments being due to the population background. In the Supplemental Methods section, we show that Equation 9 holds as long as μ < a(rd + c), which is true whenever a and d specify shared ancestry that is recent relative to pairs of individuals selected at random from the population." (Huff et al., 2011, p. 770)

Key insight from equation 9

The likelihood is maximized by attributing the longest segments to recent ancestry
The shortest segments are attributed to population background
This makes biological sense: recent ancestry produces longer segments (less time for recombination to break them up)
Population background produces shorter segments (many generations of recombination)

Notation clarification:

s_1:n = smallest (shortest) segment
s_n:n = largest (longest) segment
{s_1:n ... s_{n_P:n}} = the n_P shortest segments
{s_{n_P+1:n} ... s_n:n} = the n_A longest segments

Finding the Maximum Likelihood

"To identify the maximum value of the likelihood function (Eq. 4) given d, a, and t, we evaluate all possible values of n_P and n_A in Eq. 9:" (Huff et al., 2011, p. 770)

ML_R(n, s|d, a, t) = max{ML_R(n_P, n - n_P, s) : n_P ∈ {0, 1...n}} (10)

What equation 10 does

Takes the maximum likelihood over all possible ways to partition the n observed segments
n_P: number of segments attributed to population background
n - n_P: number of segments attributed to recent ancestry (n_A)
For each partition, calculate likelihood using equation 9
Select the partition that gives the highest likelihood

How Equations 9 and 10 Work Together

Equation 9 calculates the likelihood for a specific split of segments:

Given: which segments are from ancestry vs background
Calculate: the likelihood of that particular assignment

Equation 10 finds the best split:

Try all possible ways to split the segments
Use equation 9 to calculate likelihood for each split
Pick the split with the highest likelihood

Concrete example with 5 segments

Say we observe segments of lengths: [3.1, 5.7, 8.2, 15.4, 32.6] cM

Try n_P = 0 (all from ancestry):
- Background segments: none
- Ancestry segments: [3.1, 5.7, 8.2, 15.4, 32.6]
- Calculate ML_R using equation 9 → get likelihood L₀
Try n_P = 1 (shortest from background):
- Background segments: [3.1]
- Ancestry segments: [5.7, 8.2, 15.4, 32.6]
- Calculate ML_R using equation 9 → get likelihood L₁
Try n_P = 2:
- Background segments: [3.1, 5.7]
- Ancestry segments: [8.2, 15.4, 32.6]
- Calculate ML_R using equation 9 → get likelihood L₂

... and so on until n_P = 5

Pick the maximum: If L₂ is highest, then the optimal split is 2 background, 3 ancestry

This makes biological sense: short segments are more likely from ancient background, long segments from recent ancestry!

References and Implementation Notes

ERSA has its own code at www.hufflab.org but the site was down when I last checked. The code that follows is an interpretation of the ERSA theory written in Python. Alternatively, https://github.com/rmunoz12/ersa looks promising, but I haven't used it yet.

ERSA 1: https://genome.cshlp.org/content/21/5/768.full
ERSA 2: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004144

from typing import List, Tuple, Dict
import numpy as np
from scipy import stats

class ERSACompleteLikelihood(ERSAAlternativeHypothesis):
    """
    Extended class with complete likelihood calculations including null hypothesis
    """
    
    def __init__(self, eta: float = 12, mu: float = 3.12, **kwargs):
        """
        Initialize with additional parameters for null hypothesis
        
        Parameters:
        -----------
        eta : float
            Mean number of background segments (default: 12 for CEU)
        mu : float
            Mean background segment length (default: 3.12 cM)
        """
        super().__init__(**kwargs)
        self.eta = eta
        self.mu = mu
    
    def null_segment_count_likelihood(self, n: int) -> float:
        """
        Likelihood of n segments under null hypothesis
        Implements N_P(n|t) from equation 1
        """
        return stats.poisson.pmf(n, self.eta)
    
    def null_segment_length_density(self, length: float) -> float:
        """
        Segment length density under null hypothesis
        Implements F_P(i|t) from equation 3
        
        Note: This is a truncated exponential with threshold t
        """
        if length < self.t:
            return 0
        # This is the corrected formula from equation 3
        return np.exp(-(length - self.t) / self.mu) / self.mu
    
    def alternative_likelihood(self, n_total: int, segments: List[float], 
                             d: int, a: int) -> Tuple[float, int]:
        """
        Complete alternative hypothesis likelihood
        Implements equation 9 and 10 from ERSA paper
        
        Returns:
        --------
        Tuple[float, int]
            (maximum likelihood, optimal n_A)
        """
        segments = sorted(segments, reverse=True)  # Sort longest first
        max_log_likelihood = -np.inf
        best_n_A = 0
        
        for n_A in range(n_total + 1):
            n_P = n_total - n_A
            
            # Use log likelihoods to avoid numerical issues
            log_likelihood = 0
            
            # Background component
            log_likelihood += stats.poisson.logpmf(n_P, self.eta)
            for seg in segments[n_A:]:
                density = self.null_segment_length_density(seg)
                if density > 0:
                    log_likelihood += np.log(density)
                else:
                    log_likelihood = -np.inf
                    break
            
            # Recent ancestry component
            if n_A > 0:
                ancestry_count_likelihood = self.segment_count_likelihood(n_A, d, a)
                if ancestry_count_likelihood > 0:
                    log_likelihood += np.log(ancestry_count_likelihood)
                else:
                    log_likelihood = -np.inf
                    
                for seg in segments[:n_A]:
                    density = self.segment_length_density(seg, d)
                    if density > 0:
                        log_likelihood += np.log(density)
                    else:
                        log_likelihood = -np.inf
                        break
            
            if log_likelihood > max_log_likelihood:
                max_log_likelihood = log_likelihood
                best_n_A = n_A
                
        return np.exp(max_log_likelihood) if max_log_likelihood > -np.inf else 0, best_n_A
    
    def likelihood_ratio_test(self, segments: List[float], d: int, a: int) -> Dict:
        """
        Perform likelihood ratio test
        
        Returns:
        --------
        Dict
            Test statistics and p-value
        """
        n_total = len(segments)
        
        # Null hypothesis log-likelihood
        log_L_null = stats.poisson.logpmf(n_total, self.eta)
        for seg in segments:
            density = self.null_segment_length_density(seg)
            if density > 0:
                log_L_null += np.log(density)
            else:
                log_L_null = -np.inf
                break
        
        # Alternative hypothesis likelihood
        L_alt, n_A = self.alternative_likelihood(n_total, segments, d, a)
        log_L_alt = np.log(L_alt) if L_alt > 0 else -np.inf
        
        # Likelihood ratio test statistic
        if log_L_alt > log_L_null and log_L_null > -np.inf:
            LR = -2 * (log_L_null - log_L_alt)
        else:
            LR = 0
            
        # Chi-square test with 2 degrees of freedom
        p_value = stats.chi2.sf(LR, df=2) if LR > 0 else 1.0
        
        return {
            'L_null': np.exp(log_L_null) if log_L_null > -np.inf else 0,
            'L_alt': L_alt,
            'log_L_null': log_L_null,
            'log_L_alt': log_L_alt,
            'n_A': n_A,
            'n_P': n_total - n_A,
            'LR': LR,
            'p_value': p_value,
            'significant': p_value < 0.001
        }
    
    def null_likelihood(self, segments: List[float]) -> float:
        """
        Calculate complete null hypothesis likelihood
        Implements L_P(n,s|t) from equation 1
        """
        n = len(segments)
        log_likelihood = stats.poisson.logpmf(n, self.eta)
        for seg in segments:
            density = self.null_segment_length_density(seg)
            if density > 0:
                log_likelihood += np.log(density)
            else:
                return 0
        return np.exp(log_likelihood)
                        

Summary

The ERSA alternative hypothesis provides a comprehensive framework for modeling IBD patterns under recent shared ancestry. Key components include:

Dual-source model: Segments from recent ancestry (longer, more numerous) and population background (shorter, fewer)
Poisson segment counts: Expected number depends on relationship parameters and detection threshold
Exponential segment lengths: Truncated distributions conditioned on minimum detectable length
Optimal partitioning: Longest segments attributed to recent ancestry, shortest to background
Maximum likelihood estimation: Finding the relationship parameters that best explain observed data

This framework enables quantitative relationship inference by comparing the alternative hypothesis likelihood with the null hypothesis, providing statistical evidence for specific genealogical relationships based on patterns of IBD segment sharing.

ERSA Alternative Hypothesis: Modeling Recent Shared Ancestry

Overview

Key Concepts:

View Code

Variable Definitions

Important Relationships

1. Mathematical Framework of the Alternative Hypothesis

Likelihood vs Probability

Likelihood (what ERSA uses):

Probability (Bayesian approach):

Key Difference

In Practice:

Modeling Segment Lengths Under the Alternative Hypothesis

Understanding Segment Length Distribution Under Recent Ancestry

The Inheritance Probability Framework

Understanding the 1/2(d-1) formula

Expected Number of Shared Segments

Breaking this down:

Understanding rd + c: How Recombination Creates Potential IBD Segments

View Code

Segment Length Distribution

Understanding Morgans and the 100/d Relationship

Conditioning on Minimum Length

Understanding this formula

Distribution of the Number of Segments

Breaking down this formula

Relationship Parameters and Expected Segments

Expected IBD Segment Count Distributions

Segment Length Distributions by Relationship

View Code: ERSA Alternative Hypothesis Implementation

Maximizing the Alternative Hypothesis Likelihood

Key insight from equation 9

Finding the Maximum Likelihood

What equation 10 does

How Equations 9 and 10 Work Together

Concrete example with 5 segments

References and Implementation Notes

View Code: Complete Likelihood Implementation

Summary

Understanding the 1/2^(d-1) formula