ERSA Alternative Hypothesis: Modeling Recent Shared Ancestry

Overview

The alternative hypothesis in ERSA (Estimation of Recent Shared Ancestry) models the scenario where two individuals share recent genealogical ancestry. Unlike the null hypothesis (random population members), these individuals have traceable common ancestors within a few generations.

Key Concepts:

  • IBD segments arise from two sources: recent ancestry and population background
  • Relationship characterized by parameters d (meioses) and a (ancestors)
  • Segments from recent ancestry are typically longer and more numerous
  • The likelihood combines both recent and background components

# Import necessary libraries import numpy as np import matplotlib.pyplot as plt from scipy import stats from scipy.special import factorial, comb import pandas as pd import seaborn as sns from typing import List, Tuple, Dict, Optional import warnings warnings.filterwarnings('ignore') # Set style for better visualizations plt.style.use('seaborn-v0_8-darkgrid') sns.set_palette("husl") # Set random seed for reproducibility np.random.seed(42)

Variable Definitions

The variables used in ERSA alternative hypothesis:

  • d = total meioses separating the individuals
  • a = number of common ancestors (1 or 2)
  • n_A = number of segments from recent ancestry
  • n_P = number of background segments - segments from other ancestors in the population
  • s_A = set of segment (lengths) from recent ancestry
  • s_P = set of segment (length) from distant ancestry
  • t = minimum segment threshold (2.5 cM)

Important Relationships

Note that n_P + n_A = n where n is the total number of segments shared between the pair.
s_A and s_P are two mutually exclusive subsets of s

1. Mathematical Framework of the Alternative Hypothesis

Under the alternative hypothesis H_a, IBD segments arise from two sources:

  1. Recent ancestry segments (subscript A): Inherited from recent common ancestors
  2. Background segments (subscript P): Population-level background sharing

The likelihood is:

LR = LA(nA, sA|d, a, t) · LP(nP, sP|t)     (4)

where:

  • d = total meioses separating the individuals
  • a = number of common ancestors (1 or 2)
  • n_A = number of segments from recent ancestry
  • n_P = number of background segments - segments from other ancestors in the population
  • s_A = set of segment (lengths) from recent ancestry
  • s_P = set of segment (length) from distant ancestry
  • t = minimum segment threshold (2.5 cM)

n_P and thus L_P follows Equation 1 of the Null Hypothesis, which is that the pair is no more related than two randomly selected individuals from the population.

"L_A is the likelihood that two individuals share n autosomal segments from recent ancestor(s) specified by d and a, with the segment lengths specified by s_A" (Huff et al., 2011, p. 770).

LA(nA, sA|d, a, t) = NA(n|d, a, t) · SA(sA|d, t)     (5)

L_A is the likelihood (not probability) that can be expressed as the product of likelihoods of the number of shared segments and the length of each segment, which parallels Equation 1:

LP(n, s|t) = NP(n|t) · SP(s|t)     (1)

Key difference: The alternative hypothesis (Equation 5) includes the parameters d (total meioses) and a (number of common ancestors) that specify the targeted relatedness between individuals, whereas the null hypothesis (Equation 1) assumes no specific relationship beyond random population-level sharing.

Likelihood vs Probability

Likelihood (what ERSA uses):

LA(nA, sA|d, a, t) = NA(n|d, a, t) · SA(sA|d, t)     (5)

This reads as: "The likelihood of observing data (nA, sA) given that the relationship is (d, a)"

  • Fixed parameters: d, a (the relationship)
  • Variable: the observed data (n_A, s_A)
  • Question: How likely is this data if we assume this specific relationship?

Probability (Bayesian approach):

P(d, a|nA, sA, t) = [LA(nA, sA|d, a, t) · P(d, a)] / P(nA, sA|t)     (Bayes)

This reads as: "The probability that the relationship is (d, a) given that we observed data (nA, sA)"

  • Fixed: the observed data (n_A, s_A)
  • Variable: the relationship parameters (d, a)
  • Question: What's the probability of this relationship given what we observed?

Key Difference

  • Likelihood: L(data|relationship) - How likely is the data given a specific relationship?
  • Probability: P(relationship|data) - What's the probability of the relationship given the data?

In Practice:

ERSA uses maximum likelihood estimation (MLE), finding the relationship (d, a) that maximizes the likelihood:

(d̂, â) = argmaxd,a LA(nA, sA|d, a, t)

To get actual probabilities of relationships, you would need:

  1. Prior probabilities P(d, a) for each relationship type
  2. The marginal likelihood P(nA, sA|t) (sum over all possible relationships)
  3. Apply Bayes' theorem

Modeling Segment Lengths Under the Alternative Hypothesis

Recall equation 5:

LA(nA, sA|d, a, t) = NA(n|d, a, t) · SA(sA|d, t)     (5)

The second component, SA(sA|d, t), represents the likelihood of observing the specific set of segment lengths under recent ancestry. This is computed as:

SA(s|d, t) = ∏i ∈ s FA(i|t)     (6)

Explanation: "Equation 6 assumes that, for a given value of d, the lengths of segments are independent. This assumption is not strictly true. One might imagine that the presence of a particularly long segment would reduce the genomic space available for additional segments. However, because the length of any one segment is small relative to the genome and because the genome is physically divided into chromosomes, the segment lengths are approximately independent" (Huff et al., 2011, p. 770).

Parallel Structure: Equation 6 directly parallels Equation 2 from the null hypothesis:

SP(s|t) = ∏i ∈ s FP(i|t)     (2)

Both equations use the same product structure, assuming independence of segment lengths. The key difference is:

  • Equation 2 uses FP(i|t) - the population background distribution
  • Equation 6 uses FA(i|t) - the distribution under recent ancestry with parameters d and a

Understanding Segment Length Distribution Under Recent Ancestry

Let's look at FA(i|t) component from equation 6.

The Inheritance Probability Framework

"For two individuals who are related by an inheritance path that is d meioses long, the probability that they will inherit any particular autosomal segment from a common ancestor on that path is equal to 1/2(d-1)" (Huff et al., 2011, p. 770).

Understanding the 1/2(d-1) formula

Let's trace how a specific segment travels from a common ancestor to two relatives. Consider first cousins who share a grandmother:

At each meiosis (when a parent passes DNA to a child), each autosomal segment has a 1/2 chance of being inherited because:

  • The parent has two copies of each chromosome (one from each of their parents)
  • The child randomly inherits one of these two copies
  • Therefore, any specific segment from a grandparent has a 1/2 chance of being passed on

For first cousins:

  • The grandmother has a specific DNA segment on one of her two chromosome copies
  • Path to cousin 1: grandmother → parent 1 (1/2 chance of inheriting that specific segment) → cousin 1 (1/2 chance)
  • Path to cousin 2: grandmother → parent 2 (1/2 chance) → cousin 2 (1/2 chance)
  • Total: 4 meioses, so d = 4

However, let's consider probabilities:

  • We specify that there is a specific segment from grandmother to parent 1 that we are following.
    • This removes the 1/2 chance from grandmother to parent 1. There is no chance here.
    • We have selected the specific transmitted segment that we are following.
  • That same segment getting transmitted from parent 1 to cousin 1 has a 1/2 chance.
  • That same segment getting transmitted from grandmother to parent 2 has a 1/2 chance.
  • That same segment getting transmitted from parent 2 to cousin 2 has a 1/2 chance.

So, the probability both cousins inherit the same segment from grandmother is:

(1/2) × (1/2) × (1/2) = (1/2)3 = 1/2(d-1) where d-1 = 3

The formula is 1/2(d-1) rather than 1/2d because once we selected the segment transmitted from the common ancestor in the first meiosis, we only need to track d-1 chance (independent) transmission events for both individuals to share it.

Expected Number of Shared Segments

"The expected number of shared autosomal segments that could potentially be inherited from a common ancestor is equal to rd + c, where c is the number of autosomes and r is the expected number of recombination events per haploid genome per generation" (Huff et al., 2011, p. 770).

Breaking this down:

  • c segments from chromosomes: At minimum, if no recombination occurs, relatives can share up to c = 22 autosomal segments (one per chromosome)
  • rd segments from recombination: Each generation adds approximately r ≈ 35.3 recombination events, creating new segment boundaries. Over d meioses, this creates rd additional potential segments
  • Total potential segments: rd + c

Understanding rd + c: How Recombination Creates Potential IBD Segments

import matplotlib.pyplot as plt import numpy as np from matplotlib.patches import Rectangle, FancyBboxPatch import matplotlib.patches as mpatches # Create figure with subplots fig, axes = plt.subplots(2, 3, figsize=(16, 10)) fig.suptitle('Understanding $rd + c$: How Recombination Creates Potential IBD Segments', fontsize=16, fontweight='bold') # Parameters r = 35.3 # recombinations per generation c = 22 # number of autosomes # Set random seed for consistency np.random.seed(42) # Generate recombination positions that accumulate # First generation breaks first_gen_breaks = {} for i in range(22): n_breaks = np.random.poisson(35.3/22) if n_breaks > 0: first_gen_breaks[i] = np.random.uniform(1.5, 8.5, min(n_breaks, 2)) else: first_gen_breaks[i] = [] # ... [full visualization code from notebook] ...

The value rd + c represents the maximum possible number of distinct IBD segments that could be inherited from a common ancestor—it's the total number of potential segment boundaries created by recombination.

For example:

  • Parent and child (d=1, a=1): Potential segments: rd + c = 35.3(1) + 22 = 57
  • First cousins (d=4, a=2): Potential segments: rd + c = 35.3(4) + 22 = 163
  • Third cousins (d=8, a=2): Potential segments: rd + c = 35.3(8) + 22 = 304

However, the actual number of segments shared between two relatives is much smaller. To estimate the actual number of shared segments, we recognize that each potential segment has only a 1/2(d-1) probability of being inherited by both individuals. And we also consider the number of ancestors, 1 or 2.

  • The number of potential segments: rd + c
  • The probability each is inherited: 1/2^{(d-1)}
  • The number of common ancestors: a (doubles the expectation for full siblings vs half-siblings)
Expected shared segments = a × (rd + c) × 1/2(d-1)

Therefore, "the expected number of shared segments is equal to a(rd + c)/2^{(d-1)}" (Huff et al., 2011, p. 770).

For example:

  • Parent and child (d=1, a=1):
    • Potential segments: rd + c = 35.3(1) + 22 = 57
    • Expected shared segments: 1 × 57 / 2^0 = 57/1 = 57 segments
    • 100% of potential segments are shared (as expected for parent-child!)
  • First cousins (d=4, a=2):
    • Potential segments: rd + c = 35.3(4) + 22 = 163
    • Expected shared segments: 2 × 163 / 2^3 = 326/8 ≈ 41 segments
    • Only ~25% of potential segments are actually shared
  • Third cousins (d=8, a=2):
    • Potential segments: rd + c = 35.3(8) + 22 = 304
    • Expected shared segments: 2 × 304 / 2^7 = 608/128 ≈ 5 segments
    • Only ~1.6% of potential segments are actually shared

Segment Length Distribution

Recall from equation 6 that the likelihood of observing a set of segment lengths under the alternative hypothesis is:

SA(s|d, t) = ∏i ∈ s FA(i|t)     (6)

where FA(i|t) is the probability density for observing a segment of length i under recent ancestry. Here, i represents the length of an individual IBD segment in centiMorgans (cM). Equation 6 states that the total likelihood for a set of segments is the product of individual segment likelihoods, assuming independence between segment lengths.

Let's look at the segment i.

"Given d, the expected value of i is 100/d. Without conditioning on t, the distribution of segment length is exponential with mean 100/d" (Huff et al., 2011, p. 770). In this statement, i is the mean segment length whereas in equation 6, i is the individual segment length.

𝔼[i] = 100/d

The expected value of i on average is 100/d.

Understanding Morgans and the 100/d Relationship

First, let's define genetic distance units:

  • 1 Morgan (M) = the genetic distance over which there is, on average, 1 recombination event per meiosis
  • 1 centiMorgan (cM) = 1/100 of a Morgan = 1% probability of recombination per meiosis
  • The human genome is approximately 100 Morgans (or 10,000 cM) long

What this means concretely:

  • A 1 cM segment has a 1% or 0.01 probability of experiencing a recombination in one meiosis
  • A 10 cM segment has a 10% or 0.10 probability of experiencing a recombination in one meiosis
  • A 100 cM (1 Morgan) segment has a 100% probability or 1.0 expected recombination per meiosis

Why does 100 appear in the formulas?
The 100 appears because we measure in centiMorgans but need to convert to probability:

  • L cM = L% chance of recombination = L/100 as a decimal probability
  • It's just like converting centimeters to meters - you divide by 100
  • If we measured in Morgans instead of centiMorgans, we wouldn't need the 100

Conditioning on Minimum Length

When we only observe segments longer than threshold t (typically 2.5 cM), we need the conditional distribution. Since short segments cannot be reliably detected, we must adjust our probability density:

FA(i|d, t) = e-d(i-t)/100 / (100/d)     (7)

Understanding this formula

The unconditional exponential density is f(i) = (d/100)e-di/100 for any length i ≥ 0.

When we condition on observing only segments ≥ t:

  1. We shift the distribution by t: (i-t)
  2. We renormalize so the total probability from t to ∞ equals 1

Breaking down equation 7:

  • Numerator e-d(i-t)/100:
    • This is the shifted exponential decay
    • The (i-t) term shifts our reference point to the threshold
    • Segments just above threshold t have the highest density
  • Denominator 100/d:
    • This ensures the density integrates to 1 over [t, ∞)
    • It's actually ∫t e-d(i-t)/100 di = 100/d
    • Note: This is NOT the mean of the truncated distribution (which would be t + 100/d)
  • The 100 appears because:
    • We measure in centiMorgans (1% units)
    • The rate parameter is d/100 when using cM

Distribution of the Number of Segments

Recall from equation 5 that the alternative hypothesis likelihood has two components:

LA(nA, sA|d, a, t) = NA(n|d, a, t) · SA(sA|d, t)     (5)

We've just covered SA (segment lengths). Now let's examine NA(n|d, a, t), the likelihood of observing n segments given the relationship (d, a).

"The probability p(t) that a shared segment is longer than t is equal to e-dt/100 (Thomas et al. 1994). Because the distribution of the number of shared segments is approximately Poisson (Thomas et al. 1994)," (Huff et al., 2011, p. 770) we have:

NA(n|d, a, t) = [e-[a(rd+c)p(t)]/2d-1 × [a(rd+c)p(t)/2d-1]n] / n!     (8)

Breaking down this formula

Equation 8 is based on the Poisson distribution formula:

P(N=n) = eλn / n!

which is probability of observing exactly n segments. Appropriate because segment inheritance events are independent.

The Poisson parameter λ:

λ = [a(rd+c)p(t)] / 2d-1 = [a(rd+c)e-dt/100] / 2d-1

This combines:

  • a(rd+c): total potential segments from recent ancestry
  • 1/2(d-1): probability each is inherited by both individuals
  • p(t) = e-dt/100: probability each exceeds detection threshold

Relationship Parameters and Expected Segments

Expected IBD Segment Count Distributions

Segment Length Distributions by Relationship

import numpy as np from scipy import stats from typing import List class ERSAAlternativeHypothesis: """ Implementation of ERSA Alternative Hypothesis for relationship inference """ def __init__(self, r: float = 35.3, c: int = 22, t: float = 2.5): """ Initialize ERSA Alternative Hypothesis model Parameters: ----------- r : float Expected number of recombinations per generation (default: 35.3) c : int Number of autosomes (default: 22 for humans) t : float Minimum segment threshold in cM (default: 2.5) """ self.r = r self.c = c self.t = t def expected_segments(self, d: int, a: int) -> float: """ Calculate expected number of IBD segments Parameters: ----------- d : int Total number of meioses a : int Number of common ancestors (1 or 2) Returns: -------- float Expected number of segments """ return a * (self.r * d + self.c) / (2 ** (d - 1)) def segment_probability(self, d: int) -> float: """ Probability that a segment exceeds threshold t Parameters: ----------- d : int Total number of meioses Returns: -------- float Probability of segment > t """ return np.exp(-d * self.t / 100) def poisson_parameter(self, d: int, a: int) -> float: """ Calculate lambda parameter for Poisson distribution of segment counts Parameters: ----------- d : int Total number of meioses a : int Number of common ancestors Returns: -------- float Lambda parameter for Poisson distribution """ return self.expected_segments(d, a) * self.segment_probability(d) def segment_count_likelihood(self, n: int, d: int, a: int) -> float: """ Likelihood of observing n segments given relationship (d, a) Implements equation 8 from ERSA paper Parameters: ----------- n : int Number of observed segments d : int Total number of meioses a : int Number of common ancestors Returns: -------- float Probability of observing n segments """ lambda_param = self.poisson_parameter(d, a) return stats.poisson.pmf(n, lambda_param) def segment_length_density(self, length: float, d: int) -> float: """ Probability density for segment length under alternative hypothesis Implements equation 7 from ERSA paper (F_A(i|d,t)) Parameters: ----------- length : float Segment length in cM d : int Total number of meioses Returns: -------- float Probability density """ if length < self.t: return 0 # Equation 7: F_A(i|d,t) = (d/100) * exp(-d(i-t)/100) / p(t) # where p(t) = exp(-dt/100) is the normalizing constant p_t = np.exp(-d * self.t / 100) return (d/100) * np.exp(-d * (length - self.t) / 100) / p_t def segment_set_likelihood(self, segments: List[float], d: int) -> float: """ Likelihood of observing set of segment lengths Implements S_A(s|d,t) from equation 6 Parameters: ----------- segments : List[float] List of segment lengths in cM d : int Total number of meioses Returns: -------- float Product of individual segment likelihoods """ likelihood = 1.0 for seg_length in segments: likelihood *= self.segment_length_density(seg_length, d) return likelihood # Define all relationship types with their parameters relationships = [ {"name": "Parent-Child", "d": 1, "a": 1, "degree": 1}, {"name": "Siblings", "d": 2, "a": 2, "degree": 1}, {"name": "Half-Siblings", "d": 2, "a": 1, "degree": 2}, {"name": "Grandparent-Grandchild", "d": 2, "a": 1, "degree": 2}, {"name": "Aunt/Uncle-Niece/Nephew", "d": 3, "a": 1, "degree": 3}, {"name": "Half-Aunt/Uncle", "d": 3, "a": 1, "degree": 3}, {"name": "First Cousins", "d": 4, "a": 2, "degree": 3}, {"name": "Half First Cousins", "d": 4, "a": 1, "degree": 4}, {"name": "First Cousins Once Removed", "d": 5, "a": 2, "degree": 4}, {"name": "Second Cousins", "d": 6, "a": 2, "degree": 5}, {"name": "Second Cousins Once Removed", "d": 7, "a": 2, "degree": 6}, {"name": "Third Cousins", "d": 8, "a": 2, "degree": 7}, {"name": "Fourth Cousins", "d": 10, "a": 2, "degree": 9}, {"name": "Fifth Cousins", "d": 12, "a": 2, "degree": 11} ]

Maximizing the Alternative Hypothesis Likelihood

"Given nA and nP, the maximum value of the likelihood function (Eq. 4) is equal to:" (Huff et al., 2011, p. 770)

MLR(nP, nA, s|d, a, t) = NP(nP|t)NA(nA|d, a, t) · SP({s1:n ... snP:n}|t)SA({snP+1:n ... sn:n}|d, a, t)     (9)

"where sx:n is equal to the x th smallest value in s. Equation 9 asserts that the likelihood is maximized when the set of segments resulting from recent ancestry is equal to the longest nA segments in s, with the remaining nP segments being due to the population background. In the Supplemental Methods section, we show that Equation 9 holds as long as μ < a(rd + c), which is true whenever a and d specify shared ancestry that is recent relative to pairs of individuals selected at random from the population." (Huff et al., 2011, p. 770)

Key insight from equation 9

  • The likelihood is maximized by attributing the longest segments to recent ancestry
  • The shortest segments are attributed to population background
  • This makes biological sense: recent ancestry produces longer segments (less time for recombination to break them up)
  • Population background produces shorter segments (many generations of recombination)

Notation clarification:

  • s1:n = smallest (shortest) segment
  • sn:n = largest (longest) segment
  • {s1:n ... snP:n} = the nP shortest segments
  • {snP+1:n ... sn:n} = the nA longest segments

Finding the Maximum Likelihood

"To identify the maximum value of the likelihood function (Eq. 4) given d, a, and t, we evaluate all possible values of nP and nA in Eq. 9:" (Huff et al., 2011, p. 770)

MLR(n, s|d, a, t) = max{MLR(nP, n - nP, s) : nP ∈ {0, 1...n}}     (10)

What equation 10 does

  • Takes the maximum likelihood over all possible ways to partition the n observed segments
  • nP: number of segments attributed to population background
  • n - nP: number of segments attributed to recent ancestry (nA)
  • For each partition, calculate likelihood using equation 9
  • Select the partition that gives the highest likelihood

How Equations 9 and 10 Work Together

Equation 9 calculates the likelihood for a specific split of segments:

  • Given: which segments are from ancestry vs background
  • Calculate: the likelihood of that particular assignment

Equation 10 finds the best split:

  • Try all possible ways to split the segments
  • Use equation 9 to calculate likelihood for each split
  • Pick the split with the highest likelihood

Concrete example with 5 segments

Say we observe segments of lengths: [3.1, 5.7, 8.2, 15.4, 32.6] cM

  1. Try nP = 0 (all from ancestry):
    • Background segments: none
    • Ancestry segments: [3.1, 5.7, 8.2, 15.4, 32.6]
    • Calculate MLR using equation 9 → get likelihood L0
  2. Try nP = 1 (shortest from background):
    • Background segments: [3.1]
    • Ancestry segments: [5.7, 8.2, 15.4, 32.6]
    • Calculate MLR using equation 9 → get likelihood L1
  3. Try nP = 2:
    • Background segments: [3.1, 5.7]
    • Ancestry segments: [8.2, 15.4, 32.6]
    • Calculate MLR using equation 9 → get likelihood L2

... and so on until nP = 5

Pick the maximum: If L2 is highest, then the optimal split is 2 background, 3 ancestry

This makes biological sense: short segments are more likely from ancient background, long segments from recent ancestry!

References and Implementation Notes

ERSA has its own code at www.hufflab.org but the site was down when I last checked. The code that follows is an interpretation of the ERSA theory written in Python. Alternatively, https://github.com/rmunoz12/ersa looks promising, but I haven't used it yet.

ERSA 1: https://genome.cshlp.org/content/21/5/768.full
ERSA 2: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004144

from typing import List, Tuple, Dict import numpy as np from scipy import stats class ERSACompleteLikelihood(ERSAAlternativeHypothesis): """ Extended class with complete likelihood calculations including null hypothesis """ def __init__(self, eta: float = 12, mu: float = 3.12, **kwargs): """ Initialize with additional parameters for null hypothesis Parameters: ----------- eta : float Mean number of background segments (default: 12 for CEU) mu : float Mean background segment length (default: 3.12 cM) """ super().__init__(**kwargs) self.eta = eta self.mu = mu def null_segment_count_likelihood(self, n: int) -> float: """ Likelihood of n segments under null hypothesis Implements N_P(n|t) from equation 1 """ return stats.poisson.pmf(n, self.eta) def null_segment_length_density(self, length: float) -> float: """ Segment length density under null hypothesis Implements F_P(i|t) from equation 3 Note: This is a truncated exponential with threshold t """ if length < self.t: return 0 # This is the corrected formula from equation 3 return np.exp(-(length - self.t) / self.mu) / self.mu def alternative_likelihood(self, n_total: int, segments: List[float], d: int, a: int) -> Tuple[float, int]: """ Complete alternative hypothesis likelihood Implements equation 9 and 10 from ERSA paper Returns: -------- Tuple[float, int] (maximum likelihood, optimal n_A) """ segments = sorted(segments, reverse=True) # Sort longest first max_log_likelihood = -np.inf best_n_A = 0 for n_A in range(n_total + 1): n_P = n_total - n_A # Use log likelihoods to avoid numerical issues log_likelihood = 0 # Background component log_likelihood += stats.poisson.logpmf(n_P, self.eta) for seg in segments[n_A:]: density = self.null_segment_length_density(seg) if density > 0: log_likelihood += np.log(density) else: log_likelihood = -np.inf break # Recent ancestry component if n_A > 0: ancestry_count_likelihood = self.segment_count_likelihood(n_A, d, a) if ancestry_count_likelihood > 0: log_likelihood += np.log(ancestry_count_likelihood) else: log_likelihood = -np.inf for seg in segments[:n_A]: density = self.segment_length_density(seg, d) if density > 0: log_likelihood += np.log(density) else: log_likelihood = -np.inf break if log_likelihood > max_log_likelihood: max_log_likelihood = log_likelihood best_n_A = n_A return np.exp(max_log_likelihood) if max_log_likelihood > -np.inf else 0, best_n_A def likelihood_ratio_test(self, segments: List[float], d: int, a: int) -> Dict: """ Perform likelihood ratio test Returns: -------- Dict Test statistics and p-value """ n_total = len(segments) # Null hypothesis log-likelihood log_L_null = stats.poisson.logpmf(n_total, self.eta) for seg in segments: density = self.null_segment_length_density(seg) if density > 0: log_L_null += np.log(density) else: log_L_null = -np.inf break # Alternative hypothesis likelihood L_alt, n_A = self.alternative_likelihood(n_total, segments, d, a) log_L_alt = np.log(L_alt) if L_alt > 0 else -np.inf # Likelihood ratio test statistic if log_L_alt > log_L_null and log_L_null > -np.inf: LR = -2 * (log_L_null - log_L_alt) else: LR = 0 # Chi-square test with 2 degrees of freedom p_value = stats.chi2.sf(LR, df=2) if LR > 0 else 1.0 return { 'L_null': np.exp(log_L_null) if log_L_null > -np.inf else 0, 'L_alt': L_alt, 'log_L_null': log_L_null, 'log_L_alt': log_L_alt, 'n_A': n_A, 'n_P': n_total - n_A, 'LR': LR, 'p_value': p_value, 'significant': p_value < 0.001 } def null_likelihood(self, segments: List[float]) -> float: """ Calculate complete null hypothesis likelihood Implements L_P(n,s|t) from equation 1 """ n = len(segments) log_likelihood = stats.poisson.logpmf(n, self.eta) for seg in segments: density = self.null_segment_length_density(seg) if density > 0: log_likelihood += np.log(density) else: return 0 return np.exp(log_likelihood)

Summary

The ERSA alternative hypothesis provides a comprehensive framework for modeling IBD patterns under recent shared ancestry. Key components include:

  1. Dual-source model: Segments from recent ancestry (longer, more numerous) and population background (shorter, fewer)
  2. Poisson segment counts: Expected number depends on relationship parameters and detection threshold
  3. Exponential segment lengths: Truncated distributions conditioned on minimum detectable length
  4. Optimal partitioning: Longest segments attributed to recent ancestry, shortest to background
  5. Maximum likelihood estimation: Finding the relationship parameters that best explain observed data

This framework enables quantitative relationship inference by comparing the alternative hypothesis likelihood with the null hypothesis, providing statistical evidence for specific genealogical relationships based on patterns of IBD segment sharing.