4.1 Conditional Likelihood

In this section, we derive the probability distribution for the number and lengths of IBD segments, under the condition that at least one segment is observed. This conditional framing reflects the reality that we are only studying pairs who share detectable IBD—so the distribution is implicitly filtered on IBD presence.

We follow the notation of Ko and Nielsen (2017). Let a genealogical relationship be defined as:

R = (u, v, a)

where:

a ∈ {1, 2} indicates whether the pair shares one or two common ancestors,
u is the number of meioses from individual i to the common ancestor(s),
v is the number of meioses from individual j to the common ancestor(s).

From this, the total number of meioses separating the two individuals is:

m = u + v

and the degree of relationship is defined as:

d = m - a + 1

Now let n denote the number of observed IBD segments between i and j.

Some of these segments are inherited through the common ancestor(s) relevant to the relationship R—those are the segments we aim to model. Others may come from background shared ancestry with more distant individuals and are treated as noise.

We define:

n_d: the number of IBD segments that descend from the ancestor(s) relevant to R,
n_b: the number of IBD segments that arise from other ancestors in the broader pedigree.

By definition:

n = n_d + n_b

Our goal is to derive the distribution of n and of the segment lengths, conditional on observing n ≥ 1. This will allow us to construct likelihood functions for different relationship types, based only on the observed segment patterns between the pair.

Let {ℓ₁, …, ℓ_n} represent the lengths of the n = n_d + n_b IBD segments observed between individuals i and j, measured in centimorgans (cM).

We define the event O to be: "i and j share at least one IBD segment."
Our goal is to compute the probability distribution:

ℙ(ℓ₁, …, ℓ_n | O; m, a)

That is, we want the joint probability of observing these segment lengths, conditional on having at least one shared IBD segment and given:

m: the number of meioses separating i and j
a: the number of most-recent common ancestors (1 or 2)

We assume that the n_d segments relevant to the relationship R (i.e., from the targeted common ancestor(s)) are the ones we are modeling.

We follow the approach developed in Huff et al. (2011), who derived a similar distribution for IBD segment lengths in the unconditional case (i.e., without assuming at least one segment is observed).

To simplify the derivation, we adopt a key assumption:

The n_d segments transmitted through the most recent IBD-contributing ancestor(s) are the longest segments observed.

This assumption lets us ignore the background segments (n_b), which might come from more distant ancestors. It also removes the need to sum over all possible subsets of observed segments that could have originated from the focal ancestor(s).

With this simplifying assumption in place, the probability distribution of segment lengths becomes more tractable and can be derived directly from the genealogical parameters (m, a).

Interpreting the Conditional Probability Expression

The term

ℙ(ℓ₁, …, ℓ_n | O; a, m)

represents the joint probability of observing the specific set of IBD segment lengths ℓ₁, …, ℓ_n between two individuals, given:

O: the event that at least one IBD segment is observed,
a: the number of shared common ancestors (either 1 or 2),
m: the total number of meioses separating the individuals.

This probability expression reflects a conditional model of how IBD segments arise from a genealogical relationship. It incorporates:

The constraint that we are only analyzing pairs who share detectable IBD (i.e., O has occurred),
The genealogical structure of the relationship (number of shared ancestors and meioses),
And the assumption that segment lengths are informative about the underlying relationship.

The conditioning on O is important. In real-world applications, we only observe pairs who share at least one segment, so all inferences must be made under that constraint. Without conditioning on O, we would be modeling a space that includes pairs with no shared segments—an irrelevant and misleading scenario for most IBD-based inference.

This term therefore forms the foundation for likelihood-based methods that estimate the most probable relationship between individuals based on the number and lengths of shared IBD segments.

We now approximate the conditional probability of observing the segment lengths ℓ₁, …, ℓ_n as follows:

ℙ(ℓ₁, …, ℓ_n | O; a, m) ≈
∑_{n_d = 1}ⁿ ℙ(ℓ₁, …, ℓ_n | n_d = i, n_b = n - i, O; a, m) ·
ℙ(n_d = i, n_b = n - i | O; a, m) (1a)

This expression reflects a mixture model, where we marginalize over all possible values of n_d—the number of IBD segments that originate from the ancestor(s) relevant to the relationship R.

Each term in the sum includes:

The likelihood of observing the given segment lengths assuming n_d = i of them are from the genealogical relationship of interest and the rest (n_b = n - i) are background.
The probability of having exactly n_d = i direct segments (and n_b background), given that at least one segment is observed and the genealogical parameters (a, m).

In practice, this approach allows us to model the segment length distribution without needing to know exactly which segments came from the target ancestor(s), by treating the counts probabilistically.

Why This Is Interpreted as a Mixture Model

Although we apply the law of total probability to sum over possible values of n_d, the structure of the equation also reflects what is known in statistics as a mixture model.

In this context, we assume that the n observed IBD segments may come from two biologically distinct sources:

Some segments (n_d) are inherited through the genealogical relationship we are trying to infer (i.e., from the closest shared ancestor(s)),
The remaining segments (n_b = n - n_d) arise from other, more distantly shared ancestors.

Because we do not observe the source of each segment, we construct the overall probability distribution by:

Considering every possible way the n segments could be split into "direct" and "background," and
Weighting the likelihood of each split by how probable that configuration is under the model.

This structure — summing over unobserved categories, each with its own likelihood and weight — is characteristic of a mixture model.

While the law of total probability gives us the mathematical license to sum over hidden variables like n_d,
the interpretation as a mixture model comes from the biological assumption that segment lengths are generated by two underlying sources.
It is the presence of these separate sources that gives the model its mixture-like structure.

Law of Total Probability Interpretation

The summation in the expression 1a above is an application of the law of total probability.

Because we don't observe which of the n segments came from the shared ancestor(s) and which came from background, we treat the number of direct segments n_d as a latent (hidden) variable. The law of total probability lets us express the full probability as a weighted average over all possible values of n_d.

Here's how each component contributes:

ℙ(ℓ₁, …, ℓ_n | n_d = i, n_b = n - i, O; a, m)
This is the likelihood of observing the specific set of segment lengths ℓ₁, …, ℓ_n
assuming exactly i segments came from the relationship of interest and the remaining n - i came from background ancestry.
It is a conditional likelihood under a fixed configuration of segment origin.

ℙ(n_d = i, n_b = n - i | O; a, m)
This is the probability of that configuration occurring, given that we observed at least one segment (O) and given genealogical parameters (a, m).
It captures the uncertainty in how many of the n observed segments actually reflect the genealogical relationship of interest.

∑_{n_d = 1}ⁿ ⋯
This summation marginalizes over all possible values of n_d, from 1 to n, allowing the full likelihood to reflect all plausible ways the observed segments could have arisen.

In effect, we are computing a mixture distribution, where each term in the sum corresponds to one possible way of splitting the n segments into "direct" and "background" sources.

By applying the law of total probability, we:

Avoid having to specify which segments are which,
Integrate over the hidden variable n_d,
And obtain a valid, tractable expression for the joint probability of the observed data under genealogical model assumptions.

This approach enables principled inference, even in the presence of unobservable ancestral origins for each segment.

Note: Deriving n_b = n - i from Basic Algebra

We observe a total of n IBD segments. In each term of the summation, we assume exactly n_d = i of those segments came from the genealogical relationship of interest.

We can derive the number of background segments n_b as follows:

n = n_d + n_b      (total segments = direct + background)
n - n_d = n_d + n_b - n_d      (subtract n_d from both sides)
n - n_d = n_b
n - i = n_b      (substitute n_d = i)

This confirms that for each assumed value of n_d = i, the number of background segments must be n - i.

Expanding the previous approximation in 1a, we can break the segment lengths into two components: those coming from the target relationship and those from background ancestry. This gives:

= ∑_i=1ⁿ ℙ(ℓ⁽¹⁾, …, ℓ^(n_d); a, m) · ℙ_b(ℓ^(n_d+1), …, ℓ⁽ⁿ⁾) ·
ℙ(n_d = i, n_b = n - i | O; a, m) (1b)

Where:

ℓ⁽¹⁾, …, ℓ^(n_d) are the n_d longest observed IBD segments, assumed to come from the ancestor(s) specified by the relationship R.
ℓ^(n_d+1), …, ℓ⁽ⁿ⁾ are the remaining n_b = n - n_d segments, treated as arising from background shared ancestry.
ℙ(ℓ⁽¹⁾, …, ℓ^(n_d); a, m) is the probability of the direct segments under the relationship model (e.g., Huff et al.).
ℙ_b(⋅) is the probability model for background segments.
ℙ(n_d = i, n_b = n - i | O; a, m) is the conditional probability that exactly i of the observed segments came from the focal ancestor(s), given that at least one IBD segment was observed.

This form explicitly separates the signal (direct segments) from the noise (background segments) and weights each mixture component according to its probability under the genealogical model.

Segment Partitioning Assumption and Product Rule

To simplify the joint probability over all observed segment lengths, we assume the first n_d longest IBD segments originate from the most recent IBD-contributing ancestor(s), while the remaining n_b = n - n_d shorter segments are due to other (distant) ancestors.

This segment-length ordering allows us to deterministically assign the top n_d segments to the focal relationship and the remaining n_b to older genealogical sources, without needing to marginalize over all possible segment assignments.

The factorization above follows from the product rule of probability. The product rule states that if two sets of variables are conditionally independent given some parameters, their joint probability is equal to the product of their marginal probabilities:

ℙ(A, B | θ) = ℙ(A | θ) · ℙ(B | θ)

In our case, we assume that the segment lengths inherited from the most recent ancestor(s) and those inherited from other ancestors are conditionally independent given the genealogical parameters (a, m). This allows us to factor the joint likelihood:

ℙ(ℓ₁, …, ℓ_n | n_d = i, n_b = n - i, O; a, m) = ℙ(ℓ⁽¹⁾, …, ℓ^(n_d) | a, m) · ℙ_b(ℓ^(n_d+1), …, ℓ⁽ⁿ⁾)

This use of the product rule allows us to treat the segments from the two sources separately in the likelihood function.

To recap the three terms of statement 1b,

Direct segment component:

ℙ(ℓ⁽¹⁾, …, ℓ^(n_d); a, m)

The likelihood of observing the n_d longest segments under a model defined by generational distance a and number of meioses m to the recent shared ancestor(s).

Other ancestor segment component:

ℙ_b(ℓ^(n_d+1), …, ℓ⁽ⁿ⁾)

The likelihood of observing the remaining n_b = n - n_d shorter segments arising from IBD with other, more distant ancestors.

Split probability component (unchanged from Equation 1a):

ℙ(n_d = i, n_b = n - i | O; a, m)

This term captures the uncertainty in how the observed n IBD segments are partitioned between the recent ancestor(s) and other ancestors, given that at least one segment was observed.

Now that we have completed the factorization of the likelihood term using the segment partitioning assumption, we turn next to modeling this split probability and understanding how it depends on (a, m).

Summary of Equation (1b) Components

The likelihood of observing n IBD segments is approximated as:

ℙ(ℓ₁, …, ℓ_n | a, m) ≈ ∑_i=1ⁿ ℙ(ℓ⁽¹⁾, …, ℓ^(n_d) | a, m) · ℙ_b(ℓ^(n_d+1), …, ℓ⁽ⁿ⁾) · ℙ(n_d = i, n_b = n - i | O; a, m)

This decomposition reflects both a segment-length-based partitioning assumption and a use of the product rule, which allows separation of the segment likelihoods due to presumed conditional independence. In the next section, we will model the split probability term to complete the evaluation of the full expression.

We now normalize the full expression by dividing by the probability of observing at least one IBD segment, given the genealogical parameters (a, m). This gives:

= ∑_i=1ⁿ ℙ(ℓ⁽¹⁾, …, ℓ^(n_d); a, m) · ℙ_b(ℓ^(n_d+1), …, ℓ⁽ⁿ⁾) ·
[ℙ(n_d = i, n_b = n - i, O; a, m)] / [ℙ(O; a, m)] (1c)

This is the fully conditional probability of observing segment lengths ℓ₁, …, ℓ_n, given that at least one IBD segment is present. It includes:

A model for the direct segments (first i segments, assumed longest),
A model for background segments (remaining n - i segments),
A joint probability of the segment count breakdown (n_d, n_b) and the event O,
Normalization by ℙ(O; a, m) to ensure the distribution is properly scaled under the condition that at least one segment is observed.

This form helps isolate the contribution of the target genealogical relationship from other genealogical relationships, and it adheres to proper probabilistic conditioning.

We continue by making the segment partitioning assumption: the processes that generate segments from direct and background ancestors are treated as conditionally independent given the partition.

This allows us to treat:

Direct segments, n_d, as governed by the genealogical parameters (a, m), and
Background segments, n_b = n - i, as drawn from a separate, fixed distribution independent of (a, m).

With this assumption, we factor the joint probability as follows:

ℙ(n_d = i, n_b = n - i, O; a, m) = ℙ(n_d = i; a, m) · ℙ(n_b = n - i)

We emphasize that the observation event O remains embedded in the structure of the problem—it is still enforced by the overall normalization in the denominator ℙ(O; a, m).

Putting everything together, the fully normalized form becomes:

ℙ(ℓ₁, …, ℓ_n | O; a, m) =
∑_i=1ⁿ ℙ(ℓ⁽¹⁾, …, ℓ^(n_d); a, m) · ℙ_b(ℓ^(n_d+1), …, ℓ⁽ⁿ⁾) ·
[ℙ(n_d = i; a, m) · ℙ(n_b = n - i)] / [ℙ(O; a, m)] (1)

Where:

ℙ_b(⋅) represents the probability distribution over IBD segment lengths that originate from background ancestors—i.e., segments not linked to the focal genealogical relationship.
ℙ(n_d = i; a, m) is the probability of observing i direct IBD segments from the ancestor(s) in relationship R.
ℙ(n_b = n - i) is the background probability of observing the remaining segments from other sources.
The denominator ℙ(O; a, m) ensures that the total probability is conditioned on the observation of at least one IBD segment.

This decomposition allows us to model direct and background contributions separately, while still integrating them into a coherent, conditional probability model.

This final form cleanly separates the contributions of the direct genealogical model, the background process, and the observed segment count under the constraint that O has occurred.

Recap: Building the Fully Conditional Likelihood

We began by expressing the probability of observing the segment lengths ℓ₁, …, ℓ_n, conditioned on the presence of at least one IBD segment, as a sum over possible partitions between direct and background segments. Each term in the sum accounts for a different number i of segments attributed to the direct genealogical relationship, with the remaining n - i segments treated as background. This gave rise to Equation (1a):

ℙ(ℓ₁, …, ℓ_n | O; a, m) ≈
∑_{n_d = 1}ⁿ ℙ(ℓ₁, …, ℓ_n | n_d = i, n_b = n - i, O; a, m)
ℙ(n_d = i, n_b = n - i | O; a, m) (1a)

Under the segment partitioning assumption, we assume that the direct and background segments are generated independently given their counts. This allows us to factor the first term in each summand as a product of two independent likelihoods: one for the direct segments and one for the background segments. This leads to Equation (1b):

= ∑_{i = 1}ⁿ ℙ(ℓ⁽¹⁾, …, ℓ^(n_d); a, m) · ℙ_b(ℓ^{(n_d + 1)}, …, ℓ⁽ⁿ⁾) ·
ℙ(n_d = i, n_b = n - i | O; a, m) (1b)

To clarify how conditioning on the observation event O affects the expression, we apply the definition of conditional probability:

ℙ(n_d = i, n_b = n - i | O; a, m) = [ℙ(n_d = i, n_b = n - i, O; a, m)] / [ℙ(O; a, m)]

Substituting this into Equation (1b) yields Equation (1c):

= ∑_{i = 1}ⁿ ℙ(ℓ⁽¹⁾, …, ℓ^(n_d); a, m) · ℙ_b(ℓ^{(n_d + 1)}, …, ℓ⁽ⁿ⁾) ·
[ℙ(n_d = i, n_b = n - i, O; a, m)] / [ℙ(O; a, m)] (1c)

Finally, we make use of the conditional independence assumption between direct and background segment counts:

The count of direct segments depends on genealogical parameters a and m,
The background segment count is modeled separately and is independent of a, m.

With that, we arrive at the fully factorized and normalized Equation (1):

ℙ(ℓ₁, …, ℓ_n | O; a, m) =
∑_{i = 1}ⁿ ℙ(ℓ⁽¹⁾, …, ℓ^(n_d); a, m) · ℙ_b(ℓ^{(n_d + 1)}, …, ℓ⁽ⁿ⁾) ·
[ℙ(n_d = i; a, m) · ℙ(n_b = n - i)] / [ℙ(O; a, m)] (1)

This equation cleanly decomposes the likelihood into a sum over direct-background segment partitions, with each term reflecting:

The segment length distributions for direct and background segments,
The respective probabilities of observing n_d = i and n_b = n - i,
And normalization over the conditioning event O.

This expression serves as the foundation for modeling the likelihood using explicit models such as Huff et al. (2011) for segment lengths and segment counts.