Central limit theorem for the homozygosity of the hierarchical Pitman-Yor process

Source: arXiv:2605.12475 · Published 2026-05-12 · By Shui Feng, J. E. Paguyo

TL;DR

This paper studies the hierarchical Pitman-Yor process (HPYP) in the large-concentration regime and asks a precise asymptotic question: when the top- and second-level concentration parameters both go to infinity together, what is the Gaussian fluctuation of the homozygosity (power-sum symmetric polynomial) of the HPYP weights? The problem is the hierarchical analogue of earlier CLTs for the Pitman-Yor process and the hierarchical Dirichlet process, but the Pitman-Yor case is harder because the discount parameters create power-law structure and the hierarchy introduces an extra source of randomness.

The main contribution is a central limit theorem for the mth-order homozygosity of the HPYP with L groups under the scaling θ0, θ→∞ and θ0/θ→c∈(0,∞). The authors derive an explicit centering term, an explicit √θ scaling, and a closed-form asymptotic variance decomposed into level-1, gamma, and stable components. They also show that special cases recover previously known CLTs: letting β→0 and α=0 gives the hierarchical Dirichlet process result, and letting c→∞ collapses to the single-level Pitman-Yor homozygosity CLT. The proof is mathematically heavy and relies on a subordinator representation, combinatorial moment expansions via generalized factorial coefficients, a law of large numbers for the weights, and conditional multivariate CLTs plus Slutsky/delta-method arguments.

Key findings

Theorem 1.1 proves a CLT for the HPYP homozygosity Hm,L(α,θ0,β,θ) when θ0,θ→∞ with θ0/θ→c∈(0,∞), with √θ scaling and an explicit asymptotic variance σ²c,m,L.
The mean of Hm,L is asymptotically f(α,β,θ;m,L,c) = (Lmθm−1)−1·∑j=1..m Aj(β,m,L)(1−α)(j−1)/(βjcj−1), and Proposition 3.2 shows Hm,L / f →P 1.
The asymptotic variance is decomposed into three named pieces, σ²X,m,L + σ²T,m,L + σ²1,m,L, plus cross terms captured through the final explicit formula in Theorem 1.1.
Setting α=0 and taking β→0 recovers the hierarchical Dirichlet process CLT from the authors’ earlier work [11]; this is stated explicitly after Lemma 2.2.
For L=1 and c→∞, the variance formula reduces to the single-level Pitman-Yor homozygosity CLT from Feng [7]/Handa [15], which the authors verify in Remark (1) under Theorem 1.1.
Lemma 2.3 gives exact conditional moments of the HPYP increments Wk: E[Wk^m | V, γ1/β] = ∑j=1..m C(m,j,β) γ1/β^j Vk^j, making the generalized factorial coefficients C(m,j,β) the key algebraic object.
Lemma 2.6 shows θm−1/(1−α)(m−1)·Hm(α,θ)→a.s.1 and extends this to truncated sums up to ⌊θr⌋, providing the law-of-large-numbers backbone for the HPYP proof.
The proof of the stable term uses a truncation at ⌊θ0^r⌋ with r = ⌊1/(1−α)⌋+1 and a conditional Lyapunov CLT; the tail beyond the truncation is shown to vanish by a power bound that depends on α and r.

Methodology — deep read

The paper is a purely theoretical probability result, so the “threat model” is best read as an asymptotic regime rather than an adversary setting. The authors analyze the HPYP weights under the joint large-concentration limit θ0, θ→∞ with θ0/θ→c. The random object of interest is the vector of HPYP masses Zk,i for each group, and the statistic is the mth-order homozygosity Hm,L, which is a symmetrized power-sum polynomial across groups. The assumptions are standard for HPYP theory: α,β∈(0,1), θ0>−α, θ>−β, and the groups are conditionally iid given the shared level-1 Pitman-Yor base measure. There is no learning adversary, no poisoning, and no empirical dataset; the “problem” is to identify the limiting Gaussian law and variance decomposition for these random measures.

The data in the paper are not observed samples but the model’s latent random weights and the associated homozygosity functionals. The authors define V=(V1,V2,…) as PD(α,θ0) weights and then define the HPYP group weights Zℓ via a hierarchical construction Ξα,θ0,β,θν = Ξβ,θ,Ξα,θ0,ν. For the L-group case, they represent each group’s masses Zℓ,i in terms of increments Wℓi generated by a subordinator acting on the level-1 masses Vi. The key combinatorial coefficients are generalized factorial coefficients C(m,j,β), together with Stirling-number limits as β→0. No finite dataset, labels, or train/test splits are used anywhere because the paper is not empirical.

Architecturally, the novelty is the subordinator-based decomposition of the HPYP into three fluctuation sources. They first derive a level-1 representation from the Pitman-Yor weights, then a gamma-subordinator representation for the HPYP increments, and then express the HPYP masses as normalized increments. Lemma 2.3 is central: conditional on V and γ1/β, the mth moment of each increment Wk is a finite polynomial in Vk with coefficients C(m,j,β). This allows the homozygosity of the whole hierarchy to be expanded into sums over partitions of m across L groups (the sets Mm,L and Mj,L), with coefficients Aj(β,m,L) and ˜Aj(β,m,L). The statistic is then decomposed into (i) a level-1 term driven by the PD(α,θ0) base measure, (ii) a gamma term driven by γ1/β, and (iii) a stable term capturing the centered conditional fluctuation around the subordinator increments.

The proof strategy is constructive and stepwise. First, they derive exact expectations of Hm,L using conditional expectation and the moment formula for Wk, obtaining Lemma 3.1 and the asymptotic mean f(α,β,θ;m,L,c). Second, they prove a law of large numbers (Proposition 3.2) by Chebyshev’s inequality. Third, they write the centered-and-scaled homozygosity as the sum Xm,L + Tm,L + Ym,L plus a normalization correction term, where each term isolates a different source of randomness. Fourth, they prove separate CLTs for Ym,L (Lemma 3.4), Tm,L (Lemma 3.5), and Xm,L (Lemma 3.6). For Ym,L, they rewrite it in terms of the scaled level-1 homozygosities ˜Hj(α,θ0) and then apply the multivariate CLT from Lemma 2.1; the limit variance is an explicit quadratic form in the Aj coefficients. For Tm,L, they use the gamma CLT for βγ1/β/θ around 1 and then a multivariate delta method on g(x)=(x,x²,…,xm) to capture all powers simultaneously. For Xm,L, they split the sum at ⌊θ0^r⌋, show the tail is negligible, and then establish a conditional Lyapunov condition given V and γ1/β. The fourth-moment estimates are expanded combinatorially and bounded to show the normalized conditional fourth moment vanishes like O(1/θ), which is enough for conditional Lindeberg-Feller.

The evaluation protocol is asymptotic and analytic rather than experimental. The “metrics” are convergence in distribution to a multivariate normal and explicit formulas for the limiting covariance matrices. The authors check consistency against known special cases: Theorem 1.1 reduces to the single-level Pitman-Yor CLT when L=1 and c→∞, and to the HDP CLT when α=0 and β→0 via Lemma 2.2 and the Stirling-number limit C(m,j,β)/βj→[m j]. The proof uses no simulation-based validation, no numerical experiments, and no statistical tests; instead, the evaluation is by exact derivation and reduction to prior theorems. Reproducibility is limited to the extent that the proof is self-contained in the manuscript’s lemmas; the source excerpt does not mention code, numeric tables, or an external dataset.

One concrete end-to-end example is the derivation of the gamma contribution Tm,L. Starting from the conditional moment identity for Wℓi, the authors isolate the difference between γj1/β and its deterministic mean Qj−1 s=0(θ+sβ)/βj, weight it by the level-1 homozygosity terms ∑i Vji, and scale by √θ/f˜. They then split the j-th power into a centered gamma fluctuation plus a deterministic correction. The centered gamma term becomes asymptotically Gaussian because βγ1/β is a sum of iid gamma variables, while the vector of powers ((βγ1/β/θ)−1, (βγ1/β/θ)²−1, …) is handled by a delta method around 1. The resulting covariance matrix is β times the outer-product matrix of (1,2,…,m), which is then pushed through the Aj weights to produce σ²T,m,L.

Technical innovations

A three-way fluctuation decomposition for HPYP homozygosity into level-1, gamma, and stable components, which is sharper than the single-source decompositions used for HDP.
A subordinator-based representation of HPYP masses that turns hierarchical moments into finite polynomial expansions with generalized factorial coefficients C(m,j,β).
An explicit asymptotic variance formula for the HPYP homozygosity CLT that exposes how the level-1 concentration θ0, level-2 concentration θ, discount parameters α,β, and group count L enter separately.
A conditional Lyapunov CLT proof for the stable component using truncation at ⌊θ0^r⌋, where r=⌊1/(1−α)⌋+1, to control the infinite-weight tail.
A reduction framework showing the HPYP result collapses to the HDP theorem when β→0 and to the single-level Pitman-Yor theorem when L=1 and c→∞.

Limitations

No empirical validation or simulation study is reported in the provided text; all results are analytic asymptotics.
The theorem is proved only in the joint limit θ0,θ→∞ with θ0/θ→c∈(0,∞); other regimes such as θ0/θ→0 or ∞ are only mentioned informally in remarks.
The result is for α,β∈(0,1); the boundary cases α=0 or β=0 are handled only as limiting recoveries of known theorems, not as part of the main theorem statement.
The proof is technically intricate and relies on several truncation and conditional CLT steps; the excerpt does not provide a compact, easily checkable closed-form simplification of σ²c,m,L.
No code, symbolic notebook, or numerical verification is mentioned in the excerpt, so independent reproduction would require re-deriving the algebra from the manuscript.
The paper focuses on homozygosity; it does not address other functionals of HPYP sampling formulas, posterior inference, or finite-sample approximation error.

Open questions / follow-ons

Can the same subordinator-and-decomposition method be extended to other HPYP functionals beyond homozygosity, such as occupancy counts, diversity indices, or frequency spectra?
What happens in unbalanced concentration limits, e.g. θ0/θ→0 or θ0/θ→∞, where the hierarchy may effectively collapse to one level?
Can the asymptotic variance be simplified or re-expressed in a way that makes the contributions of α,β,L and c more interpretable for downstream Bayesian nonparametric applications?
Is there a functional CLT or joint CLT for the entire collection (Hm,L)m≥2 rather than fixed-order m?

Why it matters for bot defense

This paper is not about bots or CAPTCHA directly, but the proof style is relevant if you use hierarchical Bayesian nonparametrics to model clustered user behavior, device populations, or latent session groups. The main lesson is methodological: when a hierarchical random-measure model has multiple latent randomness sources, you should not treat the observed statistic as a single black-box random variable; you want to separate base-process variability, hierarchy-level variability, and normalization noise. That decomposition is useful if you are trying to reason about how stable a bot-defense prior will be as traffic volume grows or as cluster granularity changes.

For CAPTCHA or bot-defense practitioners, the most transferable idea is the sensitivity analysis across hierarchy levels. If you build a model of user populations with a Pitman-Yor-like prior because you expect heavy-tailed cluster sizes, this paper tells you how concentration parameters affect asymptotic fluctuation and which layer of the hierarchy dominates in the large-data limit. It also serves as a warning that power-law priors can have mathematically nontrivial variance structure compared with Dirichlet-style models, so calibration and uncertainty quantification may need more care than in lighter-tailed settings.

Cite

bibtex

@article{arxiv2605_12475,
  title={ Central limit theorem for the homozygosity of the hierarchical Pitman-Yor process },
  author={ Shui Feng and J. E. Paguyo },
  journal={arXiv preprint arXiv:2605.12475},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.12475}
}

Central limit theorem for the homozygosity of the hierarchical Pitman-Yor process ​

TL;DR ​

Key findings ​

Methodology — deep read ​

Technical innovations ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​