Semi-supervised Method for Risk Prediction with Doubly Censored EHR Data
Source: arXiv:2605.08046 · Published 2026-05-08 · By Jie Zhou, Enhao Wang, Xuan Wang
TL;DR
This paper addresses a methodological gap in survival analysis for electronic health record (EHR) data: how to estimate risk effects (covariate coefficients in a semiparametric transformation model) when the outcome is doubly censored (both left- and right-censored) and gold-standard labels are available for only a small fraction of the cohort. Prior semi-supervised learning (SSL) work for survival analysis handled right-censored or binary outcomes but not double censoring, which arises naturally in EHR studies because disease onset may predate the first recorded visit (left censoring) or remain unobserved at end of follow-up (right censoring). The double-censoring structure compounds the usual EHR challenge of expensive manual chart review, leaving most patients unlabeled while cheap but error-prone surrogate outcomes (e.g., first ICD diagnostic code date) are available for everyone.
The authors construct a two-step SSL estimator. Step 1 obtains a supervised estimator of the risk coefficient vector β using only the small labeled set via the nonparametric maximum likelihood EM algorithm of Li et al. (2018) for doubly censored semiparametric transformation models. Step 2 augments this estimator using the discrepancy between two estimates of a surrogate working-model parameter γ: one fit on labeled data only (γ̂) and one fit on all data (γ̄). The augmentation weight is chosen to minimize asymptotic variance, yielding the projection-type correction β̂_SSL = β̂_SL − Ω̂ Σ̂_γ⁻¹(γ̂ − γ̄). Two working models for the surrogate are proposed and combined into a single estimator (SSL3) that is provably at least as efficient as either alone.
Simulation studies across four model configurations (proportional hazards / proportional odds, matched and mismatched between true and surrogate) with n=200 and n=400 labeled samples out of 6n total consistently show relative efficiency (RE) gains of roughly 1.36–1.52 for SSL3 over the supervised estimator, with coverage probabilities remaining near the nominal 95%. In a real T2D cohort of 115,236 patients with only 1,613 labeled via chart review, SSL3 reduces standard errors by factors up to 1.81 relative to the supervised-only estimator, making inference on clinically meaningful covariates (baseline age, sex, race) substantially more precise.
Key findings
- SSL3 (combined estimator) achieves relative efficiency of 1.42–1.52 over the supervised-only estimator in the proportional hazards setting (r=0, r*=0) at n=200, and 1.40–1.53 at n=400, across both β₁ and β₂ (Table 1).
- SSL3 achieves relative efficiency of 1.51–1.52 in the proportional odds setting (r=1, r*=1) at n=200 and 1.52–1.52 at n=400, demonstrating that efficiency gains are robust across transformation model families (Table 1).
- Even when working model SSL1 is misspecified for T* (e.g., r=1 for true T but r*=0 for T*, so SSL1 uses the wrong error distribution for the surrogate), SSL1 still achieves RE ≈ 1.46–1.50, demonstrating robustness to working model misspecification (Table 1).
- All SSL estimators maintain empirical coverage probabilities between 93.0% and 96.3% across all simulation scenarios (500 replications), confirming validity of the theoretical variance estimator (Table 1).
- In the real T2D EHR dataset (N=115,236 total; n=1,613 labeled), SSL3 reduces the ESE for baseline age from 0.0039 to 0.0022 (RE=1.77), for male gender from 0.1788 to 0.0988 (RE=1.81), and for white race from 0.2211 to 0.1320 (RE=1.68) relative to the supervised estimator (Table 2).
- The labeled T2D subset has an extreme imbalance: only 168 of 1,613 labeled patients (10.4%) had exactly observed onset times; 1,401 (86.9%) were right-censored and 44 (2.7%) left-censored, illustrating the practical severity of the labeling bottleneck.
- SSL3 is provably at least as efficient as SSL1 and SSL2 individually (Corollary 3: Σ₁ − Σ₃ and Σ₂ − Σ₃ are both positive definite), so combining working models never hurts asymptotically.
- The method requires the label-missing rate to approach 1 (ρ = n/(n+N) → 0); standard missing-data imputation approaches are explicitly noted as inapplicable in this regime.
Threat model
n/a — this is a statistical methodology paper for clinical risk estimation, not a security or adversarial ML paper. The paper does not model a malicious adversary. The primary sources of error are structural (double censoring, label scarcity) and stochastic (noise in surrogate outcomes), not intentional manipulation. The closest analog to an adversarial assumption is the MCAR condition on label availability, which the authors acknowledge may be violated if chart-review selection is outcome-dependent.
Methodology — deep read
Threat model and problem setup. The statistical adversary here is bias and inefficiency, not a malicious attacker. The threat is two-fold: (1) left- and right-censoring jointly on the true event time T (disease onset may predate first visit or postdate last visit), and (2) most patients lack gold-standard labels because chart review is resource-prohibitive. The paper assumes labels are missing completely at random (MCAR), motivated by random sampling for chart review in the T2D study. Surrogate outcomes T* (e.g., first ICD code date) are available for all N+n patients but are noisy proxies of T. The censoring interval [L, U] (first and last visit ages) is assumed independent of (T, T*, Z), which is standard but potentially violated in EHR settings due to healthcare utilization patterns.
Data provenance and structure. The real dataset is a US health system EHR cohort of 115,236 patients with at least one ICD code relevant to T2D (following Liao et al., 2010 and Cipparone et al., 2015). The labeled subset of n=1,613 patients (915 female, 698 male) had true T2D onset ascertained via manual chart review. The surrogate T* for all 115,236 patients is age at first encounter with PheCode 250.2 (T2D PheCode). Covariates are baseline age (age at enrollment), gender, and race. No external validation cohort is reported. For simulations, N=5n with n∈{200,400} across 500 Monte Carlo replications; the labeled fraction is thus ~17% of total cohort size, representing a relatively generous labeled proportion compared to the real data (~1.4%).
Model class. The paper works within the semiparametric linear transformation model h(T) = −β'Z + ε, where h is an unknown monotone increasing function and ε has a known parametric distribution. This class nests the Cox proportional hazards model (ε ~ extreme-value, corresponding to G(x,r) with r→0) and the proportional odds model (ε ~ logistic, r=1). The baseline cumulative hazard Λ(t) is treated as an infinite-dimensional nuisance parameter approximated by a step function with jumps only at observed uncensored event times. Under double censoring, the observed data per subject is (X_i = max(L_i, min(T_i, U_i)), δ_i) where δ_i ∈ {1=exact, 2=right-censored, 3=left-censored}.
Step 1 — supervised estimator β̂_SL. Following Li et al. (2018), the authors apply an EM algorithm with latent Poisson frailty variables to handle double censoring. In the E-step, conditional expectations E(N_ik) (pseudo-event counts at each uncensored time t_k) and E(μ_i) (frailty expectations) are computed separately for left-censored, exactly observed, and right-censored subjects using formulas that differ by censoring indicator. In the M-step, jump sizes λ_k are updated in closed form and β is updated by solving a profile score equation. The influence function ξ_i of β̂_SL is derived explicitly in Appendix A via a first-order Taylor expansion of the profile score around the true β₀, yielding ξ_i = A_β⁻¹ ψ_i(β₀, β₀), where A_β is the negative expected derivative of the score.
Step 2 — semi-supervised augmentation. The key insight is an augmentation of the form β̂_SSL = β̂_SL − Ω̂ Σ̂_γ⁻¹(γ̂ − γ̄), where γ̂ is estimated from labeled data only and γ̄ from all data, under a working model for the surrogate T*. The correction exploits the fact that γ̂ − γ̄ is asymptotically mean-zero but correlated with the estimation error of β̂_SL; projecting out this correlated component reduces variance. The optimal weight Ω Σ_γ⁻¹ is estimated from the labeled data using plug-in influence function estimates. Two working models are considered: (4) the same semiparametric transformation model as for T (giving SSL1), and (5) a more flexible model where both h* and the error distribution are unspecified, identified only through the sufficient dimension reduction result of Li and Duan (1989) using a combined logistic (for left-censoring probability) and Cox (for right-censored part) convex likelihood (giving SSL2). SSL3 stacks γ̂₁ and γ̂₂ into a 2p-dimensional vector and applies the same augmentation formula, provably dominating either alone.
Simulation design. Bivariate normal covariates (ρ=0.3 off-diagonal), true β=(0.5, −0.3), surrogate γ=(−0.3, 0.7), errors coupled via Gaussian copula with correlation 0.85. Censoring times L_i ~ U(0.5τ_l, 1.5τ_l) and U_i ~ U(0.5τ_r, 1.5τ_r) where τ_l, τ_r are the 20th and 80th percentiles of the marginal survival distribution, inducing roughly 20% left and 20% right censoring. Four model combinations: (r=0,r*=0), (r=0,r*=1), (r=1,r*=0), (r=1,r*=1), covering both matched and mismatched error distributions between true and surrogate models. Metrics: bias, sample SE, average estimated SE (ESE), 95% empirical coverage probability, and relative efficiency (RE = ESE_SL / ESE_SSL).
Reproducibility. No code repository is mentioned. The paper provides explicit formulas for the EM updates, influence functions (Appendices A and B), and full simulation parameters, so re-implementation is feasible in principle. The real EHR dataset is from an unnamed US health system and is not publicly available. The labeled subset size (n=1,613) and surrogate construction (PheCode 250.2) are specified, but the data sharing status is not discussed.
Technical innovations
- First SSL framework for risk effect estimation under double censoring: extends the augmented estimation paradigm (previously limited to right-censored or binary outcomes, e.g., Tong et al., 2020; Ahuja et al., 2023) to accommodate simultaneous left and right censoring of both true and surrogate outcomes.
- Doubly-censored convex surrogate likelihood (eq. 6): combines a logistic model for left-censoring probability with a partial Cox likelihood for the right-censored portion, enabling dimension-reduction-based estimation of γ's direction under a fully nonparametric error distribution (extending Li and Duan, 1989 to double censoring).
- Projection-based augmentation with double-censored EM: the augmentation weight Ω Σ_γ⁻¹ is estimated entirely from labeled-set influence functions, avoiding any need to model the marginal distribution of unlabeled covariates or the missing-data mechanism beyond MCAR.
- Combined estimator SSL3 with dominance guarantee: stacking multiple surrogate working models into a single augmentation step yields an estimator provably no worse than any individual working model (Corollary 3), with formal proof via positive definiteness of covariance differences.
Datasets
- US health system T2D EHR cohort — 115,236 patients total; 1,613 labeled via manual chart review — non-public, unnamed US health system
- Simulated doubly-censored survival data — n+N subjects with N=5n, n∈{200,400}, 500 replications per scenario — synthetic
Baselines vs proposed
- Supervised-only (SL, labeled data only), β₁, r=0,r*=0, n=200: ESE=0.0844 vs SSL3: ESE=0.0595 (RE=1.42)
- Supervised-only (SL), β₁, r=1,r*=1, n=200: ESE=0.1290 vs SSL3: ESE=0.0851 (RE=1.51)
- Supervised-only (SL), β₁, r=1,r*=1, n=400: ESE=0.0900 vs SSL3: ESE=0.0591 (RE=1.52)
- Supervised-only (SL), β_male, T2D real data: ESE=0.1788 vs SSL3: ESE=0.0988 (RE=1.81)
- Supervised-only (SL), β_baseline-age, T2D real data: ESE=0.0039 vs SSL3: ESE=0.0022 (RE=1.77)
- Supervised-only (SL), β_white, T2D real data: ESE=0.2211 vs SSL3: ESE=0.1320 (RE=1.68)
- SSL1 vs SSL3, β₁, r=1,r*=1, n=400: ESE=0.0592 vs 0.0591 (RE 1.5206 vs 1.5241 — marginal SSL3 advantage)
- SSL2 vs SSL3, β₁, r=1,r*=1, n=200: ESE=0.0959 vs 0.0851 (RE 1.3451 vs 1.5149 — SSL3 substantially better than SSL2 alone)
Limitations
- MCAR assumption: labels are assumed missing completely at random, motivated by random sampling for chart review. This will not hold in EHR cohorts where chart review is triggered by clinical suspicion or disease severity, potentially inducing selection bias that the method cannot correct.
- Independence of censoring from covariates: the method assumes (L, U) ⊥ (T, T*, Z), but in EHR data, healthcare utilization (and thus visit frequency, first/last visit age) is strongly associated with covariates like age, race, and comorbidities, making this assumption likely violated in practice.
- No adversarial or distribution-shift evaluation: simulation scenarios use the same data-generating process for labeled and unlabeled data. No experiments assess robustness to model misspecification in the censoring mechanism or covariate distribution shift between labeled and unlabeled pools.
- Single real dataset, no external validation: the real-data application uses one unnamed health system's cohort. Generalizability of the efficiency gains to other EHR systems, coding practices, or disease endpoints is untested.
- Efficiency gain depends on surrogate-outcome quality, which is not systematically characterized: the paper notes that gains depend on surrogate-true correlation but does not provide guidance on when the surrogate is too weak to justify the additional modeling complexity, nor does it quantify this for the T2D PheCode surrogate specifically.
- Computational cost of SSL3 not reported: combining two working models doubles the number of EM runs and influence-function estimations; no wall-clock timing or scalability analysis is provided for cohorts larger than ~115k patients.
- No multiple-surrogate extension demonstrated empirically: while the paper notes that additional surrogates can be incorporated in principle, the simulation and real-data analysis use only a single surrogate outcome T*, leaving the multi-surrogate regime unexplored.
Open questions / follow-ons
- Extension to missing-at-random (MAR) or missing-not-at-random (MNAR) label mechanisms: the current MCAR assumption is unlikely to hold in most real EHR settings; deriving an augmented estimator that remains consistent under informative labeling is an open and practically important problem.
- Covariate-dependent censoring: relaxing the assumption that (L, U) ⊥ Z to allow healthcare-utilization-driven censoring would require a substantially different identification strategy and is explicitly flagged by the authors as future work.
- Optimal surrogate selection and combination: the paper shows that stacking working models never hurts asymptotically, but provides no data-adaptive procedure for selecting or weighting surrogates when many noisy ICD codes or NLP-derived phenotype scores are available, nor a finite-sample penalty for including uninformative surrogates.
- Scalability to high-dimensional covariates and regularization: the current framework assumes a low-dimensional covariate vector Z and a fully parametric β; extending the SSL augmentation to penalized or high-dimensional regression settings (e.g., LASSO-type penalties on β) under double censoring is not addressed.
Why it matters for bot defense
At first glance this paper is distant from bot defense and CAPTCHA engineering. However, there is a structural analogy worth noting for practitioners who work on behavioral risk scoring from event-log data. EHR double censoring mirrors a common problem in bot-detection pipelines: a user's true 'onset of malicious behavior' may predate the first observed session in a logging window (left censoring — the bot was active before instrumentation began) or may not yet be confirmed at the end of an observation window (right censoring — the account has not yet been actioned or verified). Surrogate signals — such as the timestamp of the first flagged action, first CAPTCHA failure, or first velocity anomaly — are cheap to compute for all accounts but are noisy proxies of true compromise time. The framework here shows how to combine a small set of manually verified ground-truth labels (analogous to fraud analyst-reviewed cases) with large-scale surrogate timestamps to get more efficient risk coefficient estimates without bias inflation.
The practical takeaway for a bot-defense engineer is methodological rather than immediately deployable: (1) if you have a doubly censored event time (behavioral onset predating or postdating your observation window) and only a small labeled set from manual review, this paper gives a statistically principled way to leverage surrogate timestamps from your full user base to sharpen covariate effect estimates; (2) the augmentation is robust to misspecification of the surrogate model, which is reassuring given that bot behavioral surrogates are often ad hoc; (3) the MCAR assumption is the main point of friction — in fraud and abuse contexts, manual review is rarely random (high-risk accounts get reviewed first), so the MAR/MNAR extension flagged as future work would be essential before direct application. The efficiency gains of RE~1.4–1.8 suggest that if the analogy holds, one could potentially reduce manual review burden by 30–50% while maintaining the same inferential precision on risk factors.
Cite
@article{arxiv2605_08046,
title={ Semi-supervised Method for Risk Prediction with Doubly Censored EHR Data },
author={ Jie Zhou and Enhao Wang and Xuan Wang },
journal={arXiv preprint arXiv:2605.08046},
year={ 2026 },
url={https://arxiv.org/abs/2605.08046}
}