Network Recovery from Cascade Data: A Debiased Jacobian-Based Machine Learning Approach

Source: arXiv:2606.07483 · Published 2026-06-05 · By Lei Huang

TL;DR

This paper addresses the fundamental problem of recovering hidden influence networks from observed cascade data sequences, such as disease spread, product adoption, or information diffusion. Existing approaches typically rely on strong parametric assumptions about the diffusion process (e.g., exponential or power-law transmission kernels) and suffer from substantial performance degradation when those assumptions are violated. Moreover, they do not provide formal statistical inference for estimated edges, limiting interpretability and decision-making utility. To overcome these limitations, the author proposes CascadeNet, a novel two-step machine learning framework that (1) flexibly estimates the one-step transition function governing cascade dynamics without assuming its parametric form, and (2) recovers the underlying influence network by debiasing the Jacobian matrix of this estimated transition function using Neyman-orthogonal corrections leveraging the Riesz representer. This debiasing achieves √n-consistency and asymptotic normality of individual edge estimates, enabling confidence intervals and hypothesis tests for edge presence.

The empirical validation includes extensive experiments on nine synthetic diffusion data-generating processes that vary from classical models like independent cascade, linear threshold, SIS, and SIR, to complex nonlinear dynamics. Across these, CascadeNet substantially outperforms classical baselines by up to 50 percentage points in Pearson correlation with ground truth edges, particularly when baselines are misspecified. In a real-world application to COVID-19 transmission across Spain’s 52 provinces, CascadeNet recovers networks whose edges significantly correlate with observed inter-province mobility patterns, while all baseline methods fail to show meaningful alignment. This demonstrates the method’s robustness to diffusion model misspecification and practical relevance.

Overall, the paper contributes a flexible, model-agnostic framework for network recovery from cascades paired with debiased inference guarantees, allowing reliable estimation and uncertainty quantification of influence edges from limited observational data without requiring strong parametric diffusion assumptions.

Key findings

CascadeNet achieves highest network recovery accuracy across nine synthetic diffusion models, outperforming existing methods by up to 50 percentage points in Pearson correlation.
On a closed-form tanh synthetic example, naive plug-in Jacobian correlates with truth at r=0.07, while the debiased Riesz-corrected estimator attains r=0.77, over a 10x improvement.
For COVID-19 data from Spain’s 52 provinces, CascadeNet’s recovered network edges significantly correlate with true inter-province mobility flows (statistical significance reported), whereas all classical baselines fail to produce any significant correlation.
Neyman-orthogonal debiasing via the Riesz representer yields √n-consistent, asymptotically normal edge estimates, enabling formal inference (confidence intervals and hypothesis tests) for each network edge.
CascadeNet’s flexible transition function estimator accommodates binary, count, or continuous outcomes, and nests classical diffusion models like independent cascade, linear threshold, SIS, SIR, Hawkes as special cases.
The absorbing wrapper enforces irreversible adoption dynamics by making previously adopted states absorbing with probability one, improving estimation in epidemiological or irreversible adoption settings.
CascadeNet handles panels with as few as a single cascade trajectory consisting of T time steps and N nodes, estimating an N×N influence matrix robustly despite sparse observations.
Cross-fitting and separate nuisance estimation folds control overfitting and bias from using the same data for transition function estimation and Jacobian inference.

Threat model

The adversary is not the focus; rather, the model assumes observation of cascade data generated by an unknown diffusion mechanism over a hidden influence network. The analysis assumes data is not adversarially manipulated and covariates are exogenous. The goal is statistical recovery of the influence matrix under realistic misspecification but not under active adversarial attacks.

Methodology — deep read

Threat model & assumptions: The framework assumes a Markovian cascade process where the observed system evolves in discrete time steps. At each time t, the system state Y_{c,t} for cascade c is an N-dimensional vector of agent outcomes (infection status, adoption, case counts, etc.), updated by an unknown transition function m0 plus additive noise. The analyst observes multiple independent cascades C with temporal data and covariates X_{c,t}. The primary assumption is Markovian dynamics: Y_{c,t+1} depends only on Y_{c,t} and X_{c,t}, not earlier history. Covariates X are treated as exogenous. The goal is to recover the N×N influence network encoded in the Jacobian matrix J0 of the transition function with respect to the prior state.
Data: The method operates on panel data consisting of C independent trajectories (cascades), each with T time points and N nodes. Outcomes Y_{c,t} can be binary, count, or continuous. Covariates X_{c,t} may vary over time and across agents. The synthetic experiments use nine known diffusion models spanning independent cascade, linear threshold, continuous-time SI with different kernels, epidemic models with recovery/removal (SIS, SIR), complex aggregate models, and nonlinear DGPs. The empirical example uses COVID-19 infection counts from Spain’s 52 provinces, with mobility flow data as benchmark.
Architecture / algorithm: CascadeNet has two key components: (a) a flexible estimator ̂m for the one-step transition function m0 that predicts Y_{c,t+1} from (Y_{c,t}, X_{c,t}), and (b) an estimator for the network Jacobian defined as the matrix of partial derivatives: J0[i,j] = ∂m0,i/∂y_j.

The default ̂m is a linear-index model: ̂m_i(Y_t, X_{t,i}) = ℓ( J Y_t + k_i x_{t,i} + b_i ), where J is the learnable network weight matrix, k_i the direct effect of covariates, b_i a bias term, and ℓ is a flexible link function (sigmoid for binary, identity or MLP for continuous).

An absorbing wrapper is optionally applied to enforce irreversibility: ̂m_abs,i = Y_{t,i} + (1 - Y_{t,i}) ̂m_raw,i.

Estimation is by regularized empirical risk minimization minimizing prediction loss (cross-entropy for binary or MSE for continuous) plus ℓ1 or ℓ2 penalties on J, solved with stochastic gradient descent (Adam).

The Jacobian of the estimator is obtained by differentiating ̂m w.r.t. Y; however, regularization biases this plug-in derivative downward.

Training regime: The model is fit on observed cascade data, with hyperparameters (regularization λ) selected by held-out cross-validation on independent folds of trajectories. Cross-fitting is employed for debiasing: ̂m is trained multiple times on separate folds, and inference uses held-out prediction residuals to avoid bias.
Evaluation protocol: Metrics are Pearson correlation between estimated and true networks (for synthetic) and correlation with external mobility data (for COVID-19). Baselines include classical parametric diffusion network inference methods (NetInf, NetRate, DANI) and recent GNN-based models. Ablations compare naive plug-in Jacobians to the debiased estimator. Statistical significance of correlations is tested where applicable. Results demonstrate robustness to misspecified diffusion kernels and improved interpretability via confidence intervals.
Reproducibility: Code and data availability are not explicitly stated, but synthetic experiments use publicly documented diffusion benchmarks. COVID-19 data is from official provincial case reports; ground-truth mobility data is presumably from external sources. The methodology specifies fully the estimator class and debiasing procedure, allowing replication given data.

Example end-to-end: In the COVID-19 application, the 52 provinces form nodes, state vectors Y_{c,t} represent daily case counts (continuous), and X_{c,t} includes covariates like policy variables. CascadeNet fits the linear-index model with a softplus or identity link function, regularized with ℓ2 penalties. The Jacobian matrix of the estimated transition function is computed and debiased with the Riesz representer, leveraging cross-fitting to estimate the bias correction from residuals. The resulting edge weights are then correlated with inter-province mobility matrices to validate recovered transmission patterns, revealing significant alignment not achieved by baselines.

Technical innovations

Formulating influence network recovery as estimation of the Jacobian matrix of an unknown one-step transition function, avoiding parametric diffusion kernel assumptions.
Applying Neyman-orthogonal debiasing via the Riesz representer to correct bias in Jacobian estimates induced by regularization, achieving √n-consistent, asymptotically normal inference per edge.
Extending double machine learning debiasing techniques to high-dimensional N×N matrix functionals (network Jacobian) with entry-specific orthogonal scores.
Introducing an absorbing wrapper construction to enforce irreversibility constraints in transition functions estimating diffusion models with absorbing adoption states.
Demonstrating that flexible transition function estimators (linear index model with flexible link or GNNs) coupled with debiased derivatives provide robust network recovery under misspecification.

Datasets

Synthetic cascades from nine diffusion models (IC, LT, Exp, PL, SIS, SIR, Complex, Hawkes, nonlinear DGP) — size unspecified — simulation
COVID-19 case counts — 52 provinces in Spain — public health records
Inter-province mobility network — Spain — external mobility data (source not specified)

Baselines vs proposed

NetInf: Pearson correlation with true network < CascadeNet by up to 50 percentage points across multiple diffusion models
NetRate: Pearson correlation < CascadeNet by significant margins (exact numbers not specified)
DANI: Correlation with true diffusion network substantially lower than CascadeNet
Naive plug-in Jacobian: r=0.07 vs CascadeNet debiased estimator r=0.77 on tanh synthetic example
Baseline methods on COVID-19 data: networks recovered show no statistically significant correlation with mobility vs CascadeNet showing significant correlation

Limitations

The method assumes Markovian dynamics and exogeneity of covariates; extensions to endogenous covariates or non-Markovian processes are left for future work.
Cross-fitting and debiasing introduce computational overhead, which may limit scalability to very large networks or very long cascades.
The empirical COVID-19 application uses correlation with mobility data as proxy ground truth; causal validity of recovered edges is not directly evaluated.
The absorbing wrapper assumes irreversible adoption, which may not hold in all practical cascades where recovery or dropout states exist.
No explicit adversarial robustness or attacks on the network recovery procedure are studied, limiting security implications.
Code release and reproducibility details are not specified in the paper, limiting immediate adoption.

Open questions / follow-ons

How to extend the framework to handle endogenous covariates and leverage instrumental variables for network recovery under confounding?
Can the debiased Jacobian estimation approach be scaled efficiently to networks with thousands or millions of nodes?
What is the robustness of CascadeNet to adversarial cascades or active manipulation of observed outcomes?
How to integrate temporal heterogeneity and non-Markovian dynamics more fully, beyond state augmentations?

Why it matters for bot defense

For bot-defense or CAPTCHA practitioners, CascadeNet’s debiased framework exemplifies how to recover latent dependency networks from observed sequential outcome data without strong parametric assumptions, which parallels challenges in detecting coordinated bot behaviors from interaction cascades on platforms. The formal inference on edges with confidence intervals enables principled decisions about the presence and strength of influence (or coordinated) links, moving beyond heuristic or point estimates. This is relevant in adversarial detection where false positives/negatives can have costly consequences and uncertainty quantification is critical. The combination of flexible machine learning models with orthogonal debiasing techniques may inspire new approaches to building robust graph-based bot detection systems that are not brittle to model misspecification. However, security analysts should be aware that extensions to adversarial robustness and endogenous confounding are future work.

Cite

bibtex

@article{arxiv2606_07483,
  title={ Network Recovery from Cascade Data: A Debiased Jacobian-Based Machine Learning Approach },
  author={ Lei Huang },
  journal={arXiv preprint arXiv:2606.07483},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.07483}
}

Network Recovery from Cascade Data: A Debiased Jacobian-Based Machine Learning Approach ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​