TurboServe: Serving Streaming Video Generation Efficiently and Economically

Source: arXiv:2606.19271 · Published 2026-06-17 · By Youhe Jiang, Haoxu Wang, Haotong Bao, Kai Jiang, Jianfei Chen, Jun Zhu et al.

TL;DR

This paper addresses the emerging challenge of serving streaming video generation workloads, where users interact with long-lived sessions that progressively generate video chunks with real-time latency constraints. Unlike offline video generation or typical large language model (LLM) serving, this workload requires preserving session state over active and idle periods and repeatedly scheduling chunk generation within tight latency targets. Key challenges include session duration heterogeneity—some sessions last seconds while others last minutes—and temporal user-demand heterogeneity, where the number of active sessions sharply fluctuates. To address these, the authors present TurboServe, the first serving system designed specifically for streaming video generation that jointly optimizes session placement and GPU autoscaling in a closed loop. TurboServe uses migration-aware placement to rebalance load across GPUs and a load-driven autoscaling controller to adapt GPU resource allocation dynamically. Runtime optimizations such as coalesced chunk processing, GPU-CPU offloading, and NCCL-based GPU-GPU session migration enable efficient management of stateful sessions.

Empirical evaluation on real production traces from Shengshu Technology across multiple model sizes and clusters up to 64 NVIDIA B300 GPUs shows that TurboServe reduces worst-case per-chunk latency by 37.5% and total GPU operating cost by 37.2% on average compared to baselines. This demonstrates its ability to deliver stable, low-latency streaming video generation serving economically under dynamic workloads. The system and code are publicly released for further research and adoption.

Key findings

TurboServe reduces worst-case per-chunk latency by 37.5% compared to baseline fixed placement and GPU allocation serving.
TurboServe reduces total GPU operating cost by 37.2% on average compared to baselines when tested on real-world production streaming video generation traces.
Session migration alone reduces worst-case latency by 26.53% without increasing GPU cost, isolating the benefit of balanced session placement.
GPU autoscaling alone reduces total GPU cost by 32.57% while maintaining the same worst-case latency, isolating the benefit of dynamic resource provisioning.
Joint migration and autoscaling improves latency by 8.17% and reduces GPU cost by 40.25% compared to fixed-baseline, showing complementary benefits.
Coalesced chunk processing batches concurrent sessions on the same GPU to improve utilization while respecting latency targets.
GPU-CPU offloading frees GPU memory by suspending idle sessions’ state to host memory, allowing long-lived sessions without occupying GPU slots.
NCCL-based GPU-GPU migration enables efficient session movement online to enable dynamic load balancing.

Threat model

The adversary is modeled as dynamic users who start, interact, idle, and end long-lived video generation sessions generating streaming chunks with strict latency targets. Adversaries do not maliciously disrupt the system but represent workload variability and session state persistence challenges. The system assumes no adversarial attempts to corrupt session state or overload GPUs beyond capacity constraints. The threat model is practical rather than adversarial in a security sense, focusing on workload heterogeneity and resource efficiency challenges.

Methodology — deep read

The paper begins by defining the threat/model context as a multi-user, multi-GPU streaming video generation serving environment with dynamic workloads. The adversary model is implicit: users interact with stateful sessions that generate videos chunk-wise with strict per-chunk latency targets. Key assumptions include GPU workers each able to handle a maximum number K of concurrent active sessions without latency violation, and that session states can be offloaded or migrated efficiently.

Data provenance involves replaying production workload traces collected from Shengshu Technology, featuring arrival/departure of sessions and transitions between active and idle periods. These traces capture session duration heterogeneity (ranging from seconds to several minutes) and temporal fluctuations in concurrent active sessions, allowing realistic evaluation. Multiple video generation model sizes and GPU cluster sizes (up to 64 NVIDIA B300 GPUs) are evaluated.

The core architecture is a closed-loop scheduling system combining two tightly coupled controllers: (1) a migration-aware placement controller that decides per-event session-to-GPU assignments and rebalances sessions to minimize the maximum per-chunk latency, and (2) an autoscaling controller that dynamically adjusts the number of provisioned GPUs based on runtime load feedback (GPU utilization, latency, session counts). Session placement decisions consider minimizing worst-case per-GPU latency subject to GPU capacity constraints (max K sessions per GPU) and ensuring active user sessions run on GPUs (not suspended).

At runtime, the system uses coalesced chunk processing which batches chunk generation requests across concurrent active sessions assigned to the same GPU to improve utilization without violating latency constraints. GPU-CPU offloading is used to suspend idle sessions’ state to host memory, freeing GPU slots. For migration, sessions’ context is moved between GPUs using NCCL to enable load balancing without disrupting real-time latency guarantees. The closed-loop scheduler is invoked on system events (session arrival, departure, active/idle transitions) to adapt placement and GPU counts.

The training regime per se is not applicable but the evaluation involves running the scheduling system live on replayed traces on a real GPU cluster. Metrics used include worst-case latency per chunk across sessions within windows, total GPU operating cost (cloud-price-equivalent dollars), and GPU utilization. Ablations isolate the impacts of migration only, autoscaling only, and joint strategies against a fixed baseline.

The evaluation demonstrates that fixed static placement and GPU provisioning leads to latency imbalances and resource inefficiencies. Adding session migration lowers maximum latency by rebalancing load without increasing GPUs, and autoscaling reduces wasted GPU cost while preserving latency. The joint scheme provides the best tradeoffs across load conditions. Experimental results are detailed in figures such as Fig. 3 showing latency and cost impacts over time and Fig. 4 quantifying approach improvements. The code and system design are publicly released, but exact reproduction depends on access to proprietary workload traces from Shengshu Technology.

Technical innovations

Formulation of streaming video generation serving as an online joint scheduling problem coordinating session placement and GPU autoscaling under strict per-chunk latency constraints.
A migration-aware placement controller using event-driven min-max rebalancing that reassigns sessions to reduce worst-case per-GPU latency while accounting for migration overhead.
A load-driven autoscaling controller that adapts GPU cluster size based on runtime load and latency feedback to optimize cost-latency tradeoffs.
Runtime coalesced chunk processing that batches concurrent chunk generation requests on the same GPU to improve throughput without breaching latency targets.
GPU-CPU offloading and NCCL-based GPU-GPU migration mechanisms that enable efficient suspension/resumption and dynamic session migration maintaining session state for long-lived streaming workloads.

Datasets

Shengshu Technology streaming video generation workload traces — multi-minute session durations with detailed session state transitions — proprietary production traces

Baselines vs proposed

Baseline (fixed placement and GPU allocation): worst-case latency = 0.71s, cost = $3.99
Migration only (fixed GPU budget + periodic session migration): latency reduced by 26.53%, cost unchanged
Autoscaling only (dynamic GPU provisioning without migration): cost reduced by 32.57%, latency maintained ≤ 0.71s
Joint migration + autoscaling: latency reduced by 8.17%, cost reduced by 40.25% compared to baseline

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.19271.

Fig 1

Fig 1: Stateless isolated generation jobs (top) compared with stateful streaming generation sessions (bottom) in

Limitations

Evaluation relies on proprietary production traces from one company, limiting generalization and reproducibility without access to this data.
The study focuses on average and worst-case latency and GPU cost but does not quantify user-level QoE impacts or throughput metrics explicitly.
Assumes GPUs have fixed capacity K and does not evaluate impact of varying model architectures or heterogeneous GPU types.
Does not explore adversarial usage patterns or robustness of scheduling under malicious workload injections.
Migration overhead model uses an α-β model but real-world migration costs may vary with session state size and network conditions.
Autoscaling controller tuning parameters may require customization for different workloads; stability under rapid demand spikes is not elaborated.

Open questions / follow-ons

How does TurboServe perform under adversarial or bursty workloads designed to stress migration and autoscaling controllers?
Can the scheduling framework be extended to heterogeneous GPU clusters with diverse architectures and capacities?
How sensitive is the system to autoscaling controller parameters and latency-cost tradeoff weights in wildly varying real-world environments?
Can TurboServe incorporate predictive workload modeling to proactively schedule and provision resources ahead of demand spikes?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, TurboServe highlights key challenges and solutions around efficiently serving long-lived, stateful streaming workloads under tight latency constraints with fluctuating user demand. The migration-aware session placement and elastic GPU autoscaling principles could inform designs for scalable interactive CAPTCHA challenge generation or dynamic bot-challenge video content that must be served with low latency and cost efficiency. Understanding session state suspension, chunk coalescing, and migration tradeoffs may also help in defending against bots that exploit session persistence or latency variability.

While TurboServe targets video generation, the scheduling framework and closed-loop control concepts generalize to any multi-session interactive AI service where workload heterogeneity and temporal burstiness affect resource utilization and latency. Practitioners can draw from these insights when architecting CAPTCHA and bot-defense systems that require real-time streaming outputs and efficient GPU resource management.

Cite

bibtex

@article{arxiv2606_19271,
  title={ TurboServe: Serving Streaming Video Generation Efficiently and Economically },
  author={ Youhe Jiang and Haoxu Wang and Haotong Bao and Kai Jiang and Jianfei Chen and Jun Zhu and Fangcheng Fu and Jintao Zhang },
  journal={arXiv preprint arXiv:2606.19271},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.19271}
}

TurboServe: Serving Streaming Video Generation Efficiently and Economically ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​