Skip to content

AI-Driven Multi-Region Provisioning for Cloud Services Using Spot Fleets

Source: arXiv:2605.22778 · Published 2026-05-21 · By Javier Fabra, Enrique Molina-Giménez, Pedro García-López

TL;DR

This paper addresses the challenge of provisioning large spot instance fleets in cloud environments under uncertainty caused by dynamic pricing, availability, and interruptions that vary by region. Existing services like AWS EC2 Spot Service provide allocation strategies limited to single regions and lack ability to estimate fleet costs before deployment, hindering cost-aware, multi-region orchestration.

The authors propose an AI-driven provisioning service that leverages monitoring of the EC2 Spot Service combined with LSTM-based predictive models to estimate fleet configurations and prices prior to deployment across multiple regions. This approach preserves EC2 Spot Service compatibility while enabling users to make region-aware provisioning decisions and optimize costs.

The system was validated on AWS with up to 1500 vCPUs fleets over nine regions. It achieved 99.79% prediction accuracy against EC2 Spot Service plans and demonstrated potential cost savings of up to 64% by exploiting regional price differences. The work highlights the value of predictive intelligence for cloud service orchestration at scale.

Key findings

  • Prediction model achieves 99.79% accuracy compared to EC2 Spot Service recommendations across 720 evaluated fleets (Table 1).
  • 92.78% of predicted provisioning plans exactly match EC2 plans; mismatches cause only 2.97% average price deviation.
  • Multi-region provisioning can yield up to 64% cost savings by exploiting price variability across 9 AWS regions (Fig 4).
  • Price differences between regions remain significant for fleets sized 64, 512, and 1500 vCPUs (up to 2.77× factor).
  • Within a single region, changing allocation strategies leads to less than 10% price variation, demonstrating regional choice impact is larger.
  • Lowest-price allocation strategy has very high eviction probability (~94.9% after 1 hour), while capacity-optimized is more stable.
  • No provisioning requests recommended by the predictive service were rejected by AWS, preserving feasibility.

Threat model

The threat model focuses on uncertainty in cloud spot instance markets driven by an adversarial environment of fluctuating pricing, availability, and interruption risk controlled by the cloud provider and market forces. The user adversary cannot manipulate AWS pricing or availability but must cope with incomplete information. The system assumes no ability to alter underlying cloud platform behavior, only to predict it.

Methodology — deep read

  1. Threat Model & Assumptions: The adversary is implicit—cloud market dynamics cause uncertainty via fluctuating prices, availability, and interruption risks across regions. Users need accurate pre-launch provisioning estimates. The system assumes black-box access to the EC2 Spot Service APIs but no ability to influence AWS internal behavior.

  2. Data: The system monitors provisioning plans by issuing 720 fleet requests per region over 9 AWS regions, covering capacities from 64 to 1500 vCPUs during a 90-day period. Metadata collected includes timestamps, region, instance types, allocation strategies, realized prices, and fleet composition. Data is stored in AWS S3.

  3. Architecture / Algorithm: The key novel component is a set of region- and allocation-strategy-specific Long Short-Term Memory (LSTM) models trained on temporal data of past provisioning requests. Inputs are past provisioning plans with temporal dependencies; outputs are fleet configuration and estimated cost predictions. The modular architecture includes monitoring (data collection via EC2 API probes), prediction (LSTM models), auditing (validates model predictions vs real outcomes), and API nodes providing recommendations.

  4. Training Regime: Models are trained individually per region and strategy using AWS SageMaker. Exact epochs, batch sizes, or optimizer settings are not detailed. Training leverages time-series temporal dependencies in provisioning data to forecast future spot fleet plans prior to requests.

  5. Evaluation Protocol: Over a week of live evaluation (Jan 24–30, 2025), 720 experiments were conducted launching 10 fleets every 3 hours with random sizes and strategies across 9 regions. Metrics include mismatch rate versus EC2 Spot Service provisioning, price deviation, provisioning success (request acceptance), and estimated cost savings across regions. No mention of statistical significance tests or cross-validation.

  6. Reproducibility: Implementation deployed on AWS infrastructure using standard EC2 APIs and SageMaker. Models and full data release status unknown, but training data resides on S3. The approach is architecture-general and could extend to other clouds.

A concrete example: for a request to provision a 1500 vCPU fleet with a capacity-optimized strategy, the prediction model takes recent historic provisioning data from that region and outputs an estimated fleet composition and cost before submitting the real EC2 request. Agreement with EC2's actual allocation is measured, and the cost estimate guides whether to deploy in that region or elsewhere for savings.

Technical innovations

  • Use of region- and strategy-specific LSTM models to predict spot fleet provisioning plans and costs before deployment, capturing temporal dependencies.
  • Integration of monitoring, prediction, auditing, and API modules in a modular service overlay compatible with AWS EC2 Spot Service.
  • Multi-region provisioning approach enabling cost-aware deployment decision-making that leverages regional spot price variability absent in native spot services.
  • Feedback-driven auditing module to retrain models when prediction error thresholds are exceeded, preserving alignment with cloud market dynamics.

Datasets

  • AWS Spot Fleet provisioning logs — 720 fleets per region × 9 regions — collected over 90 days, including 64 to 1500 vCPU sizes, stored in AWS S3.

Baselines vs proposed

  • EC2 Spot Service allocation plans: prediction mismatch rate = 7.22% overall vs proposed predictive model accuracy = 99.79%
  • Within-region allocation strategies differ by <10% in price vs multi-region savings up to 64% enabled by proposed approach
  • Eviction probability after 1 hour: lowest-price strategy ~94.9%, capacity-optimized substantially lower, demonstrating trade-offs preserved

Limitations

  • Evaluation limited to AWS EC2 spot instances and x86 c6i/c7i instance families, not extended to other clouds or instance types.
  • No detailed adversarial evaluation or robustness tests under market shocks or attacks on provisioning APIs.
  • Evaluation period of 7 days post training may not capture longer term temporal variability or seasonality.
  • No reported uncertainty quantification or confidence intervals in predictions.
  • Training hyperparameters and model architecture details are not fully disclosed, limiting reproducibility.
  • Possible overfitting to the 9 specific AWS regions monitored; generalization to unavailable or future regions untested.

Open questions / follow-ons

  • How do predictions and savings scale with more diverse instance families or GPU/accelerator types?
  • Can the approach be generalized effectively to multi-cloud scenarios combining different providers?
  • How robust are the LSTM models to sudden market disruptions or changes in cloud provider policies?
  • What is the impact of integrating predictive provisioning at the application orchestration level on end-to-end service QoS and latency?

Why it matters for bot defense

For bot-defense or CAPTCHA practitioners, while this work is centered on cloud spot instance provisioning, the core insight—predictive modeling of uncertain, dynamic infrastructure allocation—can analogously inform bot detection scale-out decisions. Accurately estimating multi-region resource costs enables efficient placement of compute-intensive defenses or CAPTCHA generation systems with cost-performance tradeoffs. Integrating predictive provisioning APIs into service orchestration could allow automated scaling of anti-bot infrastructure across geographies, optimizing cost while maintaining service resilience.

Furthermore, the demonstrated benefits of multi-region, AI-driven provisioning highlight the importance of anticipating variable infrastructure costs and availability in real-time defense workflows. Capturing temporal and spatial variability in resource pricing could inspire analogous techniques in defending against distributed automated traffic that utilizes global cloud infrastructure. However, this paper does not address security threats directly or adversarial evasion, so the relevance is primarily architectural and operational rather than detection methodology.

Cite

bibtex
@article{arxiv2605_22778,
  title={ AI-Driven Multi-Region Provisioning for Cloud Services Using Spot Fleets },
  author={ Javier Fabra and Enrique Molina-Giménez and Pedro García-López },
  journal={arXiv preprint arXiv:2605.22778},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.22778}
}

Read the full paper

Articles are CC BY 4.0 — feel free to quote with attribution