Joint Beamforming and Antenna Placement Optimization in Pinching Antenna Systems with User Mobility: A Deep Reinforcement Learning Approach

Source: arXiv:2605.08039 · Published 2026-05-08 · By Ali Amhaz, Mohamed Elhattab, Chadi Assi, Sanaa Sharafeddine

TL;DR

This paper addresses a gap in the pinching antenna systems (PASS) literature: virtually all prior work assumes users are stationary or quasi-static, yet the optimal placement of pinching points along dielectric waveguides is a direct function of the user's real-time position. The authors set up a downlink scenario where a base station equipped with N parallel waveguides, each carrying Pn reconfigurable pinching antennas, must continuously reposition those antennas and recompute a beamforming vector as a single mobile user moves according to a Random Waypoint (RWP) model and experiences stochastic LoS/NLoS blockage transitions modeled via a distance-dependent Bernoulli process. The joint optimization over continuous-valued beamforming weights and continuous antenna positions is non-convex with coupled variables, ruling out closed-form or standard convex solvers.

To solve this, the authors frame the problem as a Markov Decision Process and apply Deep Deterministic Policy Gradient (DDPG), an actor-critic reinforcement learning algorithm suited for continuous action spaces. The state encodes the previous action, beamforming power, in-waveguide propagation channels, and wireless channels to the user. The reward combines instantaneous achievable rate with penalty terms for QoS violations and geometric constraint violations (minimum inter-antenna spacing, antennas outside waveguide bounds). Power budget feasibility is enforced via a normalization step on the beamforming vector rather than through the reward.

Simulation results (150 Monte Carlo realizations, Table I parameters) show that the DDPG-optimized mobile pinching configuration consistently outperforms a fixed-pinching-location benchmark across all tested BS transmit power levels (15–35 dBm) and that blockage density β meaningfully degrades performance, with β=0.1 noticeably reducing average sum rate relative to β=0.05. Convergence is reported within roughly 250 episodes (Fig. 2 reaches ~10 bps/Hz). Fig. 4 qualitatively confirms that the learned policy tracks the user's trajectory along the waveguides and tightens antenna clustering under stricter QoS thresholds.

Key findings

The DDPG-controlled movable pinching configuration outperforms a fixed-pinching-location baseline at every BS power level tested (15–35 dBm), with the gap visually largest around 20–25 dBm in Fig. 3 (exact dB deltas not numerically tabulated in the paper).
Increasing blockage density from β=0.05 to β=0.1 produces a visible drop in average sum rate for the movable scheme in Fig. 3; β=0.15 causes further degradation, confirming that the probabilistic LoS/NLoS model is a material performance driver.
The DDPG agent converges to a stable average reward of approximately 10 bps/Hz within ~250 training episodes under the default parameter set (N=2 waveguides, Pn=4 antennas each, PBS=20 dBm, β=0.05), as shown in Fig. 2.
Fig. 4 demonstrates that under a stricter QoS threshold (Rth=2 bps/Hz vs. Rth=1 bps/Hz), the learned policy clusters pinching antennas more tightly around the user's instantaneous position, indicating the policy adapts its spatial aggressiveness to QoS requirements.
Post-deployment inference complexity scales as O(WLNe) — linear in time steps, layers, and neurons — which the authors argue is lower than the SCA-based alternative whose complexity includes polynomial terms in problem dimension Z^4.5 and iterative search factor Nitr.
The simulation assumes reconfiguration time is negligible relative to the 1-second control interval, justified by a maximum user speed of 5 m/s and correspondingly small required per-step antenna displacements.
Results are averaged over 150 Monte Carlo realizations of the RWP mobility and blockage process, providing at least moderate statistical stability for the reported curves.

Methodology — deep read

THREAT MODEL & ASSUMPTIONS: This is not a security paper; there is no adversarial threat model. The key engineering assumption is that the BS has perfect instantaneous channel state information (CSI) at each time slot — both in-waveguide propagation channels gn and the wireless channel vectors hn,k. User mobility follows the Random Waypoint model (speed and waypoints drawn stochastically), and blockage is modeled per-link via an independent Bernoulli variable whose LoS probability decays exponentially with distance: Pr(LoS)=exp(−β·d). The system is downlink single-user; no inter-user interference is modeled. Waveguide in-propagation is assumed lossless.

DATA & ENVIRONMENT: There is no real-world dataset. All results come from synthetic simulation using the parameters in Table I: 2 waveguides (N=2), each 100m long at heights A1=A2=10m, separated by y=3m, each carrying Pn=4 pinching antennas. Carrier wavelength λ=11.1mm (≈27 GHz mmWave), path-loss exponent α=3.9, noise PSD σ²=−174 dBm/Hz, PBS=20 dBm baseline. The RWP user moves in a 25×25m area at up to 5 m/s. Episodes consist of T time steps (exact T not explicitly stated in the text). 150 Monte Carlo runs are performed for each configuration.

ARCHITECTURE & ALGORITHM: The solution uses DDPG with an actor-critic architecture, each comprising a training network and a target network (four networks total). The actor μ maps state s(t) to action a(t), which is a concatenated vector of the complex beamforming vector wk (dimension N=2) and all PA x-axis positions (N×Pn=8 continuous scalars). The critic Q takes (state, action) as joint input and outputs a scalar Q-value. Hidden layer sizes L1 and L2 are referenced symbolically in the complexity analysis but their exact neuron counts are not specified in the paper. The state vector s(t) includes: previous action a(t−1), beamforming power ‖wk(t−1)‖², in-waveguide channel vectors g1…gN, and wireless channel vectors h1,k…hN,k — effectively giving the agent full current channel knowledge plus action memory. The reward (Eq. 16) is the instantaneous rate Rk(t) minus weighted penalties for QoS violation (pen1), out-of-bounds pinching (pen2), and inter-antenna spacing violation (pen3). Power constraint is not penalized but enforced by normalizing wk to satisfy ‖wk‖²≤PBS (Eq. 20–21) before executing the action in the environment. Soft target network updates use τ=0.001 for all four networks.

TRAINING REGIME: The agent trains for M=250 episodes (inferred from Fig. 2 x-axis). Each episode resets user location, channel conditions, and beamforming initialization. Actions are generated by the actor, stored as (s,a,r,s') tuples in a replay buffer F, and a random minibatch B is sampled to compute TD-error loss (Eq. 13) for critic updates and deterministic policy gradient for actor updates. Exact batch size B, replay buffer capacity, learning rates, activation functions, and random seed strategy are not reported, which is a reproducibility gap. The discount factor η is mentioned but its value is not given in the paper.

EVALUATION PROTOCOL: The primary metric is average sum rate (bps/Hz) versus BS transmit power (dBm). The sole baseline is a fixed-pinching-location scheme where PAs are distributed at predetermined static positions along the waveguides — the paper does not compare against SCA, FP, or other optimization baselines despite describing SCA complexity. No ablation study isolates the contribution of individual reward penalty terms or the effect of state representation choices. No held-out mobility trajectory test set is used; evaluation appears to use the same stochastic RWP generator as training. Fig. 4 provides a qualitative spatial visualization of learned PA placement along a sample user trajectory under two QoS thresholds.

CONCRETE EXAMPLE END-TO-END: At t=1, the user is at position (x,y)=(5,5)m. The actor network receives s(1) encoding the initialized channel vectors and prior action. It outputs action a(1): a 2-element complex beamforming vector plus 8 PA x-positions. The beamforming vector is normalized via Eq. 20 to satisfy PBS=20 dBm. The environment computes LoS/NLoS status for each of the 8 PA-to-user links by drawing Bernoulli samples with probability exp(−0.05·d). Channel vectors hn,p,k are assembled per Eq. 4–7. The received SNR and rate Rk(1) are computed per Eq. 11. Reward r(2) is Rk(1) minus any active penalties. The tuple (s(1),a(1),r(2),s(2)) is stored; a minibatch is drawn and both networks are updated. This repeats for T steps, then a new episode begins.

REPRODUCIBILITY: No code release is mentioned. No pre-trained weights are shared. Key hyperparameters (batch size, learning rate, hidden layer dimensions, discount factor η, penalty weights pen1/pen2/pen3, episode length T) are absent from the paper, making independent replication difficult.

Technical innovations

First formulation (per authors' claim) of joint beamforming and pinching antenna placement optimization under user mobility with a stochastic distance-dependent blockage model in PASS, as opposed to prior static-user PASS work such as Xu et al. (IEEE WCL 2025).
Application of DDPG to the PASS placement problem, enabling direct optimization over the fully continuous joint action space of complex beamforming weights and continuous antenna positions without discretization — a direct extension from its prior use in RIS resource allocation (Huang et al., JSAC 2020).
Penalty-augmented reward design with three separate indicator-function-gated penalty terms that enforce QoS, waveguide boundary, and minimum inter-antenna spacing constraints entirely within the RL reward signal, combined with a deterministic normalization layer for the power budget constraint.
Separation-of-timescales argument justifying negligible reconfiguration latency: maximum 5 m/s user speed at 1-second control intervals limits per-step antenna displacement, making continuous repositioning physically tractable within the assumed hardware.

Baselines vs proposed

Fixed Pinching Locations (β=0.05): average sum rate ≈ 5 bps/Hz at PBS=20 dBm vs. proposed DDPG Movable (β=0.05): ≈ 8 bps/Hz at PBS=20 dBm (values read approximately from Fig. 3; exact numbers not tabulated).
Fixed Pinching Locations (β=0.05): average sum rate ≈ 17 bps/Hz at PBS=35 dBm vs. proposed DDPG Movable (β=0.05): ≈ 20 bps/Hz at PBS=35 dBm (approximate from Fig. 3).
DDPG Movable (β=0.05) vs. DDPG Movable (β=0.1): visible rate degradation at all power levels in Fig. 3; no exact delta reported.
DDPG Movable (β=0.05) vs. DDPG Movable (β=0.15): further rate degradation beyond β=0.1 case in Fig. 3; no exact delta reported.

Limitations

Single-user scenario only — no multi-user interference, no NOMA or SDMA, limiting direct applicability to realistic dense deployments that motivated PASS in the first place.
Key hyperparameters (batch size, learning rates, hidden layer sizes, discount factor η, penalty coefficients pen1/pen2/pen3, episode length T) are unreported, making independent reproduction infeasible.
The fixed-pinching baseline is the only comparison; no SCA, successive convex approximation, or FP-based solver baseline is included despite the SCA complexity expression being derived — preventing a meaningful quality-of-solution comparison beyond a trivially weak benchmark.
Perfect instantaneous CSI is assumed at every time slot; no channel estimation error, feedback delay, or quantization is modeled, which is unrealistic especially for mmWave channels with rapid blockage transitions.
The claim that pinching antenna reconfiguration time is negligible at a 1-second control interval is asserted without hardware measurement or citation — actual PASS mechanical or electromechanical repositioning latency is uncharacterized.
No distribution shift evaluation: the RL policy is trained and tested under the same RWP model and β range; generalization to different mobility patterns (e.g., pedestrian, vehicular) or different blockage densities outside the training range is untested.
Figures are the only result presentation — no tables of exact numerical values, no confidence intervals, and no statistical significance tests despite 150 Monte Carlo runs being used.

Open questions / follow-ons

How does the DDPG policy generalize to mobility models other than RWP (e.g., Gauss-Markov, vehicular traces) or to blockage statistics with spatial correlation, which would more closely model urban canyon environments?
Can a model-based or hybrid RL approach (e.g., incorporating a Kalman-filtered trajectory predictor as part of the state) reduce the number of training episodes required and improve performance under rapid blockage transitions compared to the purely reactive DDPG policy?
What is the practical reconfiguration latency and energy cost of physically repositioning pinching antennas at the timescales required by 5 m/s mobility, and does hardware reality invalidate the negligible-reconfiguration-time assumption?
How does the joint optimization scale to multi-user PASS scenarios where pinching antenna positions must balance competing user locations, and whether a centralized DDPG or multi-agent RL formulation is more tractable?

Why it matters for bot defense

This paper is a wireless communications / signal processing paper with no direct connection to CAPTCHA, bot defense, or web security. It does not address human-vs-bot discrimination, behavioral biometrics, challenge-response protocols, or any authentication mechanism. A bot-defense engineer would find no immediately applicable techniques or threat models here.

The only tangential relevance is methodological: the paper demonstrates DDPG applied to a continuous-action optimization problem with non-stationary environment dynamics and constraint-augmented reward shaping. Bot-defense practitioners working on adaptive challenge difficulty, dynamic rate limiting policies, or RL-based bot behavior simulation might note the reward-penalty design pattern (Eq. 16) as a template for encoding multiple operational constraints into a single scalar reward. However, this is a generic RL engineering pattern, not a PASS-specific contribution, and the paper offers nothing specific to web traffic, user behavior modeling, or adversarial ML.

Cite

bibtex

@article{arxiv2605_08039,
  title={ Joint Beamforming and Antenna Placement Optimization in Pinching Antenna Systems with User Mobility: A Deep Reinforcement Learning Approach },
  author={ Ali Amhaz and Mohamed Elhattab and Chadi Assi and Sanaa Sharafeddine },
  journal={arXiv preprint arXiv:2605.08039},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.08039}
}

Joint Beamforming and Antenna Placement Optimization in Pinching Antenna Systems with User Mobility: A Deep Reinforcement Learning Approach ​

TL;DR ​

Key findings ​

Methodology — deep read ​

Technical innovations ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​