FFR: Forward-Forward Learning for Regression

Source: arXiv:2606.03927 · Published 2026-06-02 · By Xinyang Liu, Xuanyu Liang, Shiqi Ding, Boyang Li, Zhiqiang Que, Jiayang Li et al.

TL;DR

This paper addresses the fundamental challenge of extending the Forward-Forward (FF) algorithm—a biologically plausible and local alternative to backpropagation (BP)—from classification to real-world regression tasks. Original FF depends on contrasting positive and negative pairs, which does not directly apply to continuous target spaces lacking natural negative examples. The authors propose FFR (Forward-Forward for Regression), which innovates by introducing an ordinal competitive goodness function, a stratified ladder architecture to progressively refine predictions from coarse to fine granularity, and a hierarchical multi-scale prediction scheme that naturally produces uncertainty estimates.

FFR is evaluated on five diverse real-world regression benchmarks—spanning smart-home IoT, industrial tool wear, indoor localization, wearable health, and image quality assessment—as well as four synthetic regression problems. It achieves on average 98.6% of the accuracy of standard BP-trained networks (BP-UR) while using only 8–27% of BP's peak training memory depending on depth and reducing per-iteration compute time to about 72% of BP's. It also substantially outperforms all previously proposed BP-free or FF-based regression methods, which either failed to scale beyond toy problems or suffered large accuracy drops. The stratified ladder design and the ordinal competitive goodness function are shown via ablations to be essential to these gains.

Key findings

FFR recovers on average 98.6% of BP-UR's accuracy across five real-world regression benchmarks, with absolute RMSE and MAE within 1–5% of BP-UR on datasets including Appliances Energy, Tool Wear, UJIIndoorLoc, BIDMC, and KonIQ-10k.
FFR reduces peak training memory footprint to approximately 27% of BP at depth 8 and 8% at depth 32, due to layer-local updates and no backward pass activation storage (Fig 3b).
Per-iteration training time for FFR is about 72% that of BP on the KonIQ-10k CNN backbone at depth 8, benefiting from forward-only locality and avoiding backward gradient chains (Fig 3c).
FFR significantly outperforms all FF-based and BP-free baselines (FF-MSE, FF-CLF, FF-CAR, FF-Zero, Trifecta, PEPITA, F3) on both synthetic and real-world benchmarks, often by large margins in RMSE/MAE (Tables 1 and 2).
Ablation studies show ordinal soft labels provide the largest single contribution to performance; stratified growing group counts per layer improve granularity of learned representations; ladder aggregation and hierarchical prediction further enhance accuracy (Table 3).
FFR's hierarchical prediction scheme yields well-calibrated uncertainty estimates as a byproduct, which track with error magnitudes on multiple datasets without extra cost (Fig 4).
Performance is robust across group partition schedules Kℓ and hierarchical prediction weighting schemes, with the default {16, 32, 64} group schedule and uniform weights providing good balance (Table 4).
Naive FF adaptations to regression like FF-MSE collapse representations, converging to poor local minima and underperforming FFR by an order of magnitude in error (Appendix A.5.1).

Methodology — deep read

Threat Model & Assumptions: The study does not target adversarial settings explicitly; it focuses on enabling efficient, biologically-plausible local training for regression without backpropagation. The adversary is not modeled in the classical security sense.
Data: Datasets include four synthetic regression tasks (Sin-Cos, Exp-Trig-Poly, MT-A, MT-B) and five real-world regression benchmarks covering different domains: Appliances Energy (~energy consumption), Machine Tool Wear, UJIIndoorLoc (indoor localization), BIDMC (wearable health monitoring), and KonIQ-10k (image quality assessment). Dataset sizes and splits follow prior literature protocols with data splits held constant across methods for fair comparison.
Architecture & Algorithm: The core innovation is designing a local, layer-wise training objective replacing FF’s positive-negative pair contrast with an ordinal competitive goodness function. At each layer ℓ, the continuous target space is discretized into Kℓ ordered bins (doubling number of bins with layer depth for finer granularity). Neurons at each layer are partitioned into disjoint groups Gℓ,k corresponding to these bins. The average squared activation (goodness) per neuron group forms a normalized distribution pℓ over bins. The layer is trained to match pℓ to a soft ordinal target distribution qℓ—a Gaussian bump centered on the true target projected to bin midpoints—via cross-entropy loss. This preserves ordinal relations and imposes local competition between groups.

The model uses a stratified ladder architecture where Kℓ grows exponentially per layer (e.g., 16, 32, 64 groups) enabling shallow layers to coarsely discriminate bins and deeper layers to refine finer differences. Group-wise normalization per partition stabilizes activations. Outputs from all layers’ group goodness vectors are concatenated and combined via a terminal regression head trained with MSE loss to produce a continuous final prediction. Gradients for the terminal layer do not backpropagate beyond it, preserving layer locality.

For prediction, each layer’s distribution pℓ is converted into a mean prediction µℓ by weighting bin midpoints. These multi-scale predictions plus the terminal output are combined with non-negative weights into an ensemble prediction, providing improved robustness and an uncertainty estimate derived from prediction variance across layers without extra sampling.

Training Regime: Architectures use 4-layer fully connected backbones for tabular/synthetic data, 8-layer CNN for Images (KonIQ-10k). Batch sizes, epoch counts, learning rates, and optimizers are chosen per dataset to ensure convergence and fair baseline comparisons; details are in appendix. Training runs on standard single-GPU setups with random seeds for reproducibility.
Evaluation Protocol: Metrics include RMSE and MAE on held-out test sets across all datasets. Baselines include two BP references (BP-UR: unified regression; BP-EX: adds grouped-classification losses), multiple FF variants (FF-MSE, FF-CLF, FF-CAR, FF-Zero), and BP-free approaches (PEPITA, F3). Ablations remove or alter components such as soft ordinal labels, stratification schedule, ensemble aggregation, and hierarchical prediction. Statistical robustness via repeated runs or confidence intervals not explicitly reported. Runtime and memory benchmarks isolate depth scaling effects.
Reproducibility: No explicit mention of code or model weight release in the main text. Datasets used are public or standard benchmarks. Precise hyperparameters, architectures, and normalization schemes detailed in appendices to enable replication. Closed datasets are not involved. Experimental settings and baselines matched fairly.

Example End-to-End: A sample from the UJIIndoorLoc dataset (indoor location coordinates) is fed into the stratified ladder network. At layer 1, the target latitude/longitude is discretized into 16 bins. The group-wise normalized activations produce goodness values per bin group, converted into a probability distribution. The cross-entropy loss with Gaussian-smoothed soft labels forces competition among groups closest to the target location. This continues across layers with finer binning (32 groups at layer 2, 64 at layer 3). Finally, the concatenated multi-scale goodness vectors feed a regression head predicting continuous latitude/longitude. The ensemble of intermediate predictions produces a final output plus uncertainty estimation. The model updates each layer independently with its local ordinal competitive loss, enabling forward-only weight updates without backpropagation. This yields prediction accuracy close to BP baselines but with greatly reduced memory and compute costs.

Technical innovations

Introduction of an ordinal competitive goodness function that replaces positive-negative pair contrasts with intra-layer competition across partitioned neuron groups supervised by a distance-aware ordinal soft label.
Stratified ladder architecture with exponentially increasing group counts per layer, allowing shallow layers to learn coarse ordinal bins and deeper layers to refine predictions progressively.
Aggregation of goodness vectors from all intermediate layers into a terminal regression head trained independently, preserving pure layer-local updates without backpropagation through earlier layers.
Hierarchical multi-scale prediction ensemble providing both robust continuous regression outputs and uncertainty estimates as a free byproduct without sampling or Bayesian approximations.

Datasets

Sin-Cos — synthetic benchmark — constructed from trigonometric functions
Exp-Trig-Poly — synthetic benchmark — composed of exponential, trigonometric, and polynomial functions
MT-A — synthetic multi-target benchmark with 2 outputs — constructed from Sin-Cos and Exp-Trig-Poly
MT-B — synthetic multi-target benchmark with 4 outputs — constructed from Sin-Cos and Exp-Trig-Poly variants
Appliances Energy [3] — real-world tabular data on appliance energy usage
Machine Tool Wear [6] — industrial sensor data for tool wear prediction
UJIIndoorLoc [26] — indoor localization dataset with latitude/longitude targets
BIDMC [21] — wearable health monitoring data
KonIQ-10k [12] — image quality assessment dataset with continuous quality scores

Baselines vs proposed

BP-UR: average RMSE = 0.008 (synthetic), 12.357 (KonIQ-10k); FFR: 0.013 (synthetic), 12.731 (KonIQ-10k), achieving 98.6% relative performance
FF-MSE: average synthetic RMSE ~0.193 (10x worse than FFR’s 0.013); FFR outperforms FF-MSE across all datasets
FF-CLF: competitive on single-target synthetic but errors grow over 10x on multi-target MT-A/B; FFR improves on these by stratified ladder design
FF-CAR [20]: error substantially worse and training/inference more expensive than FFR
FF-Zero [10]: traded accuracy for hardware implementability, performance well below FFR
PEPITA [7] and F3 [9]: consistently worse RMSE and MAE on real-world datasets than FFR by large margins
Ablation w/o ordinal soft labels increased RMSE from 76.7 to 81.7 on Appliances and from 0.017 to 0.048 on UJIIndoorLoc indicating its critical role (Table 3)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.03927.

Fig 1

Fig 1: Overview of FFR. (a) FFR framework and regression applications. (b) The gap between

Fig 2

Fig 2: FFR framework. Stratified ladder architecture is trained with ordinal competitive goodness

Fig 3

Fig 3 (page 4).

Limitations

Current implementation uses standard floating-point precision without exploring low-bit or quantization-aware training which could yield further efficiency gains.
Forward-only locality is not yet demonstrated on specialized low-bit accelerators, analog circuits, or neuromorphic hardware; benefits outside GPU remain untested.
Uncertainty estimates lack formal calibration analysis or out-of-distribution detection tests beyond qualitative visualization.
No explicit adversarial or robustness evaluation against corrupted or malicious inputs was conducted.
Experiments focus on 4-8 layer models; scalability or suitability for extremely deep networks or very large datasets is not fully explored.
No code release mentioned reduces immediate reproducibility for the wider community.

Open questions / follow-ons

Can the ordinal competitive goodness function and stratified ladder architecture be effectively combined with quantized or extremely low-bit precision training to further reduce resource usage?
How well does FFR’s forward-only training adapt to neuromorphic hardware implementations or analog substrates, and what accuracy/efficiency tradeoffs emerge?
Can the predictive uncertainty output by FFR be formally calibrated and leveraged for selective prediction, anomaly detection, or distribution shift identification in deployed systems?
How does FFR handle extremely large or high-dimensional regression targets, e.g., in multi-output or structured regression settings beyond the tested 4-output synthetic benchmarks?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, FFR represents an exciting alternative to backpropagation for training regression models under resource constraints where latency, memory, or on-device computation dominate. Its forward-only local learning scheme reduces memory footprint and enables more parallel, asynchronous weight updates, which could be useful for low-power edge devices that run real-time anomaly detection or sensor regression models relevant to bot detection. The ordinal competitive goodness function enables meaningful supervision for continuous-valued predictions, a capability that could be adapted for scoring-based challenge-response or behavioral feature regression in CAPTCHAs. Finally, the built-in uncertainty estimates could support confidence-aware decisions, helping distinguish confident human interactions from uncertain or malicious bot inputs. While FFR is not immediately plug-and-play for security-critical settings, its locality and efficiency tradeoffs suggest it is worth further exploration for future bot-detection models that must run on constrained hardware or securely on-device without centralized gradient updates.

Cite

bibtex

@article{arxiv2606_03927,
  title={ FFR: Forward-Forward Learning for Regression },
  author={ Xinyang Liu and Xuanyu Liang and Shiqi Ding and Boyang Li and Zhiqiang Que and Jiayang Li and Guosheng Hu },
  journal={arXiv preprint arXiv:2606.03927},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.03927}
}

FFR: Forward-Forward Learning for Regression ​

TL;DR ​

Key findings ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​