Skip to content

OpenCLAW-P2P v6.0: Resilient Multi-Layer Persistence, Live Reference Verification, and Production-Scale Evaluation of Decentralized AI Peer Review

Source: arXiv:2604.19792 · Published 2026-04-06 · By Francisco Angulo de Lafuente, Teerth Sharma, Vladimir Veselov, Seid Mohammed Abdu, Nirmal Tej Kumar, Guillermo Perry

TL;DR

The authors provide an honest production-scale evaluation, including failure-mode analysis, a successful recovery protocol that restored 25 lost papers, and lessons on scaling AI peer review networks. Core v5.0 features such as the tribunal system, multi-LLM scoring ensemble, calibrated 14-rule deception detection, Proof of Value consensus, and formal verification via the Lean 4-based AETHER engine are retained and further hardened. The paper offers detailed architecture and operational protocols, with all source code released open-source, marking a rare mature, decentralized, automated peer review system evaluated in live conditions.

Key findings

  • The multi-layer paper persistence system prevented any paper loss across infrastructure redeployments or outages when at least one of Cloudflare R2 or GitHub is reachable (Proposition 7.1).
  • The multi-layer retrieval cascade reduced median retrieval latency from >3 seconds in v5.0 to under 50 milliseconds at the in-memory cache after warm-up (Fig 2, Table 10).
  • Live reference verification querying CrossRef, arXiv, Semantic Scholar, PubChem, UniProt, OEIS, and Materials Project achieved >85% accuracy in detecting fabricated or ghost citations, a major improvement over structural heuristics.
  • The scientific API proxy service enabled rate-limited, cached access to 7 major public scientific databases, supporting verifiable grounding of AI-generated references in real time.
  • In production, 14 autonomous agents and 23 simulated citizens jointly produced 50+ papers with word counts between 2,072 and 4,073 words and leaderboard scores from 6.4 to 8.1.
  • The paper recovery protocol successfully restored all 25 papers lost to a critical 500-character truncation bug from v5.0, with 56% passing tribunal clearance in round one and 100% by round two, demonstrating robustness of re-examination (Table 12).
  • The 17-judge multi-LLM scoring ensemble combined 18 model judgments from diverse providers with calibration to correct for LLM score inflation, yielding rigorously calibrated final paper scores.
  • The tribunal cognitive examination consists of 8 questions from 7 balanced categories, requiring ≥60% to publish, ensuring gatekeeping without human intervention.

Threat model

Adversaries are autonomous AI agents that may attempt to publish papers with fabricated citations or inflated claims to game the system. They cannot simultaneously disable all four storage tiers, nor can they alter previously committed proofs without detection due to cryptographic and formal verification safeguards. The model assumes no human gatekeepers, treating deception detection and factual grounding as key defenses.

Methodology — deep read

The authors assume an adversary capable of submitting AI-generated papers including fabricated citations but unable to tamper with storage tiers simultaneously due to the multi-layer architecture. The platform runs 14 autonomous AI agents (3 research-focused, 5 meta-intelligence, 6 recovery/specialist) and 23 labeled simulated citizens to scale tests in production-like settings. Data comprises 50+ papers with 2,072–4,073 word counts, published and peer-reviewed live through the OpenCLAW-P2P system.

At publication, papers undergo a tribunal cognitive exam: a three-phase process requiring agents to answer 8 questions from a 26-question pool across 7 categories, seeded by keywords and covering cognitive skills; passing score ≥60% grants a one-time clearance token to enable submission. Submitted papers enter a multi-LLM judge ensemble scoring process involving 17 judges running in parallel, evaluating 10 granular dimensions (Abstract, Introduction, Methodology, etc.) on 0–10 scales. Scores correct for LLM inflation using a calibrated affine transform and are further subject to 14 heuristic calibration rules addressing missing sections, evidence gaps, deception patterns (e.g., ghost citations, cargo cult structure).

New in v6.0, a live reference verification subsystem queries CrossRef, arXiv, Semantic Scholar, and other scientific APIs via a rate-limited cached API proxy to confirm citations in real time. Papers exceeding 50% unverifiable citations trigger ghost-citation flags affecting scores.

Paper content is written synchronously to four storage tiers on publish: in-memory cache, Gun.js graph DB, Cloudflare R2 object store, and GitHub repo backup. A retrieval cascade attempts fetches top-down from these tiers, backfilling higher tiers on miss. This layered persistence solves prior data-loss issues, including a prior paper content truncation bug fixed by preserving full content plus explicit word count metadata.

To recover lost papers from pre-fix truncations, local saved copies were re-submitted with force flags after passing tribunal re-examinations using rotating agent IDs, overcoming system rate limits.

All submitted papers pass through the Proof of Value consensus protocol involving automated formal proof verification via a Lean 4-based kernel (AETHER engine), mempool publication with Ed25519 signatures, and τ-aligned peer verification.

Evaluation captures retrieval latencies per tier (cache median <1 ms, Gun.js up to 3 s), scoring distributions, production throughput metrics, and failure mode stats. No adversarial attack tests reported. Open-source code and public APIs enable reproducibility, though the exact seed strategy and hardware for training LLM judges are not detailed.

One concrete workflow: Agent passes tribunal →Publishes paper which triggers synchronous four-tier storage writes →Paper is scored by 17 independent LLM judges →Scores calibrated via 14-rule pipeline including live reference verification →Paper status is promoted on multiple independent validation passes →Paper retrievable via multi-layer cache cascade within 50 ms typically.

Technical innovations

  • Design and implementation of a four-tier paper persistence architecture combining in-memory cache, Gun.js graph DB, Cloudflare R2 durable object store, and GitHub repo backup to guarantee zero data loss through redeployments and outages.
  • Multi-layer retrieval cascade with automatic backfill that reduces lookup latency by over 98% (from >3 seconds to <50 milliseconds) by querying successively slower but more durable storage tiers.
  • Live reference verification system querying seven scientific databases (CrossRef, arXiv, Semantic Scholar, PubChem, UniProt, OEIS, Materials Project) in real-time during scoring to detect ghost/fabricated citations with >85% accuracy.
  • Scientific API proxy providing rate-limited, cached, normalized access to multiple heterogeneous public databases enabling agents to ground citations and claims in verifiable data.
  • A calibrated ensemble of 17 heterogeneous LLM judges from diverse global providers combined with a 14-rule calibration and deception detection pipeline to mitigate LLM score inflation and detect semantic and structural paper failures.

Datasets

  • OpenCLAW-P2P Papers — 50+ papers, 2,072–4,073 words each — collected/generated by platform agents during production operation
  • Tribunal Question Pool — 26 questions across 7 cognitive categories — internal to platform
  • Scientific Public Databases (CrossRef, arXiv, Semantic Scholar, PubChem, UniProt, OEIS, Materials Project) — varied sizes — public APIs

Baselines vs proposed

  • v5.0 retrieval latency: >3 seconds vs v6.0 multi-layer cache latency: <50 milliseconds on median
  • Reference verification: structural heuristics ghost-citation detection <50% accuracy vs live external API verification >85% accuracy
  • Paper recovery success: N/A vs 100% recovery of 25 previously lost papers after truncation bug fix with tribunal re-examination
  • Average leaderboard score range for papers produced: 6.4–8.1 with recovered papers scoring 6.9–8.6

Limitations

  • No reported adversarial evaluation testing deliberate attempts to evade deception detectors or disrupt persistence layers.
  • No distribution shift analysis exploring performance with papers from domains or agents outside the training/validation distribution.
  • Limited transparency on training or hyperparameter tuning for constituent LLM judges beyond architecture origins and providers.
  • The recovery protocol relied on local agent file caches unavailable in all failure modes, limiting general applicability.
  • No explicit user or real-world human-in-the-loop evaluation metrics or acceptance criteria presented.
  • API proxy rate limits could bottleneck scaling beyond current load; impact on latency under heavy load not evaluated.

Open questions / follow-ons

  • How robust is live reference verification against sophisticated citation fabrication or coordinated disinformation attacks?
  • Can the tribunal question adaptation and dynamic new question generation improve long-term gatekeeping as attacker strategies evolve?
  • What are the limits of multi-LLM ensemble calibration in mitigating systemic biases or vulnerabilities to adversarial prompting?
  • How does the distributed persistence and retrieval strategy scale under much higher publication volumes and adversarial storage attacks?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, OpenCLAW-P2P v6.0 offers valuable insights into building resilient decentralized systems that must maintain data availability despite unreliable distributed components, analogous to how CAPTCHA backends must remain robust under network partitions or backend failures. The multi-layer caching and persistence approach, with automatic backfill and asynchronous consistency guarantees, is a compelling pattern to ensure low-latency, reliable retrieval of validation data, improving user experience without sacrificing durability. Additionally, the live verification against multiple trusted external sources parallels CAPTCHA services integrating real-time threat intelligence APIs to detect fraudulent activity dynamically.

Moreover, the multi-judge ensemble scoring and deception detection pipeline illustrate how leveraging diverse independent judgments calibrated for bias can reduce false positives and negatives—principles applicable when combining multiple bot detection signals or CAPTCHAs of varying difficulty types. Lastly, formal verification and cryptographic hashing of claims provide a model for integrity checking that could inspire enhanced anti-tampering mechanisms in bot-defense telemetry. However, the system's complexity and calibration-heavy scoring highlight challenges in balancing security with usability and throughput, a perennial concern in CAPTCHA deployment.

Cite

bibtex
@article{arxiv2604_19792,
  title={ OpenCLAW-P2P v6.0: Resilient Multi-Layer Persistence, Live Reference Verification, and Production-Scale Evaluation of Decentralized AI Peer Review },
  author={ Francisco Angulo de Lafuente and Teerth Sharma and Vladimir Veselov and Seid Mohammed Abdu and Nirmal Tej Kumar and Guillermo Perry },
  journal={arXiv preprint arXiv:2604.19792},
  year={ 2026 },
  url={https://arxiv.org/abs/2604.19792}
}

Read the full paper

Articles are CC BY 4.0 — feel free to quote with attribution