The Origins of MEV: Systematic Attribution of Arbitrage Opportunity Creation at Scale

Source: arXiv:2604.27979 · Published 2026-04-30 · By Andrei Seoev, Dmitry Belousov, Anastasiia Smirnova, Ksenia Kurinova, Aleksei Smirnov, Denis Fedyanin et al.

TL;DR

This paper asks a question that prior MEV work largely sidestepped: not how arbitrage is extracted, but which on-chain transaction(s) created the opportunity in the first place. The authors formalize “MEV opportunity attribution” for atomic arbitrage on EVM chains, then build four attribution methods with different assumptions and cost/accuracy trade-offs: bot-data-driven, simulation-based counterfactual replay, coefficient/K-value based, and Shapley-based cooperative attribution.

The empirical claim is that, in competitive MEV markets, opportunity creation is usually attributable to a single transaction. On Polygon, across a large March 2026 crawl, they report 360,026 atomic arbitrage transactions and find that 96.7% can be tied to a single source transaction. Their broader retrospective analysis over 1,050,000 blocks finds strong concentration: a small number of protocols account for most arbitrage opportunity creation. The paper’s value is less in a new extraction strategy and more in turning MEV from an extraction-only measurement problem into a causal provenance problem.

Key findings

Across the March 2–29, 2026 Polygon crawl (blocks 83,770,001–84,820,000), they identified 360,026 atomic arbitrage transactions with total extracted MEV of $334,799 at execution-time oracle prices.
Across 2,526 atomic arbitrage events in the February 4, 2026 comparison window (blocks 82,546,747–82,567,395), the authors say 96.7% were attributable to a single source transaction.
The simulation method uses a 5% profit-drop threshold to define the edge transaction; they report that 99.3% of attributable opportunities had the edge transaction within 7 blocks of the arbitrage transaction (within a search window D = 100).
The coefficient/K-value method is explicitly cheaper but less faithful for nonlinear price impact; the paper states it is most suitable when |C| ≲ 1000, while simulation is recommended for |C| ≲ 200 and exact Shapley only for |C| < 20.
Exact Shapley attribution is exponential in candidate-set size, and the authors state Monte Carlo with 1000 samples stabilizes estimates within 5% of asymptotic values for typical candidate sizes in their dataset.
The paper claims MEV creation is highly concentrated across protocols, but the truncated text does not provide the exact protocol shares or a full top-k ranking.
Bot-data attribution is limited to opportunities observed by live searcher infrastructure from February 2026 onward, so it is not complete coverage; the paper explicitly uses it as validation rather than ground truth.

Threat model

The adversary is implicit and structural: a sequence of prior on-chain transactions that perturb pool reserves and create an arbitrage opportunity, plus the uncertainty of attributing that opportunity to the correct source. The attacker model assumes no hidden off-chain price feed is needed for the core method, but the system cannot attribute opportunities caused by off-chain events unless they are reflected on-chain. It also assumes deterministic EVM execution with replayable historical state; if state history is incomplete or a transaction depends on unavailable externalities, attribution may fail or become approximate.

Methodology — deep read

The threat model is causal attribution rather than adversarial security in the classic sense: given an observed atomic arbitrage transaction T_arb, the goal is to identify which prior on-chain transaction(s) created the price imbalance that made the arbitrage profitable. The adversary is not a malicious external attacker but the informational uncertainty inherent in MEV provenance: multiple earlier transactions can change pool reserves, and the arbitrage transaction itself only reveals that a profitable state existed. The paper assumes the blockchain is a deterministic state machine (EVM-compatible chain), historical state is replayable via an archive node, and the relevant MEV category is atomic arbitrage, which is fully on-chain and observable. It explicitly excludes other MEV classes such as sandwiches and liquidations, which would require different causal models.

Data are split into two evaluation corpora. For the large-scale analysis, they process Polygon blocks 83,770,001 through 84,820,000, corresponding to March 2–29, 2026, and identify 360,026 atomic arbitrage transactions. For the method-comparison study, they use Polygon blocks 82,546,747 through 82,567,395, a roughly 12-hour window on February 4, 2026, containing 2,526 atomic arbitrage events. The source text says they use a modified Geth archive node to access historical state snapshots and transaction traces; internal computation normalizes profit in MATIC, while reported totals are converted to USD at execution-time oracle prices. The truncated excerpt does not specify train/test splits in the ML sense because this is not a predictive model evaluation; instead, the February window serves as a focused benchmark for method comparison and the March window for distributional analysis.

The architecture is a four-method attribution system. First, bot-data attribution uses a production MEV searcher model as an external signal: a GNN encodes liquidity-pool topology and reserves, then an MLP value head predicts expected arbitrage profit; the agent is trained with PPO on historical Polygon data and is designed for sub-10 ms inference. This method does not infer causality from first principles; it treats a bot’s bidding and route-selection behavior as a proxy for what market participants believed was the source transaction. Second, the simulation-based method is the main causal engine. It filters candidate transactions to those touching the same pools as the arbitrage route, then binary-searches backward from T_arb to find an “edge” transaction where simulated profit falls to 5% or less of the observed profit, and finally scans backward to compute each transaction’s marginal impact as the profit delta between adjacent replay states. The source is the transaction with the maximum positive marginal impact, with ties broken by proximity to the arbitrage transaction. Third, the coefficient-based/K-value method avoids full replay and instead tracks the change in a theoretical price-multiplier coefficient k derived from pool reserves; the source is the transaction that maximizes Δk. The novelty here is efficiency, but the paper explicitly acknowledges that it ignores slippage and fees, so it is less faithful for large trades. Fourth, the Shapley-based method frames preceding candidate transactions as players in a cooperative game where the value of any subset is the arbitrage profit obtainable after executing exactly that subset; exact Shapley is exponential, so the authors use Monte Carlo permutation sampling for larger candidate sets.

Training and implementation details are only partially specified in the excerpt. The bot-data RL agent is trained with PPO on historical Polygon data, but the excerpt does not give epochs, batch size, learning rate, random seeds, or validation strategy. The rest of the system is offline and retrospective rather than trained in the machine-learning sense. The authors implement performance-critical components in Rust, wrap analysis and visualization in Python, and run experiments on a multi-core cluster (the hardware description is truncated after “Intel Xeon Pl…”). For the Shapley method, the paper states that 1000 Monte Carlo samples are enough for estimates to stabilize within 5% of asymptotic values for typical candidate-set sizes, and the simulation method uses a hardcoded D = 100 blocks on Polygon as a conservative lookback window.

Evaluation is comparative and method-oriented. The primary metrics reported in the visible text are attribution coverage, single-source rate, and method scalability/complexity rather than standard classifier metrics. Table 1 summarizes asymptotic costs: bot-data is O(1) after loading the model, coefficient-based is O(1), simulation is O(log |C| + m), exact Shapley is O(2^|C|), and Monte Carlo Shapley is O(N·|C|). Candidate set sizes drive feasibility: the authors recommend coefficient-based filtering when the candidate set is large, simulation for most cases, and Shapley only for small sets or ground-truth checks. The comparison study on the February window is meant to validate the single-source hypothesis and compare the four methods on the same events; however, the truncated excerpt does not include the per-method accuracy numbers, disagreement rates, or statistical tests. One concrete end-to-end example is Figure 1: an atomic arbitrage on Polygon block 58,329,504 that swaps between Uniswap V3 and Uniswap V2 WMATIC/USDC pools, producing 0.3121 WMATIC profit (about $0.24 at execution time). In the attribution framework, that arbitrage would be traced backward through the relevant pool-interacting transactions, simulate counterfactual states, and assign the opportunity to the transaction with the largest marginal increase in executable profit.

Reproducibility is mixed. The paper clearly states that it uses Polygon historical data, a modified Geth archive node, and publicly specified algorithms, but the excerpt does not mention a code release, frozen weights for the bot model, or public release of the labeled dataset. Because the method depends on historical state replay and possibly non-public bot bidding logs, full replication may require substantial infrastructure and access beyond the paper text. The truncated section also leaves some implementation details unspecified, especially for the bot-data pipeline and the exact Monte Carlo sample schedule for Shapley beyond the 1000-sample heuristic.

Technical innovations

Formally defines MEV opportunity attribution as a causal provenance problem over deterministic blockchain state transitions, rather than as extraction or mitigation.
Introduces a simulation-based backward replay method that finds an edge transaction via binary search and then assigns source responsibility by marginal profit impact.
Adapts a Shapley-value formulation to MEV opportunity creation, treating preceding transactions as players in a cooperative game whose payoff is arbitrage profit.
Uses live searcher bot bidding behavior as an external validation signal for opportunity-source attribution, rather than relying only on retrospective replay.

Datasets

Polygon March 2026 crawl — 1,050,000 blocks (83,770,001–84,820,000) — public blockchain data via modified Geth archive node
Polygon February 4, 2026 window — 20,649 blocks (82,546,747–82,567,395) — public blockchain data via modified Geth archive node
Atomic arbitrage events (March 2026) — 360,026 events — derived from Polygon on-chain traces using Vostrikov et al. criteria
Atomic arbitrage events (February 2026) — 2,526 events — derived from Polygon on-chain traces using Vostrikov et al. criteria

Baselines vs proposed

Bot-data attribution vs. simulation-based attribution: no numeric accuracy/comparison value visible in the excerpt
Coefficient-based attribution vs. simulation-based attribution: no numeric accuracy/comparison value visible in the excerpt
Exact Shapley vs. Monte Carlo Shapley: 1000 Monte Carlo samples reportedly stabilize within 5% of asymptotic values for typical candidate sets
Single-source hypothesis check: 96.7% of atomic arbitrage opportunities attributable to a single source transaction

Limitations

The excerpt does not report the full per-method accuracy table, disagreement rates, or error bars, so comparative performance is only partially visible.
Bot-data attribution depends on non-public bidding logs / mempool surveillance and only covers opportunities observed from February 2026 onward.
The simulation method assumes deterministic replay fidelity; any mismatch between archive-node replay and actual historical execution would bias attribution.
The coefficient-based method ignores slippage, fees, and liquidity depth, so it can misattribute large or nonlinear arbitrage opportunities.
The scope is limited to atomic arbitrage on Polygon; results may not generalize to sandwiches, liquidations, cross-chain MEV, or non-EVM chains.
The excerpt truncates key implementation details, including hardware configuration, seeds, and whether code or labeled data are released.

Open questions / follow-ons

How well does the attribution framework generalize to non-atomic MEV types like sandwiches, liquidations, and cross-domain arbitrage?
Can the single-source assumption be relaxed into a principled multi-source causal model without exponential cost in the common case?
How sensitive are the simulation and coefficient methods to reorgs, partial archive-node inconsistencies, or nonstandard DEX contract behavior?
Can protocol-level interventions use these attribution signals to reduce MEV leakage without suppressing legitimate price discovery?

Why it matters for bot defense

For bot-defense practitioners, this paper is relevant less as a blockchain study than as a template for causal attribution under adversarial competition. The key idea is to move from “who extracted value?” to “what upstream event made the exploit possible?”, which is directly analogous to explaining why a bot succeeded on a protected workflow. The simulation-based and Shapley-style methods suggest a way to rank candidate precursor actions by marginal contribution, but the same trade-off appears in captcha defense: high-fidelity counterfactual reasoning is expensive, while cheaper heuristics can miss nonlinear interactions.

A bot-defense engineer could adapt the paper’s methodology to analyze fraud or automation incidents where multiple user actions jointly create a vulnerable state. The caution is that the paper’s strongest results depend on deterministic replay and a tightly scoped problem (atomic arbitrage on EVM chains); if your system has hidden state, non-deterministic behavior, or opaque client-side signals, the attribution confidence drops quickly. In other words: good pattern for causal forensics, but only when your instrumentation is strong enough to replay the world.

Cite

bibtex

@article{arxiv2604_27979,
  title={ The Origins of MEV: Systematic Attribution of Arbitrage Opportunity Creation at Scale },
  author={ Andrei Seoev and Dmitry Belousov and Anastasiia Smirnova and Ksenia Kurinova and Aleksei Smirnov and Denis Fedyanin and Yury Yanovich },
  journal={arXiv preprint arXiv:2604.27979},
  year={ 2026 },
  url={https://arxiv.org/abs/2604.27979}
}

The Origins of MEV: Systematic Attribution of Arbitrage Opportunity Creation at Scale ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​