GeoSVG-RL: Geometry-Aware Reinforcement Learning for Layout-Constrained Text-to-SVG Diagram Generation

Source: arXiv:2605.25447 · Published 2026-05-25 · By Sifan Li, Yujun Cai, Hongkai Chen, Yiwei Wang

TL;DR

GeoSVG-RL addresses the challenge of generating structurally reliable, layout-constrained text-to-SVG diagrams, a problem where standard large language models often fail due to outputs with subtle geometric errors (e.g., misaligned connectors, overlapping text, overflow beyond canvas). The key innovation is a reinforcement learning framework that leverages executable, browser-rendered geometric feedback to optimize the SVG generation process, rather than relying solely on token-level likelihood objectives. The model first predicts a structured layout plan that acts as a geometric contract, then generates SVG code conditioned on this plan. A browser-backed verifier extracts detailed geometry to compute fine-grained rewards on rendering validity, bounding, text containment, anchor accuracy, graph structure, and code cleanliness.

Quantitative experiments on a synthetic diagram dataset demonstrate that GeoSVG-RL outperforms prior state-of-the-art methods (e.g., VFig, AutoFigure-Edit) and supervised baselines across multiple carefully designed geometry-aware metrics. Notably, precision of arrow anchoring, text containment rates, and graph edge F1 scores improve by large margins. An ablation shows that layout planning, verifier feedback, and reinforcement learning each contribute crucially to gains in local geometric precision and global layout adherence. These results show that combining structured planning with executable verifier rewards in an RL framework is a promising pathway to generating professional-grade, editable SVG diagrams that meet strict spatial constraints.

Key findings

GeoSVG-RL achieves 92.3% Render Success Rate (RSR), outperforming VFig (90.2%) and AutoFigure-Edit (91.5%) on valid SVG output generation.
Arrow Anchor Accuracy (AAcc) increases from 43.2% under supervised warm start to 78.6% after RL refinement, a +35.4 point improvement.
Text-In-Box Rate (TBR) improves from 39.8% (supervised baseline) to 83.0% with GeoSVG-RL, more than doubling text containment accuracy.
Edge Connectivity F1 (E-F1) rises from 47.3% to 90.4% after RL, nearly doubling graph structural consistency.
Global Fit Rate (GFR) reaches 89.7%, highest among compared models, indicating superior canvas boundary adherence.
Overflow Area Ratio (OAR) is 2.9% for GeoSVG-RL, slightly higher than AutoFigure-Edit’s 2.2%, suggesting residual challenges with dense layouts.
Code Cleanliness score of 91.8% reflects improved semantic SVG structure versus baselines (89.5% and 90.2%).
Ablation confirms that layout plan conditioning, supervised warm start, verifier-based reranking, and RL each provide incremental gains across almost all metrics.

Threat model

n/a — This work is not centered on a security or adversarial threat model but rather on improving structured, constrained SVG diagram generation reliability through reinforcement learning.

Methodology — deep read

Threat model & assumptions: The adversary is effectively the distribution of possible text prompts and the stochastic generation process. The system assumes access to a paired dataset of synthetic text-to-layout-to-SVG triples. No explicit security adversary is modeled; rather the problem is how to generate structurally sound SVG that meet layout constraints from text prompts.
Data: A procedurally generated synthetic dataset of box-arrow-text diagrams is created using layout engines to assign bounding boxes, anchors, and edges. The dataset includes templates for pipeline, stacked, grouped, and branching layouts. It contains prompt, plan (layout specification z), SVG program (y), and geometric metadata. Data splits use random seeds and graph templates to ensure test time novelty.
Architecture: The system factorizes SVG generation as p(y, z|x) = p_φ(z|x) p_θ(y|x, z). The layout planner p_φ outputs a JSON plan (canvas size, node bounding boxes, labels, anchors, graph edges). The SVG generator p_θ is an autoregressive code model conditioned on prompt x and layout z that generates SVG code. The pipeline enforces an explicit geometric contract to constrain generation and reduce coordinate drift.
Training regime: The model is initialized by supervised fine-tuning (SFT) of p_φ and p_θ on synthetic layout-SVG pairs using token-level likelihood objective. Then reinforcement learning with Group Relative Policy Optimization (GRPO) refines p_θ using executable geometric rewards. At each RL iteration, multiple SVG candidates per prompt are sampled, rendered in a headless browser, parsed back to extract geometric metadata, and scored with a reward function combining execution validity, canvas fit, anchor placement accuracy, text containment and padding, graph connectivity (F1), and code cleanliness. Rewards within candidate groups are normalized into advantages to update model parameters via a clipped policy gradient objective for stability.
Evaluation protocol: Metrics measured on a held-out test split include Render Success Rate (RSR), Global Fit Rate (GFR), Overflow Area Ratio (OAR), Element-In-Canvas Rate (EICR), Arrow Anchor Accuracy (AAcc), Anchor Endpoint Error (AEE), Text-In-Box Rate (TBR), Text Padding Violation Rate (TPVR), Edge Connectivity F1 (E-F1), and code Cleanliness. Evaluation uses browser rendering and verification pipeline for extraction. Baselines include VFig and AutoFigure-Edit systems. Ablations isolate effects of layout planning, warm start, reranking, RL, verifier fidelity, curriculum weighting, and cleanliness reward.
Reproducibility: Code is publicly released (GitHub link in paper). Dataset is synthetic and described procedurally, but not published. Hardware and seed details are not deeply specified but training uses multiple iterations of supervised and RL phases. The rendering and verification rely on Chromium headless browser and standard XML parsers.

Example end-to-end: Given a prompt "draw a pipeline with three modules and arrows," the planner produces a JSON layout plan specifying three rect nodes with positions and labels, and edges connecting them. The generator samples SVG code conditioned on prompt and plan. This code is rendered in Chromium, geometric features extracted (bounding boxes, text regions, anchors, edges), and rewards computed for anchor alignment, text fit, etc. Multiple SVG samples are evaluated per prompt, advantages computed from relative reward, and the generator model is updated to maximize expected reward, leading to improved accuracy in subsequent generations.

Technical innovations

Formulating text-to-SVG layout-constrained diagram generation as a constrained program synthesis problem combining layout planning and geometry-aware generation.
Introducing executable, browser-backed verification to extract rich geometric feedback from rendered SVG output for use as reinforcement learning rewards.
Applying Group Relative Policy Optimization (GRPO) to train SVG generation policies based on relative advantages within candidate sets, stabilizing RL updates.
Designing a modular pipeline that predicts explicit structured layout plans as geometric contracts prior to SVG code generation to mitigate coordinate drift and enforce constraints.
Developing a curriculum weighting scheme that gradually shifts reward emphasis from local geometry (e.g., anchor placement) to global layout properties (e.g., canvas fit).

Datasets

Synthetic diagram corpus — size unspecified (~tens of thousands presumed) — procedurally generated with known templates and layouts

Baselines vs proposed

VFig: Render Success Rate = 90.2% vs GeoSVG-RL: 92.3%
VFig: Arrow Anchor Accuracy = 76.6% vs GeoSVG-RL: 78.6%
VFig: Text-In-Box Rate = 81.8% vs GeoSVG-RL: 83.0%
VFig: Edge Connectivity F1 = 85.7% vs GeoSVG-RL: 90.4%
AutoFigure-Edit: Overflow Area Ratio = 2.2% vs GeoSVG-RL: 2.9%
Supervised warm start baseline: Arrow Anchor Accuracy = 43.2% vs GeoSVG-RL: 78.6%
Supervised warm start baseline: Text-In-Box Rate = 39.8% vs GeoSVG-RL: 83.0%
Verifier-based reranking baseline: Edge Connectivity F1 = 68.4% vs GeoSVG-RL: 90.4%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.25447.

Fig 1

Fig 1: The GeoSVG-RL framework adopts a plan-then-generate approach, where a structured

Fig 2

Fig 2: Overview of the GeoSVG-RL framework. Given a textual prompt x, the model first predicts

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 4).

Fig 5

Fig 5 (page 4).

Fig 6

Fig 6 (page 4).

Fig 7

Fig 7 (page 4).

Fig 8

Fig 8 (page 4).

Limitations

Dependence on synthetic training data limits generalization to real-world, diverse diagram styles and noisy prompts.
Residual issues remain with global canvas overflow (OAR metric) especially in dense layouts, implying limited global packing optimization.
RL training stability depends on curriculum weighting and careful hyperparameter tuning; training details and compute requirements are not deeply documented.
Browser-based rendering and verification pipeline introduces additional computational overhead and complexity for training.
No evaluation on adversarial or out-of-distribution prompts; robustness to noisy or ambiguous instructions is untested.
Real-world deployment risks producing plausible but inaccurate diagrams that still require human oversight.

Open questions / follow-ons

How to extend the framework to real-world, diverse technical diagram corpora beyond synthetic templates?
Can global layout packing and overflow minimization be improved without compromising local geometric precision?
How does the model perform under noisy, ambiguous, or adversarial prompts that may cause semantic misinterpretations?
Could integrating learned font metric models or differentiable rendering further improve text containment rewards?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, GeoSVG-RL highlights critical challenges and solutions in generating structurally valid, editable vector graphics under strict geometric constraints. Its approach of tying generation to executable verification and explicit layout planning could inspire similar methods for ensuring structural integrity in automated challenge generation, such as CAPTCHA diagrams or puzzles that require spatial coherence and precise component alignment. The reinforcement learning setup integrating fine-grained geometric rewards from rendering feedback exemplifies how to enforce complex validity constraints beyond traditional token-based objectives, which could be adapted for robust CAPTCHA content creation resistant to automated solver strategies. The explicit verification of rendering validity and layout adherence also offers insights into developing detection mechanisms for malformed or bot-generated graphics in security applications.

Cite

bibtex

@article{arxiv2605_25447,
  title={ GeoSVG-RL: Geometry-Aware Reinforcement Learning for Layout-Constrained Text-to-SVG Diagram Generation },
  author={ Sifan Li and Yujun Cai and Hongkai Chen and Yiwei Wang },
  journal={arXiv preprint arXiv:2605.25447},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.25447}
}

GeoSVG-RL: Geometry-Aware Reinforcement Learning for Layout-Constrained Text-to-SVG Diagram Generation ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​