GGBound: A Genome-Grounded Agent for Microbial Life-Boundary Prediction

Source: arXiv:2605.14442 · Published 2026-05-14 · By Hanbo Huang, Xuan Gong, Jing Wang, Lei Bai, Xiang Xiao, Weishu Zhao et al.

TL;DR

This paper addresses the challenging problem of predicting microbial physiological life boundaries—such as viable temperature, pH, salinity ranges, substrate utilization, and morphology—directly from strain-level genomic data. Existing methods either consider physiological traits independently or use biological foundation models as static encoders, failing to fully bridge genotype to phenotype. The authors formulate a unified genome-to-physiology prediction task and propose GGBound, a genome-conditioned, tool-augmented large language model agent that integrates frozen genome embeddings with external biological evidence via tools. They curate a novel strain-centric benchmark of 1,525 microbial strains and 6,448 phenotype-labeled instances from IJSEM, NCBI, and BacDive covering multiple physiological trait categories. Architecturally, the agent injects compact LucaOne genome embeddings into a Qwen LLM and dynamically reasons with a similarity-based retrieval-augmented generation (RAG) module and a genome-scale metabolic model (GEM) perturbation tool. Training proceeds through gene–text alignment, supervised fine-tuning on distilled tool-usage trajectories, and reinforcement learning (RL) with a novel counterfactual gene-grounding reward that enforces causal dependence on genomic input. The resulting 4B-parameter agent matches or surpasses larger frontier LLMs such as DeepSeek and GLM variants in predicting physiological boundaries and traits, with ablations confirming the impact of genome fusion, tool integration, and gene-grounded policy optimization. This approach presents an innovative multi-modal LLM framework that tightly couples genome embeddings, domain tools, and causal RL rewards for interpretable microbial life boundary prediction.

Key findings

The curated benchmark dataset contains 1,525 unique microbial strains and 6,448 phenotype-labeled instances across five physiological prediction tasks, including viability intervals, optimal growth conditions, metabolism, categorical traits, and morphology.
Injecting frozen LucaOne genome embeddings into the Qwen LLM backbone via token fusion yields modest gains over the base language model, especially for continuous traits like optimal growth temperature (RMSE reduced from 8.570 to 8.449).
Agentic supervised fine-tuning (SFT) on distilled tool-use trajectories substantially improves performance across all tasks; for example, salinity-range interval coverage rate (ICR) increased from 0.471 (fusion) to 0.781 (SFT), and electron acceptor subtype mAP@5 jumps from 0.003 to 0.233.
Subsequent reinforcement learning with counterfactual gene-grounding further refines the model with smaller but consistent improvements, e.g., pH range ICR from 0.458 to 0.497 and nitrogen source mAP@5 from 0.209 to 0.217.
Tool-use statistics show that the base and fusion models invoke many more biological tools (average 6.966 calls) but perform worse than agentic models using fewer calls (SFT 1.574 calls, RL 1.732 calls), indicating more selective, effective evidence acquisition.
In a direct comparison, the 4B-parameter GGBound agent matches or outperforms substantially larger models like DeepSeek-V3.2 and GLM-4.7 across aggregated prediction metrics (see Fig 7), demonstrating efficiency and domain-specific gains.
Counterfactual gene-grounding reward is essential to enforce genuine genomic conditioning, preventing language prior bias by rewarding only answers more probable with authentic genome embeddings versus zero gene input.
Ablation studies confirm that genome-token fusion, dynamic tool usage, and counterfactual RL stages each independently contribute significant accuracy gains.

Threat model

N/A — The paper is focused on microbial physiological prediction rather than a security or adversarial threat scenario. The model is optimized to avoid reliance on memorized taxonomic or language priors by anonymizing strain names and enforcing causal genome embedding dependence but does not consider a malicious adversary.

Methodology — deep read

Threat model & assumptions: The adversary is not explicitly described as this is a computational microbial prediction task rather than a security threat paper. The model is designed to predict physiological traits for microbial strains using only genome encodings and tool observations, preventing memorization of taxonomic identity by anonymizing strain names. The model cannot access direct taxonomic labels or rely solely on language priors, enforced through counterfactual ablation rewards.
Data: The authors curate a large-scale strain-centric dataset by integrating literature-derived physiological traits extracted with LLM pipelines (DeepSeek-V3.2) from 18,498 IJSEM articles, genome and protein sequences from NCBI, and phenotype annotations from BacDive. The combined corpus covers 1,525 unique strains and 6,448 task instances spanning continuous viability ranges, optimal conditions, and categorical traits. Raw data is filtered, normalized, and harmonized with controlled vocabularies. Genome sequences are encoded using the frozen LucaOne model into 2560-dim CLS token embeddings maintained as the genomic modality. Benchmarks are held out disjoint from agent training data.
Architecture / algorithm: The core agent is based on a Qwen3.5-4B language model backbone. Genome embeddings are injected as continuous genome tokens via a lightweight 2-layer MLP projector (token fusion) into the textual context. Two biological tools augment the agent: (1) a similarity-based Retrieval-Augmented Generation (RAG) module retrieves top-3 nearest strains’ annotations via cosine similarity in gene-embedding space, providing relevant evidence; (2) a genome-scale metabolic model (GEM) perturbation tool simulates minimal media growth under nutrient removals (e.g., sulfate, ferric iron) using strain-specific metabolic models reconstructed by CarveMe. The agent interleaves query, tool calls, and observations to produce physiological trait predictions.
Training regime: Training proceeds in three stages. Stage 1: Gene–text alignment instruction tuning with full parameter updates on gene-aligned Qwen3.5-4B checkpoint (one epoch, batch=256, LR=2e-5). Stage 2: Agentic supervised fine-tuning (SFT) on 54,249 distilled trajectories generated by a Qwen3.5-27B teacher, including stepwise tool uses and grounding rationales (one epoch, batch=128, LR=5e-6). Stage 3: Reinforcement learning with Group Relative Policy Optimization (GRPO) on a separate 2,000 strain prompt set, optimizing a composite reward combining JSON format, correctness, tool use, and critically a counterfactual gene-grounding reward, which compares token log-probs with real vs zero genome embeddings (LR=1e-6). LucaOne encoder and modal fusion remain frozen throughout.
Evaluation protocol: Metrics are domain-specific (Interval Coverage Rate (ICR) for range predictions, RMSE for optima, mAP@5 for substrate utilization, accuracy for categorical traits). Benchmark samples are anonymized. The authors report performance progression on five physiological trait groups comparing base LLM, genome-fusion, agent SFT, and RL agent, with statistical consistency. Tool usage statistics (number/type of calls) are analyzed. Ablations test the impact of each main component. Comparative experiments versus larger frontier LLM baselines (DeepSeek V3.2/R1, GLM-4.7, Kimi-K2) are conducted.
Reproducibility: The authors state the benchmark dataset will be released. They leverage closed biological resources (NCBI, BacDive) for genome and phenotype data. Code or weights release is not explicitly indicated. Hyperparameters and training details are described in Appendix B. The exact teacher model checkpoints and trajectory distillation method use proprietary Qwen3.5-27B models.

Technical innovations

Formulating microbial life-boundary prediction as a unified genome-to-physiology task integrating viability intervals, environmental optima, substrate utilization, categorical and morphological traits.
Designing a genome-conditioned, tool-augmented LLM agent architecture that injects frozen LucaOne genome embeddings into a Qwen backbone via lightweight token fusion and dynamically reasons with retrieval-based (RAG) and genome-scale metabolic model (GEM) perturbation tools.
Developing a three-stage training pipeline combining gene–text alignment, supervised fine-tuning on distilled multi-step tool-use trajectories, and Group Relative Policy Optimization (GRPO) with a novel counterfactual gene-grounding reward that enforces causal dependence on genome input.
Introducing the counterfactual gene-grounding reward that compares token log-probabilities conditioned on authentic versus zeroed genome embeddings, thus penalizing reliance on language priors and strengthening genuine genomic conditioning.

Datasets

Unified microbial physiology benchmark — 1,525 strains, 6,448 phenotype instances — curated from IJSEM literature, NCBI genomes, and BacDive phenotype annotations
Training data for SFT and RL — 10,000 BacDive strain records with aligned genomic embeddings and physiological traits (8,000 for SFT, 2,000 for RL)
IJSEM literature corpus — 88,927 high-quality microbial strain records extracted via LLM pipeline from 18,498 articles

Baselines vs proposed

Base Qwen3.5-4B model: Salinity range ICR = 0.466 vs GGBound RL agent: 0.818
Fusion model (with genome-token injection): pH range ICR = 0.296 vs RL agent: 0.497
Agentic SFT model: Electron acceptor mAP@5 = 0.233 vs RL agent: 0.249
RL Agent: Optimal growth temperature RMSE = 6.373 vs base model: 8.570
RL Agent: Oxygen tolerance accuracy = 0.549 vs base model: 0.271
GGBound 4B model average performance surpasses larger LLMs (DeepSeek-V3.2, DeepSeek-R1, GLM-4.7, Kimi-K2) on physiological boundary prediction (Fig.7, exact metrics not given)
Tool-use count: base model averages 6.966 calls per sample vs RL agent 1.732 with higher accuracy

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.14442.

Fig 1

Fig 1: Overview of GGBound. The agent conditions on LucaOne genome embeddings, integrates

Fig 2

Fig 2 (page 2).

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

No explicit adversarial evaluation or robustness testing against noisy genomic inputs or adversarial manipulation.
Experimental validation of predicted physiological boundaries remains limited; predictions currently hypotheses rather than verified outcomes.
The benchmark and training data, though large, are constrained to cultivable strains with available annotations, potentially biasing predictions toward well-studied microbes.
Counterfactual gene-grounding reward relies on zeroed genome embedding as baseline, which may underestimate partial genome signal contributions or ignore other confounding factors.
The method depends on closed datasets and large pretrained LLMs (Qwen3.5, Qwen3.5-27B teacher), which may limit reproducibility and accessibility.
Genome-token fusion uses frozen embeddings without end-to-end genome encoder tuning, which could limit adaptation to specific phenotype prediction nuances.

Open questions / follow-ons

Can the agent’s life-boundary predictions be prospectively validated experimentally to quantify real-world utility in strain cultivation and biotechnology applications?
How well does the approach generalize to uncultured or poorly annotated microbial taxa outside the benchmark datasets?
Could end-to-end training jointly tuning genome encoders and the LLM policy further improve prediction accuracy or allow discovery of novel genotype-phenotype links?
What are the limits of the counterfactual gene-grounding reward paradigm, and can alternative causal attribution methods better enforce genome input usage?

Why it matters for bot defense

While this work is not directly related to CAPTCHA or bot defense, its methodology exemplifies how domain-specific data and tool-augmented large language models can be combined with counterfactual training rewards to improve grounding and interpretability. Bot-defense engineers could draw inspiration from the genome-token fusion and dynamic retrieval-augmented interaction techniques for designing systems that integrate external evidence sources into LLMs with explicit grounding signals, potentially enhancing robustness to adversarial or out-of-distribution inputs. Moreover, the notion of counterfactual reward to ensure causal dependence on essential conditioning information might be applicable in designing CAPTCHA challenges or bot classifiers that must rely on specific input-confirmation features instead of textual or superficial priors. In general, the paper advances understanding of how to tightly fuse pretrained embeddings with LLM agents and optimize tool-use policies, which could inform next-gen multi-modal verification and bot-detection approaches.

Cite

bibtex

@article{arxiv2605_14442,
  title={ GGBound: A Genome-Grounded Agent for Microbial Life-Boundary Prediction },
  author={ Hanbo Huang and Xuan Gong and Jing Wang and Lei Bai and Xiang Xiao and Weishu Zhao and Shiyu Liang },
  journal={arXiv preprint arXiv:2605.14442},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.14442}
}

GGBound: A Genome-Grounded Agent for Microbial Life-Boundary Prediction ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​