Autonomous Intelligent Agents for Natural-Language-Driven Web Execution with Integrated Security Assurance

Source: arXiv:2605.15281 · Published 2026-05-14 · By Vinil Pasupuleti, Siva Rama Krishna Varma Bayyavarapu, Shrey Tyagi

TL;DR

This paper addresses the chronic brittleness and maintenance overhead in modern web test automation suites caused by UI refactors, ambiguous element locators, timing-related race conditions, and flaky navigations. The authors introduce a novel AI-driven autonomous testing framework leveraging large language models (LLMs) combined with a five-strategy enhancement pipeline: navigation reliability, context-aware selector generation, script validation, heuristic wait injection, and failure learning. These strategies systematically reduce major failure modes that plagued previous LLM-based test script generation attempts. Evaluated on 176 test scenarios across four production applications, the framework boosts baseline script generation success rate from 55% to 93%, cuts navigation failures eight-fold, removes 80% of timing/race condition errors, and reduces test creation time by 75% compared to manual Selenium scripting. The framework also extends naturally to security testing, enabling testers to write natural language attack descriptions, which the system converts to OWASP Top 10-aligned browser probes that detect 85% of authentication bypass and 95% of input validation vulnerabilities with false positive rates under 12%. This demonstrates a practical and novel approach to accessible, natural-language-driven functional and security test automation.

Key findings

Script generation success improved from 55% baseline to 93% after applying five integrated strategies (Table III).
Navigation failures reduced by 8×, dropping from 40% failed navigations to 5% (Table V).
Element-not-found errors decreased from 30% to 10% after context-enriched selector generation (Strategy 2).
Timing and race condition failures dropped from 25% to 5% through heuristic smart wait injection (Strategy 4).
Test creation time was cut by 75%, averaging 6.1 minutes per test with 93% success versus 23.5 minutes for manual Selenium scripting.
Security testing aligned to OWASP Top 10 detected 85% of authentication bypass vulnerabilities and 95% of input validation flaws, with false positive rates under 12% (Table II).
The autonomous agent achieved 67% self-recovery success from failed test steps and 92% goal completion over 85 complex goals.
Containerized worker architecture achieved 99% job completion compared to 12% for pure serverless approaches (Table I).

Threat model

The core threat model assumes nondeterministic web application behavior causing brittleness: ambiguous selectors, timing/race conditions, and UI changes that break tests. In the security extension, the adversaries are authenticated or unauthenticated web users attempting horizontal privilege escalation, injection via input fields, or unauthorized access detectable at browser layer. The system cannot detect server-side vulnerabilities without browser-visible signals and does not handle adversarial prompt injection fully.

Methodology — deep read

The authors built an autonomous testing system combining large language models with a containerized execution infrastructure and a five-strategy test generation enhancement pipeline.

Threat model and assumptions: The adversary is the web application unpredictability causing brittle tests—UI refactors, ambiguous selectors, timing/race conditions—but not malicious actors except in the security testing extension. Security threats modeled include authenticated and unauthenticated attackers observable at the browser layer.
Data: They evaluated across four modern production web applications of varying architectures (SPA, hybrid, server-rendered), running 176 total test scenarios including complex multi-step workflows. Known vulnerabilities (47 instances) were seeded for security validation. Scripts were labeled as success if ≥80% scripted test steps passed.
Architecture and algorithms: The core architecture (Fig 1) stacks User UI, Backend Orchestration (DB + API), AI Decision Engine (vision-enabled LLM), Headless Browser Automation, and Security Validation. The five strategies are:

Strategy 1: Navigation reliability by replacing ambiguous click selectors pointing to links with direct URL navigation.
Strategy 2: Selector specificity by enriching selectors with parent context (headings, ARIA labels) to resolve ambiguity.
Strategy 3: Post-generation validation gates scripts via static analysis scoring to filter invalid ones before execution.
Strategy 4: Smart wait injection heuristics adds waits after navigation and click steps to mitigate timing issues.
Strategy 5: Failure learning clusters failure logs to identify anti-patterns and feedback into improvements.

The autonomous agent uses a perceive-reason-act loop with transformer-based LLMs combining DOM semantic with vision-based screenshot analysis. The system supports goal decomposition, re-planning, error recovery with retry and alternative selectors.

Security testing maps plain English attack steps to browser-based probes aligned with OWASP Top 10 categories. The defense-in-depth stack includes input validation, sanitization, isolated container execution, access control, and audit logging.

Training regime: Not applicable in traditional model training sense but involved iterative development and extensive trial/error heuristic tuning of selectors, waits, validation thresholds.
Evaluation protocol: Metrics include script success rate (≥80% steps passed), failure type reduction, test creation time, security vulnerability detection rate and false positive rate. Baseline comparisons were manual Selenium authored scripts and pure serverless job execution. Ablations examine individual strategies' isolated impacts.
Reproducibility: Evaluation artifacts, sample applications, config, and scripts are included as supplementary materials. Docker containerized deployment supports cross-LLM providers. Exact code release status not specified.

A concrete example: The generated script initially clicks a href link (ambiguous), Strategy 1 rewrites it to direct URL navigation; then selectors are enriched contextually to fix element-not-found errors; static validation filters invalid scripts; waits inserted post navigations avoid race conditions; failed runs logged for future learning. This pipeline raises success from 55% to 93%.

Technical innovations

A five-strategy pipeline enhancing LLM-generated web test scripts that proactively reduces failure modes instead of reactive self-healing (contrasted with prior self-healing locator systems).
Containerized worker architecture decoupling fast, stateless job orchestration from long-running browser-based test execution, overcoming serverless timeout limits.
Integration of vision-enabled large language models combining DOM and screenshot analysis in a perceive-reason-act autonomous agent loop for intelligent test execution and error recovery.
Natural-language-driven security testing translating attack scenario descriptions into OWASP Top 10 aligned browser probes executable in sandboxed containers without requiring security expertise.

Datasets

4 production web applications — 176 test scenarios — internal/anonymous

Baselines vs proposed

Manual Selenium scripting: test creation time = 23.5 min/test vs proposed: 6.1 min/test
Pure serverless execution architecture: job completion = 12% vs containerized workers: 99%
Baseline LLM-generated scripts (no strategies): success rate = 55% vs enhanced pipeline: 93%

Limitations

LLM usage introduces non-trivial cost ($0.03–$0.15 per test) and latency (2–5 seconds generation delay) which scale linearly with test volume.
Lower success rates (78–85%) on complex multi-step workflows exceeding 8 steps, modal-heavy flows, and dynamic async content (WebSocket-driven).
Limited coverage of Shadow DOM and iframe-embedded content causing 23% of remaining failures.
Security testing scope restricted to vulnerabilities observable at browser level; server-side flaws with no client-visible signal remain undetectable.
Evaluation conducted on four modern SPA/hybrid apps; uncertain generalization to legacy or highly customized applications.
Known risk of adversarial prompt injection targeting LLM reasoning through malicious page content without principled mitigation yet implemented.

Open questions / follow-ons

How to integrate principled defenses against adversarial prompt and content injection targeting LLM reasoning in browser contexts?
Can visual intelligence enhancements (OCR, layout analysis) significantly improve robustness in complex UIs and failure recovery?
To what extent can fine-tuning or domain adaptation of LLMs on application-specific data boost test generation success beyond heuristic strategies?
How effective are policy-compliant multi-agent orchestration frameworks at enforcing security and compliance constraints pre-execution in real enterprise workflows?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this work presents an insightful approach to leveraging multimodal LLMs for resilient automated web interaction scripting, a capability that overlaps with modern bot behavior simulation. The containerized architecture and failure-mode analysis provide a blueprint for scalable, reliable agent execution, which can inform bot detection strategies targeting autonomous agent fingerprints. The natural language-driven security testing reveals a promising direction to democratize vulnerability probing without specialized skills—potentially applicable to CAPTCHA bypass risk assessments via browser-layer probes. Furthermore, the failure learning and recovery mechanisms highlight challenges in robust scripting of interactive web elements, relevant to designing CAPTCHAs and bot traps that exploit edge cases and timing conditions difficult for autonomous agents to handle. However, existing limitations in asynchronous state prediction and complex UI handling hint at exploitable gaps that defenders might leverage.

Cite

bibtex

@article{arxiv2605_15281,
  title={ Autonomous Intelligent Agents for Natural-Language-Driven Web Execution with Integrated Security Assurance },
  author={ Vinil Pasupuleti and Siva Rama Krishna Varma Bayyavarapu and Shrey Tyagi },
  journal={arXiv preprint arXiv:2605.15281},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.15281}
}

Autonomous Intelligent Agents for Natural-Language-Driven Web Execution with Integrated Security Assurance ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​