Automated Testing of Broken Authentication Vulnerabilities in Web APIs with AuthREST

Source: arXiv:2509.10320 · Published 2025-09-12 · By Davide Corradini, Mariano Ceccato, Mohammad Ghafari

TL;DR

AuthREST is an open-source black-box security testing tool for a very specific but common class of API failures: broken authentication. The paper targets three practical attack surfaces that are often tested manually or with ad hoc scripts—credential stuffing, password brute forcing, and acceptance of invalid/forged tokens—and automates them using only an API’s OpenAPI specification plus HTTP interaction history. The key design point is that AuthREST does not depend on implementation language or framework; it infers login operations and relevant parameters from specification patterns, then generates HTTP sequences and checks them with oracles.

The main result is empirical rather than algorithmic novelty: on six public APIs, AuthREST flagged four previously unknown authentication vulnerabilities across public endpoints, and the authors manually confirmed those findings. For the benchmark APIs, the credential-stuffing and password-brute-force strategies reported vulnerabilities with 100% precision (no false positives) in the evaluated set, while the unchecked-token strategy was validated on one API (Tradematic) and produced no false positives there. The paper’s strongest contribution is practical coverage: it shows that automated, specification-driven testing can uncover real broken-authentication bugs that were not previously known to the owners.

Key findings

The benchmark contained 6 public APIs sourced from APIs.guru; 4 of them were reported vulnerable to both credential stuffing and password brute forcing, and all 4 were manually confirmed.
Across the 10 repeated runs, credential-stuffing detection achieved true positives on the vulnerable APIs with 0 false positives, which the paper summarizes as 100% precision.
Password-brute-force detection also achieved 0 false positives and found the same four vulnerable APIs in the benchmark.
Unchecked token authenticity was evaluated on only 1 API (Tradematic) and AuthREST reported no vulnerabilities there, with 0 false positives.
The authors initially sampled 50 APIs from APIs.guru, but excluded unreachable/non-working APIs, APIs that did not require authentication, and APIs without an explicit login operation in the spec, leaving 6 suitable targets.
AuthREST uncovered 4 previously unknown authentication vulnerabilities in publicly accessible APIs and the authors reported them to the respective owners.
For credential stuffing, the tool sends 100 login requests in 10 seconds; the paper cites OWASP best practice that mitigations should trigger after 3-10 rapid requests from the same source.

Threat model

The adversary is a remote black-box attacker with access to the API’s HTTP interface and, for the token tests, a legitimate token captured during nominal use of the API. They can send repeated requests, vary credential values, and mutate tokens, but they cannot access source code, server internals, or private implementation data. The tool is intended to detect missing lockout/rate limiting, weak password-guessing defenses, and acceptance of malformed or invalid authentication tokens; it does not model deeper post-auth compromise or account takeover beyond observing whether the API resists these attacks.

Methodology — deep read

AuthREST is explicitly framed as a black-box security tester. The adversary model is the usual internet attacker trying to abuse API authentication: the tool itself does not assume source code access, only the HTTP interface and an OpenAPI specification. The paper’s testing logic assumes that vulnerable APIs will either fail to rate-limit repeated login attempts, fail to vary lockout behavior, or accept malformed/incorrect tokens on protected endpoints. For the token test, the authors also assume they can obtain one valid token from a legitimate user session during nominal exploration, then mutate it to probe server-side validation. What the paper does not assume is any privileged instrumentation, internal telemetry, or knowledge of implementation details.

Data provenance is relatively modest but well-defined. The evaluation benchmark comprises 6 publicly accessible real-world APIs drawn from APIs.guru, which provides public API specifications. The authors first randomly sampled 50 APIs from the catalog to manually assess suitability, then excluded APIs that were unreachable, non-working, did not require authentication, or did not explicitly list a login operation in the specification. The final six APIs are named in the paper as Here, ID4I, 6 Dot, BRAINBI, BeezUP, and Tradematic; the text implies one additional API name is missing in the excerpted table formatting, so the exact six-item list is not perfectly recoverable from the provided snippet. The benchmark was used both to validate login-related heuristics and to test the three security strategies. The authors repeated experiments 10 times because parts of the pipeline are non-deterministic due to random input generation. They report averages over those runs, but do not provide confidence intervals or statistical significance testing.

Architecturally, AuthREST sits on top of RestTestGen. The first stage is specification processing: it parses the OpenAPI spec to identify operations, parameters, and useful values for test generation. For credential stuffing and password brute forcing, AuthREST uses heuristics to discover login operations and login-related parameters. Login operations are detected when the HTTP method is POST or GET and the path contains a normalized form of “login” or “signin” (with punctuation removed). For parameters, passwords are prioritized by names containing “password,” then “pass”; user identifiers are prioritized by a fixed substring order: “username,” “email,” “login,” “user,” “phone,” “mail,” then “id.” Once a login endpoint is found, AuthREST generates request sequences by filling the user and password fields with random values and populating any extra parameters using RestTestGen’s value sources: OpenAPI examples, enum/default values, values seen in previous request/response histories, and random generation. The paper notes that this is not “real” credential stuffing in the sense of using stolen credentials; instead, it is a black-box simulation that is behaviorally similar from the API’s perspective.

The credential-stuffing strategy is concrete: AuthREST sends 100 login requests within 10 seconds to the identified login operation. The oracle then inspects the HTTP trace for two signs of missing lockout: absence of rate limiting, operationalized as no 429 responses, and lack of meaningful error-message change across repeated attempts. Because raw string comparison is fragile, the authors tokenize error messages using Porter stemming and treat two messages as related if they share at least 70% of tokens. That threshold is author-chosen rather than empirically justified, and the paper acknowledges that semantically equivalent messages can differ because of timestamps, IDs, or other metadata. Password brute forcing reuses the same machinery but keeps the user identifier constant while randomizing the password on each attempt.

The unchecked-token-authenticity strategy is more involved. First, AuthREST performs nominal exploration using a previously published strategy from the same research line: operations are executed according to producer-consumer data dependencies, again using values from the OpenAPI spec, past HTTP interactions, and random generation. During this nominal run, the tool collects a valid token and records the HTTP interaction history. It then replays the interaction sequence multiple times while mutating the token in three ways: alter one character, remove one character, or add one character, with care taken not to break JWT structural separators such as dots. Because operations that do not require authentication may accept requests whether the token is valid or not, the authors run a third replay without any token at all. The oracle computes W, the set of operations that succeed with mutated tokens, and N, the set of operations that succeed with no token; the final vulnerability set is V = W − N, intended to filter out unauthenticated endpoints that would otherwise look like false positives. In the evaluation, only Tradematic was suitable for this strategy because the other APIs were commercial or lacked usable credentials.

Evaluation is manual and verification-oriented rather than purely automated benchmark scoring. The authors inspect the HTTP histories flagged by AuthREST and also reproduce the issues with Postman to confirm actual exploitability. For the credential-stuffing and brute-force strategies, they report true positives and false positives averaged over 10 runs, and state that all discovered issues were confirmed by manual inspection. They explicitly note that they could not measure false negatives because there was no ground truth. For the token-authenticity test, the single evaluated API (Tradematic) produced no vulnerabilities, which the authors interpret as evidence of low false-positive tendency but also caution that one case study is too small for generalization. The paper does not report baselines in the traditional comparative sense, because the authors claim no fully automated prior tool exists for broken-authentication testing in the literature.

Technical innovations

Heuristic, specification-driven discovery of login operations and login parameters from OpenAPI specs without human labeling.
A black-box credential-stuffing oracle that combines high-rate request bursts with rate-limit and error-message stability checks.
A token-authenticity testing workflow that distinguishes genuinely protected endpoints from endpoints that simply ignore tokens by replaying the same sequence with and without a token.
Use of nominal API exploration to harvest valid tokens and request sequences before mutating token values for invalid-token testing.

Datasets

Public APIs.guru benchmark — 6 APIs after filtering from an initial random sample of 50 APIs — public APIs.guru/OpenAPI specifications
Heuristic-building corpus — 100 APIs from APIs.guru reviewed for login-operation and login-parameter patterns — public APIs.guru/OpenAPI specifications

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2509.10320.

Fig 1

Fig 1: The architecture of AuthREST.

Fig 2

Fig 2 (page 1).

Fig 3

Fig 3 (page 1).

Limitations

No baseline comparison was possible because the authors state there was no prior fully automated broken-authentication testing tool to compare against.
The benchmark is small: only 6 APIs survived the filtering criteria, and only 1 API was suitable for unchecked-token-authenticity evaluation.
False negatives were not measurable because there was no ground truth for all vulnerabilities across the benchmark.
The login-operation and parameter heuristics are pattern-based and may miss unconventional endpoint names or parameter naming schemes.
The 70% token-overlap threshold for error-message similarity is heuristic and may fail on some APIs with unusual localization, templating, or structured error payloads.
The evaluated findings were manually confirmed, but the paper does not provide a large-scale public ground-truth corpus for independent replication of precision/recall.

Open questions / follow-ons

How well do the login heuristics generalize to APIs with non-English naming, OAuth/OIDC flows, or non-standard authentication endpoints?
Can the token-authenticity approach be extended to bearer tokens other than JWTs, and to APIs that rotate or introspect tokens server-side?
What is the false-negative rate on a larger benchmark with known vulnerable and non-vulnerable APIs, especially for lockout policies that are adaptive or challenge-based rather than 429-based?
Can the error-message oracle be replaced with a more robust semantic classifier that handles structured JSON errors, localization, and templated fields?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this paper is useful because it formalizes the kinds of automated abuse that usually trigger defensive challenges: repeated login attempts, password spraying, and invalid token reuse. An engineer could adapt AuthREST-style test generation to validate whether rate limits, IP/device reputation, CAPTCHA prompts, or secondary challenges actually appear after the expected number of attempts, rather than assuming the control exists because documentation says so.

It also highlights a practical caveat: defenses that only change the error message or return code are easy to miss or game, while defenses that rely on account lockout, challenge escalation, or authentication-state transitions must be tested as sequences, not single requests. If you are building bot defense, the paper is a reminder to test for “accepts bad token but still serves the protected operation” and for “repeated failures still look like ordinary login errors,” because both are exploitable failure modes that automated black-box tooling can catch.

Cite

bibtex

@article{arxiv2509_10320,
  title={ Automated Testing of Broken Authentication Vulnerabilities in Web APIs with AuthREST },
  author={ Davide Corradini and Mariano Ceccato and Mohammad Ghafari },
  journal={arXiv preprint arXiv:2509.10320},
  year={ 2025 },
  url={https://arxiv.org/abs/2509.10320}
}

Automated Testing of Broken Authentication Vulnerabilities in Web APIs with AuthREST ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​