Skip to content

Integrating Log-Based Security Analytics in Agile Workflows: A Real-World Experience Report

Source: arXiv:2605.00352 · Published 2026-05-01 · By Arpit Thool, Chris Brown

TL;DR

This experience report documents the 'Red Flag Project,' an effort by a cross-functional IT team at an anonymized organization to integrate log-based fraud detection into an existing Kanban/Agile workflow following a real-world account takeover incident. The system combined Splunk (SIEM/log analytics), a custom Flask middleware, Grouper (IAM group management), and a Python/Mutt cron-based email alerter to detect and notify stakeholders of accounts exhibiting three correlated suspicious behaviors: repeated MFA failures, malicious inbox-rule creation targeting direct-deposit emails, and unauthorized payroll direct-deposit changes. The project is documented through the action research cycle, with the first author serving as an embedded engineer, making this a first-person practitioner account rather than an external audit.

After deploying version 1.0 to production, the researchers conducted semi-structured IRB-approved interviews with all seven non-author team members (ranging from 7 to 45 years of industry experience) to surface perceptions around adoption willingness, friction points, and day-to-day impact. All interviewees reported willingness to adopt and continue using the system, though adoption enthusiasm was frequently tempered by concerns about long-term ownership, technical fragility of the log-format-dependent Splunk queries, and reliance on a single dedicated engineer. Day-to-day workflow disruption was reported as minimal, with the main overhead being an additional weekly synchronization meeting.

The paper's primary value is qualitative and process-oriented rather than quantitative or algorithmic. It surfaces practitioner-level concerns that rarely appear in security-engineering literature: the 'software rust' problem of unmaintained detection rules, the challenge of translating alerts into actionable stakeholder response, the organizational risk of single-person knowledge concentration, and the tension between alert precision (false-positive rate) and recall in a low-volume, high-stakes fraud context. The authors distill concrete recommendations around governance, alert design, ownership rituals, and Agile backlog patterns for teams attempting similar integrations.

Key findings

  • All 7 interviewed team members (100%) expressed willingness to adopt the Splunk-based fraud detection system; 6/7 expressed comfort continuing to use it post-deployment.
  • 5/7 participants reported low or minimal impact on day-to-day development activities; the primary overhead was attendance at one additional short weekly meeting.
  • 5/7 participants cited Project & Resource Trade-offs as the top challenge, specifically that team velocity was preserved only because a dedicated engineer was assigned — removal of that resource was flagged as a capability risk.
  • 3/7 participants reported conditional or limited perceived security improvement, contingent on producing actionable detections; 3/7 reported clear security benefits including reduced time-to-detection.
  • 4/7 participants identified multi-signal correlation (failed MFA + Outlook rule creation + payroll change) as the central value proposition, providing higher-confidence flagging than any single signal.
  • 5/7 participants reported heightened individual or organizational security awareness as a result of the project; 2/7 reported unchanged awareness.
  • 2/7 participants cited false positive rate as a primary concern, though informal testing during rollout did not surface a high false positive rate — no quantitative FPR figure is reported in the paper.
  • The deployed architecture relied on a shared Red Hat server subject to developer reboots and lacking strict access controls, which 1/7 participants explicitly flagged as unsuitable for a production-critical service.

Threat model

The adversary is an external actor conducting account takeover (ATO) via stolen or phished credentials against an organizational IT environment. The attacker is assumed to have valid user credentials and to follow a stereotyped post-compromise playbook: modifying Outlook inbox rules to intercept direct-deposit notification emails and then changing payroll direct-deposit destination accounts. The system assumes the attacker does not know the specific detection heuristics (>2 MFA failures in 24h, Outlook rule modifications, payroll change attempts) and therefore does not attempt to evade them by staying below thresholds or spacing events across the 24-hour window. The adversary is not modeled as an insider threat. The system has no capability to detect the initial credential compromise vector (e.g., phishing, credential stuffing at login) — it only detects post-authentication behavioral signals in production logs. The threat model is grounded in a real incident the organization experienced prior to the project.

Methodology — deep read

The study employs an action research design structured around five iterative phases: problem diagnosis, action planning, action taking, evaluation, and specifying learning. The first author was an embedded software engineer on the team throughout the project lifecycle, making this a participant-observer study rather than an external evaluation. This insider positionality gives rich access to implementation details but also introduces potential observer bias, which the authors acknowledge only implicitly through the IRB approval and dual-coding process.

Threat model and system scope: The adversary is an external attacker performing account takeover (ATO) via credential stuffing or phishing, who then conducts financial fraud by modifying payroll direct-deposit information and intercepting related email notifications via Outlook inbox rules. The threat model is empirically grounded — the organization experienced a real incident with this exact attack chain before the project began. The system does not model insider threats and does not address the initial credential compromise vector; it is purely a post-authentication behavioral detection layer.

System architecture: Three Splunk saved-search alerts were configured to fire webhooks upon detecting the target behaviors (>2 MFA failures in 24h; creation/modification of Outlook rules touching direct-deposit keywords; payroll direct-deposit change attempts). Webhooks were received by a Flask web application hosted on an internal Red Hat server. The Flask app parsed alert payloads and called Grouper's API to insert the flagged account into one of three time-bounded detection groups. Grouper's native auto-expiration feature enforced a 24-hour TTL on MFA-failure flags. A final 'Red Flag' Grouper group was configured to contain only accounts present in all three sub-groups simultaneously, implementing a logical AND aggregation across signals. A separate Python script, scheduled via cron, queried the final Red Flag group, identified new entries since the last execution, and dispatched email alerts via Mutt to stakeholders (a Google Group distribution list). Version 1.0 was deployed to production.

Data and evaluation: This is not an ML paper — no training data, held-out test sets, or quantitative performance benchmarks are reported. The 'data' in the study is the interview corpus: 7 semi-structured interviews, each 30–45 minutes, conducted virtually via Microsoft Teams or Zoom and recorded with consent. Transcription used Microsoft Word's built-in transcription functionality. The interview protocol consisted of 10 open-ended questions organized around the three research questions. The researchers applied open coding independently, then reconciled codes to establish final thematic categories. An example of the coding process was published on Figshare; raw transcripts are not shared due to confidentiality. Inter-rater reliability statistics (e.g., Cohen's kappa) are not reported.

Concrete end-to-end example of the detection pipeline: A user's account triggers a Splunk alert for >2 MFA failures within a 24-hour rolling window. Splunk fires a webhook to the Flask service. Flask parses the account identifier from the payload and calls Grouper's API to add the account to the 'MFA-failures' sub-group with a 24-hour TTL. If, within that window, the same account also triggers the Outlook-rule and payroll-change alerts and is inserted into those sub-groups, Grouper's nested-group logic automatically promotes the account into the final 'Red Flag' group. The cron job runs on its next scheduled tick, detects the new entry, and sends a Mutt-generated email to the stakeholder Google Group distribution list. If MFA failures stop and the 24-hour TTL expires without re-triggering, the account is removed from the MFA sub-group and consequently exits the Red Flag group.

Reproducibility: The interview coding example is on Figshare. Raw interview transcripts are withheld due to confidentiality. No Splunk queries, Flask source code, Grouper configuration, or cron scripts are released. The organization is anonymized. This substantially limits reproducibility of both the technical artifact and the qualitative findings.

Technical innovations

  • Multi-signal AND-aggregation via Grouper's nested group membership: rather than alerting on individual behavioral signals, the system only flags accounts present simultaneously in all three time-bounded sub-groups, reducing noise compared to single-signal SIEM alerting.
  • Use of Grouper's native time-bounded membership (auto-expiration TTL) as a stateful sliding-window mechanism for fraud signal aggregation, avoiding custom database state management in the Flask middleware.
  • A lightweight webhook-to-IAM bridge pattern using Flask as a stateless intermediary between Splunk (alerting) and Grouper (group management), enabling integration between two systems with no native interoperability.

Limitations

  • Extremely small interview sample (n=7) from a single anonymized organization in one IT division; findings are not statistically generalizable and may reflect idiosyncratic organizational culture.
  • No quantitative performance metrics are reported: false positive rate, false negative rate, mean-time-to-alert, precision, or recall are absent. The paper notes FPR appeared low in testing but provides no numbers.
  • Single-author insider positionality (first author was the implementing engineer) creates significant observer bias risk; the team may have provided more positive assessments knowing a colleague would analyze the interviews.
  • Inter-rater reliability for qualitative coding is not reported (no Cohen's kappa or equivalent), making the thematic analysis difficult to independently validate.
  • The deployed production system has known infrastructure fragility: shared server, no failover, no load balancing, and loose access controls — the authors acknowledge this but do not report whether these were remediated before or after interviews.
  • The detection heuristics are fixed and hand-crafted; there is no adversarial evaluation of whether an attacker who knows the rules (e.g., staying under 2 MFA failures, spacing out payroll changes) can trivially evade the system.
  • No longitudinal follow-up: the paper reports perceptions shortly after v1.0 deployment; whether the system was still operational, maintained, or abandoned months later is unknown.

Open questions / follow-ons

  • How does detection efficacy degrade when an attacker learns the specific signal thresholds and spacing requirements — what is the evasion surface of a purely rule-based, threshold-driven SIEM detection system in this ATO scenario?
  • What is the minimum viable ownership and maintenance structure (FTE hours per week, governance ritual cadence) required to prevent 'software rust' in log-based fraud detection rules as log schemas and attacker TTPs evolve?
  • Can the Grouper-based AND-aggregation pattern be generalized to a weighted scoring model (e.g., risk score accumulation across more than three signals) without requiring full ML infrastructure, and what precision/recall trade-offs emerge?
  • How do alert fatigue dynamics evolve over time as the system matures — do stakeholder response rates to email notifications degrade, and does integration into an incident-response platform like ServiceNow measurably improve response fidelity?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this paper is most relevant as a process and organizational case study rather than a technical one. The core detection architecture — correlating multiple weak behavioral signals (failed authentication, anomalous account modification attempts) via a time-windowed AND-aggregation — is a pattern directly applicable to bot and ATO detection pipelines. The specific signals used (MFA failures, inbox-rule creation, payroll changes) are domain-specific to IT account fraud, but the structural pattern of 'require corroboration across N independent signals within a TTL window before escalating' is a standard risk-scoring primitive. The Grouper-as-stateful-aggregator pattern is an unconventional but pragmatic choice that bot-defense engineers might replicate in constrained infrastructure environments where a full feature-store or ML serving layer is unavailable.

The organizational findings are the more transferable contribution for practitioners building or sustaining bot-defense operations within product teams. The paper surfaces a recurring problem familiar to anyone who has maintained fraud detection rules: signal degradation as attacker TTPs shift, alert fatigue from high false-positive rates, knowledge concentration in a single SME, and the 'quiet period abandonment' risk when no attacks are actively firing alerts. The recommendation to integrate alerts into an existing incident-response workflow (e.g., ServiceNow ticket creation) rather than relying on ad-hoc email distribution is directly applicable to CAPTCHA and bot-defense teams that struggle with stakeholder responsiveness. The paper's honest account of the gap between 'system deployed' and 'system operationally effective' — captured in M1's statement that improved security has 'not happened yet' because effectiveness depends on actionable detection — is a useful calibration point for teams measuring bot-defense program maturity.

Cite

bibtex
@article{arxiv2605_00352,
  title={ Integrating Log-Based Security Analytics in Agile Workflows: A Real-World Experience Report },
  author={ Arpit Thool and Chris Brown },
  journal={arXiv preprint arXiv:2605.00352},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.00352}
}

Read the full paper

Articles are CC BY 4.0 — feel free to quote with attribution