Context-Aware Web Attack Detection in Open-Source SIEM Systems via MITRE ATT&CK-Enriched Behavioral Profiling

Source: arXiv:2605.13337 · Published 2026-05-13 · By Badr Alboushy, Assef Jafar, Mohamad Aljnidi, Mohamad Bashar Disoki, Aref Shaheed

TL;DR

Evaluated on a purpose-built dataset of 46,454 Wazuh security events, Smart-SIEM dramatically improves detection compared to classifiers without context features, raising macro F1 from approximately 0.705 to as high as 0.967 for binary detection and 0.914 for attack categorization. The system outperforms Wazuh's native rule engine, detecting 100% of Brute Force and 98.3% of Broken Authentication attacks undetected by rules. The authors also define an adaptive retraining mechanism addressing concept drift, recovering F1 from 0.465 back up to 0.814 after including new attack labels. The practical architecture integrates with Wazuh via Kafka and Elasticsearch, enabling scalable, context-aware classification in real deployment. Overall, this work demonstrates the value of context-enriched behavioural feature engineering and hybrid gradient boosting cascades to enhance open-source SIEM web-attack detection.

Key findings

Contextual behavioural features improve macro F1 from ~0.705 to 0.947–0.967 for binary detection (Stage 1) and from 0.552–0.590 to 0.876–0.914 for six-class attack classification (Stage 2), an average gain of +0.254 and +0.324 respectively.
The hybrid cascade with LightGBM for binary detection and XGBoost for six-class classification achieves best results: 0.967 F1 (binary), 0.914 F1 (multi-class).
Wazuh's native rule engine detects 0% of Brute Force and Broken Authentication events; the AI module detects 100% and 98.3% respectively.
A context window size of N=30 prior events per source IP is identified as optimal via ablation study.
SMOTE-NC oversampling balances training classes, especially boosting minority classes from as low as 0.6% (Broken Auth) to equalized 10% proportions.
The self-adaptive retraining mechanism recovers from concept drift: F1 drops from 0.905 to 0.465 when unseen attacks appear, improves to 0.695 with Phase 2 retraining only, and finally reaches 0.814 after retraining both stages.
Eight gradient boosting and baseline algorithms tested all benefit similarly from adding context features, confirming feature engineering rather than specific model choice drives performance.
Context features dominate feature importance analysis, contributing far more gain than single-event fields.

Threat model

The adversary is a remote attacker performing multi-step web application attacks (SQL injection, XSS, brute force, broken authentication, scanning) from distinct source IPs. They cannot subvert or censor log collection, nor tamper with SIEM events or the classification pipeline. The defender observes enriched Wazuh security events with rule metadata and MITRE ATT&CK mappings, and is able to store and query historical events per source IP to detect correlated campaigns across sequences. The threat model excludes insider threats, encrypted or obfuscated traffic that escapes logging, and evasion that completely mimics benign temporal behaviors.

Methodology — deep read

Threat Model and Assumptions: The adversary attempts multi-step web application attacks (SQL injection, XSS, brute force, broken auth, scanning, sensitive data exposure) against a vulnerable web application (OWASP Juice Shop). Attacks originate from distinct IPs; the defender observes Wazuh SIEM security events normalized from Apache logs with rules mapped to MITRE ATT&CK techniques. The attacker cannot modify or tamper with log collection or backend AI classification pipelines. The defender assumes a clean source-IP-to-actor mapping for labeling. The system exploits intra-source temporal correlation to detect coordinated campaigns invisible to stateless rules.
Data: The dataset contains 46,454 security events collected from a controlled testbed. The OWASP Juice Shop server runs a Wazuh agent monitoring Apache access logs. Normal legitimate traffic is automated with Selenium from a separate source IP. Attacking machines run tools like SQLMAP (SQL injection), Acunetix (web scanning/XSS), Burp Suite (brute force/broken authentication), an XSS automation tool, and Gobuster (directory scanning), each from distinct IPs. Ground truth labels are deterministic: events matching attacker IPs and timestamps within recorded attack sessions get corresponding attack class labels; all others are NORMAL. The dataset is split chronologically and stratified into 64% train, 16% validation, 20% test. SMOTE-NC oversampling and undersampling balance classes in training, creating a balanced 12,500-event set with 10% per attack class and 40% normal.
Architecture and Algorithms: Smart-SIEM attaches as a module to Wazuh, receiving security events above rule level 0 via syslog, streaming them into Apache Kafka (back-pressure, scaling) for consumer workers. For each event, the consumer retrieves the N=30 most recent prior events from the same source IP via Elasticsearch queries, strictly prior in time. Contextual behavioural features are computed aggregating HTTP response status code distributions, maximum rule activation counts, and cumulative MITRE ATT&CK technique frequency counts within this window, producing a per-source-IP session-aware feature vector.

Classification is a two-stage hybrid cascade: Stage 1 LightGBM binary classifier (normal vs attack), selected for best binary F1; Stage 2 XGBoost multi-class classifier (six attack categories) for precise attack type attribution. Feature sets combine base event fields and contextual features. Missing data handled with defaults and proper encoding. SMOTE-NC handles class imbalance in training.

Training Regime: Models are trained on the balanced training split at session level with early stopping on a validation subset. Specific hyperparameters are tuned but not fully disclosed. Training is done on a single session-per-class testbed to obtain upper-bound generalization estimates. The retraining loop runs daily, evaluating macro-averaged accuracy against the Elasticsearch knowledge base of labeled events. If accuracy drops below a configurable 90% threshold (due to concept drift or new attacks), both Stage 1 and Stage 2 classifiers are retrained on the combined updated corpus.
Evaluation Protocol: Metrics reported are macro F1 scores on held-out test sets with natural class proportions (no oversampling in test). Baselines include native Wazuh rule engine and multiple gradient boosting variants with and without context features, to isolate context contribution. Ablation studies identify optimal N and quantify the performance drop from removing context. Confusion matrices, feature importance analyses confirm insights. Retraining experiments simulate concept drift by introducing unseen attacks and measuring recovery.
Reproducibility: Dataset is purpose-built and not publicly released, code for Smart-SIEM is not explicitly stated as released. The methodology includes explicit detail on dataset generation, labeling, preprocessing, and feature engineering steps. The hybrid cascade implements off-the-shelf LightGBM and XGBoost with standard configurations enhanced by novel contextual features.

Concrete Example: For a new security event from source IP X, the pipeline queries Elasticsearch for the previous 30 events timestamped before this one also from IP X, counts HTTP response codes (e.g., number of 2xx, 4xx, 5xx), maximum times any rule fired in the window, and tallies the counts of each MITRE ATT&CK technique observed. This contextual vector is concatenated with the base features from the current event, forming an enriched feature vector fed first to LightGBM to decide normal or attack. If attack, the vector proceeds to the XGBoost multi-class model for exact attack type classification, then stored with predictions in Elasticsearch for analyst visualization and retraining trigger monitoring.

Technical innovations

Definition and use of a per-source-IP behavioural context vector encoding HTTP status distributions, maximum rule activation counts, and cumulative MITRE ATT&CK technique occurrence over the N most recent prior events to enable session-aware detection.
Integration of a two-stage hybrid cascade classifier combining LightGBM for binary attack detection and XGBoost for six-class attack categorization within the Wazuh SIEM platform.
Development of a self-adaptive retraining mechanism driven by accuracy evaluation against an analyst-curated Elasticsearch knowledge base, triggering retraining automatically upon concept drift.
Demonstration that cumulative MITRE ATT&CK technique counts as explicit contextual features outperform other approaches that aggregate raw signal statistics without semantic threat framework grounding.

Datasets

Purpose-built Wazuh security event dataset — 46,454 events — collected from a controlled testbed with OWASP Juice Shop, Selenium normal traffic generator, and attack machines running SQLMAP, Acunetix, Burp Suite, XSS automation tool, Gobuster

Baselines vs proposed

All tested gradient boosting algorithms without context features: macro F1 ≈ 0.705 (Stage 1 binary), 0.552–0.590 (Stage 2 six-class)
Same algorithms with context features: macro F1 of 0.947–0.967 (Stage 1), 0.876–0.914 (Stage 2)
Hybrid cascade (LightGBM + XGBoost): binary F1 = 0.967 vs baseline without context 0.705
Hybrid cascade: six-class F1 = 0.914 vs baseline without context ~0.55
Wazuh native rule engine detects 0% of Brute Force and Broken Authentication attacks, Smart-SIEM detects 100% and 98.3% respectively
Self-adaptive retraining recovers F1 from 0.465 (after concept drift) to 0.814 after retraining both stages

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.13337.

Fig 6

Fig 6: Gain-based feature importance for the proposed hybrid: Stage 1 (LightGBM,

Fig 8

Fig 8: Algorithm

Fig 9

Fig 9: Radar chart of the full metrics profile (Stage-1 Precision, Recall, F1; Stage-2

Fig 10

Fig 10: Efficiency vs. accuracy trade-off: macro F1-score against training time (sec-

Fig 11

Fig 11: Confusion matrices for the proposed hybrid cascade (LightGBM Stage 1

Fig 13

Fig 13: Self-adaptive retraining simulation. The initial model encounters Phase 2

Limitations

Data collected from a controlled, single web application testbed (OWASP Juice Shop) with scripted attacks may limit generalizability beyond this environment.
Context features rely on accurate source IP mapping and event ordering; noisy or NATed IPs in real deployments may degrade performance.
Test set includes a single attack session per class, providing an optimistic upper bound on generalization to unseen campaigns.
No evaluation against advanced evasion techniques or adversarial attackers attempting to poison or manipulate context features.
The dataset and code do not appear publicly released, impeding external reproduction.
The self-adaptive retraining relies on analyst labeling, which may incur operational overhead or label lag in practice.

Open questions / follow-ons

How does the approach perform on real-world, large-scale, noisy enterprise data with overlapping IP address reuse, NAT, and encrypted logs?
Can the contextual feature design be generalized to multi-source correlation beyond per-IP sequences, e.g., user accounts or devices?
What is the impact of adversarial manipulation of event features or poisoning attacks on the context vector and model robustness?
Can more advanced sequence models (e.g., transformer-based architectures) further improve detection compared to gradient boosting with engineered features?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this work illustrates the value of leveraging event history and semantic threat intelligence frameworks (MITRE ATT&CK) to contextualize otherwise indistinguishable individual web security events. The presented behavioural context vector design enables robust detection of automated low-signal campaigns like brute force or broken authentication attacks, which rule-based systems often miss. Incorporating such context-aware detection in SIEM pipelines can reduce false negatives and align better with sophisticated bot activities that evolve over tens or hundreds of requests. The hybrid gradient boosting cascade provides an interpretable and scalable classification method compatible with existing open-source platforms like Wazuh. Additionally, the self-adaptive retraining mechanism addresses concept drift—important for bot detection systems facing constantly evolving attacker techniques. Practitioners should note limitations around IP-based sessionization and consider extensions to user/session-level identifiers when integrating similar context features.

Cite

bibtex

@article{arxiv2605_13337,
  title={ Context-Aware Web Attack Detection in Open-Source SIEM Systems via MITRE ATT&CK-Enriched Behavioral Profiling },
  author={ Badr Alboushy and Assef Jafar and Mohamad Aljnidi and Mohamad Bashar Disoki and Aref Shaheed },
  journal={arXiv preprint arXiv:2605.13337},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.13337}
}

Context-Aware Web Attack Detection in Open-Source SIEM Systems via MITRE ATT&CK-Enriched Behavioral Profiling ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​