Multi-Source Cybersecurity Logs: An ATT&CK-Labeled Dataset and SLM Evaluation

Source: arXiv:2606.18190 · Published 2026-06-16 · By Abir Ashab Niloy, Ahmed Ryan, Imamul Hossain Rafi, Md Erfan, Md Rayhanur Rahman

TL;DR

This work addresses a crucial gap in cybersecurity machine learning research: the lack of publicly available, multi-source security log datasets labeled with fine-grained MITRE ATT&CK technique annotations. Existing datasets often cover only network traffic or host telemetry but rarely include browser activity or detailed ATT&CK technique labels. The authors created a new dataset capturing synchronized system, network, and browser logs from 870 sessions (70 attack, 800 benign) on Windows endpoints over a 13-month period, generating approximately 2.3 million events. The attack sessions cover 12 ATT&CK tactics and 53 techniques using real attack tools such as RATs, C2 tunnels, and cloud exfiltration.

To demonstrate the dataset’s utility for cross-source attack detection research, the authors fine-tuned three small language models (Qwen2.5-1.5B, Llama-3.2-3B, Phi-4-Mini) using Low-Rank Adaptation (LoRA) and benchmarked performance on two tasks: chunk classification (normal vs suspicious activity) and ATT&CK technique identification. Fine-tuning improved all models dramatically. Chunk classification accuracy jumped from about 8% in base models to between 90% and 97%, showing a strong learnable signal. Technique identification remained harder with a best exact-match accuracy of 42%, but partial match scores and semantic F1 indicated the models captured much reasoning about attack behaviors. This dataset and evaluation provide a new standard benchmark enabling ML research to reason across hosts, networks, and browsers simultaneously against realistic multi-stage attacks with meaningful threat labels.

Key findings

The dataset contains 870 sessions (70 attack, 800 benign) totaling about 2.3 million events captured from system, network, and browser logs simultaneously over 13 months (Jan 2025–Feb 2026).
Attack sessions cover 12 ATT&CK tactics and 53 individual techniques, with 100% coverage in seven tactics such as Initial Access and Execution (Table V).
Attack sessions produce on average 2.2x more system log events and 1.9x more network events than benign sessions.
Dataset integrity is high with 100% session completeness and timestamp consistency; system and browser extraction success rates exceed 99%.
Fine-tuning all three SLMs with LoRA increases chunk classification accuracy from ~8% (base) to 90–97% (fine-tuned) on detecting suspicious activity.
Exact-match accuracy for ATT&CK technique identification improves from 0% to a best of 42% (Phi-4-Mini), with partial match scores above 85%.
Fine-tuning yields consistent improvements across 10 different metrics, including macro and weighted precision, recall, and F1 measurements (Table VII).
Chunk labeling uses sequences of seven temporally ordered events, sufficient to maintain attack context for model input.

Threat model

The adversary is a realistic attacker executing multi-stage attacks on Windows endpoints that generate system, network, and browser logs. Adversary capabilities include phishing-based initial access, use of Remote Access Trojans, C2 channels, credential dumping, persistence, lateral movement, and data exfiltration. They cannot interfere with the controlled lab environment data collection or escape containment. The defender is assumed to have full multi-source log visibility but must correlate signals across domains to detect and map attacks to specific ATT&CK techniques.

Methodology — deep read

Threat Model & Assumptions: The adversary is a sophisticated attacker executing real-world multi-stage cyberattacks spanning system, network, and browser layers on Windows endpoints. Attacks use genuine offensive tools including Remote Access Trojans (RAT), Command and Control (C2) tunnels, persistence mechanisms, credential harvesting, lateral movement, cloud exfiltration, and ransomware. The attacker can phish victims and execute code but cannot escape the controlled lab environment or tamper with data collection infrastructure. The defender attempts to leverage multi-modal logs to detect and characterize attacks at the ATT&CK technique level.
Data Collection: The dataset was built from 870 sessions recorded January 2025 to February 2026, with 70 attack and 800 benign sessions. Each session is a fixed 20-minute window capturing synchronized system logs (filtered Sysmon events and Windows Event Logs), network traffic (full PCAP using tshark in promiscuous mode), and browser observability (foreground application and URL visits via Activity Watch). Data was collected on Windows 10/11 machines and VMs isolated from production networks, with controlled internet access only to required services (C2 servers, cloud storage). The attack sessions executed 50 diverse scenarios covering broad ATT&CK tactics using real tools and multi-stage workflows executed by a two-person team simulating attacker and victim roles.
Data Processing: Raw logs were converted into JSON with normalized timestamps and a unified schema. Personally identifiable information was anonymized by masking IPs, hashing usernames, and scrubbing sensitive file paths. Events were grouped into chunks of seven sequential log entries with metadata (session ID, chunk index). Chunks containing malicious events were labeled 'suspicious' and annotated with ATT&CK technique IDs; others labeled 'normal'. The dataset was split into training (89,693 chunks), validation (12,427), and testing (10,606) sets. Due to hardware limits, 40% sampling was done for training/validation, and token counts capped at 2000.
Model & Training: Three popular small language models (Qwen2.5-1.5B, Llama-3.2-3B, Phi-4-Mini 3.8B) using decoder-only transformer architectures, all instruction-tuned, were fine-tuned using LoRA. LoRA enables lightweight adaptation by training small low-rank adapter layers while freezing base weights. Rank r=16, scaling α=32, max sequence length 2000 tokens, batch size 4, and up to 3 epochs were used, updating less than 1% of parameters.
Evaluation: Two tasks were evaluated: (a) chunk classification as normal vs suspicious activity and (b) alert generation producing attack severity and per-event ATT&CK techniques. Metrics included accuracy, precision, recall, and F1 (macro and weighted) for task 1, plus exact match accuracy, average partial match, and word-level F1 for task 2. Base and fine-tuned variants were compared on the test splits for each SLM. Improvements were assessed quantitatively.
Reproducibility: Code and dataset availability are not explicitly stated; dataset derives from controlled lab simulations rather than public sources, and uses standard open-source tools for logging and downstream modeling. This may limit external reproduction until dataset release. The consistent methodology spanning data collection, labeling, preprocessing, and a clear train-validation-test split aids future benchmarking.

Technical innovations

A publicly unreleased dataset of synchronized system, network, and browser logs with per-event ATT&CK technique labeling across 12 tactics and 53 techniques.
Use of real-world attack simulation tools to generate multi-source telemetry data in a controlled and reproducible lab environment.
Application of chunking method grouping 7 sequential multi-source log events into structured inputs for small language model analysis.
Benchmarking instruction-tuned SLMs fine-tuned via LoRA on complex cross-source cybersecurity log classification and ATT&CK technique identification tasks.

Datasets

870-session multi-source log dataset — ~2.3 million events — internal collection from Windows 10/11 endpoints with attack simulations and benign sessions (not yet publicly released)

Baselines vs proposed

Llama-3.2-3B base: chunk classification accuracy = 8.1% vs fine-tuned: 92.8%
Phi-4-Mini base: chunk classification accuracy = 7.2% vs fine-tuned: 97.0%
Qwen2.5-1.5B base: chunk classification accuracy = 7.9% vs fine-tuned: 89.9%
Phi-4-Mini exact-match ATT&CK technique identification accuracy base = 0% vs fine-tuned = 41.6%
Llama-3.2-3B exact-match accuracy base = 0% vs fine-tuned = 29.3%
Qwen2.5-1.5B exact-match accuracy base = 0% vs fine-tuned = 27.8%

Limitations

Dataset is currently internal/private; no explicit mention of public release restricts reproducibility and community benchmarking.
Attack scenarios run in a lab environment may miss nuances of real-world network complexity and user behavior diversity.
Model evaluation focuses on 3 relatively small SLMs; results may differ for other architectures or larger-scale models.
Partial attack coverage for some tactics, notably limited lateral movement implemented only in 21% of attack sessions.
Technique identification exact-match accuracy remains modest (~42%), indicating open challenges for fine-grained attribution.
Benign sessions conducted by authors in scripted but potentially limited activity ranges; may not fully represent real benign variability.

Open questions / follow-ons

How well do the trained models generalize to unseen, real-world attack campaigns with broader tactic variability?
Can modeling techniques be improved to boost exact-match ATT&CK technique identification beyond the current 42% ceiling?
What gains can be obtained by combining these logs with other telemetry modalities such as endpoint process memory?
How to scale dataset collection and annotation cost-effectively to include more diverse user behaviors and rare attack variants?

Why it matters for bot defense

This dataset and evaluation provide actionable insights for bot-defense and CAPTCHA practitioners by demonstrating the feasibility of integrating multi-source telemetry (system, network, browser) to improve detection of sophisticated multi-stage cyberattacks. The fine-grained ATT&CK technique labeling enables more precise attribution and response prioritization beyond simple malicious/benign flags, which can enhance triage and mitigation workflows in real-world security operations centers. The use of instruction-tuned small language models fine-tuned via parameter-efficient LoRA shows promise as a deployable method to analyze complex event sequences. However, the moderate technique identification accuracy highlights ongoing challenges in automated deep attack understanding from heterogeneous logs, underscoring the continued need for human-in-the-loop review and layered defense mechanisms in bot detection and CAPTCHA efficacy assessment scenarios.

Cite

bibtex

@article{arxiv2606_18190,
  title={ Multi-Source Cybersecurity Logs: An ATT&CK-Labeled Dataset and SLM Evaluation },
  author={ Abir Ashab Niloy and Ahmed Ryan and Imamul Hossain Rafi and Md Erfan and Md Rayhanur Rahman },
  journal={arXiv preprint arXiv:2606.18190},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.18190}
}

Multi-Source Cybersecurity Logs: An ATT&CK-Labeled Dataset and SLM Evaluation ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​