A Hybrid Cluster-Based Classification Model for Anomaly Detection in Unbalanced IoT Networks

Source: arXiv:2605.19451 · Published 2026-05-19 · By Hossein Shaemi Barzoki, Amir Hossein Fathi Hafshejani, Ahmadreza Montazerolghaem

TL;DR

This paper tackles the problem of anomaly detection in highly imbalanced and heterogeneous Internet of Things (IoT) network traffic, focusing on the Bot-IoT dataset. Traditional single-model classifiers struggle with diverse attack types and severe class imbalance, leading to suboptimal detection performance. The authors propose a hybrid cluster-based framework that first segments the training data into three homogeneous clusters using K-Means clustering, each containing both normal and attack samples. Then, for each cluster, they train multiple baseline machine learning models independently and select the best-performing simple model per cluster to form a specialized ensemble.

This divide-and-conquer methodology leverages data-driven clustering to decompose the complex global classification problem into smaller, more manageable subproblems with consistent traffic profiles. Experimentally, the approach shows near-perfect accuracy on held-out test data, significantly outperforming standard monolithic models on the same data. The cluster-specific models (KNN for cluster 0 and Decision Trees for clusters 1 and 2) achieve over 99.9% accuracy, demonstrating improved robustness and efficiency for IoT anomaly detection. The work represents a pragmatic and scalable framework addressing the dual challenges of data heterogeneity and class imbalance inherent in IoT security datasets.

Key findings

The Bot-IoT dataset is highly imbalanced with attack traffic comprising approximately 99.9%, making normal traffic only 0.013%.
K-Means clustering into 3 clusters effectively partitions the data into homogeneous traffic profiles preserving the binary class structure.
Separate models trained and optimized per cluster outperform any single global classifier across the entire dataset.
K-Neighbors Classifier (KNN) achieved 1.0 accuracy on cluster 0 test data, while Decision Tree (entropy) models attained >0.9999 accuracy on clusters 1 and 2.
Combining SMOTE for oversampling the minority class and RandomUnderSampler for the majority class produced a balanced training distribution that improved model accuracy.
Table 1 shows robustness of models under different resampling strategies, with Approach 3 (resample all data) yielding near-perfect scores for key classifiers.
The hybrid cluster-based framework leverages simpler models rather than a single complex ensemble, improving interpretability and computational efficiency.
Clusters identified data-driven profiles without manual feature-based segmentation, exposing subtle high-dimensional structures.

Threat model

The adversary represents an attacker generating diverse and evolving malicious IoT network traffic encompassing DoS, DDoS, reconnaissance, and data theft attacks. The adversary attempts to evade detection by exploiting IoT's heterogeneity and class imbalance. The defender is assumed to have access to labeled training data including normal and attack samples but cannot rely on robust signature-based detection due to the novelty and variety of attacks. The adversary cannot directly manipulate the clustering or training process but attempts to blend attacks within the feature space to evade a single classifier.

Methodology — deep read

Threat Model & Assumptions: The adversary is modeled as a source of diverse attack traffic impacting IoT networks, generating heterogeneous attack types such as DoS, DDoS, reconnaissance, and data theft. The defender’s goal is to detect anomalous traffic against a highly skewed background where normal traffic is extremely rare. The framework assumes access to labeled network traffic flows from the Bot-IoT dataset but faces challenges due to imbalance and heterogeneity. Attacker and attack features are implicitly embedded in flow features; no assumptions are made about adversary knowledge or evasion.
Data: The study uses the Bot-IoT dataset, a public corpus comprising realistic network traffic labeled as normal or various attack types. The raw dataset severely skews toward attack samples (~99.9%). Non-predictive features such as packet numbering IDs were removed. Hexadecimal port values were converted to decimal, and categorical features encoded numerically using LabelEncoder. To address imbalance, the authors applied a combination of SMOTE oversampling for the minority class and RandomUnderSampler on the majority class, producing a more balanced dataset suitable for classification training. The train-test split was 70/30, with test data fully held out from clustering and balancing.
Architecture / Algorithm: The core hybrid framework consists of two stages: (a) Data Clustering using K-Means to segment the balanced training data into k=3 clusters based on feature similarity, preserving the binary-class labels within each cluster, thereby creating homogeneous traffic profiles. The 'Elbow Method' guided the choice of k=3. (b) Cluster-specific Classification: For each cluster, six baseline models (Decision Tree with gini and entropy criteria, Random Forest, Gaussian Naive Bayes, XGBoost, and K-Neighbors Classifier) were trained independently. Each model outputs a binary label of 'Attack' or 'Normal' within its cluster. After evaluation, the best-performing model per cluster was selected (KNN for cluster 0; Decision Tree entropy for clusters 1 and 2). This ensemble of specialized models forms the final detector.
Training Regime: Models were trained on the balanced clusters derived from preprocessing and clustering. Training details such as number of epochs, batch size, optimizer, or random seeds were unspecified in the source text. The large dataset size and baseline classical ML models suggest efficiency and reproducibility without extensive deep learning tuning.
Evaluation Protocol: Accuracy was the primary metric, justified by the effective resampling balancing classes within each cluster. Model performance was measured on a held-out test set never involved in clustering or balancing, ensuring realistic generalization estimates. Comparative analysis of resampling strategies (none, per-cluster, global) was presented in Table 1, showing the superiority of the global resampling. Confusion matrices illustrated cluster-specific model effectiveness. No formal statistical tests or cross-validation were reported.
Reproducibility: The paper does not mention public release of code or trained models. The Bot-IoT dataset is publicly accessible, so the setup is generally reproducible with appropriate preprocessing and clustering scripts. Exact hyperparameters, seeds, and implementation details were not fully detailed, limiting full reproducibility. However, the use of standard methods (K-Means, classical ML models) aids reproducibility.

Example End-to-End: Network flow records from Bot-IoT are first preprocessed (feature reduction, conversion, encoding). Synthetic minority class samples are generated by SMOTE and majority samples reduced by undersampling, producing a balanced dataset. The balanced data is clustered with K-Means into 3 clusters (k=3). For each cluster, six candidate classifiers are trained and evaluated on a test set. The best model per cluster is identified by accuracy. At inference, a new sample is assigned to the nearest cluster by feature distance and classified by the dedicated model for that cluster, yielding a binary anomaly prediction.

Technical innovations

Use of unsupervised K-Means clustering to partition imbalanced IoT traffic into homogeneous clusters preserving both normal and attack samples, enabling simpler downstream classification.
A hybrid cluster-specific classification framework that trains independent baseline classifiers on each cluster and selects the optimal model per cluster, rather than relying on a single global complex model.
Integration of synthetic minority oversampling (SMOTE) and random undersampling applied globally prior to clustering to effectively balance classes in highly skewed IoT datasets.
Demonstration that simple interpretable models (KNN and Decision Trees) specialized per cluster outperform monolithic complex models on the Bot-IoT anomaly detection task.

Datasets

Bot-IoT — over 72 million network flow records — publicly available from UNSW Canberra Cyber

Baselines vs proposed

Decision Tree (gini): accuracy cluster 0 = 0.999995 vs hybrid best cluster 0 (KNN) = 1.0
Decision Tree (entropy): accuracy cluster 1 = 0.999996 vs hybrid best cluster 1 (DT entropy) = 0.999996
Random Forest: accuracy cluster 2 = 0.985917 vs hybrid best cluster 2 (DT entropy) = 0.999952
Gaussian Naive Bayes: accuracy cluster 1 = 0.999990 vs hybrid best cluster 1 (DT entropy) = 0.999996
XGBoost: accuracies near 1.0 but not specifically best per cluster in reported results
K-Neighbors Classifier: accuracy cluster 0 = 1.0 vs others lower on that cluster

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.19451.

Fig 1

Fig 1: Overview of the Hybrid Cluster-Based Classification

Fig 2

Fig 2: KNN for cluster 0

Fig 3

Fig 3: Decision Tree (entropy) for cluster 1

Fig 4

Fig 4: Decision Tree (entropy) for cluster 2

Limitations

The paper uses accuracy as the primary metric without reporting precision, recall, F1-score, or ROC-AUC, which are crucial for imbalanced classification.
No explicit adversarial or evasion-resilience evaluation was performed, limiting insight into robustness against adaptive attackers.
Lack of detailed training hyperparameters, seeds, and implementation specifics reduces reproducibility.
The approach relies on offline batch clustering; no evaluation under online, streaming, or concept drift scenarios was presented.
Only three clusters were used; the effect of alternative cluster counts or more granular partitions remains unexplored.
The hybrid model assigns a single best classifier per cluster, but dynamic or ensemble combinations within clusters were not tested.

Open questions / follow-ons

How would the hybrid cluster-based approach perform under adaptive adversarial attacks or evasion tactics designed to confuse clustering?
Can the framework be extended to an online or streaming setting with dynamic updating of clusters and models in real time?
Would more sophisticated or deep learning models specialized per cluster further improve detection, or do simple models suffice?
What is the impact of varying the number of clusters (k) on classification performance and computational cost?

Why it matters for bot defense

This paper’s approach of decomposing a highly imbalanced and heterogeneous security dataset into homogeneous clusters before applying specialized classifiers offers valuable lessons for bot-defense and CAPTCHA challenges. Many bot detection problems involve diverse attack methods with skewed distributions where a one-size-fits-all detector underperforms. Applying unsupervised clustering as a preprocessing step can enable building lightweight, interpretable classifiers optimized per threat profile, improving detection accuracy and efficiency. The demonstrated gains on the Bot-IoT dataset illustrate that such hybrid architectures can mitigate class imbalance and data heterogeneity—two common challenges in bot behavior modeling. However, deploying cluster-specialized models requires careful real-time assignment of traffic to clusters, which may add complexity. The method also highlights risks of synthetic oversampling in heterogeneous domains, reinforcing the importance of data-driven partitioning before resampling. Bot-defense engineers might consider integrated clustering plus classification pipelines to improve detection robustness beyond monolithic ML models.

Cite

bibtex

@article{arxiv2605_19451,
  title={ A Hybrid Cluster-Based Classification Model for Anomaly Detection in Unbalanced IoT Networks },
  author={ Hossein Shaemi Barzoki and Amir Hossein Fathi Hafshejani and Ahmadreza Montazerolghaem },
  journal={arXiv preprint arXiv:2605.19451},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.19451}
}

A Hybrid Cluster-Based Classification Model for Anomaly Detection in Unbalanced IoT Networks ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​