FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection

Source: arXiv:2605.22779 · Published 2026-05-21 · By Huanchi Wang, Zihang Huang, Yifang Tian, Kristina Dzeparoska, Hans-Arno Jacobsen, Alberto Leon-Garcia

TL;DR

FAME addresses the challenging problem of message-level anomaly detection in logs, overcoming limitations of window-level detectors that lack granularity and label-scarce settings where annotation is expensive. Unlike prior art that either detects anomalies at coarse session/window granularity or relies on costly per-line large language model (LLM) inference, FAME uses a single offline LLM invocation to semantically cluster event templates into failure domains. The system then trains lightweight on-premise mixture-of-experts (MoE) models specialized per failure domain, greatly reducing labeling and inference cost.

The key novelty is failure-domain routing with an asymmetric confidence principle in K-shot labeling: domains with only anomalous labels require no expert classifier, while mixed domains use independent classifiers. This decomposition enables higher detection accuracy by modeling coherent failure modes separately and benefits from strong LLM semantic knowledge without per-line overhead at runtime. Experiments on large-scale benchmarks Blue Gene/L (BGL) and Thunderbird show FAME achieves F1 scores of 98.16% and 99.95%, respectively, with only 100 labeled lines per event template, reducing annotation effort by 76× compared to full labeling. FAME also detects 86.3% of anomalies from unseen event types on BGL and processes over 1 million lines per hour on a single 4-GPU node offline, far cheaper than continuous LLM inference.

Key findings

FAME achieves F1 = 98.16 on BGL and 99.95 on Thunderbird at K = 100 labeled lines per event template.
Annotation effort is reduced by 76× on BGL compared to fully labeling all 4.7 million lines.
FAME detects 86.3% of anomalies from unseen event templates on BGL, showing generalization beyond labeled templates.
Direct LLM inference yields similar F1 but at 650–980× higher deployment cost and much lower throughput (10K vs 1.2M lines/hour).
Expert routing improves F1 by over 32 points on BGL compared to a single global BERT classifier (98.16 vs 65.68 with SBERT+LR).
Routing thresholding and semantic partitioning using an LLM closely matches or improves over non-LLM TF-IDF based groupings (FAME w/ TF-IDF grouping F1=93.01 vs LLM grouping F1=98.16 on BGL).
All-anomalous failure domains require no expert classifier and can be identified by router alone, reducing modeling complexity and training cost.
FAME runs entirely on-premise after one offline LLM call, enabling privacy-sensitive deployments without recurrent LLM API calls or data egress.

Threat model

The threat model involves an operator or security team monitoring production system logs with a scarcity of labeled message-level anomaly data. The adversary is the inherent operational challenge of detecting rare, heterogeneous anomalies within massive daily log streams without session context. There is no active adversarial manipulation of logs assumed, only the challenges of class imbalance, mixed-template anomaly semantics, and label scarcity. The system cannot rely on continuous large-scale LLM access due to privacy, cost, or latency constraints.

Methodology — deep read

Threat Model & Assumptions: The adversary is an operator facing production system logs with millions of lines daily, containing rare anomalies (<5% lines). The defender has limited labeled anomaly data per event template (K-shot regime, typically K=100), and wants to detect anomalies at the message-level. The adversary does not directly attack the detector but the challenge is scarcity of labels and heterogeneous failure modes.
Data: Two log benchmarks used — Blue Gene/L (BGL), 4.7M lines with 7.3% anomalies, and Thunderbird, 5M lines with 7.6% anomalies. Logs are parsed with Drain3 into event templates (EventIDs). The data is chronologically split 85% offline region for setup and training, 15% held-out test. Up to K labeled lines per EventID are sampled from the offline region for supervised training and threshold calibration.
Architecture / Algorithm:

Offline, an LLM (single invocation) receives per-EventID statistics and representative normal/anomalous examples, then proposes a partition of templates into failure domains (subgroups with shared failure mechanisms).
A deterministic certification step validates that each domain is pure-anomaly (all labeled anomalous) or mixed (contains normal lines), with pure-anomaly domains resolved solely by routing.
A lightweight on-premise sparse Mixture-of-Experts (MoE) model is trained: a DistilBERT gate routes lines to either a domain expert or a universal normal expert.
If pure-anomaly domain, router flags anomaly immediately.
If mixed domain, a domain-specific BERT-based classifier predicts anomaly probability.
Each expert is trained independently with a two-phase regimen: Phase 1 domain-adaptive masked language modeling (MLM) on a large normal-only corpus capped at 200k lines, Phase 2 supervised fine-tuning with K-shot labels and Focal Loss to handle class imbalance.
Router components (gate and selector) are DistilBERT classifiers trained with focal loss and class-weighting for imbalanced classes.
Thresholds for binary anomaly decisions are calibrated on a held-out 20% K-shot calibration subset per expert, maximizing per-domain F1 while enforcing recall ≥ 0.9.

Training Regime:

Router trained first, validated at recall 0.95 threshold.
Experts trained on GPUs, using early stopping based on AUROC.
Small datasets trained for fixed 500 gradient steps; larger with epoch-based training and patience.
Overall hyperparameters: focal loss with γ=2, α=0.75, DistilBERT backbone, MLM masking at 15%.

Evaluation Protocol:

Metrics: Precision, Recall, F1, AUROC at message level.
Chronological split: 85% offline region for training/setup, 15% for test (novel unseen lines and templates).
Baselines include classical methods (Drain+Random Forest, TF-IDF+Isolation Forest, SBERT+Logistic Regression), LLM direct inference (various GPT, Claude, Gemini models with few-shot prompting), and several ablations (no routing, no MLM pretraining, single BERT expert).
Threshold calibration on held-out labeled subset.
Results compared in Table I with decimal precision due to dataset scale.

Reproducibility:

Uses publicly available BGL and Thunderbird datasets.
Drain3 parser with fixed parameters.
Details sufficient to replicate routing, training, and evaluation pipeline.
Code and LLM call details not explicitly described; LLM invocation is one-time offline step.

Example end-to-end usage:

Parse raw logs into EventIDs using Drain3.
Sample K=100 labeled lines per EventID.
Extract binary anomaly signals and representatives.
Prompt LLM once to propose failure domain partitioning.
Certify partition using K-shot evidence.
Train gate and selector DistilBERT routers.
Train domain experts with 2-phase BERT training for mixed domains; none for pure anomaly.
Calibrate expert thresholds.
At inference, route each new log line through router to the expert; assign anomaly binary label and failure domain.
Entire runtime model runs on-premise with no LLM calls.

This methodology delivers message-level anomaly detection with labeled-cost-saving and interpretable failure domain routing grounded in offline LLM semantic clustering rather than continuous expensive LLM inference.

Technical innovations

One-time offline LLM semantic partitioning of event templates into failure domains for routing, eliminating per-line LLM inference cost.
Asymmetric K-shot sampling confidence principle enabling pure-anomaly domains to be resolved by routing alone without expert classifiers.
Failure-aware Mixture-of-Experts architecture with domain-grounded expert specialization to address heterogeneity in failures.
Two-phase expert training combining domain-adaptive masked language modeling and supervised fine-tuning under severe class imbalance.
On-premise inference pipeline fusing gate and universal normal expert scores for robust fallback detection of misrouted anomalies.

Datasets

Blue Gene/L (BGL) — 4,747,963 lines, 7.3% anomalous, public benchmark
Thunderbird — 5,000,000 lines subset, 7.6% anomalous, public benchmark

Baselines vs proposed

Drain+Random Forest: F1 = 33.59 on BGL vs FAME: F1 = 98.16
TF-IDF+Isolation Forest: F1 = 27.67 on BGL vs FAME: F1 = 98.16
SBERT+Logistic Regression: F1 = 65.68 on BGL vs FAME: F1 = 98.16
FAME w/ TF-IDF grouping: F1 = 93.01 on BGL vs FAME w/ LLM grouping: 98.16
Direct LLM inference GPT-5.4: F1 = 96.01 on BGL vs FAME: 98.16 with 650–980× lower deployment cost
Single BERT (Phase1+2, no routing): F1 = 29.65 on BGL vs FAME: 98.16
Single BERT (Phase2 only): F1 = 38.10 on BGL vs FAME: 98.16
No-Gate (symmetric routing): F1 = 90.99 on BGL vs FAME with gating: 98.16

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.22779.

Fig 5

Fig 5: Case study with ambiguous keyword ’FATAL’

Fig 2

Fig 2 (page 10).

Limitations

Single offline LLM partitioning is non-deterministic and depends on prompt quality; robustness to LLM changes studied but not exhaustively.
Label-scarcity assumed as K-shot (up to 100 lines/template); performance may degrade with fewer labels, although K-sensitivity was explored.
Pure-anomaly domains require completely anomalous K-shot samples, which may be rare in some production systems.
Evaluation limited to two supercomputer log datasets; generalization to other domains or more diverse logs unclear.
No explicit adversarial robustness evaluation (e.g., obfuscated or crafted anomaly messages).
Threshold calibration relies on small held-out calibration sets, which may introduce instability in low-K regimes.

Open questions / follow-ons

How to improve robustness and stability of the LLM failure-domain partitioning step, potentially via ensemble or deterministic methods.
Extension and evaluation on varied log sources beyond supercomputers, including cloud-native services or IoT device logs.
Adaptation of FAME to incremental learning scenarios where new event templates and anomaly classes emerge after deployment.
Exploring adversarial attacks on routing and expert models, and defenses against evasion or poisoning.

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, the principle of leveraging offline large-scale semantic understanding (via LLMs) to guide the design of lightweight, domain-specialized anomaly detectors is directly relevant. FAME's technique of clustering heterogeneous error classes into failure domains and training specialized experts enables precise, interpretable anomaly attribution at a granular message level with minimal labeled data. This approach addresses key challenges faced in continuous monitoring of system events under tight labeling budgets and operational constraints. The asymmetric K-shot confidence principle also highlights how detection architectures can be optimized based on label distributional properties, a lesson transferable to bot event classification where some attack types may be rare yet distinct.

Moreover, the strict on-premise inference framework without recurring calls to expensive LLM APIs aligns well with privacy, cost, and latency requirements in production bot detection systems. Failure-domain routing could inspire analogous approaches to route suspicious traffic or user events to specialized classifiers corresponding to different attack or abuse categories, improving detection accuracy while controlling computational overhead. The methodology and evaluation rigor provide a valuable blueprint for balancing granularity, cost, and robustness in security anomaly detection pipelines.

Cite

bibtex

@article{arxiv2605_22779,
  title={ FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection },
  author={ Huanchi Wang and Zihang Huang and Yifang Tian and Kristina Dzeparoska and Hans-Arno Jacobsen and Alberto Leon-Garcia },
  journal={arXiv preprint arXiv:2605.22779},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.22779}
}

FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​