A Longitudinal Study of Recently Observed Malicious Domains: Characteristics, Infrastructure, and Abuse Patterns

Source: arXiv:2606.11111 · Published 2026-06-09 · By Fathima Mashood, Mohamed Nabeel

TL;DR

This work presents a large-scale, longitudinal empirical study of newly observed malicious domains collected from VirusTotal over a 5-month period in early 2026. The dataset contains approximately 1.52 million domains flagged by at least five independent scanners with first-seen dates within Jan-May 2026. The authors classify the domains into attacker-created versus compromised legitimate domains and characterize attacker behavior across multiple dimensions including temporal patterns, registrar and TLD concentration, hosting infrastructure usage, DNS query volumes, bulk registration events, and brand impersonation. Key results show that attacker-created domains dominate (89.3%), are deployed rapidly after registration (median 60 days), concentrate heavily at a few registrars and TLDs, exploit Cloudflare infrastructure extensively, involve frequent automated bulk registrations of thousands per day, and impersonate top global brands like WhatsApp and Google. The study provides an up-to-date, richly annotated dataset with associated infrastructure metadata for deeper analysis and practical mitigation approaches.

Key findings

89.3% (1,357,921) of malicious domains are attacker-created, while 10.7% are compromised legitimate domains.
Median domain age at first VirusTotal detection is 60 days; 7.4% detected within one day of registration; 29.9% within one week.
Top 4 registrars (GNAME.COM, Dynadot LLC, Namesilo, Namecheap) collectively register over 36% of attacker-created domains.
Top 10 TLDs account for 68.1% of attacker-created domains, with .com dominating at 31.0%.
DNS query volume for attack domains is highly skewed: median 493 queries, 99th percentile 91,597 queries, max >2.1 billion queries.
Eight of the top 10 hosting IP addresses belong to Cloudflare, hosting over 230,000 attack domains each, demonstrating infrastructure concentration.
Over 77% of attack domains with WHOIS data belong to bulk registration batches of ≥5 domains sharing registrar and creation date; the largest single batch includes 2,168 domains.
8.4% of attacker-created domains impersonate known brands; WhatsApp (19,511 domains), Logitech (5,900), and Google (2,302) are most frequently impersonated.

Threat model

The adversary is an attacker registering or compromising domain names to host phishing, malware, fraud, or command-and-control infrastructure aiming to evade detection by security vendors and defenders. The attacker’s capabilities include bulk automated domain registrations, abuse of low-cost registrars and TLDs, leveraging privacy features, and exploiting major CDN/reverse proxy providers like Cloudflare to mask true hosting. The attacker cannot perfectly hide creation dates or completely prevent detection by multiple VirusTotal scanners but exploits rapid domain deployment and infrastructure reuse to maximize attack resilience.

Methodology — deep read

Threat model & assumptions: The adversary is a domain attacker who either compromises legitimate domains or registers attacker-owned domains for malicious purposes (phishing, malware distribution, C2, fraud). The attacker tries to evade detection by registering domains quickly, using bulk registrations, abusing registrars and TLDs, and abusing hosting infrastructure such as CDNs. The adversary is assumed not to have ability to fully hide domain registry metadata but can exploit privacy and bulk registration mechanisms.
Data provenance, size, labels: The primary dataset consists of 1,520,050 unique domain names collected from VirusTotal between January and May 2026. Domains flagged as malicious by at least 5 independent VT scanning engines and with first-seen date in the study window were included. Annotation was done by enriching VT data with WHOIS records (for registrar, creation date, expiry date, nameservers) and Passive DNS (PDNS) resolution data (for IPs, ASNs, query counts). Around 88.4% of domains had WHOIS creation dates, and 88.1% had associated non-empty PDNS IP sets. The Tranco Top-1M list was used for classification and brand token extraction.
Architecture / algorithm: Domains were classified as compromised or attacker-created using a conservative heuristic: if domain is in Tranco top 1M with rank ≤500,000 or domain age ≥3 years at time of detection, it is marked as compromised; otherwise, attacker-created. Batch registrations were detected by grouping domains sharing registrar and WHOIS creation date with batch size ≥5. Brand impersonation was measured by substring matching domain names against a filtered set of 7,154 brand tokens from the Tranco top-10K excluding common English words.
Training regime: Not applicable since this study is analytical and empirical; no ML model training reported.
Evaluation protocol: Analysis was descriptive and statistical over the large dataset. Key metrics include counts, fractions, median/percentile domain age, registrar and TLD frequencies, DNS query volumes (used as proxy for harm), IP and ASN hosting concentrations, batch sizes for registrations, and counts of brand impersonations. Visualizations (CDF, CCDF, histograms) and network graphs were used to analyze distributional properties and infrastructure sharing.
Reproducibility: The annotated dataset, including classification labels and metadata, is publicly shared on GitHub at https://github.com/mufimash/malicious_domains. Methodology details and heuristics for classification are provided for reproducibility. External datasets used (VirusTotal scans, WHOIS data, FarSight PDNS, Tranco lists) may have access restrictions or commercial licensing which could limit end-to-end reproduction.

One concrete example: a domain newly registered at a known high-abuse registrar (e.g. GNAME.COM) in early 2026 is detected within days by VirusTotal with multiple AV engines flagging it malicious. WHOIS records reveal batch registration of over 2,000 similar domains on same day. Passive DNS shows the domain resolves to an IP address associated with Cloudflare infrastructure. Brand substring analysis detects partial token matching “whatsapp.” This domain falls into the attacker-created class and highlights a typical attacker tactic of rapid deployment, bulk registration, and leveraging major CDN hosting to mask origin while impersonating a popular brand for phishing.

Technical innovations

A large-scale longitudinal analysis of 1.52M newly detected malicious domains on VirusTotal over 5 months in 2026, updating threat landscape understanding.
Conservative heuristic combining domain popularity (Tranco ranking) and age to distinguish compromised vs attacker-created malicious domains.
Integration of VirusTotal detections with WHOIS, Passive DNS, and Tranco popularity data to characterize attacker behaviors across registry, hosting, traffic volume, and impersonation dimensions.
Detailed quantification of bulk domain registration events at registrar/day granularity demonstrating widespread automation in attack domain acquisition.
Comprehensive brand impersonation analysis using substring matching against a filtered top-10K brand token list to reveal top globally targeted entities.

Datasets

VirusTotal malicious domain dataset — 1,520,050 domains — collected Jan-May 2026 with ≥5 independent detections
WHOIS records — ~88.4% coverage — parsed from public WHOIS
Passive DNS (Pdns) records — ~88.1% domains with IP resolution history — from FarSight/DomainTools
Tranco top-1M domain ranking list — publicly available
Tranco top-10K brand token list — used for brand impersonation analysis

Baselines vs proposed

No ML baselines reported - study is large-scale empirical characterization.
Prior work cited: De Silva et al. (2021) classification approach adapted for compromised vs attacker-created distinction, scaled to much larger dataset.
Compared to Antonakakis et al. (2010) and Lever et al. (2017) insights on hosting infrastructure reuse; confirms continued Cloudflare dominance hosting attacker domains.
Aligns with Hao et al. (2011) findings on rapid deployment of malicious domains after registration; updated with precise median (60 days) and percentile (7.4% within 1 day) metrics.

Limitations

VirusTotal coverage is imperfect and may skew towards domains detected by popular scanners; some attacker-owned domains could be missed if not flagged.
The compromised vs attacker-created classification heuristic may misclassify some borderline or aged attacker domains.
Approximately 10% missing or malformed WHOIS creation date reduces coverage and precision of age-based analyses.
Passive DNS data lacks global completeness, with 11.9% of domains having empty IP sets due to FarSight visibility gaps.
Analysis is observational without active adversarial testing or manipulation-resistant evaluation.
Focus on short-term 5-month window limits insights on longer-term malicious domain lifecycles and attacker adaptation.

Open questions / follow-ons

How can registrar-level interventions such as rate limiting or enhanced vetting impact bulk registration abuse?
What new detection features can differentiate attacker-created from compromised domains beyond age and popularity heuristics?
How can passive DNS monitoring be improved to better capture infrastructure shifts and evasive tactics?
To what extent do attacker behaviors adapt to increased scrutiny of popular CDN infrastructures like Cloudflare?

Why it matters for bot defense

For bot-defense and CAPTCHA engineers, this study provides a comprehensive view of the current domain-based attacker infrastructure targeting the internet ecosystem. The dominance of attacker-created domains with rapid activation and heavy reliance on a few registrars and Cloudflare infrastructure suggests defense mechanisms focusing on early detection of bulk registrations, leveraging registrar cooperation for anti-abuse measures, and detection of domains hosted behind popular CDNs could be fruitful. The skewed distribution of DNS query traffic indicates prioritizing protection or sinkholing high-impact malicious domains may limit attacker reach. The extensive brand impersonation highlights the need for CAPTCHA challenges to incorporate brand reputation signals as part of risk scoring for login or transaction authentication workflows to mitigate phishing risks.

Cite

bibtex

@article{arxiv2606_11111,
  title={ A Longitudinal Study of Recently Observed Malicious Domains: Characteristics, Infrastructure, and Abuse Patterns },
  author={ Fathima Mashood and Mohamed Nabeel },
  journal={arXiv preprint arXiv:2606.11111},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.11111}
}

A Longitudinal Study of Recently Observed Malicious Domains: Characteristics, Infrastructure, and Abuse Patterns ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​