Archon: A Unified Multimodal Model for Holistic Digital Human Generation
Source: arXiv:2605.30311 · Published 2026-05-28 · By Chong Bao, Shichen Liu, Lijun Yu, David Futschik, Stylianos Moschoglou, Shefali Srivastava et al.
TL;DR
This paper introduces Archon, a fully pretrained unified multimodal model designed specifically for holistic digital human generation. Unlike prior approaches that rely on specialized expert models for individual modalities (speech, video, animation, text), Archon integrates seven modalities including description, script, speech audio, 3D animation parameters, semantic video, image, and raw video into a single autoregressive framework pretrained on synchronized modalities and 72 diverse multimodal tasks. By unifying inputs as discrete tokens with modality-specific tokenizers and addressing token explosion via a novel semantic video reparameterization that achieves 4× token reduction, Archon effectively models joint cross-modal distributions. It further introduces a "Thinking in Modality" inference strategy to decompose complex cross-modal generation into stepwise chains of modality predictions that boost fidelity and controllability. Extensive evaluation on speech-driven video generation and image-conditioned text-to-speech benchmarks shows Archon performing comparably or superior to state-of-the-art specialized models despite not being finetuned on those datasets, validating the unified approach for digital human generation and multimodal understanding.
Key findings
- Archon's semantic video discretization achieves a 4× token reduction over direct RGB video tokenization, enabling efficient training within an 8K token limit.
- Archon is pretrained on a large-scale 6,000-hour synchronized multimodal video dataset covering speech, script, animation, semantic video, and image modalities.
- Compared to expert models AniPortrait, EchoMimic, and Hallo3 on CelebV-HQ and HDTF benchmarks, Archon achieves lower FID (6.818 vs. 15.67/12.78), lower FVD (93.81 vs. 105.5/96.51), and comparable or better lip-sync scores.
- In image-conditioned text-to-speech (FaceTTS benchmark), Archon achieves higher cosine similarity of speaker identity (0.6223 vs. 0.6032) and better identity accuracy (0.6223 vs. 0.6032) despite slightly higher MCD-DTW.
- "Thinking in Modality" inference strategy improves video generation fidelity over direct speech-to-video mapping, reducing blur and distortion (Fig. 3).
- Modality tokenizers include pretrained MAGVIT-v2 for image, SoundStream at 16 kHz 25fps for speech, and Residual Vector Quantized VAEs for 3DMM animation, each producing discrete token sequences.
- Archon's language model backbone is a 1B parameter PaLM2 model with a 550K token vocabulary capable of handling all modalities jointly via structured natural language prompts.
Threat model
n/a — the work targets multimodal digital human synthesis for immersive interaction rather than adversarial or security settings. The model assumes access to large datasets of clean, synchronized multimodal digital human data without adversarial interference.
Methodology — deep read
Archon's threat model is not adversarial but focuses on a holistic multimodal digital human synthesis scenario involving cross-modal generation among text, audio, motion, semantics, and visual video for talking avatars. The model assumes access to synchronized multimodal data capturing all modalities for supervised training of cross-modal generation tasks. The data provenance includes 6,000 hours of aligned monologue videos gathered from publicly available internet sources, with synchronized speech, transcription scripts, 3D face shape and motion parameters derived via 3D Morphable Models, semantic face segmentation (21 facial part categories) extracted by a backbone segmentation model, and associated images and videos. The training datasets exclude the test benchmarks CelebV-HQ and HDTF to ensure strict evaluation splits.
Modality tokenization is modality-specific: images are tokenized via pretrained MAGVIT-v2 (a 3D CNN VQGAN quantizing 256x256 images into 16x16 tokens with a 2^18 codebook), while speech uses the SoundStream residual vector quantizer producing multiple discrete residual levels at 25 fps, focusing on first 4 residual levels for tokens. Animation parameters (shape, expression, and pose) are modeled separately with Residual VQVAEs turning continuous 3DMM coefficients into discrete tokens. Semantic videos are represented using a semantic-driven reparameterization where a reference image is combined with a lower-resolution semantic label video (21 classes) tokenized into a smaller token set, achieving a 4× token count reduction compared to raw RGB videos. Text uses the T5 encoder tokenizer.
The core architecture is a monolithic autoregressive language model based on a 1B-parameter PaLM2 decoder with prefix bidirectional attention, trained to model the joint distribution over concatenated modality tokens from an 8,000 token context window. The vocabulary of 550K tokens is partitioned into contiguous blocks representing different modality token types, allowing learned embeddings for each token type. The language model receives prompts structured as natural language key-value pairs enumerating input/output modalities and states, which helps semantic grounding and reduces reliance on special tokens.
Training uses a compositional multi-task sampling strategy over 72 diverse cross-modal generation tasks, balancing distributional and difficulty biases by weighting tasks proportional to the log perplexity divided by the number of tasks per output modality. The PaLM2 model is fine-tuned for 20 days on 256 TPUv6 chips with Adam optimizer, weight decay 10^-3, and a cosine learning rate schedule. The diffusion video decoder is based on finetuned WALT (transformer latent diffusion) trained for 10 days on 128 TPUv6 chips.
Evaluation is performed without finetuning on held-out CelebV-HQ and HDTF benchmark datasets, using quantitative metrics such as FID, FVD for video quality, Sync-C and Sync-D for lip synchronization, IQA for video quality assessed by a VLM, Mel-Cepstral Distortion and cosine similarity for speech fidelity and identity coherence. Comparisons to state-of-the-art specialized baselines (AniPortrait, EchoMimic, Hallo3, FaceTTS) are reported. Ablations on the "Thinking in Modality" inference strategy demonstrate improved quality over direct single-step cross-modal generation.
Reproducibility: Code and pretrained weights are not mentioned as released. The dataset is compiled from public internet sources but is not fully public; thus replication may require reconstruction following the paper's methodology. Exact architectural and hyperparameter details are provided in supplementary sections with examples of prompts for conditioning the model.
Technical innovations
- Memory-efficient semantic video reparameterization that reduces video token count by 4× while preserving dynamics for large-scale training within an 8K token context window.
- A unified autoregressive language model backbone (PaLM2 1B) with modality-specific tokenizers enabling joint cross-modal generation and modeling of holistic digital human signals.
- "Thinking in Modality": an inference strategy that decomposes ambiguous cross-modal tasks into stepwise generation through intermediate modalities, improving output fidelity and controllability without extra training.
- A novel multimodal training sampling strategy balancing task difficulty and modality distribution to reduce training biases when learning 72 diverse cross-modal transformations.
Datasets
- 6,000 hours synchronized multimodal monologue videos — from public internet sources (not fully public)
- CelebV-HQ — test split: 200 videos — used for evaluation
- HDTF — test split: 200 videos — used for evaluation
Baselines vs proposed
- AniPortrait [52]: CelebV-HQ FID = 39.73 vs proposed Archon = 6.818
- EchoMimic [8]: CelebV-HQ FVD = 236.9 vs Archon = 93.81
- Hallo3 [12]: HDTF FID = 12.78 vs Archon = 5.779
- FaceTTS [29]: CelebV-HQ cosine similarity (C-SIM) = 0.9048 vs Archon = 0.9117
- FaceTTS [29]: CelebV-HQ identity accuracy = 0.6032 vs Archon = 0.6223
Figures from the paper
Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.30311.

Fig 1: Archon. We propose a novel unified multimodal model that performs cross-modal generation among a wide range of modalities,

Fig 2: Pipeline. We use modality tokenizers to tokenize description, script, speech, animation, image and semantic video into discrete

Fig 3 (page 1).

Fig 4 (page 1).

Fig 5 (page 1).

Fig 6 (page 1).

Fig 7 (page 1).

Fig 8 (page 1).
Limitations
- No adversarial robustness evaluation or real-world attack scenarios against hostile manipulation of digital human generation.
- The large unified model requires substantial compute (256 TPUv6 for 20 days), limiting accessibility and reproducibility.
- The dataset is large but aggregated from uncontrolled internet sources, possibly introducing biases and lacking fine-grained annotation standards.
- The evaluation benchmarks cover only speech-to-video and image-to-speech; other modality combinations are qualitatively assessed but lack rigorous quantitative tests.
- No ablation or comparison on different tokenizer designs beyond MAGVIT-v2 and SoundStream residual levels was presented.
- Dependence on pretrained components (e.g., PaLM2, segmentation models) may limit end-to-end interpretability and adaptability.
Open questions / follow-ons
- How robust is the unified model to real-world noise, distribution shifts, or adversarial input perturbations?
- Can the 'Thinking in Modality' approach be formalized or optimized as a learnable policy rather than heuristic inference?
- What are the trade-offs in fidelity and computational cost for scaling the semantic video tokenizer to higher resolutions or longer sequences?
- How adaptable is the Archon framework to interactive or real-time digital human generation scenarios?
Why it matters for bot defense
For bot-defense and CAPTCHA systems, Archon exemplifies state-of-the-art unified multimodal generation techniques that are capable of producing highly realistic audiovisual digital humans. This highlights the increasing difficulty of detecting synthetic human-like content across multiple modalities simultaneously. Bot-defense systems must therefore consider cross-modal coherence and temporal consistency when analyzing potential avatar-based threats. The paper's token-efficient semantic video reparameterization and 'Thinking in Modality' progressive inference concepts could inspire defenses that detect subtle semantical or dynamic inconsistencies that single-modal detectors might miss. Conversely, captcha designers might leverage understanding of such unified generative models to create tests that require cross-modal grounding—linking audio, video, and motion cues in ways that challenge current synthetic avatar systems. Overall, Archon signals a trend toward holistic, multimodal generative models producing coherent human content, urging multilayer and multimodal verification strategies in bot-defense.
Cite
@article{arxiv2605_30311,
title={ Archon: A Unified Multimodal Model for Holistic Digital Human Generation },
author={ Chong Bao and Shichen Liu and Lijun Yu and David Futschik and Stylianos Moschoglou and Shefali Srivastava and Ziqian Bai and Feitong Tan and Guofeng Zhang and Zhaopeng Cui and Sean Fanello and Yinda Zhang },
journal={arXiv preprint arXiv:2605.30311},
year={ 2026 },
url={https://arxiv.org/abs/2605.30311}
}