Audio-Visual Intelligence in Large Foundation Models

Source: arXiv:2605.04045 · Published 2026-05-05 · By You Qin, Kai Liu, Shengqiong Wu, Kai Wang, Shijian Deng, Yapeng Tian et al.

TL;DR

This is a survey paper rather than an empirical model paper, so its contribution is conceptual synthesis: it organizes the fast-growing audio-visual intelligence (AVI) literature under a unified framework spanning perception, generation, and interaction. The authors argue that the field has become fragmented across subcommunities with inconsistent task definitions, evaluation protocols, and terminology, which makes it hard to compare methods fairly or identify the main technical bottlenecks. Their core response is a taxonomy that separates tasks by level of abstraction (pixel-level perception, content understanding, logical reasoning), by generation type (conditional, cross-modal, joint), and by interaction setting (conversation, embodiment, agentic systems).

What is new here is not a new algorithm, but a structured map of the area through the lens of large foundation models, plus an organized review of the enabling techniques that recur across the literature: modality tokenization, cross-modal fusion, autoregressive and diffusion generation, large-scale pretraining, instruction tuning, and preference optimization. The survey also curates datasets, benchmarks, and metrics, and it explicitly calls out unresolved issues such as synchronization, spatial/audio grounding, controllability, safety, watermarking, and governance. The result is a reference architecture for thinking about AVI research, not a benchmark claim in the usual sense.

Key findings

The survey claims to be the first comprehensive review of audio-visual intelligence through the lens of large foundation models, with a taxonomy that covers understanding, generation, and interaction.
Figure 1 places AVI’s evolution from early alignment systems like SoundNet, SyncNet, and AVTS to recent foundation-style systems such as MovieGen, Veo 3, Qwen3-Omni, and GPT-4o, showing the field’s shift toward unified multimodal models.
Figure 2 explicitly groups methods into representation learning, contrastive binding, tokenization, diffusion/flow generation, AR/MAR generation, LLM-centric post-training, and policy learning, indicating that AVI is no longer a single-method problem.
The paper identifies synchronization, spatial reasoning, controllability, and safety as the main open technical bottlenecks across AVI task families.
The benchmark map in Figure 2 names distinct dataset clusters for open-domain AV, speech/human reasoning, T2AV/editing, and embodied/XR, implying evaluation is still siloed by subtask rather than unified across AVI.
The survey emphasizes that evaluation for open-ended AV generation remains inconsistent across datasets and metrics; it highlights SyncNet-style synchronization, FID/FVD/FAD, and CLIP/ImageBind alignment as common but incomplete measures.
Section 9.1 argues that current work is still mostly about temporal synchronization, whereas the next step is causal event-source grounding, which is a stronger and less solved problem.
The authors state that all summarized resources, references, and organizational structures will be publicly released via the project homepage/GitHub resource hub.

Methodology — deep read

This is a literature survey, so the “methodology” is the survey design and organization rather than a trainable model. The implied threat model is not security-adversarial in the formal sense; instead, the paper assumes a research reader trying to navigate a fragmented AVI landscape. The survey’s assumptions are that audio and vision are the key complementary modalities, that foundation-model-scale pretraining changes the design space, and that tasks can be meaningfully grouped by what the system is doing: perceiving, generating, or interacting. The paper also assumes that a useful survey should reconcile task definitions, representation choices, and evaluation practices across otherwise disconnected subfields.

Data-wise, the paper does not introduce a new dataset or split. Instead, it curates representative datasets and benchmarks from the literature and organizes them by task family. In the taxonomy figure and surrounding sections, it names example resources such as AudioSet, VGGSound, Greatest Hits, AVSync15, FoleyBench, VGGSound-Omni, VoxCeleb, VoxCeleb2, LRS2, LRS3, HDTF, MEAD, VOCASET, AIST++, BEAT2, AV-Odyssey, OmniBench, OmniVideoBench, OmniXR, Video-MME, AudioBench, AIR-Bench, MMAU, MMAR, CMI-Bench, JavisBench, Verse-Bench, Harmony-Bench, VABench, T2AV-Compass, PhyAVBench, AvED-Bench, SoundSpaces, SoundSpaces 2.0, SonicVerse, RAF, AVLMaps/MSLMaps, VLABench, and VLA-OS. Because this is a survey, the paper does not report preprocessing pipelines, train/val/test splits, or label construction procedures for a unified dataset; those details are deferred to the cited original works.

The architecture/algorithm part of the survey is a synthesis of common technical building blocks rather than a single proposed model. In Section 2, the authors first define audio representations as raw waveforms, spectrograms, dense embeddings, or discrete token sequences, and visual representations as images/videos, dense feature maps, or discrete tokens. In Section 4, they then organize methods into representation-centric approaches (feature extraction, variational auto-encoding, discrete tokenization), generation-centric approaches (GANs, diffusion, autoregressive, masked autoregressive), and LLM-centric approaches (encoder+LLM, LLM+generator, unified multimodal models, agentic systems, and VLA policies). The novelty is in how these are related: for example, discrete tokenization is framed as the bridge that allows audio and vision to be treated as sequences for language-model-style generation, while diffusion is framed as the high-fidelity synthesis route that is increasingly combined with multimodal conditioning. The paper does not propose a new loss, optimizer, or module; instead it compares the role of these modules across published systems.

Because there is no training regime for the survey itself, there are no epochs, batch sizes, seed strategies, or hardware specs to report. The evaluation protocol is likewise a synthesis of how the field evaluates itself: SyncNet-style synchronization for alignment, FID/FVD/FAD for generation quality, and CLIP/ImageBind-style similarity metrics for cross-modal correspondence. The survey repeatedly points out that these metrics are heterogeneous and often insufficient for open-ended generation, long-horizon coherence, or causal grounding. When the paper discusses benchmarks, it does so by task family rather than by standardized cross-paper re-evaluation, and it highlights that the lack of shared protocols is a major obstacle to fair comparison. For example, AV question answering, sound localization, and Foley generation each use different metrics and datasets, which prevents direct apples-to-apples comparison.

A concrete end-to-end example of the survey’s logic is the audio-visual event localization setting described in Section 3 and Figure 4: given a video of a dog barking, a model must infer that the sound originates from the dog rather than from another object in the frame. In the survey’s framing, the input first passes through modality-specific encoders or tokenizers, then cross-modal fusion aligns temporal audio cues with spatial visual features, and the output is a localized region or segmentation mask. The same representation stack can be reused for related tasks such as AV segmentation or AVQA, but the supervision changes from spatial masks to question-answer pairs or retrieval labels. This example illustrates the survey’s main claim: many AVI tasks differ only in the level of abstraction and the supervision signal, while sharing underlying alignment and tokenization machinery.

Reproducibility is addressed at the level of the survey resource rather than experiments. The authors state that summarized resources, references, and organizational structures will be publicly released via the project homepage/GitHub hub, which improves discoverability but does not provide frozen checkpoints or a benchmark suite of their own. Since the paper is a survey, there are no released model weights or ablation tables to reproduce. One thing the survey does well is to explicitly note where the source literature is inconsistent or under-specified; for instance, it flags heterogeneous evaluation practices and missing safety/provenance standards as structural reproducibility issues in the field, not just reporting gaps in individual papers.

Technical innovations

Introduces a unified taxonomy for AVI that spans perception, generation, and interaction under a foundation-model framing.
Synthesizes AVI methods into shared technical primitives: tokenization, cross-modal fusion, autoregressive generation, diffusion generation, instruction alignment, and preference optimization.
Organizes benchmarks and metrics by task family to expose the current fragmentation in AVI evaluation.
Surfaces a research agenda around synchronization, spatial reasoning, controllability, safety, watermarking, and governance rather than treating them as isolated subproblems.

Datasets

AudioSet — size not specified in the excerpt — public
VGGSound — size not specified in the excerpt — public
Greatest Hits — size not specified in the excerpt — public
AVSync15 — size not specified in the excerpt — public
FoleyBench — size not specified in the excerpt — public
VGGSound-Omni — size not specified in the excerpt — public
VoxCeleb — size not specified in the excerpt — public
VoxCeleb2 — size not specified in the excerpt — public
LRS2 — size not specified in the excerpt — public
LRS3 — size not specified in the excerpt — public
HDTF — size not specified in the excerpt — public
MEAD — size not specified in the excerpt — public
VOCASET — size not specified in the excerpt — public
AIST++ — size not specified in the excerpt — public
BEAT2 — size not specified in the excerpt — public
AV-Odyssey — size not specified in the excerpt — public
OmniBench — size not specified in the excerpt — public
OmniVideoBench — size not specified in the excerpt — public
OmniXR — size not specified in the excerpt — public
Video-MME — size not specified in the excerpt — public
AudioBench — size not specified in the excerpt — public
AIR-Bench — size not specified in the excerpt — public
MMAU — size not specified in the excerpt — public
MMAR — size not specified in the excerpt — public
CMI-Bench — size not specified in the excerpt — public
JavisBench — size not specified in the excerpt — public
Verse-Bench — size not specified in the excerpt — public
Harmony-Bench — size not specified in the excerpt — public
VABench — size not specified in the excerpt — public
T2AV-Compass — size not specified in the excerpt — public
PhyAVBench — size not specified in the excerpt — public
AvED-Bench — size not specified in the excerpt — public
SoundSpaces — size not specified in the excerpt — public
SoundSpaces 2.0 — size not specified in the excerpt — public
SonicVerse — size not specified in the excerpt — public
RAF — size not specified in the excerpt — public
AVLMaps/MSLMaps — size not specified in the excerpt — public
VLABench — size not specified in the excerpt — public
VLA-OS — size not specified in the excerpt — public

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.04045.

Fig 1

Fig 1 (page 1).

Fig 2

Fig 2 (page 1).

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 8

Fig 8 (page 1).

Limitations

No new model, dataset, or benchmark is introduced, so the paper cannot provide empirical claims beyond synthesis of prior work.
The excerpt does not give standardized dataset sizes, splits, or re-evaluation numbers for the benchmarks it lists.
The survey acknowledges that evaluation practices are highly heterogeneous, which makes cross-paper comparison weak even when the taxonomy is unified.
Safety, watermarking, licensing, and governance are identified as open concerns, but the paper does not provide a concrete technical solution or formal protocol.
Several very recent systems are referenced by name in the timeline figure, but the survey does not experimentally validate them itself.
Because the paper is a survey, claims about “best” methods or “state of the art” are necessarily inherited from the cited sources and may age quickly.

Open questions / follow-ons

What unified benchmark can measure synchronization, semantic alignment, controllability, and safety in one AVI evaluation suite instead of task-specific metrics only?
How should causal event-source grounding be formalized for audio-visual scenes so that models distinguish correlation from true source attribution?
Can AVI foundation models maintain long-horizon audio-video context memory without drifting in either modality, especially in streaming or agentic settings?
What verification, watermarking, and provenance mechanisms are robust enough for generated audio-video content at foundation-model scale?

Why it matters for bot defense

For bot-defense practitioners, this survey matters less as a direct CAPTCHA method and more as a signal that multimedia spoofing and multimodal interaction are becoming easier and more unified. The paper’s taxonomy shows that AV systems are moving from simple recognition toward generation and agentic interaction, which means attackers can increasingly synthesize convincing audio-video artifacts, imitate synchronized lip motion, or build conversational agents that mimic human-like timing and modality coordination. That raises the bar for defences that rely on static media checks or single-modality signals.

A bot-defense engineer would mainly use this paper to track where attack surfaces are expanding: audio-visual synchronization, talking-head generation, interactive voice/video agents, and embodied systems. The survey’s emphasis on open challenges like synchronization, controllability, and safety suggests that current evaluation metrics are still too narrow to distinguish real users from high-quality synthetic agents in many settings. In practice, that implies defences should combine behavioral signals, provenance checks, challenge-response design across modalities, and anomaly detection for temporal consistency rather than depending on one-off image or audio classifiers.

Cite

bibtex

@article{arxiv2605_04045,
  title={ Audio-Visual Intelligence in Large Foundation Models },
  author={ You Qin and Kai Liu and Shengqiong Wu and Kai Wang and Shijian Deng and Yapeng Tian and Junbin Xiao and Yazhou Xing and Yinghao Ma and Bobo Li and Roger Zimmermann and Lei Cui and Furu Wei and Jiebo Luo and Hao Fei },
  journal={arXiv preprint arXiv:2605.04045},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.04045}
}

Audio-Visual Intelligence in Large Foundation Models ​

TL;DR ​

Key findings ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​