Quality-Diversity Search in Sound Generation: Investigating Innovation Engines for Audio Exploration

Source: arXiv:2606.09780 · Published 2026-06-08 · By Björn Þór Jónsson, Çağrı Erdem, Stefano Fasciani, Kyrre Glette

TL;DR

This study targets the problem of automated exploration and generation of novel and diverse synthetic sounds for composers and sound designers, overcoming technical skill barriers and enabling serendipitous sonic discoveries. The authors propose a system combining Quality Diversity (QD) search algorithms, specifically MAP-Elites, with a discriminative deep neural network classifier (YAMNet) to define behavioral niches in the sound space. Sounds are generated via evolutionary processes acting on genomes encoding Compositional Pattern Producing Networks (CPPNs) and Digital Signal Processing (DSP) graphs, evolved together to produce diverse, high-quality sonic phenotypes. A novel multi-CPPN approach specializing CPPNs for different frequency ranges yields simpler networks with comparable or improved QD performance compared to single-CPPN setups. The authors analyze evolutionary ‘stepping stones’ via goal switches between musical and non-musical auditory classes, revealing pathways enabling discovery of elite sounds. Extending the behavioral descriptor to include temporal niches uncovers sound specializations across durations. Results demonstrate that this Innovation Engine framework coupled with YAMNet as a fitness signal produces a broad and innovative variety of synthetic sounds across multiple contexts and durations. The outputs are accessible via an online explorer and downloadable files for creative usage.

Key findings

Co-evolved CPPN + DSP graphs achieve higher overall QD scores than CPPN-only genomes, with improved coverage and more gradual discovery of elites (Fig 2, 3, 4).
Using multiple specialized CPPNs for different frequency ranges reduces average CPPN node count from 32.7 to 10.3 and connections from 142.8 to 80.2, while slightly increasing DSP nodes and achieving higher QD scores (1408.6 vs 1427.3) (Table 1).
Allowing each genotype to win multiple cells leads to near full coverage of behavior space; restricting to single-cell wins reduces coverage to 57.4%±3.4% and lowers unique genome counts (Fig 2, 3).
A mean of 21.7±3.6 evolutionary goal switches occur per run, with 63% of new champions originating from different classes, indicating rich evolutionary stepping stones and path diversity.
Temporal niche extension shows specialization in sound phenotypes across various durations, validating the expanded behavior space approach.
Musical versus non-musical context analysis reveals frequent goal switching between categories, mirroring natural evolutionary contextual jumps.
CPPN-only runs show increased network complexity and rendering time compared to co-evolved DSP graphs, indicating DSP aids efficiency and sonic quality.
The system can generate diverse sounds assessed by the YAMNet classifier across 521 classes from AudioSet, though it struggles on categories related to broad musical genres.

Threat model

n/a — The paper does not present a security or adversarial threat model. The adversary could be construed as the evolutionary process exploring the sound space guided by a classifier, but there is no focus on attacks, defenses, or adversarial robustness.

Methodology — deep read

Threat model & assumptions: Although not directly security-focused, the study assumes an evolutionary adversary exploring a large sound space without human-in-the-loop judgement, relying on a pre-trained classifier (YAMNet) to guide search. The system cannot leverage human preference or supervision during evolution.
Data: No explicit training dataset for generation; instead, YAMNet is a pre-trained deep neural network classifier trained on AudioSet (~2 million YouTube clips, 521 classes). Sounds are synthesized from genomes encoding CPPNs and DSP graphs and then classified by YAMNet to produce behavioral descriptors.
Architecture & algorithms: The genome encodes a CPPN and a DSP graph. CPPNs produce periodic waveforms (sine, triangle, square, sawtooth) acting as oscillators and modulators. DSP graphs combine these signals with nodes like wavetable lookup, additive synthesis, delays, filters, etc. Evolution employs NEAT to incrementally complexify both CPPN and DSP structures. The MAP-Elites QD algorithm maintains an archive indexed by YAMNet classifier confidence vectors from the generated sounds to promote diversity and quality across auditory niches. Innovations include splitting CPPN into multiple specialized networks per frequency band.
Training regime: Evolution runs use 10 independent MAP-Elites runs over 300,000 iterations each, batch size 32. Initial 50 seed individuals initialized randomly. Mutation rates tuned via manual search (10% addition, 6% deletion of nodes/connections). Sounds rendered predominantly at 0.5 seconds duration for evaluations. Multi-CPPN experiments use 9375 generations with the same batch size.
Evaluation protocol: Performance measured by QD-score, the sum of classifier confidence scores of elites across cells. Behavioural coverage and genome complexity metrics tracked. Goal switching and phylogenetic analyses quantify evolutionary stepping stones. Multiple configurations tested: CPPN-only vs CPPN+DSP, multi-CPPN vs single CPPN, single-cell-win vs multiple cell wins.
Reproducibility: Source code and datasets, including evolutionary histories and rendered sound files, are publicly available with the paper. The classifier is pre-trained YAMNet, publicly released. Method is fully procedural, requiring no prior data training of the generator.

Example workflow: An individual genome encoding a CPPN and DSP graph undergoes NEAT mutations. The genome is decoded to produce control and audio signals over time, generating a waveform of 0.5s length. This waveform is fed into YAMNet, producing confidence scores over 521 audio classes. These scores define the behavioural niche cells. MAP-Elites archives this individual if it outperforms existing elites in respective cells. Iteratively, this process explores and fills diverse behavioral niches with high-quality elite sounds.

Technical innovations

Integration of MAP-Elites QD algorithm with a pre-trained deep audio classifier (YAMNet) as a multi-class behavioral descriptor for sound synthesis exploration, inspired by Innovation Engines.
Combination of CPPNs encoding periodic waveforms with co-evolving DSP graphs to broaden sonic diversity while balancing complexity.
Development of a novel multi-CPPN approach specializing networks per distinct frequency bands, reducing CPPN complexity without sacrificing generative performance.
Analysis of evolutionary stepping stones via goal switching between musical and non-musical auditory contexts to reveal unusual evolutionary pathways in sound space.

Datasets

AudioSet — ~2 million 10-second YouTube clips with 521 audio event labels — Public Google Dataset
Generated sound objects and evolutionary run data accompanying this paper — Size not explicitly specified — Public with publication

Baselines vs proposed

CPPN-only runs: QD score = lower (exact numbers in Fig 2, 4) vs CPPN+DSP runs: QD score = higher
Single CPPN approach: mean CPPN node count = 32.68 vs Multi-CPPN: 10.30 (Table 1)
Single CPPN approach: QD score = 1408.58 ± 73.10 vs Multi-CPPN: 1427.28 ± 51.92 (Table 1)
Multiple-cell-win setup: behavior coverage near 100% vs single-cell-win setup: 57.4% ± 3.4% coverage (Fig 2)
Goal switches per run: 21.7 ± 3.6 vs reported 17.9% in prior work [12]

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.09780.

Fig 1: The QD algorithm MAP-Elites uses the pre-trained YAMNet DNN classifier to define cells in

Fig 10: (3.7) Evolution runs explorer, where it is possible to scrub through evolution runs, their

Fig 9: Genome intersections through temporal variations; counting the occurrence of genomes

Limitations

Reliance on a pre-trained classifier (YAMNet) to define behavior space restricts exploration to AudioSet label space and biases search towards audible categories it can recognize.
No human-in-the-loop evaluation or subjective listening tests reported beyond informal author sessions, limiting insight on aesthetic quality beyond classifier confidence.
Computationally intensive evolutionary runs with rendering and classification bottlenecks; complex genomes increase evaluation time.
Limited investigation of robustness to different classifiers, alternative behavior descriptors, or real-time interactive use.
Lack of adversarial analysis or attempts to test brittleness of evolved sounds under adversarial conditions (n/a for security focus but relevant for robustness).
Exploration focused on sounds up to 10 seconds; longer temporal structures and musical intentions not deeply modeled.

Open questions / follow-ons

How would the evolutionary system perform if alternative discriminative models or multi-modal classifiers were used instead of YAMNet?
Can human preference or interactive feedback be integrated effectively to further guide and refine the QD search in sound?
What is the impact of scaling the temporal dimension beyond 10 seconds on the specialization of sound artefacts?
How can this evolutionary framework support real-time creative workflows in music production or sound design tools?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this paper presents an interesting application of Quality Diversity algorithms combined with deep discriminative models to explore a high-dimensional, complex behavioral space, in this case of synthetic sounds. The methodology of using a pre-trained classifier to define behavioral niches and reward novelty and quality could inspire analogous approaches in bot detection and CAPTCHA design, where behavior spaces are defined by outputs from deep models or multi-class classifiers.

However, the context differs significantly as sound generation is an open-ended creative domain rather than adversarial bot detection. The demonstrated evolutionary stepping stones and goal switching analyses highlight how diverse and non-linear evolutionary pathways can lead to rich solution spaces, a principle useful for designing defenses that adapt or evolve to complex bot behaviors. Additionally, the challenge of balancing quality and diversity in a large behavior space intersects with challenges in CAPTCHA generation and bot fingerprinting, where diversity is needed to prevent overfitting and maintain robustness.

Practitioners should note the dependence on a fixed pre-trained classifier for guiding diversity, which can bias outcomes and limits novelty. Designing similar pipelines for security would require consideration of classifier robustness and the potential for adversarial exploitation. Overall, this research underscores the merit of coupling discriminative evaluation with diversity-promoting search to explore complex input or behavior spaces, an insight applicable to CAPTCHA and bot-defense innovation.

Cite

bibtex

@article{arxiv2606_09780,
  title={ Quality-Diversity Search in Sound Generation: Investigating Innovation Engines for Audio Exploration },
  author={ Björn Þór Jónsson and Çağrı Erdem and Stefano Fasciani and Kyrre Glette },
  journal={arXiv preprint arXiv:2606.09780},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.09780}
}

Quality-Diversity Search in Sound Generation: Investigating Innovation Engines for Audio Exploration ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​