Vol-Mark: A Watermark for 3D Medical Volume Data Via Cubic Difference Expansion and Contrastive Learning

Source: arXiv:2605.04705 · Published 2026-05-06 · By Jiangnan Zhu, Yuntao Wang, Shengli Pan, Yujie Gu

TL;DR

Vol-Mark targets a practical gap in medical watermarking: most prior methods either operate on 2D slices or sacrifice exact recoverability, even though telemedicine increasingly exchanges full 3D CT/MRI volumes where inter-slice structure matters. The paper’s core idea is to combine a learned zero-watermark feature extractor for ownership binding with a reversible embedding path for direct data integrity protection. That combination is meant to support two checks: exact restoration of the original volume when the data are intact, and statistical ownership verification when the volume has been tampered with or the embedded payload is degraded.

What is new here is the specific pairing of a 3D contrastive feature extractor with a new cubic difference expansion (c-DE) scheme over 2×2×2 voxel cubes in the low-frequency band of a 3D integer wavelet transform. In the reported experiments on MSD Task01/03/07-style medical volumes, the method is said to remain robust across conventional, geometric, and hybrid attacks, with average watermark detection accuracy typically above 0.90 in many attack settings. The paper’s central claim is not just stronger robustness, but that the system preserves reversibility and diagnostic fidelity while providing a second-stage statistical ownership test when direct bitwise recovery fails.

Key findings

The method uses a pretrained 3D ResNet-18 backbone from MedicalNet plus a 3-layer MLP projector to learn volumetric features directly from 128×128×64 volumes, rather than relying on slice averages or 2D descriptors.
Vol-Mark embeds ownership bits only into the LLL sub-band of a single-level 3D integer wavelet transform, using 2×2×2 cubes and difference expansion on three voxel differences per cube.
The extraction path adds majority voting across three expanded differences: the embedded bit is set to 1 if at least two of the three recovered differences are odd, which the authors present as a reliability boost under attack.
For ownership verification, the paper uses a one-sided binomial test with null H0: ξ = 0.5 and significance level α = 10^-6; they report p = 7.46×10^-155 ± 0 for watermarked data versus 0.5264 ± 0.4535 for non-watermarked data on MSD Task01 volumes.
Under Gaussian noise, average NC stays above 0.90 up to 25% intensity in the reported table: 0.9969 at 1%, 0.9831 at 5%, 0.9656 at 10%, 0.9306 at 20%, and 0.9137 at 25%.
Under salt-and-pepper noise, average NC remains 0.9807 at 1%, 0.9627 at 3%, 0.9482 at 5%, 0.9168 at 10%, and 0.8891 at 15%, showing the method degrades more sharply than under Gaussian noise but still often stays near the claimed 0.90 threshold.
The paper reports a case where JPEG compression at 50%–70% quality still yields very high robustness on the shown table fragment; however, the full numeric row is truncated in the provided text, so the exact 70% values cannot be reliably reconstructed here.
The experiments are conducted on three MSD tasks: Task01 (Brain Tumours, 750 4D MRI scans), Task03 (Liver, 210 3D CT scans), and Task07 (Pancreas, 420 3D CT volumes), with results aggregated across at least Task01 and Task07 in the visible table.

Threat model

The adversary can observe, transmit, store, attack, and potentially attempt to tamper with or remove watermarks from 3D medical volumes via noise, compression, filtering, cropping, rotation, translation, random dropping, and combined hybrid attacks. The adversary is assumed not to know the secret Henon-map initialization/chaotic key, the registered original watermark, or the exact verification parameters, and cannot directly modify the verifier’s stored ownership share. The scheme is designed so that if exact recovery fails, statistical ownership verification can still detect the watermark with high confidence.

Methodology — deep read

Vol-Mark is designed for an adversary who can redistribute, perturb, geometrically transform, or try to remove a watermark from a medical 3D volume, but who does not know the secret keys/initial conditions used to generate the chaotic sequence and does not control the original registered watermark parameters. The threat model implicitly covers conventional signal attacks (noise, filtering, compression), geometric attacks (cropping, rotation, translation, dropping), and hybrid combinations; the paper also frames tampering and watermark-removal attempts as scenarios where bitwise recovery may fail but statistical ownership verification should still work. It assumes the verifier has access to the same watermark registration parameters and the stored ownership share, and that the watermark can be checked after either exact restoration or degraded extraction.

The data pipeline is straightforward but important. The authors evaluate on the Medical Segmentation Decathlon (MSD) dataset, using Task01 Brain Tumours (750 4D MRI scans), Task03 Liver (210 3D CT scans), and Task07 Pancreas (420 3D CT volumes). The volumes are resized to 128×128×64 and normalized with global mean and standard deviation computed from the training set. For the contrastive feature extractor, they create positive pairs by applying random augmentations to the same volume, including Gaussian noise (variance 0.01–0.25), salt-and-pepper noise (density 0.01–0.15), JPEG compression (quality 50–90), median/average filtering (kernel size 3, 5, or 7), cropping (ratio 0.01–0.20), rotation (1°–30°), translation (0.01–0.20 along a random axis/plane), and random dropping (0.01–0.25). The paper does not spell out an explicit train/val/test split protocol for the watermarking tasks in the excerpt; it also does not specify seed strategy or whether results are averaged over multiple random initializations.

The architecture has two coupled parts: a learned zero-watermark feature extractor and a reversible embedding scheme. For feature extraction, they take a pretrained 3D ResNet-18 backbone from MedicalNet, remove the classification head, and attach a 3-layer MLP projector with batch normalization to map the backbone output to a feature vector f whose length N matches the watermark bit length. The use of a zero threshold on the projected features is justified by batch normalization, producing a binary feature vector fb where fi>0 maps to 1 and otherwise 0. The feature extractor is trained with a contrastive loss over augmented views of the same volume versus other samples in the batch; the exact formula in the paper is standard InfoNCE-style cosine-similarity contrastive learning with temperature τ. The goal is not classification accuracy but stable, discriminative embeddings that remain consistent under benign or attacked transformations.

The watermark itself is derived by XOR-ing three components: the preset watermark w, a binary chaotic sequence cb produced from a Henon map, and the binarized feature vector fb extracted from the volume. The Henon map parameters are not fully derived from patient information in the excerpt; the paper says they can optionally be set from patient metadata or other values, but does not define a concrete key schedule. The result is an ownership share OS = w ⊕ cb ⊕ fb, which is then embedded reversibly into the volume. This coupling is the zero-watermark aspect: even though the original data are not directly altered to derive the ownership relation, the registered share depends on the data content and the secret chaotic key.

The reversible embedding path uses single-level 3D integer wavelet transform (3D-IWT) and a new cubic difference expansion (c-DE) routine. After decomposing the volume into eight sub-bands, the method only edits the low-frequency LLL band, which preserves perceptually important structure and gives more stable bits. The LLL band is partitioned into 2×2×2 cubes; from each cube, one reference voxel A and three neighbors B, C, D are selected. The method computes three differences dab=A−B, dac=A−C, dad=A−D and an average M=(A+B+C+D)/4. Each difference is expanded as d′=2d, then a watermark bit b is inserted by d′′=d′+b. The cube is reconstructed from the expanded differences and average using integer rounding, with an overflow/underflow check to avoid invalid voxel values. If any reconstructed voxel exceeds the legal intensity range, the cube is skipped and recorded in a location map L. Because the transform is integer-based and the expansion is reversible, the authors claim exact recovery of the original voxel values after extraction.

Extraction reverses that process. The watermarked volume is transformed with 3D-IWT, the LLL band is partitioned into the same cubes, and any cubes marked in the location map are skipped. The three expanded differences are read back from the cube, and the embedded bit is inferred by parity with a majority vote: if at least two of the three recovered differences are odd, the bit is 1; otherwise 0. The original differences are then restored by subtracting the bit and halving, and the original voxel values are reconstructed and passed through inverse 3D-IWT to recover the original volume. A concrete end-to-end example is: a registered 3D MRI volume is resized and normalized, augmented copies are used to train the contrastive 3D ResNet-18; the restored feature vector is binarized, XORed with the preset watermark and Henon-bitstream to form OS; OS is embedded into LLL cubes by c-DE; later, the attacked or clean volume is inverted through c-DE, the original feature vector is recomputed on the restored data, and the extracted OS is XORed with cb and the recomputed binarized features to recover w for verification.

Evaluation is two-layered. First, integrity verification uses bit error rate (BER = ||w−ŵ||1 / N); BER = 0 indicates perfect recovery, while BER > 0 triggers the ownership-share-based zero-watermark check. Second, ownership verification uses a right-tailed binomial test with H0: ξ=0.5 and H1: ξ>0.5, where ξ is the probability a bit matches by chance. They set α=10^-6 and explicitly report a control experiment on MSD Task01 volumes: one watermarked volume and 400 non-watermarked volumes, with extraction performed 400 times. The reported mean p-value for watermarked data is 7.46×10^-155 ± 0, while non-watermarked data gives 0.5264 ± 0.4535, supporting the threshold choice. For robustness, the paper reports PSNR, BER, and normalized correlation (NC) under attack intensities. In the visible table, average NC under Gaussian noise is 0.9969 at 1%, 0.9831 at 5%, 0.9656 at 10%, 0.9306 at 20%, and 0.9137 at 25%; under salt-and-pepper noise it is 0.9807, 0.9627, 0.9482, 0.9168, and 0.8891 respectively. The excerpt also says the method outperforms existing methods by a clear margin across conventional, geometric, and hybrid attacks, but the exact baseline-by-baseline numbers are not visible in the provided text, so that comparison cannot be reconstructed here.

Technical innovations

A 3D contrastive feature extractor based on pretrained MedicalNet ResNet-18 is used to build a zero-watermark directly from volumetric structure rather than from 2D slices.
The paper introduces cubic difference expansion (c-DE), which embeds bits into three voxel differences inside 2×2×2 cubes of the 3D-IWT low-frequency band.
A majority-voting rule over three recovered differences is used during extraction to improve bit reliability under attack.
Ownership verification is cast as a binomial hypothesis test with α=10^-6 instead of relying only on raw bit recovery.

Datasets

MSD Task01 (Brain Tumours) — 750 4D MRI scans — Medical Segmentation Decathlon
MSD Task03 (Liver) — 210 3D CT scans — Medical Segmentation Decathlon
MSD Task07 (Pancreas) — 420 3D CT volumes — Medical Segmentation Decathlon

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.04705.

Fig 1

Fig 1: Medical volume data.

Fig 2

Fig 2: The workflow of our proposed Vol-mark method. (a) First, Vol-Mark extracts features from volume data using a

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 3

Fig 3: 3D integer wavelet transform scheme.

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

The provided excerpt does not include a full baseline table with exact numeric comparisons, so the claimed margin over prior methods cannot be independently quantified here.
Training details for the contrastive feature extractor are incomplete in the excerpt: number of epochs, batch size, optimizer, learning rate, and seed strategy are not stated.
The evaluation appears centered on MSD tasks, but the excerpt does not show a clear held-out attacker protocol or cross-dataset generalization test.
The paper reports robustness on several attack classes, but the visible results are partial and truncated for JPEG and any later rows, limiting precise reproduction of all attack-point numbers.
The ownership verification experiment uses 400 volumes from Task01, but the excerpt does not clarify whether all subjects are independent from the feature-extractor training set or whether any leakage is possible.
The security of the chaotic component depends on Henon-map initialization, but the paper excerpt does not fully specify key management or attacker knowledge assumptions beyond general secrecy.

Open questions / follow-ons

How stable is the contrastive 3D feature extractor under out-of-distribution scanners, protocols, or non-MSD medical modalities not seen in training?
What is the actual capacity-distortion tradeoff of c-DE as cube selection becomes more conservative due to overflow/underflow constraints?
How does the method behave against adaptive attackers who know the use of 3D-IWT, majority voting, and binomial testing but not the secret key?
Can the ownership-share generation be simplified or made key-rotation friendly without weakening the zero-watermark binding to the volume content?

Why it matters for bot defense

For bot-defense practitioners, the most relevant lesson is methodological rather than directly deployable: Vol-Mark shows how combining a content-derived signature with a reversible payload can separate integrity checks from ownership checks. That pattern maps well to CAPTCHA or abuse-detection systems that need both a low-friction, content-bound signal and a fallback statistical test when the primary signal is degraded.

The paper also illustrates a useful design principle for adversarial settings: use a learned representation for robustness, but do not let the learned part be the only line of defense. The c-DE path is an example of a deterministic, reversible mechanism that complements the learned features. In CAPTCHA-like systems, a similar hybrid can help: learned embeddings for robustness to natural variation, plus a transparent verification layer that can still reject or flag suspicious behavior when the representation is perturbed. The caveat is that this paper is medical-volume-specific; the exact transform and verification mechanics would not transfer directly, but the layered defense architecture is relevant.

Cite

bibtex

@article{arxiv2605_04705,
  title={ Vol-Mark: A Watermark for 3D Medical Volume Data Via Cubic Difference Expansion and Contrastive Learning },
  author={ Jiangnan Zhu and Yuntao Wang and Shengli Pan and Yujie Gu },
  journal={arXiv preprint arXiv:2605.04705},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.04705}
}

Vol-Mark: A Watermark for 3D Medical Volume Data Via Cubic Difference Expansion and Contrastive Learning ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​