PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models

Source: arXiv:2604.27677 · Published 2026-04-30 · By Haocheng Huang, Yuchen Chen, Weisong Sun, Peizhuo Lv, Yuan Xiao, Chunrong Fang et al.

TL;DR

PuzzleMark addresses a very practical but underexplored problem: how to watermark high-value code datasets so an owner can later prove unauthorized use by neural code completion models, while avoiding the two failure modes that hurt prior work: easy detection of suspicious carriers and fragile trigger-target co-occurrence patterns. The paper’s main idea is to separate two problems that earlier methods conflated: first, choose code snippets that are less likely to be flagged even before watermarking; second, embed a watermark using a more distributed, less frequency-anomalous association than fixed co-occurrence.

The new pieces are a carrier-suitability filter based on lightweight complexity features plus a PCA-like projection score, and a concatenation-based watermark that rewrites variable names into composite names (for example, prefix+suffix into snake_case or camelCase) rather than injecting a fixed trigger-target pair. The authors also add a black-box verification test using Fisher’s exact test. Empirically, they report 100% verification success and 0% false positives across three NCCM settings and two languages, with negligible model-performance impact; they also claim much lower detectability than prior methods, with average suspicious rate <= 0.24 in human study and average recall <= 30.41% against four detection methods.

Key findings

The carrier-selection step is motivated by a measured mismatch: on CodeSearchNet Python/Java, >60% of CodeMark watermark carriers were already flagged by KillBadCode even before watermarking, and DeCoMa detected about 100% of the watermarked carriers shown in Fig. 1.
PuzzleMark reduces carrier detectability by filtering on seven complexity features (cc, nloc, tc, vc, dvc, ec, dec) derived from AST/code analysis rather than selecting carriers randomly or only by SPT constraints.
The watermark is embedded through variable-name concatenation, replacing the traditional fixed co-occurrence trigger-target pattern used by CoProtector and CodeMark; the paper argues this avoids outlier high-frequency pairs highlighted by DeCoMa.
The authors report 100% verification success rate and 0% false positive rate under black-box validation using Fisher’s exact test across three NCCM types and two programming languages.
Human-study imperceptibility is reported as average suspicious rate <= 0.24, indicating participants rarely judged PuzzleMark samples as suspicious.
Against four state-of-the-art watermark detection methods, PuzzleMark’s watermarks have average recall <= 30.41%, which the authors present as evidence of machine imperceptibility.
The paper states that model performance impact is negligible, but the excerpt does not provide the exact accuracy/perplexity deltas in the text provided.
Under two attack scenarios, the paper says verifiability is consistently retained, supporting robustness against removal and dilution attacks.

Threat model

The adversary is a party that steals or reuses the code dataset to train NCCMs and knows a watermark exists, but not its exact construction. They can inspect and pre-process code, run automated detection/removal tools, and attempt to evade post hoc ownership verification. The dataset owner cannot inspect the attacker’s model weights and must rely on black-box API queries for verification. The attacker is assumed not to know the specific trigger design, carrier-selection rule, or embedding details, but may try general detection, removal, and dilution attacks.

Methodology — deep read

Threat model and assumptions: the paper assumes a dataset owner wants to watermark a code dataset before release, but does not know which downstream NCCM architecture or task an attacker will train. The attacker is assumed to know that a watermark exists, but not the exact implementation details, and their goal is to detect and remove watermarked samples so they can train a model and evade later ownership verification. Verification is black-box only: the owner can query the suspicious model through an API, but cannot inspect weights. This is a realistic copyright-forensics setting rather than a poisoning-as-attack setting.

Data and preprocessing: the excerpt explicitly says the authors evaluate on the Python and Java subsets of CodeSearchNet. They parse all code snippets with tree-sitter into ASTs and build source-code/AST pairs. For carrier-selection analysis, they compute function-level complexity features for each snippet. The paper selects 10 candidate features from Klocwork-like static metrics, then prunes this to 7 features after a detection-oriented validation study. The seven retained features are three structural metrics — extended cyclomatic complexity (cc), non-comment non-blank lines of code (nloc), tokens of code (tc) — and four semantic metrics — unique function calls (vc? actually the paper’s notation uses vc, dvc, ec, dec; the text describes vc as number of local declarations and parameters, dvc as distinct local declarations/parameters, ec as number of expressions, dec as number of distinct expressions). The paper’s feature validation uses suspicious-code labels defined by the union of outputs from KillBadCode and DeCoMa; in Fig. 5, positives are defined as the top-35% code snippets for each metric separately when evaluating feature usefulness. The excerpt does not give the exact dataset sizes, train/validation/test splits, or how many carriers were ultimately watermarked, so those details are unclear from the provided text.

Architecture / algorithm: PuzzleMark has two main stages before verification. First, carrier selection computes a suitability score intended to measure how safe a snippet is as a watermark carrier. The authors standardize the 7-dimensional feature vector with z-score normalization, compute the covariance matrix, take the top principal eigenvector w, project each normalized sample onto that direction, and pass the projection through a sigmoid-based transform to produce a suitability score s_i. Snippets below a quantile threshold tau are discarded from carrier candidacy. The intuition is that snippets with high code complexity are easier to detect and therefore poor carriers, even if the watermark syntax is hard to spot. Second, watermark embedding uses variable-name concatenation. If code naturally contains a chosen trigger prefix P, the algorithm picks the first variable name after P in byte-offset order as suffix S, concatenates them using the naming convention (snake_case or camelCase) to create a target name T~, and then replaces another variable T with T~. If P is absent, the algorithm first introduces P by renaming an existing variable, then proceeds with the same concatenation embedding. The key novelty is that the trigger-target relation is no longer a globally repeated co-occurrence pair; the trigger suffix and the target are adaptive per sample, which should reduce frequency anomalies. The paper presents two strategies: fixed-trigger watermarking, where prefixes like key/value are manually chosen, and a universal strategy where the trigger prefix is itself adaptive, chosen from a fixed code position (the first variable name in the snippet).

Training regime and evaluation protocol: the excerpt provides limited training details. It says experiments cover three NCCM types and two programming languages, but does not name all models, training epochs, batch sizes, optimizers, random seeds, GPU hardware, or whether multiple seeds were averaged. The evaluation spans four properties: effectiveness, harmlessness, imperceptibility, and robustness. Effectiveness is measured by verification success and false positives, using Fisher’s exact test in a black-box setting to classify suspicious models. Harmlessness is model performance impact, though the exact metric values are not included in the excerpt. Imperceptibility is evaluated in two ways: a human study with 10 participants, and four state-of-the-art detection methods (the excerpt names KillBadCode and DeCoMa in the motivation and says four detectors in the evaluation, but only two are explicitly named in the provided text). Robustness is tested under removal and dilution attacks; the paper says verifiability is retained under both attack scenarios. The carrier-selection section also contains an empirical validation of the seven features using F1-score across Python/Java and two detectors, with positive labels defined by suspicious-code outputs. Fig. 1 and Fig. 2 are used to show that simple complexity measures correlate with suspiciousness and that CodeMark carriers are skewed toward higher NLOC.

Reproducibility: the paper states that all code is publicly released. The excerpt does not mention whether datasets are released, whether watermarked datasets or frozen weights are provided, or whether the exact embedding/verification scripts and attack scripts are packaged. Because the excerpt is truncated before the full experimental section, it is unclear whether the authors release exact hyperparameters, seed values, or model checkpoints. A concrete end-to-end example in the text is the Fig. 3 code snippet: CoProtector and CodeMark modify the variable index_and_score into poisoning/protection or best_indexes-style triggers/targets, whereas PuzzleMark would instead choose two existing variable names in the snippet, concatenate them into a compound name, and rename a third variable to that composite form. That illustrates the core mechanism: one variable becomes the prefix, another the suffix, and the selected replacement variable becomes the target, with the exact choice driven by frequency rules to minimize perturbation.

Technical innovations

Carrier suitability scoring based on code-complexity features and a PCA-style projection, instead of randomly selecting watermarked snippets or only checking semantics-preserving constraints.
A concatenation pattern for watermarking code datasets that replaces fixed co-occurrence trigger-target pairs with adaptive variable-name composition.
Two embedding modes: a fixed-trigger watermark and a universal watermark where the trigger prefix itself is adaptive from code position.
Black-box verification with Fisher’s exact test for suspicious NCCMs, rather than requiring white-box access or direct watermark extraction.

Datasets

CodeSearchNet Python subset — size not stated in excerpt — public
CodeSearchNet Java subset — size not stated in excerpt — public

Baselines vs proposed

CodeMark (carrier detectability analysis): KillBadCode flags >60% of watermark carriers even before watermarking, and DeCoMa detects approximately 100% of the watermarked carriers shown in Fig. 1 vs PuzzleMark: carrier filtering is designed to remove such detectable snippets
PuzzleMark verification: false positive rate = 0% vs proposed: 0%
PuzzleMark verification: success rate = 100% vs proposed: 100%
PuzzleMark imperceptibility (human study): average suspicious rate <= 0.24 vs proposed: <= 0.24
PuzzleMark machine imperceptibility: average recall <= 30.41% against four watermark detectors vs proposed: <= 30.41%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2604.27677.

Fig 6

Fig 6: A case of PuzzleMark.

Fig 7

Fig 7: Impact of PuzzleMark

Limitations

The excerpt does not provide exact dataset sizes, train/validation/test splits, or how many carriers are watermarked, so reproducibility details are incomplete here.
The robustness evidence is limited to two attack scenarios named as removal and dilution; the excerpt does not show broader adaptive-adversary testing.
Black-box verification is practical, but Fisher’s exact test still depends on the quality and quantity of queries; the excerpt does not specify query budget sensitivity.
The human study uses only 10 participants, which is small for a perceptual imperceptibility claim.
The universal watermark still relies on a deterministic code-position heuristic (first variable name, then next distinct variable name), which may be easier to model or strip than the authors suggest under stronger normalization or refactoring attacks.
The paper emphasizes CodeSearchNet Python/Java; generalization to other languages, coding styles, or private datasets with different naming conventions remains unproven in the excerpt.

Open questions / follow-ons

How well does the suitability score generalize to other languages, repositories, or style conventions beyond CodeSearchNet Python/Java?
What happens under stronger adaptive attacks that normalize variable names, canonicalize formatting, or perform semantic-preserving refactoring before training?
How sensitive is Fisher-based verification to query budget, class imbalance, and downstream task mismatch?
Can the carrier-selection step be learned jointly with watermark placement rather than using a fixed PCA-like score and quantile threshold?

Why it matters for bot defense

For bot-defense practitioners, the main takeaway is that stealth is not just about trigger design; the carrier population matters too. PuzzleMark shows that if your “marked” instances are themselves statistically weird, detection tools can find them even when the watermark payload is subtle. In a CAPTCHA or anti-abuse setting, that maps directly to challenge selection: if you choose unusually hard, rare, or structurally distinctive challenges, adversaries can filter them out by anomaly detection before your watermark or signal ever matters.

The second lesson is that fixed co-occurrence patterns are brittle under frequency analysis. If you are designing tamper-evident synthetic challenges, honeytokens, or canary interactions, you want the marker to be distributed across naturally occurring variation rather than repeated as an identical signature. The paper’s black-box verification framing is also relevant: ownership or fraud claims often have to be made from API-level evidence, so statistical tests over query outputs are more realistic than assuming internal model access.

Cite

bibtex

@article{arxiv2604_27677,
  title={ PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models },
  author={ Haocheng Huang and Yuchen Chen and Weisong Sun and Peizhuo Lv and Yuan Xiao and Chunrong Fang and Yang Liu and Xiaofang Zhang },
  journal={arXiv preprint arXiv:2604.27677},
  year={ 2026 },
  url={https://arxiv.org/abs/2604.27677}
}

PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​