Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions

Source: arXiv:2605.31586 · Published 2026-05-29 · By Wesley Scivetti, Ethan Wilcox, Nathan Schneider, Kanishka Misra, Leonie Weissweiler

TL;DR

This paper investigates how well language models (LMs), especially open-source and medium-sized ones, understand rare English Paired-Focus constructions like “let alone” and “much less,” which join two focused elements with a scalar semantic relationship. Unlike previous work that mostly finds semantic understanding of such constructions only in very large closed-source LLMs, the authors create a novel, large-scale dataset grounded in scalar adjective semantics and world knowledge to directly probe both syntax and semantics of these constructions in a wide variety of models. They discover that several modestly sized models (around 400M parameters) do exhibit sensitivity to both form and meaning, but models trained on “human-scale” datasets fail to capture semantics. By analyzing training dynamics, they demonstrate that syntactic knowledge of Paired-Focus constructions emerges early but deeper semantic understanding arises much later and correlates with broader world knowledge gains. The study highlights a nuanced dissociation between syntax and semantics acquisition and shows relationships between constructional semantics and general knowledge domains, providing novel empirical insights into the acquisition of rare linguistic constructions in LMs.

Key findings

Medium-sized open-source LMs (e.g. ~400M parameters) achieve over 90% accuracy on both syntactic and semantic evaluations of Paired-Focus constructions.
Models below roughly 400M parameters fail semantic evaluations even with increasing training data; parameter count is a stronger predictor of semantic accuracy than training data size (β=6.607 vs β=2.651).
Syntactic knowledge of Paired-Focus constructions emerges early in training and plateaus quickly; semantic understanding emerges much later, mirroring the learning curve of world knowledge benchmarks (EWoK).
PAIRED-FOCUS semantic understanding is moderately correlated (ρ=0.48) with performance on physical relations subset of world knowledge (EWoK), showing semantic grounding beyond syntax.
All tested Paired-Focus constructions (LET-ALONE, MUCH-LESS, NOT-TO-MENTION, NEVER-MIND) display highly correlated learning trajectories.
Models trained on human-scale datasets (e.g., BabyLMs with ~100M tokens) fail semantic evaluation but can learn syntactic patterns, questioning their applicability as linguistic model systems.
No significant correlation found between Paired-Focus semantic understanding and purely syntactic benchmarks (e.g., BLiMP), supporting dissociation between syntax and semantics acquisition.
Open-source masked language models and autoregressive models do not differ significantly in Paired-Focus semantic performance.

Threat model

The adversary is an analyst attempting to probe language models to detect whether they learn and represent the form and semantic functions of rare Paired-Focus linguistic constructions. The adversary has access only to the models’ probabilities for input sentences and no access to internals or training data. The adversary cannot modify training or influence model behavior beyond querying probabilities.

Methodology — deep read

The authors investigate LM understanding of rare Paired-Focus constructions through a comprehensive, multi-step empirical framework:

Threat Model & Assumptions: The evaluation assumes a passive linguistic probing scenario where adversaries seek to determine if LMs represent constructional form and semantics. The adversary (researcher) has access to LM probability outputs but not internal weights. There is no active adversary or attack scenario.
Data: The authors focus on four Paired-Focus English constructions: LET-ALONE, MUCH-LESS, NOT-TO-MENTION, and NEVER-MIND, all rare in corpora (see Table 1). They create a novel dataset of ~3,500 example sentence pairs per construction, grounded in scalar adjective scales from Wilkinson and Tim (2016), paired with natural noun phrases. The dataset contains minimal pairs contrasting plausible vs. implausible follow-up sentences that either entail or contradict the scalar semantic relation implied by the construction. Control sentences replace the Paired-Focus phrase with "or" to isolate scalar semantics from simple conjunction.
Architecture / Algorithm: They evaluate 36 models varying in parameter count (small to 12B), pretraining corpus size (human-scale to large), pretraining objectives (masked MLM vs autoregressive), and architecture families (e.g., OPT, Pythia, BERT variants, Ettin). They probe the models by computing surprisal differences on follow-up sentences with and without Paired-Focus constructions as contexts. This probabilistic evaluation directly measures how the presence of the construction shifts likelihood toward semantically consistent continuations.
Training Regime: For three top-performing models (Pythia-12B, Ettin-enc-400M, Ettin-dec-1B), they study learning trajectories by evaluating checkpoints spaced logarithmically through training progress (up to 400B tokens for Ettin, full training for Pythia). Syntactic and semantic evaluations are conducted at each checkpoint to assess order and speed of acquisition.
Evaluation Protocol: The primary metric is an accuracy score measuring whether the model prefers scalar-consistent (entailed) follow-up sentences over contradictory ones significantly more in Paired-Focus than in control "or" constructions (Equation 2). They control for lexical biases and ordering effects by balancing minimal pairs. Syntactic tests adapted from prior literature are also evaluated by comparing grammatical and ungrammatical variants of Paired-Focus forms. They additionally correlate learning trajectories with other linguistic benchmarks: BLiMP for syntax, COMPS for conceptual noun properties, and EWoK for world knowledge.
Reproducibility: The authors release code and the dataset at the provided GitHub link. LM checkpoints are open-source where available; some experiments use publicly released checkpoints. Detailed appendices provide dataset templates, evaluation details, and syntactic test suites. However, not all checkpoint data (e.g., early Ettin checkpoints) is readily available, and some results rely on partially unpublished models.

Concrete example end-to-end evaluation: Given a sentence with a Paired-Focus construction such as "I couldn't lift a tiny rock, let alone a huge one," the model's surprisal is computed on follow-ups like "Lifting a huge rock is easier" (plausible) vs. "Lifting a huge rock is harder" (implausible). The difference in model surprisal for these follow-ups with the Paired-Focus context is compared to the same differences when the context uses "or" instead of "let alone." A higher preference by the model for the scalar-consistent follow-up in the Paired-Focus context indicates semantic understanding.

Technical innovations

Design of a novel large-scale, scalar-semantics-grounded dataset to evaluate LM understanding of rare Paired-Focus constructions in a controlled probabilistic framework.
Use of minimal pair-based probabilistic surprisal differences between Paired-Focus and simple conjunction contexts to isolate constructional semantic knowledge from general world knowledge.
Comprehensive analysis of training dynamics linking acquisition of syntactic form and semantic understanding of rare constructions to broader world knowledge benchmarks.
Demonstration that modestly sized open-source LMs can learn both form and meaning of rare constructions well before or without massive data scale, challenging assumptions from prior work focused on huge models.

Datasets

Paired-Focus Semantics Dataset — ~3,500 sentence pairs per construction, 4 constructions — Newly constructed by authors, available at https://github.com/WesScivetti/Meaning_Alone
Wilkinson and Tim (2016) scalar adjective scales — used as basis for scalar adjective selection — Public linguistic resource
BLiMP — ~67k minimal pairs for syntactic evaluation — Public benchmark
COMPS — conceptual noun property tasks — Public benchmark
EWoK — world knowledge probing dataset focused on physical/material properties — Public benchmark

Baselines vs proposed

Model size <400M parameters: Semantic accuracy at or below chance (~50%) vs Model size ≥400M: Semantic accuracy reaching up to 90%+
Training data size: Weak correlation with semantic accuracy (β=2.651, p=0.034) vs Parameter count: Stronger correlation (β=6.607, p=0.003)
Masked vs Autoregressive models: No significant difference in Paired-Focus semantic accuracy
Small models (e.g., BERT base) achieve high syntactic accuracy (~top performance) vs Low semantic accuracy (~chance)
Pythia-12B semantic accuracy increases gradually over training steps while syntactic accuracy plateaus early
EWoK physical relations benchmark accuracy moderately correlates with Paired-Focus semantic accuracy (ρ=0.48)
BLiMP syntactic benchmark accuracy does not correlate significantly with Paired-Focus semantic performance (ρ near 0)

Limitations

Semantic evaluations rely on LM probability distributions and minimal pairs, which may not capture all nuances of human semantic understanding.
Training dynamics analysis is limited to a few open-checkpoint models; results may not generalize across all architectures or pretraining regimes.
Human-scale datasets (e.g., BabyLMs) are relatively small and pretraining recipes differ, limiting conclusions about why they fail semantic tasks.
Potential lexical bias and ordering confounds are controlled for but cannot be completely ruled out as influencing outcomes.
Evaluation focuses exclusively on English Paired-Focus constructions; cross-linguistic generalizability is untested.
The dataset and syntactic tests target only noun phrase focus variants, limiting scope of constructions evaluated.

Open questions / follow-ons

What architectural or training modifications could enable smaller models or human-scale LMs to acquire semantic knowledge of rare constructions?
How do Paired-Focus construction learning dynamics vary across languages with different typological properties?
Can integrating explicit scalar semantics representations in LM architectures improve semantic acquisition of constructions?
What causal relationships exist between acquisition of general world knowledge and constructional semantics in LMs?

Why it matters for bot defense

For bot-defense and CAPTCHA engineers, this paper demonstrates that even relatively small open-source language models can acquire nuanced semantic understanding of subtle, rare linguistic constructions, but only beyond a certain parameter count threshold (~400M). This indicates that modestly sized language models can internalize complex form-meaning mappings relevant for natural language understanding tasks involving compositional semantics and focus-sensitive operators. From a defense perspective, the dissociation between early syntactic learning and later semantic acquisition suggests that probing models for semantic comprehension may help distinguish genuine human language understanding from surface-level pattern recognition often exploited by automated bots. The authors’ methods for constructing scalar-semantics grounded datasets and probabilistic contrastive evaluation might inspire new testing methodologies for chatbot authenticity verification or linguistic challenge design. Their findings also caution that models trained on limited data (human-scale) do not acquire required semantic robustness, an insight relevant when relying on smaller models for security-critical NLP tasks.

Cite

bibtex

@article{arxiv2605_31586,
  title={ Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions },
  author={ Wesley Scivetti and Ethan Wilcox and Nathan Schneider and Kanishka Misra and Leonie Weissweiler },
  journal={arXiv preprint arXiv:2605.31586},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.31586}
}

Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​