Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification

Source: arXiv:2606.07479 · Published 2026-06-05 · By Sercan Karakaş, Yusuf Şimşek

TL;DR

This paper addresses the challenging NLP problem of detecting Turkish idiomatic light verb constructions (LVCs), which are multiword expressions with ambiguous surface forms that can have either idiomatic or literal meanings. The authors frame LVC detection as a binary classification task (idiomatic vs. literal) and create a carefully controlled diagnostic dataset of 147 sentences with matched negative controls (literal in-domain and random out-of-domain). They compare a supervised Turkish encoder baseline (BERTurk with a classifier head) against three instruction-tuned large language models (LLMs) under zero-shot, one-shot, and few-shot prompting. The study reveals that zero-shot prompting leads LLMs to strongly underdetect LVCs, showing very low recall but high precision on negatives, whereas one-shot prompting induces substantial model-specific biases (over- or under-predicting LVCs). Rich few-shot prompting, providing multiple positive and negative demonstrations, improves calibration and yields robust and balanced performance for the larger GPT-OSS-20B and Qwen 2.5-14B models. Overall, the supervised BERTurk baseline remains competitive and more balanced across conditions but prompted LLMs can match or exceed it on idiomatic detection with carefully constructed prompts. The work highlights the sensitivity of Turkish metalinguistic classification to prompt design and the complementary strengths of supervised vs. instruction-tuned approaches.

Key findings

Zero-shot prompting of LLMs results in near-total failure to detect LVC positives: Llama3.1-8B achieves 0% accuracy on LVC, GPT-OSS-20B 6.1%, and Qwen 2.5-14B 12.2%, while maintaining high accuracy on negative controls (>85%)
Supervised BERTurk-128K achieves 79.6% accuracy on LVC positives and 98.0% on random negatives in zero-shot (supervised) setting, overall accuracy 86.4%
One-shot prompting shifts error profiles strongly: Llama3.1-8B reaches 87.8% LVC accuracy but drops to 28.6% on NLVC negatives (overpredicting positives), Qwen 2.5 has 49.0% LVC but near perfect (~100%) on negatives (underpredicting positives), GPT-OSS-20B is most balanced (83.7% LVC, 73.5–89.8% negatives)
Few-shot prompting with richer demonstrations improves calibration, yielding high overall accuracy for GPT-OSS-20B (84.4%) and Qwen 2.5 (85.7%), with LVC accuracy of 85.7% and 71.4% respectively, while Llama3.1 remains lower overall due to false positives
Prompt sensitivity is large: regime×correctness tests show significant shifts in error types and magnitudes across zero-shot, one-shot, and few-shot regimes per model (e.g., GPT-OSS LVC accuracy from 6.1% to 85.7%)
Statistical tests indicate no significant difference between GPT-OSS few-shot and BERTurk-128K supervised on LVC accuracy (85.7% vs 79.6%, p=0.42) or overall (84.4% vs 86.4%, p=0.62)
False positives and false negatives trade off differently by model and prompting regime, evidencing calibration shifts rather than uniform performance improvements
Small, controlled diagnostic data (N=147) exposes subtle model behaviors masked by aggregate metrics, underscoring value of contrast sets for metalinguistic tasks

Threat model

The adversary is the NLP model tasked with classifying Turkish verb–nominal sequences as idiomatic (LVC) or literal. Models have access only to input sentences and instructions/prompts but no external oracle or perfect supervision during inference. The threat model does not consider malicious attacks or adversarial inputs; instead, it focuses on classification errors and calibration biases intrinsic to prompting regimens and training data coverage.

Methodology — deep read

The authors study Turkish idiomatic light verb constructions (LVCs) as a binary classification problem to distinguish idiomatic uses from literal verb–nominal pairs. Their threat model assumes adversaries (models) receive sentence input and must classify underlying meaning; the focus is on evaluating state-of-the-art LLMs and a supervised baseline rather than adversarial robustness. Data provenance combines multiple Turkish Universal Dependencies (UD) treebanks (~82K sentences) used to harvest candidate LVCs via dependency labels (compound:lvc, compound), which are manually verified by annotators to finalize a high-quality set of 9,491 LVC instances for BERTurk training. The primary evaluation corpus is a manually authored, controlled diagnostic test of 147 sentences split evenly into three sets: LVC (idiomatic positives), NLVC (literal negatives sharing same verbs), and RANDOM (unrelated negatives). This ensures lexical control and focuses on the literal–idiomatic contrast.

The baseline model is BERTurk (two vocabulary sizes: 32K and 128K) fine-tuned with an added binary classification head over final [CLS] token representations. Training uses stratified 80/20 splits from the UD-derived data with standard hyperparameters (learning rate 2e-5, batch size 32, weight decay 0.01), early stopping on validation loss after up to 10 epochs, dropout 0.2 to reduce overfitting. Evaluation metrics reported include accuracy, F1, loss.

Three instruction-tuned models from different families (Llama 3.1 8B, GPT-OSS 20B, Qwen 2.5 14B) are tested locally via the Ollama inference framework with low temperature (0.1) to stabilize outputs. Their evaluation involves generating either [0] or [1] labels for each sentence under three prompting regimes: zero-shot instructions only, one-shot (instruction + 1 positive + 1 negative demonstration per verb template), and few-shot (instruction + multiple positive and negative demonstrations per verb). Outputs are parsed and scored automatically.

The evaluation analyzes model performance across the three controlled splits (LVC, NLVC, RANDOM) and overall, emphasizing the condition-wise success rates and error profiles (false positives on negatives, false negatives on LVC). Statistical tests (chi-square, Cramer's V, Holm corrections for multiple testing) assess significance of model and regime effects on correctness.

Reproducibility is supported by releasing code, prompts, and evaluation materials, but the main training data (UD treebanks) is publicly known but not necessarily bundled. The diagnostic test dataset is newly created and specialized for this study.

A concrete example: In zero-shot, Llama3.1-8B classifies an LVC sentence with idiomatic meaning as negative [0] because no demonstrations bias it towards positively predicting LVC, yielding 0% recall on LVC class. This demonstrates how zero-shot prompting fails to internalize task-specific linguistic distinctions without examples. Contrastingly, supervised BERTurk trained on 9,491 labeled instances reliably detects these idiomatic cases with >79% accuracy.

Technical innovations

Constructing a manually validated diagnostic dataset of Turkish LVCs with matched in-domain literal and out-of-domain negative controls to isolate idiomatic vs literal classifications.
Systematic comparison of zero-shot, one-shot, and few-shot prompting regimes on instruction-tuned LLMs for Turkish metalinguistic binary classification tasks.
Analysis and quantification of error profile shifts (false positives vs false negatives) induced by different prompting approaches across heterogeneous LLM families.
Demonstrating that carefully constructed few-shot prompts can calibrate large open-weight LLMs (GPT-OSS-20B, Qwen 2.5-14B) to match or exceed supervised model performance on nuanced Turkish MWE detection.

Datasets

Turkish Universal Dependencies (UD) treebanks — 82,884 sentences initially pooled — public UD repository
Diagnostic LVC evaluation dataset — 147 manually authored, controlled sentences with 49 LVC, 49 NLVC, and 49 RANDOM examples — newly created for this study, publicly released with code/prompts

Baselines vs proposed

Zero-shot GPT-OSS-20B: LVC accuracy = 6.1% vs BERTurk-128K supervised: 79.6%
One-shot GPT-OSS-20B: LVC accuracy = 83.7% vs zero-shot GPT-OSS-20B: 6.1%
Few-shot GPT-OSS-20B: Overall accuracy = 84.4% vs BERTurk-128K supervised: 86.4%
Few-shot Qwen 2.5-14B: Overall accuracy = 85.7% vs zero-shot Qwen 2.5-14B: 63.3%
Few-shot Llama 3.1 8B: Overall accuracy = 66.7% due to high false positives despite high LVC recall (87.8%)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.07479.

Fig 1

Fig 1: Flowchart of the experimental process

Limitations

The diagnostic evaluation dataset is small (N=147), limiting broad generalization to all Turkish LVCs or real-world data distributions.
Supervised BERTurk baseline trains on proxy labels derived from UD treebanks which may not exhaustively or perfectly annotate idiomatic LVC uses.
LLMs are tested only in prompting/in-context learning settings without fine-tuning or parameter updates, so comparison to supervised training is not fully apples-to-apples.
Prompt wording, ordering, and demonstration selection effects on LLM calibration are known but not exhaustively explored here—results can be prompt-sensitive.
No adversarial or out-of-domain robustness evaluation to assess model resilience to obfuscation or linguistic variation beyond the limited diagnostic sets.
Cross-lingual or multilingual generalization beyond Turkish is untested, leaving open applicability to other morphologically rich or idiomatic languages.

Open questions / follow-ons

How does prompt design optimization (number, order, diversity of demonstrations) quantitatively affect calibration and error profile in Turkish LVC detection for LLMs?
Can parameter-efficient fine-tuning or adaptor layers bridge the gap between supervised encoders and prompted LLMs on morphosyntactic multiword classification tasks?
What is the impact of larger or more diverse Turkish idiomatic/literal datasets for supervised training versus prompting-based learning?
How robust are these LVC detection approaches to domain shifts, paraphrases, noise, or non-standard Turkish text genres?

Why it matters for bot defense

For bot-defense engineers, this study illustrates that metaphorical or idiomatic language elements, such as Turkish light verb constructions, pose significant classification challenges even for large LLMs. Reliance on zero-shot prompting may lead to systematic underdetection of nuanced language phenomena, potentially exploitable by sophisticated bots mimicking idiomatic usage. Carefully engineered few-shot prompts can calibrate model decisions to better identify subtle semantic distinctions, but prompt sensitivity implies that in-the-wild performance may vary unpredictably. From a CAPTCHA or language analysis perspective, incorporating supervised or task-specific annotated data to train dedicated classifiers may yield more robust and balanced detection of challenging semi-idiomatic phrases likely to confound simpler pattern-based filters. The findings caution against overreliance on broad instruction-following abilities of LLMs without prompt tuning for metalinguistic tasks relevant to bot differentiation and natural language understanding challenges.

Cite

bibtex

@article{arxiv2606_07479,
  title={ Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification },
  author={ Sercan Karakaş and Yusuf Şimşek },
  journal={arXiv preprint arXiv:2606.07479},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.07479}
}

Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​