System Report for CCL25-Eval Task 5: New Dataset and LoRA-Fine-Tuned Qwen2.5

Source: arXiv:2606.12392 · Published 2026-06-10 · By Haotao Xie

TL;DR

This paper addresses the underexplored domain of classical Chinese poetry appreciation with large language models (LLMs), focusing beyond general translation or generation tasks to precise translation and nuanced emotional-semantic interpretation. The authors construct CCPoetry-49K, a new high-quality dataset containing 49,404 instruction-response pairs aligned and cleansed from multiple open-source corpora. The dataset supports three subtasks: term interpretation, semantic interpretation, and emotional inference, specifically tailored for evaluating classical poetry. To leverage this dataset, the authors fine-tune the Qwen2.5-14B large language model using Low-Rank Adaptation (LoRA), producing PoetryQwen specialized for this domain. On the CCL25-Eval Task 5 benchmark, PoetryQwen achieves 0.757 overall score, a 9.7% relative improvement over the Qwen2.5-14B-Instruct baseline, demonstrating the value of domain-specific adaptation for classical poetry comprehension.

Key findings

CCPoetry-49K consists of 49,404 instruction-response pairs split into 19,323 term interpretation, 21,799 semantic interpretation, and 8,282 emotional inference samples.
PoetryQwen (Qwen2.5-14B fine-tuned with LoRA on CCPoetry-49K) obtains an overall CCL25-Eval Task 5 score of 0.757, outperforming Qwen2.5-14B-Instruct baseline's 0.690 by 9.7%.
On semantic interpretation, PoetryQwen achieves a BLEU score of 0.405 vs. 0.169 for baseline, and a BERTScore of 0.909 vs. 0.865, showing large gains in semantic understanding.
PoetryQwen improves emotional inference accuracy to 0.847 compared to 0.832 for Qwen2.5-14B-Instruct and 0.734 for GLM-4-9B.
LoRA fine-tuning targeted six projection modules (q-proj, k-proj, v-proj, o-proj, gate-proj, up-proj, down-proj) with rank 16 and alpha 32 for efficient adaptation.
The CCPoetry-49K dataset was carefully aligned and cleansed from three major open-source classical Chinese poetry datasets to ensure quality and task relevance.
Post-hoc output alignment using Qwen2.5-14B-Instruct was applied for emotional inference to meet CCL25-Eval Task 5 submission format.

Threat model

Not a security-focused paper; the work assumes a standard NLP task setting where the adversary is not explicitly modeled. The main focus is domain adaptation of large language models to classical Chinese poetry understanding without modeling attacks or malicious inputs.

Methodology — deep read

The authors begin by defining the threat model implicitly as a domain adaptation task to improve LLM comprehension of classical Chinese poetry, assuming the adversary is not directly modeled since this is an NLP specialization task. The focus is on adapting to unique linguistic, semantic, and emotional aspects of classical poetry. First, multiple open-source datasets related to classical Chinese poetry were aggregated, including Poetry CN, Chinese ancient poetry translation corpus, and poems-db. These datasets contain metadata, annotations, and partial labels spanning modern translations, term explanations, commentaries, and sentiment tags. The combined raw data was extensively cleansed and aligned to produce CCPoetry-49K, a large instruction-response dataset of 49,404 samples explicitly subdivided into three subtasks: term interpretation (explaining key words), semantic interpretation (translating poetic phrases into modern language), and emotional inference (assigning affective or thematic labels). Each subtask’s data was formatted as JSON instruction pairs for instruction tuning.

For model development, the authors selected Qwen2.5-14B (a 14.7B parameter LLM with 128k token context support) as the base model. They applied Low-Rank Adaptation (LoRA) for efficient parameter tuning, targeting the projection matrices q-proj, k-proj, v-proj, o-proj, gate-proj, up-proj, and down-proj with rank 16, alpha 32, and dropout 0.1. Training ran for 2 epochs at learning rate 2e-4 with fixed random seed 42 to enhance reproducibility. LoRA adapters were trained separately for each subtask, maintaining the base Qwen2.5-14B frozen but augmented by task-specific low-rank updates.

For emotional inference outputs, generated results were additionally passed through Qwen2.5-14B-Instruct to align predictions with the specific emotion label set format required by the CCL25-Eval Task 5 benchmark.

Evaluation was conducted on the CCL25-Eval Task 5 benchmark across subtasks. Metrics used included BLEU and BERTScore for translation-related subtasks (term and semantic interpretation), and accuracy for emotional inference. Baselines evaluated were Qwen2.5-7B, Qwen2.5-14B-Instruct, and GLM-4-9B. The team performed task-specific ablations by comparing the proposed PoetryQwen with baseline scores detailed in Table 2, demonstrating significant improvements.

Reproducibility is partially supported by releasing CCPoetry-49K at https://github.com/XieHaoTao/CCPotery and using publicly available Qwen2.5 base models from Huggingface. However, final model weights for PoetryQwen are not confirmed as publicly released. Evaluation is restricted to the CCL25-Eval benchmark with no reported adversarial or distribution shift tests. The detailed LoRA configuration and fixed random seed aid independent retraining efforts.

As a concrete example, to handle the semantic interpretation subtask, a JSON instruction providing poem title, content, and target sentences is processed by the semantic LoRA adapter fine-tuned on CCPoetry-49K. The model outputs a modern Chinese paraphrase of classical phrases evaluated by BLEU and BERTScore against references in the benchmark.

Technical innovations

Construction of CCPoetry-49K, a 49,404-sample multi-subtask instruction dataset specifically for classical Chinese poetry appreciation, combining term, semantic, and emotional interpretation.
Application of LoRA fine-tuning on Qwen2.5-14B to efficiently adapt a large general LLM to specialized classical poetry understanding tasks with small parameter overhead.
Decomposition of poetry appreciation into three distinct subtasks (term interpretation, semantic interpretation, emotional inference) with tailored instruction tuning.
Post-hoc output alignment using a separate LLM for mapping emotional inference outputs to predefined task-specific label sets.

Datasets

CCPoetry-49K — 49,404 samples — constructed from Poetry CN, Chinese ancient poetry translation, and poems-db open-source datasets

Baselines vs proposed

Qwen2.5-7B: overall score = 0.667 vs PoetryQwen: 0.757
Qwen2.5-14B-Instruct: overall score = 0.690 vs PoetryQwen: 0.757
GLM-4-9B: overall score = 0.628 vs PoetryQwen: 0.757
Semantic Interpretation BLEU: Qwen2.5-14B-Instruct = 0.169 vs PoetryQwen = 0.405
Emotional Inference accuracy: GLM-4-9B = 0.734 vs PoetryQwen = 0.847

Limitations

The evaluation is limited to the CCL25-Eval Task 5 benchmark; no testing on out-of-domain poetry or other classical literature evaluation tasks.
No explicit adversarial robustness analysis or evaluation under distribution shift is reported.
Final PoetryQwen model weights are not clearly stated as publicly released, impacting reproducibility.
LoRA fine-tuning was conducted for only 2 epochs; longer fine-tuning or alternative hyperparameters not explored.
Emotional inference requires post-hoc alignment with an additional model, adding complexity and potential error sources.
Dataset construction relies on existing open-source corpora with varying annotation quality, which may propagate biases or noise.

Open questions / follow-ons

How would PoetryQwen perform on broader classical Chinese literature tasks beyond poetry or on cross-genre evaluation?
Can the instruction tuning dataset and LoRA adapters generalize to other niche poetry styles or languages with similar poetic structures?
What is the robustness of the model to adversarial inputs or noisy, ambiguous poetic phrasing?
How effective would end-to-end multi-task or unified fine-tuning be versus separated LoRA adapters for each subtask?

Why it matters for bot defense

This paper offers insight into domain-specialized adaptation of large language models using efficient fine-tuning (LoRA) and carefully curated instruction datasets. For bot-defense and CAPTCHA practitioners, the study underscores the importance of domain-specific data and adaptation strategies to boost understanding of nuanced or subtle text inputs, which can inform challenge design that exploits specialized semantic and affective comprehension. Furthermore, the structured subtask decomposition could inspire evaluation frameworks for security-sensitive NLP tasks where fine-grained understanding is critical. However, direct techniques like LoRA fine-tuning or classical poetry-focused datasets may have less immediate applicability to CAPTCHA text generation or bot interaction but underscore a broader principle: targeted domain adaptation can substantially improve model nuance and interpretive accuracy.

Cite

bibtex

@article{arxiv2606_12392,
  title={ System Report for CCL25-Eval Task 5: New Dataset and LoRA-Fine-Tuned Qwen2.5 },
  author={ Haotao Xie },
  journal={arXiv preprint arXiv:2606.12392},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.12392}
}

System Report for CCL25-Eval Task 5: New Dataset and LoRA-Fine-Tuned Qwen2.5 ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​