Skip to content

LLUMI: Improving LLM Writing Assistance for Mental Health Support with Online Community Feedback

Source: arXiv:2605.30273 · Published 2026-05-28 · By Jiwon Kim, Maya Ajit, Sherry Gong, Soorya Ram Shimgekar, Dong Whi Yoo, Eshwar Chandrasekharan et al.

TL;DR

This paper addresses the challenge of providing AI writing assistance for mental health peer-support interactions using smaller, open-source large language models (LLMs). Existing proprietary LLMs, while powerful, raise privacy concerns and require costly expert annotations and compute resources. The authors introduce LLUMI, a computational framework composed of two models: a generation model (GM) that drafts supportive responses from scratch, and an improvement model (IM) that revises user-written responses to enhance empathy, clarity, and safety. LLUMI leverages naturally occurring preference signals from Reddit mental health communities, particularly r/SuicideWatch, by utilizing upvotes/downvotes to create chosen-rejected response pairs for supervised fine-tuning (SFT) and Direct Preference Optimization (DPO). Additional human evaluations with distinct preference pairs are used for further alignment. Results show that despite relying on smaller 7B-parameter open-source models (Mistral-7B-Instruct-v0.2), LLUMI's DPO2-tuned models achieve human evaluation scores within 1% of proprietary GPT-5-nano baselines across empathy, readability, connection, actionability, and safety metrics. Linguistic analyses confirm improvements in stylistic accommodation, semantic similarity, and diversity. The IM similarly improves user-written comments while preserving original intent and style. This suggests that community-derived preference signals combined with iterative human feedback enable smaller LLMs to provide privacy-conscious, high-quality AI writing assistance for sensitive peer-support contexts.

Key findings

  • LLUMI generation model (DPO2) achieves human evaluation scores within approximately 1% of GPT-5-nano on empathy, readability, connection, actionability, and safety (Tables 2 and 3).
  • DPO1 fine-tuning improves over original Reddit community responses by 22% readability, 49% empathy, 48% connection, 77% actionability, and 42% safety (Table 2).
  • DPO2 training with human preference data further improves over DPO1 by 15% readability, 106% empathy, 109% connection, 135% actionability, and 52% safety (Table 3).
  • Preference dataset for DPO1 contains 4,390 chosen-rejected response pairs derived from Reddit upvote/downvote patterns with minimum score difference of 15.
  • The improvement model (IM) trained with retrieval-augmented generation improves linguistic metrics such as empathy (+12.8%), formality (+18.2%), and readability compared to original comments (Table 4).
  • Human evaluators prefer model-revised responses over original user comments 83.7% of the time, with LLUMI–IM preferred 82% and GPT-5-nano preferred 85.4% (Table 5).
  • IM preserves content and style consistency comparable to GPT-5-nano, with no statistically significant difference in participant ratings (Table 5).
  • Linguistic analyses show LLUMI–GM responses have higher linguistic accommodation, semantic similarity, and diversity than baselines (Table 1).

Threat model

The adversary considered is implicit and not explicitly modeled; the primary concern is privacy and data governance when deploying LLMs in sensitive mental health contexts. The threat addressed is unauthorized data leakage or misuse of sensitive user-generated content. The defended scenario assumes the model is hosted in-house to avoid cloud-based privacy risks. There is no adversarial attack model on the model's outputs or data poisoning addressed.

Methodology — deep read

The work targets mental health peer-support interactions, assuming an adversary cannot break local data privacy since models can be hosted in-house, and that data labels are derived from community feedback rather than expert annotation.

Data provenance is from the Reddit mental health community subreddit r/SuicideWatch, collecting over 310,000 post-comment pairs with associated metadata including comment upvotes/downvotes, cleaned to exclude short comments (<20 chars) and author self-replies. This data is used to construct query-response pairs as well as preference pairs for supervised fine-tuning and preference optimization.

The generation model (GM) is based on Mistral-7B-Instruct-v0.2, a 7B-parameter open-source LLM. First, supervised fine-tuning (SFT) is performed on 42,503 examples of high-scoring comments (score≥2) paired with posts, using Low-Rank Adaptation (LoRA) with the AdamW optimizer, batch size 1 with gradient accumulation. LoRA adapters target attention projection layers. Training uses 1 epoch.

For preference alignment, a Direct Preference Optimization (DPO) approach is applied. The DPO1 dataset uses the top 2% and bottom 15% comments within threads with at least 15 score difference, combined with model-generated rejected examples, totaling 4,390 preference pairs. DPO training applies a larger LoRA configuration (rank 64), training 1 epoch with batch size 2 and cosine scheduler. The loss is negative log-likelihood weighted by preference temperature.

A second DPO stage (DPO2) further aligns by incorporating human crowdworker ratings collected for 840 responses across readability, empathy, connection, actionability, and safety. All pairwise comparisons among original community (OC), GPT, and DPO1 responses generate 675 additional preference pairs for training. Separate datasets are held out for final evaluation to prevent leakage.

For the improvement model (IM), a retrieval-augmented generation approach is used, conditioning on the post, the initial human comment, and retrieved similar improved comments from a nearest-neighbor search (FAISS) on SentenceTransformer embeddings. GPT-5-nano generates synthetic improved comments which form 2K triples (post, initial comment, improved comment) for training.

IM is fine-tuned on Mistral-7B-Instruct-v0.2 using LoRA adapters on attention projection and embedding layers, trained for 2 epochs with AdamW optimizer at learning rate 2e-4 and max sequence length 512.

Evaluation combines linguistic and psycholinguistic analyses (metrics like verbosity, readability, empathy, accommodation, semantic similarity) and human evaluations via the Prolific platform, where 210 participants rate responses along five dimensions.

Several baseline models are used for comparison including original Reddit community responses (OC), open-source LLaMA-3, and proprietary GPT-5-nano. Statistical tests include paired t-tests and Kruskal-Wallis H tests. Results are analyzed for significance across metrics.

Code and exact data splits are not explicitly stated as publicly released, though descriptions indicate separation of training, preference data, and evaluation sets to avoid leakage. The Reddit source data is publicly available but cleaned and filtered internally.

Technical innovations

  • Leveraging naturalistic online community feedback (upvotes/downvotes) from Reddit mental health communities as scalable preference supervision for LLM alignment rather than requiring costly expert annotations.
  • Combining supervised fine-tuning with Direct Preference Optimization (DPO) in two stages, including iterative human evaluation feedback (DPO2), to progressively refine supportive response generation.
  • Designing a dual-component LLUMI framework with a generation model for drafting responses and an improvement model using retrieval-augmented generation to refine user-written replies.
  • Applying parameter-efficient LoRA fine-tuning techniques on a 7B open-source instruct-tuned model (Mistral-7B-Instruct-v0.2) to achieve GPT-level performance with substantially lower computational resources.

Datasets

  • Reddit r/SuicideWatch post–comment pairs — 310,000+ pairs — publicly accessible subreddit data
  • SFT training set for GM — 42,503 high-scoring query-response pairs — derived from Reddit
  • DPO1 preference pairs — 4,390 chosen-rejected pairs from Reddit and model-augmented data
  • DPO2 human evaluation preference pairs — 675 pairs from crowdworker ratings
  • IM training triples — 2,000 post-comment-improved comment triples with GPT-generated revisions

Baselines vs proposed

  • Online Community (OC) responses: baseline human peer replies
  • LLaMA-3: linguistic metrics variable but lower than DPO2 (e.g., readability 1.59 vs 8.54)
  • GPT-5-nano: human eval scores ~4.3-4.6 on empathy, readability
  • LLUMI GM (DPO1): empathy = 4.36 vs OC = 2.93 and vs GPT = 4.27 (Table 2)
  • LLUMI GM (DPO2): empathy = 4.41 vs OC = 2.14 and GPT = 4.34 (Table 3)
  • IM vs OC and GPT baselines: empathy = 0.987 vs 0.978 vs 0.983 in linguistic metrics (Table 4)
  • Human preference IM vs original: 83.7% preferred improved (Table 5)
  • No significant statistical differences between LLUMI and GPT baselines on content and style preservation (Table 5)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.30273.

Fig 1

Fig 1 (page 14).

Limitations

  • Models are trained and tested only on Reddit data, particularly the r/SuicideWatch subreddit, limiting generalizability to other mental health communities or platforms.
  • The preference signals (upvotes/downvotes) while scalable may be noisy or biased, reflecting community dynamics rather than clinical validation of support quality.
  • Human evaluation data is collected via crowdworkers whose expertise in mental health is not established; clinical expert evaluations are absent.
  • The improvement model relies on synthetic GPT-generated improved comments as ground truth, which may propagate GPT biases and limit model creativity.
  • Code release and detailed data splits are not specified, limiting immediate reproducibility and external benchmarking.
  • Distributional robustness under adversarial inputs, shifts in community norms, or different mental health conditions is not examined.

Open questions / follow-ons

  • How well do LLUMI models generalize to other mental health online communities beyond r/SuicideWatch or different types of support scenarios?
  • What is the impact of incorporating clinical expert feedback or annotations alongside community preference signals for further alignment?
  • Can adversarial robustness or safety be improved by integrating fail-safe mechanisms to detect harmful or risky AI-generated support replies?
  • How does the real-world use of LLUMI in live peer-support systems affect community dynamics, trust, and user outcomes over time?

Why it matters for bot defense

Bot-defense and CAPTCHA practitioners can view LLUMI as an example of responsibly adapting smaller open-source LLMs, aligned with community feedback, to sensitive applications requiring privacy preservation and nuanced, safe output. For CAPTCHA systems that may incorporate text generation or dialogue assistance (e.g., support chatbots), the use of preference-based fine-tuning on naturalistic user feedback could enhance output quality and safety without relying on large proprietary models or massive expert labeling. LLUMI demonstrates a scalable method of leveraging in-domain community signals for alignment that may inspire alignment approaches in other sensitive, trust-critical domains. However, careful attention to dataset representativeness, evaluation rigor, and model auditing would be essential before deployment.

Cite

bibtex
@article{arxiv2605_30273,
  title={ LLUMI: Improving LLM Writing Assistance for Mental Health Support with Online Community Feedback },
  author={ Jiwon Kim and Maya Ajit and Sherry Gong and Soorya Ram Shimgekar and Dong Whi Yoo and Eshwar Chandrasekharan and Koustuv Saha },
  journal={arXiv preprint arXiv:2605.30273},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.30273}
}

Read the full paper

Last updated:

Articles are CC BY 4.0 — feel free to quote with attribution