Learning Critical Testing Literacy Through Puzzles: an Experience Report

Source: arXiv:2606.20129 · Published 2026-06-18 · By Niels Doorn, Bart Th. Knaack, Tanja E. J. Vos, Beatriz Marín

TL;DR

This experience report explores the use of puzzle-based learning activities to develop Critical Testing Literacy (CTL) in software testing education. Recognizing that software testing is difficult to teach and that learners struggle with the sensemaking processes involved, the authors leverage a previously developed pedagogical framework called P4TEST. This framework emphasizes cognitive skills of a critical tester such as experimentation, modelling, communication, and reflection-in-action. Through thirteen workshops involving diverse participant groups including students, teaching professionals, expert testers, and primary school pupils, they studied how puzzles elicited testing-related cognitive behaviors. The puzzles, drawn from a Body of Knowledge specifically designed to evoke testing skills, were domain-agnostic and had multiple valid solutions to promote exploration and critical thinking rather than rote answers. Observations indicated that participants consistently engaged in experimentation, yet differences emerged: students tended to converge quickly on solutions, while professionals kept exploring. Emotions affected behavior but were difficult to surface via written reflections alone; think-aloud sessions helped reveal immediate reasoning processes. The authors highlight that the full learning sequence of puzzle-solving, debriefing, and reflection is essential for fostering CTL, more so than the puzzles themselves. They also developed an open-source web application to support future puzzle-based workshops with analytics. Overall, the work provides valuable qualitative insights into how puzzle-based activities support teaching of complex cognitive testing skills in varied educational contexts.

Key findings

Participants reported high levels of Experimenting with median Likert scores of 4 across puzzles for both students and professionals.
Students tended to converge quickly on solutions and move on, while professionals showed sustained exploration and reflection.
Emotions scored relatively low (median ≤ 3) on self-reports, indicating difficulty articulating feelings' influence through written reflections.
Think-aloud sessions revealed immediate reasoning and cognitive strategies not captured by written reflection alone.
The Nine-dots and ABC-connect puzzles primed participants for one another, suggesting conceptual priming effects influencing confidence and strategy.
Primary school pupils employed novel problem-solving behaviors like changing physical perspective and using paper manipulation.
Proper workshop logistics (room setup, availability of pen and paper) significantly impacted participant engagement and effectiveness.
Debrief and reflection periods were identified as crucial for transforming puzzle activity into meaningful learning experiences.

Threat model

n/a – This is an educational research study focusing on cognitive and pedagogical processes in teaching software testing using puzzles. There is no adversarial or security threat modeled.

Methodology — deep read

The researchers conducted thirteen workshops over diverse settings including universities, conferences, primary schools, and professional meetups from Dec 2024 to Dec 2025. The threat model was educational rather than adversarial; the focus was on understanding cognitive engagement during puzzle-solving rather than security threats. The data included qualitative observations, participant self-reports via workbooks, Likert-scale reflections on six aspects (Experimenting, Knowledge Curation, Communicating, Disposition, Experience, Emotions), open-ended written reflections, and recorded think-aloud protocols. Participant groups ranged from undergraduate CS students to professional testers and young pupils. Workshops followed a semi-structured format: introduction (~2 min), puzzle solving in groups or individually (~10-15 min), plenary debrief (~5 min), and reflection via workbook (~5 min). Six domain-agnostic puzzles designed to evoke key tester cognition skills were used (Next-line, Nine-dots, ABC-connect, Three sons, Weird symbols, Dice puzzle). Reflections were coded deductively using P4TEST competencies plus inductively to capture emergent themes. Likert data was analyzed descriptively (medians and distributions), while qualitative reflections and think-aloud transcripts underwent reflexive thematic analysis to identify recurring cognitive and emotional patterns. Two think-aloud sessions with student pairs allowed detailed capturing of real-time reasoning. The study emphasized naturalistic educational contexts and voluntary participation without grade impact. Open-source digital tools to replace paper workbooks were developed post hoc. No formal statistical hypothesis testing was reported. The data and codebooks are shared in an anonymized form according to GDPR standards. One concrete example: in a student workshop, participants worked on the widely known Nine-dots puzzle to recognize how implicit constraints block solutions, externalized thinking on paper, engaged in strategy shifts to draw lines extending outside the perceived square, then reflected in workbook entries and group debrief.

Technical innovations

Development and validation of a Body of Knowledge (BoK) of domain-agnostic puzzles designed to evoke cognitive skills essential to software testing.
Creation of the P4TEST pedagogical framework that structures teaching around experimentation, modelling, communication, disposition, and reflection-in-action.
Integration of mixed qualitative data collection methods (workbooks with structured reflections plus think-aloud protocols) to understand learner cognition during puzzle-solving.
Open-source web application with built-in analytics for delivering and customizing puzzle workshops digitally, reducing reliance on paper.

Datasets

Anonymized workbook Likert scores and open-ended reflection responses collected across 13 workshops — ~20-40 participants per session on average — shared publicly in research dataset [15].
Think-aloud audio and video recordings (2 student pairs) — unpublished but anonymized transcripts and artifacts shared.

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.20129.

Fig 1

Fig 1: P4TEST Pedagogical framework to teach software testing [3]

Fig 2

Fig 2: New solutions found by the primary school children

Fig 3

Fig 3 (page 5).

Limitations

The study is primarily qualitative and exploratory, limiting quantitative generalizability.
Participant samples are relatively small and convenience-based, with heterogeneous backgrounds and voluntary participation.
No controlled experimental comparisons to alternative teaching methods or puzzles were conducted.
Emotions were difficult to measure and articulate via self-report, suggesting further instrumentation is needed.
Limited demographic data was collected on professional participants, constraining subgroup analysis.
Some puzzles (e.g. Three sons) were challenging for younger participants, potentially limiting broad applicability.

Open questions / follow-ons

How can puzzle sequence design (e.g., interleaving versus blocking by puzzle type) be optimized to maximize learning?
What specific emotional states influence critical testing literacy, and how can they be better elicited and supported during learning activities?
Can the P4TEST framework and puzzle-based approach be quantitatively validated at scale with controlled experiments?
How can digital tools further support real-time analytics to adaptively personalize puzzle difficulty and reflection prompts?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners striving to design puzzle challenges for human versus automated distinction, this paper offers insights on how puzzles engage complex cognitive processes such as experimentation, hypothesis formation, and reflection-in-action rather than straightforward problem-solving. It highlights that the cognitive and emotional engagement in a puzzle sequence—alongside guided debriefing and reflection—is critical to developing the tester's problem-solving skills. CAPTCHA designers could consider integrating puzzles with multiple valid solutions that resist fixating on a single answer, encouraging lateral thinking and skepticism, traits that are harder for bots to mimic. The observed differences in puzzle engagement between novices and experts may inform adaptive challenge levels. Additionally, the difficulties in capturing emotional states via self-report underline challenges in measuring human engagement for bot detection. Finally, the emphasis on designing the entire solving + reflection sequence rather than puzzles alone suggests that multi-stage or meta-cognitive interaction designs for CAPTCHAs could increase robustness against automation by requiring iterative reasoning and active sensemaking.

Cite

bibtex

@article{arxiv2606_20129,
  title={ Learning Critical Testing Literacy Through Puzzles: an Experience Report },
  author={ Niels Doorn and Bart Th. Knaack and Tanja E. J. Vos and Beatriz Marín },
  journal={arXiv preprint arXiv:2606.20129},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.20129}
}

Learning Critical Testing Literacy Through Puzzles: an Experience Report ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​