ICCDesign: An R Package for the Design and Analysis of ICC-Based Reliability Studies with Continuous Responses

Source: arXiv:2606.02059 · Published 2026-06-01 · By Ziyu Liu, Ruilin Ma, Yundan Zhang, Chenge Gao

TL;DR

The ICCDesign package consolidates and streamlines the design and analysis workflow for intraclass correlation coefficient (ICC) based reliability studies with continuous responses. The ICC is a fundamental statistic in reliability research across medical, psychological, and behavioral sciences. However, practical application is challenged by multiple ICC forms defined in the McGraw and Wong (1996) framework and fragmented R tooling for ICC estimation, inference, sample size planning, and reliability evaluation, requiring researchers to juggle several incompatible packages. ICCDesign addresses these gaps by implementing all ten recognized ICC design combinations with logical form selection guidance, seamless sample size planning based on Zou (2012) formulas, automated reliability grading using Koo and Li (2016) criteria, and an interactive Shiny application that supports the complete study lifecycle from data preprocessing to report generation.

The package integrates ANOVA-based point estimation, confidence intervals, and hypothesis testing for ICC, unifying otherwise scattered functionalities into a coherent interface that enforces consistent design specifications. It provides users with decision framework prompts, clear warnings for use-with-caution ICC forms, and standardized output formats. Sample size estimation supports dual planning modes (lower bound assurance and CI half-width) along with inverse assurance (power) calculation. Evaluation includes automated reliability category assignment with notifications when interval uncertainty spans key thresholds. ICCDesign thereby enhances analytical rigor, usability, and reduces error risk in ICC reliability study workflows.

Key findings

ICCDesign implements all ten ICC forms from McGraw and Wong (1996), including six standard forms and four additional use-with-caution combinations.
The package includes a four-step decision framework to guide users toward the appropriate ICC form based on study design parameters.
It provides ANOVA-based F-distribution confidence intervals and one-tailed hypothesis tests both for zero-ICC null and user-specified ICC threshold alternatives (rho0).
Sample size planning is based on Zou (2012) closed-form formulas with two modes: lower bound assurance requiring P(LowerCI≥rho0)≥gamma and CI half-width assurance, plus inverse assurance calculation.
Automated reliability grading follows Koo and Li (2016) criteria applied to the ICC confidence interval lower bound, with uncertainty notifications when the 0.75 threshold is spanned.
The package offers an interactive Shiny web app that supports data input, ICC analysis, sample size planning, and report download without coding.
Comparison to existing R packages shows ICCDesign uniquely combines full ICC type coverage, decision guidance, sample size calculation, reliability grading, preprocessing utilities, and a consistent interface.
In examples with a simulated 5-subject × 4-rater dataset showing ICC ~0.8, the package correctly classifies "Good" reliability and produces expected results for multiple ICC types.

Methodology — deep read

ICCDesign assumes a standard reliability study setting where continuous responses are collected as one measurement per subject-rater cell within a balanced design. The threat model aligns with typical measurement reliability studies, without adversarial considerations.

Data input is a wide-format numeric matrix or dataframe with subjects as rows and raters as columns. The dedicated icc_preprocess_data() function validates data structure, ensures numeric entries, removes rows with missing values by default, and returns statistics including valid subject and rater counts.

ICC estimation follows the ANOVA variance components framework of McGraw and Wong (1996). Three models are supported: one-way random effects (ICC(1,x)) for unmatched raters; two-way random effects (ICC(2,x)) when raters are randomly sampled and common across subjects; and two-way mixed effects (ICC(3,x)) when raters are fixed and common. ICCDesign exposes ten ICC form combinations distinguished by model, unit of measurement (single vs average rating), and consistency type (absolute agreement vs consistency). It includes use-with-caution ICC combinations that differ in interpretative scope and warns users accordingly.

Point estimates use mean square values from ANOVA decompositions according to closed-form formulas for each ICC type (see Section 2.3). Confidence intervals are computed via F-distribution based methods: Type 1 intervals for most forms using direct F ratios and degrees of freedom based on the variance components, and Type 2 intervals using a Satterthwaite approximation for two-way absolute agreement forms. ICCDesign also supports one-tailed F-tests for null ICC=0 and optional threshold testing H0: ICC=rho0. P-values and F statistics are reported.

Sample size planning employs Zou (2012) closed-form formulas supporting two planning modes: (1) assurance that the lower bound of the ICC confidence interval exceeds a threshold rho0 with probability at least gamma; (2) assurance that the confidence interval half-width does not exceed a specified value omega with probability gamma. An inverse assurance calculation estimates achievable assurance given a fixed sample size.

ICCDesign's icc_calc() wraps data preprocessing, parameter validation (the four-step decision tree based on same raters, rater effect, rating unit, and agreement type), ICC computation, hypothesis testing, reliability evaluation (including Koo and Li grading and uncertainty messaging), and report generation. Verbose mode produces user alerts for special ICC cases. The interactive Shiny app runs this workflow with GUI controls and visualizations.

For example, running icc_calc() on a 5×4 simulated dataset with ICC ~0.8 successfully returns ICC point estimate, 95% CI, F-test statistics, ANOVA components, and a "Good" reliability grade. Sample size functions compute required subject counts based on target reliability thresholds and CI precision levels. ICCDesign v0.1.0 is implemented in R ≥4.1.0, with core dependencies limited to base stats and shiny packages. The package source code and Shiny app are publicly available on GitHub for reproducibility.

While ICCDesign provides a unified analytic workflow, it assumes balanced designs with one continuous rating per subject-rater cell, normality underlying ANOVA, and does not support missing data except row deletion. Hypothesis tests are one-sided upper tail only. Reproducibility is enabled via code release but dataset examples are synthetic/simulated rather than extensive real-world datasets.

Technical innovations

A unified R package integrating point estimation, ANOVA-based confidence intervals, hypothesis testing, sample size planning, and reliability grading for all ten ICC design forms defined by McGraw and Wong (1996), addressing fragmentation in existing tools.
A four-step decision framework operationalized within the package to guide users toward the correct ICC form selection based on study design parameters, with user notifications for use-with-caution ICC combinations.
Implementation of two modes of sample size planning and inverse assurance calculations based on Zou (2012) closed-form formulas, providing practical tools for reliability study design.
Automated reliability grading using Koo and Li (2016) thresholds applied to ICC confidence interval lower bounds, including uncertainty notifications when key reliability thresholds are spanned.
An interactive Shiny web application that encapsulates complex ICC reliability study workflows with user-friendly data input, analysis controls, visualization, and report generation without requiring coding skills.

Datasets

icc_data — 5 subjects × 4 raters simulated matrix — included as built-in example dataset in ICCDesign

Baselines vs proposed

irr package: partial ICC type coverage; ICCDesign covers full 10 ICC types
psych package: partial ICC coverage; ICCDesign supports use-with-caution combinations with warnings
ICC.Sample.Size: supports single sample size calculation mode; ICCDesign supports two modes plus inverse assurance
irrNA package: partial support for incomplete data, no design-stage functionality; ICCDesign supports dedicated preprocessing and design interface
ICCDesign's ANOVA-based ICC estimates closely match irr and psych on commonly used ICC forms subject to differences in preprocessing and missing data handling

Limitations

Assumes balanced data with one continuous rating per subject-rater cell; no support for multiple or missing ratings other than row deletion.
ANOVA-based confidence intervals rely on normality and balanced design assumptions; validity under violations not evaluated.
Hypothesis testing supports only one-tailed upper tests; no two-sided or lower-tail tests implemented.
Use-with-caution ICC types have restricted interpretability and require user awareness; the package provides warnings but cannot fully guarantee correct use.
Sample size formulas are ICC-type-general and do not account for specific ICC form nuances; inverse assurance calculations rely on approximations.
Limited empirical validation on real-world datasets; built-in example is a small simulated dataset only.

Open questions / follow-ons

How do missing values and unbalanced designs affect ICC estimation, inference, and sample size planning within the ICCDesign framework?
What are the impacts of deviations from normality on ANOVA-based confidence intervals and hypothesis testing for ICCs?
Can the decision framework be extended to automate recommendations for study design adjustments in longitudinal or multi-level ICC studies?
How can ICCDesign be adapted or extended to incorporate replicated measurements per subject-rater cell or handle hierarchical reliability structures?

Why it matters for bot defense

While ICCDesign is focused on reliability analysis in continuous measurement settings outside of bot defense or CAPTCHA specifically, there are relevant insights for practitioners designing user behavior and human validation experiments. Understanding the nuanced ICC forms and appropriate sample size planning can help captcha researchers rigorously assess the reliability of continuous behavioral metrics or annotation agreement among raters. ICCDesign’s unified workflow and built-in guidance reduce the risk of analysis errors in reliability studies, which could be adapted for bot-detection evaluation where inter-rater or test-retest consistency is critical. Its emphasis on explicit model choice, comprehensive inference, and planned assurance levels offers best practices that can inspire more rigorous statistical validation in bot detection and CAPTCHA-related reliability assessments. However, direct application requires adaptation because CAPTCHA defenses often involve categorical or discrete data rather than continuous responses.

Cite

bibtex

@article{arxiv2606_02059,
  title={ ICCDesign: An R Package for the Design and Analysis of ICC-Based Reliability Studies with Continuous Responses },
  author={ Ziyu Liu and Ruilin Ma and Yundan Zhang and Chenge Gao },
  journal={arXiv preprint arXiv:2606.02059},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.02059}
}

ICCDesign: An R Package for the Design and Analysis of ICC-Based Reliability Studies with Continuous Responses ​

TL;DR ​

Key findings ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​