Machine Learning for Coding Retail Product Names to Consumer-Price Categories: A Rule-plus-Bag-of-Words Pipeline with Reliability-Weighted Human-in-the-Loop Labeling

Source: arXiv:2606.02004 · Published 2026-06-01 · By Vladimir Beskorovainyi

TL;DR

This paper addresses the practical challenge faced by statistical agencies of mapping noisy, short, and inconsistently formatted retail product names from transaction and receipt data to standardized consumption categories (e.g., UN COICOP) for consumer price index (CPI) compilation. The author proposes a lightweight, scalable pipeline combining deterministic rule-based pre-classification using tries with per-category binary confirmation classifiers trained on bag-of-words features. To generate labeled data at scale while maintaining quality, a human-in-the-loop weighted consensus labeling protocol is introduced, which dynamically updates annotator reliability scores and folds the model itself into voting for continual fine-tuning. Experimental evaluation on a representative product category (granulated sugar) confirms that simple linear bag-of-words models nearly saturate the task (F1≈0.99), with no gain from adding word order or complex sequence models. Furthermore, a Monte Carlo simulation shows that the proposed additive reliability-weighted voting marginally outperforms majority vote but is inferior to the classic Dawid–Skene latent-ability model. The paper concludes that for official statistics at terabyte scale, simple models and transparent human-in-the-loop protocols can suffice, but points to latent ability estimators and coverage evaluation as important future improvements.

Key findings

A linear classifier over unigram bag-of-words features achieves near-perfect F1 = 0.999 ± 0.003 on a controlled task of confirming product category (granulated sugar) using 711 positive and 978 hard negative examples (Table 1).
Adding explicit word order (bi- and tri-gram features) does not improve F1 beyond 0.999 ± 0.003; character n-gram models perform worse at F1 = 0.972 ± 0.011.
About 67 labeled examples (5% of training split) suffice to reach F1 ≈0.986, showing high data efficiency for the confirmation model.
MLP models with one hidden layer nearly match linear classifiers (F1 = 0.988 ± 0.005) but require more training time (0.7s vs 0.03s).
In a Monte Carlo simulation with 12 annotators of heterogeneous accuracy and 1,500 items, Dawid-Skene label aggregation achieves 0.928–0.960 accuracy vs. 0.847–0.914 for additive reliability-weighted voting and 0.794–0.886 for majority vote (Table 2).
The additive reliability-weighted vote saturates weights for better-than-chance annotators, degenerating towards majority vote with limited accuracy gain over majority, suggesting latent-ability models should be preferred where label quality is paramount.
The per-category binary confirmation design allows retraining and deployment of individual classifiers independently, facilitating continual fine-tuning after labeling threshold 𝜃c is reached (Section 5.3 and Algorithm 1).
The trie-based rule pre-classifier sharply reduces model load by filtering most items deterministically, leaving the confirmation model to focus on residual ambiguous or challenging examples (Section 5.2).

Threat model

The adversary is implicit and limited to noisy or ambiguous product descriptions, possibly adversarial annotators submitting invalid labels, or errors in transaction data. The system assumes annotators vary in reliability but does not explicitly model targeted evasion or hostile attacks on the pipeline. The model does not assume access to confidential data by adversaries or capability to insert malicious data at scale.

Methodology — deep read

The paper studies consumer-price product name classification as a three-stage pipeline. First, short, noisy item names from transaction/receipt line items are domain-specifically normalized, removing irrelevant tokens but preserving price-relevant attributes (brand, unit, quantity). Then these are tokenized into individual tokens, with attempts to split concatenated strings; unrecognized tokens are quarantined for human review. Second, a rule-based pre-classifier uses a prefix tree (trie) structure to efficiently match tokenized item names against category-specific key-phrase sets (positive triggers) and stop-phrase sets (exclusions). Items uniquely matched by the trie are deterministically assigned the category, reducing workload for learned models. Items unmatched or ambiguously matched pass to the third stage: per-category binary confirmation models. Each category c has an independent learned model NNc that receives the tokenized item previously tentatively assigned to c and outputs a binary valid/reject decision. These models are small, shallow classifiers over a binary bag-of-words vector encoding the presence of tokens from a limited vocabulary fitted only on training data (to avoid leakage). The models are trained via standard supervised learning with logistic regression or one-hidden-layer MLPs, using cross-entropy loss and optimizers like Adam, for a few epochs. The architecture choice reflects the operational constraint of millions of items processed daily, favoring simple, fast, order-insensitive models because the product names are very short and bag-of-words signals suffice. The ground truth labels for training come from a human-in-the-loop labeling protocol: annotators review candidate items and vote valid/reject per category, guided by annotation instructions. Annotator votes are aggregated by a reliability-weighted majority vote, with reliability weights updated dynamically to reflect annotator consistency. The classification model itself is also folded in as an additional voter. Items where the weighted vote is exactly tied are deferred to human adjudication. This human-in-the-loop feedback and continual fine-tuning enables retraining confirmation models when enough new labeled data accumulate. Quality control operates throughout, including input data validation, label coverage statistics, and price anomaly detection (flagged items sent to retraining as hard negatives). The evaluation protocol focused on a single representative category (granulated sugar) with 711 genuine positive and 978 hard negative examples, split 80/20 for training/testing, repeated over 5 random seeds. Metrics reported are accuracy, precision, recall, and F1 with standard deviations. For label aggregation, a Monte Carlo simulation of 1,500 items and 12 annotators modeled label recovery accuracy comparing majority, additive reliability weighting, and Dawid–Skene estimators over multiple voting cardinalities. The experimental setup did not evaluate more complex neural sequence models directly but used n-gram logistic regression as proxy for order sensitivity. Reproducibility is emphasized by describing the methodology and protocol in detail, but no source code or datasets are released due to confidentiality. Illustrative aggregate results are presented to enable future replication on public data.

Technical innovations

Hybrid product identification pipeline combining a trie-based key-phrase/stop-phrase pre-classifier with per-category binary bag-of-words confirmation models for scalable, transparent classification of noisy product names.
A reliability-weighted human-in-the-loop labeling protocol with dynamic online updating of annotator reliability weights, folded together with the model’s output into a unified weighted consensus for generating training labels and continual fine-tuning.
Framing the per-category confirmation task as a binary valid/reject decision that aligns machine classifier outputs directly with human assessor judgments, enabling interchangeable aggregation and continual model refinement.
Operational integration of price-level statistical anomaly detection feeding back into labeling and retraining to improve model quality and downstream price quote reliability.

Datasets

Illustrative dataset on granulated sugar — 1,689 items (711 positive, 978 hard negative) — proprietary transaction/receipt data, not publicly released

Baselines vs proposed

Unigram bag-of-words + Logistic Regression: F1 = 0.999 ± 0.003 vs MLP (256-unit hidden layer): F1 = 0.988 ± 0.005
Word 1-2-gram + Logistic Regression: F1 = 0.999 ± 0.003 vs Word 1-3-gram + Logistic Regression: F1 = 0.999 ± 0.003 (no significant improvement)
Char n-gram (3-5) + Logistic Regression: F1 = 0.972 ± 0.011 (worse than unigram models)
Consensus label-recovery accuracy: Majority vote k=7: 0.886 ± 0.057; Reliability-weighted vote k=7: 0.914 ± 0.054; Dawid-Skene EM k=7: 0.960 ± 0.025

Limitations

Empirical classification results are limited to a single product category (granulated sugar); results may not generalize to more ambiguous or polysemous categories.
No direct experiments on order-sensitive models such as CNNs or LSTMs; sequence models only approximated via n-gram logistic regressions due to deployment environment constraints.
Label quality evaluation uses simulated annotator populations rather than real crowd or expert annotators.
Coverage statistics and representativeness of transaction/receipt data versus traditional CPI collection methods are not quantified, leaving open questions about real-world substitution or augmentation.
The additive reliability-weight update used in consensus labeling saturates annotator weights, limiting gains over majority voting and risking over-confidence if adversarial annotators behave differently in practice.
Details on computational infrastructure, exact model hyperparameters, and production throughput constraints are omitted as deployment-specific.

Open questions / follow-ons

How well do bag-of-words classifiers perform across diverse, more ambiguous or polysemous product categories beyond the illustrative granular sugar example?
Can more sophisticated order-sensitive models (e.g., deep sequence models or pretrained Transformers) provide meaningful improvements when scaled under production constraints?
How to integrate latent-ability label aggregation models like Dawid–Skene operationally without sacrificing auditability and transparency in official statistical settings?
What is the real-world coverage and representativeness of scanner and receipt transaction data compared to traditional CPI collection, and how does that impact index accuracy?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this study highlights that simple, lightweight text classification approaches using bag-of-words models may saturate performance on noisy, short text classification tasks when appropriate deterministic pre-filtering is employed. The demonstrated utility of a hybrid system—rule-based pre-classifiers to reduce the input space followed by per-category lightweight models—underscores the value of combining symbolic and statistical methods for interpretable, scalable classification in noisy environments. The human-in-the-loop weighted consensus labeling protocol also provides insights into managing noisy annotations, a common challenge when training ML defenses where labeling cost and quality vary. However, the findings caution that simple additive reliability weighting for label aggregation may not substantially improve over majority voting and latent variable approaches like Dawid–Skene might be preferable when label accuracy is critical. Bot defense engineers should consider similar trade-offs when designing downstream classification stages for suspicious traffic or CAPTCHA challenge categorization—prioritizing transparency, scalability, and annotation workflow design over complex deep models, unless justified by significantly harder data. This framework offers a reproducible architecture for handling short, ambiguous text labels with minimal complexity, relevant beyond retail product encoding to other noisy user-generated content classification tasks in bot mitigation pipelines.

Cite

bibtex

@article{arxiv2606_02004,
  title={ Machine Learning for Coding Retail Product Names to Consumer-Price Categories: A Rule-plus-Bag-of-Words Pipeline with Reliability-Weighted Human-in-the-Loop Labeling },
  author={ Vladimir Beskorovainyi },
  journal={arXiv preprint arXiv:2606.02004},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.02004}
}

Machine Learning for Coding Retail Product Names to Consumer-Price Categories: A Rule-plus-Bag-of-Words Pipeline with Reliability-Weighted Human-in-the-Loop Labeling ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​