THE USE OF ACTIVE LEARNING FOR REFINING LABELS IN WEAK SUPERVISION BASED TASKS
DOI:
https://doi.org/10.32689/maup.it.2025.2.18Keywords:
pseudo-labels, transition matrix, heuristic rules, oracle, importance weighting, model confidenceAbstract
An iterative method has been proposed that uses active learning to detect and correct mislabeled data in a sparsely annotated dataset. The starting point is a dataset in which examples have only noisy or partially provided annotations.The proposed approach gradually refines the training sample by identifying examples that are highly likely to contain incorrect labels and involves targeted verification, both by humans and using heuristic rules with a high level of confidence.The goal of this paper is to improve the quality of labels in a sparsely annotated dataset with minimal manual annotation costs. To do this, we propose an iterative approach based on active learning and selective human validation.Methodology. An iterative method has been developed in which a discriminative model (DistilBERT) is trained on the current set of labels, evaluates the uncertainty of its predictions based on the variability of outputs, and generates pseudo- labels for samples with high uncertainty. These labels are sent to annotators for quick confirmation or correction. Weak annotation is based on four heuristic λ-functions that form the initial labels, and the generative model takes into account the dependencies between these functions through a regularized connection matrix. The effectiveness of the approach was experimentally investigated on the IMDb Movie Reviews dataset, which contains 50,000 text reviews, with a clear division into training, validation, and test samples.Scientific novelty. Unlike traditional active learning methods, which involve manual re-annotation of the most uncertain samples, a hybrid strategy is proposed that combines weak learning, pseudo-annotation, and selective human verification. This allows the target model accuracy to be achieved 2–3 times faster than classical approaches without weak labels, due to the efficient use of limited human resources and initial information from heuristic rules.Conclusions. An iterative strategy that combines sample selection based on prediction divergence, automatic pseudo- annotation, and selective human validation allows for effective improvement of label quality in weakly annotated data without complete re-annotation. The proposed method demonstrated competitive results on a real-world IMDb review corpus, providing high classification model accuracy with reduced manual labor costs.
References
Active Learning with Weak Supervision for Gaussian Processes. Olmin A., Lindqvist J., Svensson L., Lindsten F. ArXiv : website. 2024. DOI: https://doi.org/10.48550/arXiv.2204.08335
Active WeaSuL: improving weak supervision with active learning. Biegel S., El-Khatib R., Vilas Boas L. Oliveira O., Baak M., Aben N. ArXiv : website. 2021. DOI: https://doi.org/10.48550/arXiv.2104.14847
Adaptive Confidence Thresholding for Monocular Depth Estimation. Choi H., Lee H., Kim S., Kim S., Kim S., Sohn K., Min D. ArXiv : website. 2021. DOI: https://doi.org/10.48550/arXiv.2009.12840
Agrawal A., Tripathi S., Vardhan M. Active Learning Approach Using a Modified Least Confidence Sampling Strategy for Named Entity Recognition. Progress in Artificial Intelligence. 2021. Vol. 10. DOI:10.1007/s13748-021-00230-w
ALWOD: Active Learning for Weakly-Supervised Object Detection. Wang Y., Ilic V., Li J., Kisacanin B., Pavlovic V. ArXiv : website. 2023. DOI: https://doi.org/10.48550/arXiv.2309.07914
Auto-generating weak labels for real & synthetic data to improve label-scarce medical image segmentation. Deshpande T., Prakash E., Ross E. G., Langlotz C., Ng A., Valanarasu J. M. J. ArXiv : website. 2024. DOI: https:// doi.org/10.48550/arXiv.2404.17033
Cross-Validation Strategy Impacts the Performance and Interpretation of Machine Learning Models. Sweet L.-B., Müller C., Anand M., Zscheischler J. Artificial Intelligence for the Earth Systems. 2023. Vol. 2. P. 1–35. DOI:10.1175/ AIES-D-23-0026.1
Lesci P., Vlachos A. AnchorAL: Computationally Efficient Active Learning for Large and Imbalanced Datasets. ArXiv : website. 2024. DOI: https://doi.org/10.18653/v1/2024.naacl-long.467
Vincent E. Tikhonov regularization approach to solving inverse problems for parameter learning: master’s thesis. African Institute for Mathematical Sciences (AIMS), Cameroon; scientific supervisor: Dr Floriane Melo Kue. Cameroon, 2022. 30 May. 34 p. DOI:10.13140/RG.2.2.21667.43043
Warm Start Active Learning with Proxy Labels & Selection via Semi-Supervised Fine-Tuning / Nath V., Yang D., Roth H. R., Xu D. ArXiv : website. 2022. DOI: https://doi.org/10.48550/arXiv.2209.06285







