A frame-level video annotation tool for dynamic gestures and apply in Vietnamese sign language

Trung-Hieu Nguyen; Dinh-Anh Le; Van-An Duong; Thi Thanh Thuy Pham; Ngoc-Khiem Pham; Quang-Truong Tran; Thi-Hoang Trinh; Huong Giang Doan

doi:10.54939/1859-1043.j.mst.112.2026.167-175

Authors

Nguyen Trung Hieu Faculty of Control and Automation, Electric Power University
Le Dinh Anh Faculty of Control and Automation, Electric Power University
Duong Van An Faculty of Control and Automation, Electric Power University
Pham Thi Thanh Thuy Faculty of Cybersecurity and High Tech Crime Prevention, Academy of People Security
Pham Ngoc Khiem School of Electrical and Electronics Engineering, Hanoi University of Science and Technology
Tran Quang Truong Faculty of Control and Automation, Electric Power University
Trinh Thi Hoang Faculty of Control and Automation, Electric Power University
Doan Thi Huong Giang (Corresponding Author) Faculty of Control and Automation, Electric Power University

DOI:

https://doi.org/10.54939/1859-1043.j.mst.112.2026.167-175

Keywords:

Dynamic action recognition; Deep learning; Video annotation; Temporal IoU; Cohen’s Kappa; Temporal segmentation; Annotation evaluation.

Abstract

Human Action Recognition (HAR) in video is essential for human–computer interaction, particularly in sign language and smart device control. However, model performance depends heavily on accurate temporal annotation of dynamic gestures. This study proposes a frame-level video annotation tool for Vietnamese sign language, enabling precise temporal segmentation and structured JSON export for deep learning applications. A dataset of 15 dynamic gesture classes is also constructed using a multi-view acquisition setup. Annotation quality is evaluated using Temporal IoU, Cohen’s Kappa, and annotation time. Results show high inter-annotator agreement at 0.9470 and 0.9888, respectively, demonstrating the effectiveness of the proposed tool for reliable and efficient gesture annotation.

References

[1]. Vondrick et al., "Anticipating Visual Representations from Unlabeled Video", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016).

[2]. J. Carreira and A. Zisserman, "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017).

[3]. W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman, "The Kinetics Human Action Video Dataset", arXiv preprint arXiv:1705.06950, (2017).

[4]. F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, "ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 961–970, (2015).

[5]. M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, "The Pascal Visual Object Classes Challenge: A Retrospective", International Journal of Computer Vision, Vol. 111, No. 1, pp. 98–136, (2015).

[6]. H.-N. Tran, H.-Q. Nguyen, H.-G. Doan, T.-H. Tran, T.-L. Le, and H. Vu, "Pairwise-Covariance Multi-view Discriminant Analysis for Robust Cross-view Human Action Recognition", IEEE Access, Vol. 9, pp. 76097–76111, (2021).

[7]. Huong-Giang Doan, Thanh-Hai Tran, Hai Vu, Thi-Lan Le, Van-Toi Nguyen, Sang Viet Dinh, Thi-Oanh Nguyen, Thi-Thuy Nguyen, and Duy-Cuong Nguyen, "Multi-view Discriminant Analysis for Dynamic Hand Gesture Recognition", Asian Conference on Pattern Recognition (ACPR), Vol. 1180, pp. 196–210, (2020).

[8]. A. Dutta and A. Zisserman, “The VIA Annotation Software for Images, Audio and Video,” Proceedings of the 27th ACM International Conference on Multimedia, (2019).

[9]. H. Zhao, A. Torralba, L. Torresani, and Z. Yan, "HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization", Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8668–8678, (2019).

[10]. CVAT.ai Corporation, "Computer Vision Annotation Tool (CVAT)", Zenodo, (2023).

[11]. J. Cohen, "A Coefficient of Agreement for Nominal Scales", Educational and Psychological Measurement, Vol. 20, pp. 37–46, (1960).

[12]. M. Tkachenko et al., "Label Studio: Data Labeling Software", (2020–2025).

[13]. O. Crasborn and H. Sloetjes, "Enhanced ELAN Functionality for Sign Language Corpora", Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC), (2008).

[14]. R. Pontius et al., "Death to Kappa: Birth of Quantity Disagreement and Allocation Disagreement for Accuracy Assessment", International Journal of Remote Sensing, Vol. 32, pp. 4407–4429, (2011).

[15]. G. Benitez-Garcia, J. Olivares-Mercado, G. Sanchez-Perez, and K. Yanai, "IPN Hand: A Video Dataset and Benchmark for Real-Time Continuous Hand Gesture Recognition", Proceedings of the 25th International Conference on Pattern Recognition (ICPR), pp. 4340–4347, (2021).

A frame-level video annotation tool for dynamic gestures and apply in Vietnamese sign language

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)

ISSN: 1859-1043

Language

Make a Submission

Indexed by

Information

Visitors

GTM