A frame-level video annotation tool for dynamic gestures and apply in Vietnamese sign language

8 views

Authors

  • Nguyen Trung Hieu Faculty of Control and Automation, Electric Power University
  • Le Dinh Anh Faculty of Control and Automation, Electric Power University
  • Duong Van An Faculty of Control and Automation, Electric Power University
  • Pham Thi Thanh Thuy Faculty of Cybersecurity and High Tech Crime Prevention, Academy of People Security
  • Pham Ngoc Khiem School of Electrical and Electronics Engineering, Hanoi University of Science and Technology
  • Tran Quang Truong Faculty of Control and Automation, Electric Power University
  • Trinh Thi Hoang Faculty of Control and Automation, Electric Power University
  • Doan Thi Huong Giang (Corresponding Author) Faculty of Control and Automation, Electric Power University

DOI:

https://doi.org/10.54939/1859-1043.j.mst.112.2026.167-175

Keywords:

Dynamic action recognition; Deep learning; Video annotation; Temporal IoU; Cohen’s Kappa; Temporal segmentation; Annotation evaluation.

Abstract

Human Action Recognition (HAR) in video is essential for human–computer interaction, particularly in sign language and smart device control. However, model performance depends heavily on accurate temporal annotation of dynamic gestures. This study proposes a frame-level video annotation tool for Vietnamese sign language, enabling precise temporal segmentation and structured JSON export for deep learning applications. A dataset of 15 dynamic gesture classes is also constructed using a multi-view acquisition setup. Annotation quality is evaluated using Temporal IoU, Cohen’s Kappa, and annotation time. Results show high inter-annotator agreement at 0.9470 and 0.9888, respectively, demonstrating the effectiveness of the proposed tool for reliable and efficient gesture annotation.

References

[1]. Vondrick et al., "Anticipating Visual Representations from Unlabeled Video", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016).

[2]. J. Carreira and A. Zisserman, "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017).

[3]. W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman, "The Kinetics Human Action Video Dataset", arXiv preprint arXiv:1705.06950, (2017).

[4]. F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, "ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 961–970, (2015).

[5]. M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, "The Pascal Visual Object Classes Challenge: A Retrospective", International Journal of Computer Vision, Vol. 111, No. 1, pp. 98–136, (2015).

[6]. H.-N. Tran, H.-Q. Nguyen, H.-G. Doan, T.-H. Tran, T.-L. Le, and H. Vu, "Pairwise-Covariance Multi-view Discriminant Analysis for Robust Cross-view Human Action Recognition", IEEE Access, Vol. 9, pp. 76097–76111, (2021).

[7]. Huong-Giang Doan, Thanh-Hai Tran, Hai Vu, Thi-Lan Le, Van-Toi Nguyen, Sang Viet Dinh, Thi-Oanh Nguyen, Thi-Thuy Nguyen, and Duy-Cuong Nguyen, "Multi-view Discriminant Analysis for Dynamic Hand Gesture Recognition", Asian Conference on Pattern Recognition (ACPR), Vol. 1180, pp. 196–210, (2020).

[8]. A. Dutta and A. Zisserman, “The VIA Annotation Software for Images, Audio and Video,” Proceedings of the 27th ACM International Conference on Multimedia, (2019).

[9]. H. Zhao, A. Torralba, L. Torresani, and Z. Yan, "HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization", Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8668–8678, (2019).

[10]. CVAT.ai Corporation, "Computer Vision Annotation Tool (CVAT)", Zenodo, (2023).

[11]. J. Cohen, "A Coefficient of Agreement for Nominal Scales", Educational and Psychological Measurement, Vol. 20, pp. 37–46, (1960).

[12]. M. Tkachenko et al., "Label Studio: Data Labeling Software", (2020–2025).

[13]. O. Crasborn and H. Sloetjes, "Enhanced ELAN Functionality for Sign Language Corpora", Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC), (2008).

[14]. R. Pontius et al., "Death to Kappa: Birth of Quantity Disagreement and Allocation Disagreement for Accuracy Assessment", International Journal of Remote Sensing, Vol. 32, pp. 4407–4429, (2011).

[15]. G. Benitez-Garcia, J. Olivares-Mercado, G. Sanchez-Perez, and K. Yanai, "IPN Hand: A Video Dataset and Benchmark for Real-Time Continuous Hand Gesture Recognition", Proceedings of the 25th International Conference on Pattern Recognition (ICPR), pp. 4340–4347, (2021).

Downloads

Published

25-06-2026

How to Cite

[1]
T.-H. Nguyen, “A frame-level video annotation tool for dynamic gestures and apply in Vietnamese sign language”, J. Mil. Sci. Technol., vol. 112, no. 112, pp. 167–175, Jun. 2026.

Issue

Section

Information Technology & Applied Mathematics