A frame-level video annotation tool for dynamic gestures and apply in Vietnamese sign language

Trung Hiếu Nguyễn; Đình Anh Lê; Van-An Duong; Thị Thanh Thủy Phạm; Ngọc Khiêm Phạm; Quang Trường Trần; Thị Hoàng Trịnh; Huong Giang Doan

doi:10.54939/1859-1043.j.mst.112.2026.167-175

Các tác giả

Nguyen Trung Hieu Khoa Điều khiển và Tự động hóa, Trường Đại học Điện lực
Le Dinh Anh Khoa Điều khiển và Tự động hóa, Trường Đại học Điện lực
Duong Van An Khoa Điều khiển và Tự động hóa, Trường Đại học Điện lực
Pham Thi Thanh Thuy Khoa an ninh mạng và phòng chống tội phạm công nghệ cao, Học viện An ninh Nhân dân
Pham Ngoc Khiem Trường Điện - Điện tử, Đại học Bách khoa Hà Nội
Tran Quang Truong Khoa Điều khiển và Tự động hóa, Trường Đại học Điện lực
Trinh Thi Hoang Khoa Điều khiển và Tự động hóa, Trường Đại học Điện lực
Doan Thi Huong Giang (Tác giả đại diện) Khoa Điều khiển và Tự động hóa, Trường Đại học Điện lực

DOI:

https://doi.org/10.54939/1859-1043.j.mst.112.2026.167-175

Từ khóa:

Nhận dạng hành động; Học sâu; Gán nhãn video; Temporal IoU; Cohen’s Kappa; Phân đoạn theo thời gian; Đánh giá gán nhãn.

Tóm tắt

Nhận dạng hành động con người trong video đóng vai trò quan trọng trong tương tác người–máy, đặc biệt đối với ngôn ngữ ký hiệu và điều khiển thiết bị thông minh. Nghiên cứu này đề xuất công cụ gán nhãn video ở mức khung hình cho ngôn ngữ ký hiệu tiếng Việt, hỗ trợ phân đoạn thời gian chính xác và xuất dữ liệu JSON phục vụ học sâu. Một bộ dữ liệu gồm 15 lớp cử chỉ động cũng được xây dựng bằng hệ thống thu thập đa góc nhìn. Chất lượng gán nhãn được đánh giá bằng Temporal IoU, Cohen’s Kappa và thời gian gán nhãn, với kết quả đồng thuận lần lượt đạt 0.9470 và 0.9888.

Tài liệu tham khảo

[1]. Vondrick et al., "Anticipating Visual Representations from Unlabeled Video", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016).

[2]. J. Carreira and A. Zisserman, "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017).

[3]. W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman, "The Kinetics Human Action Video Dataset", arXiv preprint arXiv:1705.06950, (2017).

[4]. F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, "ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 961–970, (2015).

[5]. M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, "The Pascal Visual Object Classes Challenge: A Retrospective", International Journal of Computer Vision, Vol. 111, No. 1, pp. 98–136, (2015).

[6]. H.-N. Tran, H.-Q. Nguyen, H.-G. Doan, T.-H. Tran, T.-L. Le, and H. Vu, "Pairwise-Covariance Multi-view Discriminant Analysis for Robust Cross-view Human Action Recognition", IEEE Access, Vol. 9, pp. 76097–76111, (2021).

[7]. Huong-Giang Doan, Thanh-Hai Tran, Hai Vu, Thi-Lan Le, Van-Toi Nguyen, Sang Viet Dinh, Thi-Oanh Nguyen, Thi-Thuy Nguyen, and Duy-Cuong Nguyen, "Multi-view Discriminant Analysis for Dynamic Hand Gesture Recognition", Asian Conference on Pattern Recognition (ACPR), Vol. 1180, pp. 196–210, (2020).

[8]. A. Dutta and A. Zisserman, “The VIA Annotation Software for Images, Audio and Video,” Proceedings of the 27th ACM International Conference on Multimedia, (2019).

[9]. H. Zhao, A. Torralba, L. Torresani, and Z. Yan, "HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization", Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8668–8678, (2019).

[10]. CVAT.ai Corporation, "Computer Vision Annotation Tool (CVAT)", Zenodo, (2023).

[11]. J. Cohen, "A Coefficient of Agreement for Nominal Scales", Educational and Psychological Measurement, Vol. 20, pp. 37–46, (1960).

[12]. M. Tkachenko et al., "Label Studio: Data Labeling Software", (2020–2025).

[13]. O. Crasborn and H. Sloetjes, "Enhanced ELAN Functionality for Sign Language Corpora", Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC), (2008).

[14]. R. Pontius et al., "Death to Kappa: Birth of Quantity Disagreement and Allocation Disagreement for Accuracy Assessment", International Journal of Remote Sensing, Vol. 32, pp. 4407–4429, (2011).

[15]. G. Benitez-Garcia, J. Olivares-Mercado, G. Sanchez-Perez, and K. Yanai, "IPN Hand: A Video Dataset and Benchmark for Real-Time Continuous Hand Gesture Recognition", Proceedings of the 25th International Conference on Pattern Recognition (ICPR), pp. 4340–4347, (2021).

Công cụ gán nhãn video theo từng khung hình cho cử chỉ động và ứng dụng trong ngôn ngữ ký hiệu Việt Nam

Các tác giả

DOI:

Từ khóa:

Tóm tắt

Tài liệu tham khảo

Tải xuống

Đã Xuất bản

Cách trích dẫn

Số

Chuyên mục

Các bài báo được đọc nhiều nhất của cùng tác giả

ISSN: 1859-1043

Ngôn ngữ

Gửi bài mới

Indexed by

Thông tin

Visitors

GTM