Data augmentation for UAV-captured vessel images in maritime surveillance using multimodal language and diffusion models

Le Thi Thu Hong; Pham Thu Huong; Doan Quang Tu; Nguyen Chi Thanh

doi:10.54939/1859-1043.j.mst.IITE.2025.160-168

Authors

Le Thi Thu Hong Institute of Information Technology and Electronics, Academy of Military Science and Technology
Pham Thu Huong Institute of Information Technology and Electronics, Academy of Military Science and Technology
Doan Quang Tu Institute of Information Technology and Electronics, Academy of Military Science and Technology
Nguyen Chi Thanh (Corresponding Author) Institute of Information Technology and Electronics, Academy of Military Science and Technology

DOI:

https://doi.org/10.54939/1859-1043.j.mst.IITE.2025.160-168

Keywords:

Diffusion; Image synthesis; Data augmentation; Vessel detection.

Abstract

In maritime surveillance, UAV-based vessel detection is essential for ensuring security and safety at sea. However, limited and non-diverse annotated data often restrict model performance in complex maritime environments. This study introduces a novel data augmentation pipeline using multimodal generative models to enhance training datasets with realistic synthetic images. Scene descriptions are automatically generated from UAV imagery using Gemma, a lightweight multimodal language model, and then used to guide FLUX, a text-to-image diffusion model, in creating diverse vessel-centric scenes under varying environmental conditions. A hybrid annotation strategy combines YOLO-World for initial object proposals with manual refinement to ensure label accuracy. The augmented dataset is integrated with the original data to train a vessel detection model. Experiments on the VESSELImg benchmark demonstrate that the proposed approach improves the YOLOv11 detector’s mean average precision (mAP) from 0.775 to 0.805 at IoU thresholds of 0.50:0.95. These results validate the effectiveness of combining multimodal diffusion and language models for domain-specific data synthesis, offering improved generalization and robustness in UAV-based maritime vessel detection.

References

[1]. Cheng, S., Zhu, Y., & Wu, S. “Deep learning based efficient ship detection from drone-captured images for maritime surveillance.” Ocean engineering, 285, 115440, (2023). DOI: https://doi.org/10.1016/j.oceaneng.2023.115440

[2]. Shorten, C., & Khoshgoftaar, T. M. “A survey on image data augmentation for deep learning.” Journal of big data, 6(1), 1–48, (2019). DOI: https://doi.org/10.1186/s40537-019-0197-0

[3]. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. “High-resolution image synthesis with latent diffusion models.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, (2022). DOI: https://doi.org/10.1109/CVPR52688.2022.01042

[4]. Team, G et al. “Gemma: Open models based on gemini research and technology.” arXiv preprint arXiv:2403.08295, (2024).

[5]. Black Forest Lab. “FLUX.”, (2024). https://github.com/black-forest-labs/flux.

[6]. Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., & Shan, Y. “Yolo-world: Real-time open-vocabulary object detection.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, –, 16901–16911, (2024). DOI: https://doi.org/10.1109/CVPR52733.2024.01599

[7]. Glenn, J., & Jing, Q. “Ultralytics YOLO11.”, (2024). https://github.com/ultralytics/ultralytics.

[8]. Goodfellow. I et al. “Generative adversarial nets.” Advances in neural information processing systems, pp. 2672–2680, (2014).

[9]. Xu, M., Xie, L., Liu, Y., Wang, S., & Zhang, Y. “Generative adversarial networks in remote sensing: A review.” ISPRS journal of photogrammetry and remote sensing, 166, 296–312, (2020).

[10]. Zhang, Y., Zhang, C., Zhang, Q., & Xie, W. “Data augmentation with conditional GAN for aerial scene classification.” Remote sensing, 11(3), 243, (2019).

[11]. Dhariwal, P., & Nichol, A. “Diffusion models beat GANs on image synthesis.” Advances in neural information processing systems, 34, 8780–8794, (2021).

[12]. Ho, J., Jain, A., & Abbeel, P. “Denoising diffusion probabilistic models.” arXiv preprint arXiv:2006.11239, (2020).

[13]. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Salimans, T., Ho, J., Fleet, D., & Norouzi, M. “Imagen: Text-to-image diffusion models.” International conference on machine learning (ICML), (2022). DOI: https://doi.org/10.1145/3528233.3530757

[14]. Wolleb, J., Dejakum, K., Sandkühler, P., Reich, M., Lunz, S., & Cattin, P. C. “Diffusion models for medical anomaly detection.” Medical image analysis, 76, 102327, (2022). DOI: https://doi.org/10.1007/978-3-031-16452-1_4

[15]. Rubis, B., Cacace, J., Rodriguez, J., Company, R., Tanner, M., Arzo, R., & Cayero, J. “VESSELImg: A large UAV-based vessel image dataset for port surveillance.” International conference on unmanned aircraft systems (ICUAS), 76–83, (2024). DOI: https://doi.org/10.1109/ICUAS60882.2024.10556944

[16]. https://huggingface.co/google/gemma-3-4b-it

[17]. https://huggingface.co/black-forest-labs/FLUX.1-dev

Data augmentation for UAV-captured vessel images in maritime surveillance using multimodal language and diffusion models

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)

ISSN: 1859-1043

Language

Make a Submission

Indexed by

Information

Visitors

GTM