Data augmentation for UAV-captured vessel images in maritime surveillance using multimodal language and diffusion models

567 views

Authors

  • Le Thi Thu Hong Institute of Information Technology and Electronics, Academy of Military Science and Technology
  • Pham Thu Huong Institute of Information Technology and Electronics, Academy of Military Science and Technology
  • Doan Quang Tu Institute of Information Technology and Electronics, Academy of Military Science and Technology
  • Nguyen Chi Thanh (Corresponding Author) Institute of Information Technology and Electronics, Academy of Military Science and Technology

DOI:

https://doi.org/10.54939/1859-1043.j.mst.IITE.2025.160-168

Keywords:

Diffusion; Image synthesis; Data augmentation; Vessel detection.

Abstract

In maritime surveillance, UAV-based vessel detection is essential for ensuring security and safety at sea. However, limited and non-diverse annotated data often restrict model performance in complex maritime environments. This study introduces a novel data augmentation pipeline using multimodal generative models to enhance training datasets with realistic synthetic images. Scene descriptions are automatically generated from UAV imagery using Gemma, a lightweight multimodal language model, and then used to guide FLUX, a text-to-image diffusion model, in creating diverse vessel-centric scenes under varying environmental conditions. A hybrid annotation strategy combines YOLO-World for initial object proposals with manual refinement to ensure label accuracy. The augmented dataset is integrated with the original data to train a vessel detection model. Experiments on the VESSELImg benchmark demonstrate that the proposed approach improves the YOLOv11 detector’s mean average precision (mAP) from 0.775 to 0.805 at IoU thresholds of 0.50:0.95. These results validate the effectiveness of combining multimodal diffusion and language models for domain-specific data synthesis, offering improved generalization and robustness in UAV-based maritime vessel detection.

References

[1]. Cheng, S., Zhu, Y., & Wu, S. “Deep learning based efficient ship detection from drone-captured images for maritime surveillance.” Ocean engineering, 285, 115440, (2023). DOI: https://doi.org/10.1016/j.oceaneng.2023.115440

[2]. Shorten, C., & Khoshgoftaar, T. M. “A survey on image data augmentation for deep learning.” Journal of big data, 6(1), 1–48, (2019). DOI: https://doi.org/10.1186/s40537-019-0197-0

[3]. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. “High-resolution image synthesis with latent diffusion models.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, (2022). DOI: https://doi.org/10.1109/CVPR52688.2022.01042

[4]. Team, G et al. “Gemma: Open models based on gemini research and technology.” arXiv preprint arXiv:2403.08295, (2024).

[5]. Black Forest Lab. “FLUX.”, (2024). https://github.com/black-forest-labs/flux.

[6]. Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., & Shan, Y. “Yolo-world: Real-time open-vocabulary object detection.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, –, 16901–16911, (2024). DOI: https://doi.org/10.1109/CVPR52733.2024.01599

[7]. Glenn, J., & Jing, Q. “Ultralytics YOLO11.”, (2024). https://github.com/ultralytics/ultralytics.

[8]. Goodfellow. I et al. “Generative adversarial nets.” Advances in neural information processing systems, pp. 2672–2680, (2014).

[9]. Xu, M., Xie, L., Liu, Y., Wang, S., & Zhang, Y. “Generative adversarial networks in remote sensing: A review.” ISPRS journal of photogrammetry and remote sensing, 166, 296–312, (2020).

[10]. Zhang, Y., Zhang, C., Zhang, Q., & Xie, W. “Data augmentation with conditional GAN for aerial scene classification.” Remote sensing, 11(3), 243, (2019).

[11]. Dhariwal, P., & Nichol, A. “Diffusion models beat GANs on image synthesis.” Advances in neural information processing systems, 34, 8780–8794, (2021).

[12]. Ho, J., Jain, A., & Abbeel, P. “Denoising diffusion probabilistic models.” arXiv preprint arXiv:2006.11239, (2020).

[13]. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Salimans, T., Ho, J., Fleet, D., & Norouzi, M. “Imagen: Text-to-image diffusion models.” International conference on machine learning (ICML), (2022). DOI: https://doi.org/10.1145/3528233.3530757

[14]. Wolleb, J., Dejakum, K., Sandkühler, P., Reich, M., Lunz, S., & Cattin, P. C. “Diffusion models for medical anomaly detection.” Medical image analysis, 76, 102327, (2022). DOI: https://doi.org/10.1007/978-3-031-16452-1_4

[15]. Rubis, B., Cacace, J., Rodriguez, J., Company, R., Tanner, M., Arzo, R., & Cayero, J. “VESSELImg: A large UAV-based vessel image dataset for port surveillance.” International conference on unmanned aircraft systems (ICUAS), 76–83, (2024). DOI: https://doi.org/10.1109/ICUAS60882.2024.10556944

[16]. https://huggingface.co/google/gemma-3-4b-it

[17]. https://huggingface.co/black-forest-labs/FLUX.1-dev

Downloads

Published

30-10-2025

How to Cite

[1]
Le Thi Thu Hong, Pham Thu Huong, Doan Quang Tu, and Nguyen Chi Thanh, “Data augmentation for UAV-captured vessel images in maritime surveillance using multimodal language and diffusion models”, J. Mil. Sci. Technol., no. IITE, pp. 160–168, Oct. 2025.

Issue

Section

Information Technology

Most read articles by the same author(s)

1 2 > >>