Prompt-Guided AI Analytics for Manufacturing Digital Twins: Adapting YOLO-World to Industrial Object Detection

Yuxuan Tang; Wenjie Han; Min Liu; Lei

doi:10.63646/jaiaa.2025.030104

Published 2025-03-30

Yuxuan Tang

School of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, Heilongjiang, China

Wenjie Han

School of Mechanical Engineering, Shenyang University of Technology, Shenyang 110870, Liaoning, China

Min Liu

School of Information and Mechanical Engineering, Wuhan Polytechnic University, Wuhan 430023, Hubei, China

Lei Zhang*

School of Mechatronic Engineering, Xi'an Polytechnic University, Xi'an 710048, Shaanxi, China
zhanglei@xpu.edu.cn

DOI: https://doi.org/10.63646/jaiaa.2025.030104

Abstract

Manufacturing digital twins promise to bridge physical production and data-driven decision making, yet the data side of the bridge remains underdeveloped because vision sensors require large, well-annotated datasets that industrial sites cannot easily provide. This paper investigates how prompt-guided vision–language detection can be adapted to deliver such data with limited supervision. We propose a Sim2Real pipeline that fuses photorealistic synthetic imagery generated from CAD models with a small body of coot-captured controlled imagery, then fine-tunes the open-vocabulary YOLO-World detector on the resulting mixed dataset. The pipeline is validated on a tidal-turbine workstation containing seven assembly states and is benchmarked against closed-vocabulary YOLOv8 baselines and against synthetic-only training. On 1,024 spontaneous production frames, the prompt-guided model attains mAP@0.5 = 0.579 and precision = 0.815 after a brief 943-image spontaneous fine-tune, raising mean average precision by 25.4 percentage points over the synthetic controlled-only configuration and by 18.0 points over a YOLOv8-S baseline trained on the same pool. Open-vocabulary prompts add a further 5–6 map points by exposing color and assembly-state attributes that were never explicit class labels at training time. Annotation effort is cut by 77% relative to a fully manual real-data pipeline. The paper concludes with a practitioner-oriented discussion of how the pipeline integrates into manufacturing digital twins and where its limits lie.

Keywords: Digital twin; computer vision; YOLO-World; Sim2Real; open-vocabulary detection; vision–language model; manufacturing analytics; assembly monitoring

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

Tang, Y., Han, W., Liu, M., & Lei. (2025). Prompt-Guided AI Analytics for Manufacturing Digital Twins: Adapting YOLO-World to Industrial Object Detection. Journal of AI Analytics and Applications, 3(1), 60-79. https://doi.org/10.63646/jaiaa.2025.030104

Article sidebar

Main article

Abstract

Article details

How to Cite