Prompt-Guided AI Analytics for Manufacturing Digital Twins: Adapting YOLO-World to Industrial Object Detection
Main article
Abstract
Manufacturing digital twins promise to bridge physical production and data-driven decision making, yet the data side of the bridge remains underdeveloped because vision sensors require large, well-annotated datasets that industrial sites cannot easily provide. This paper investigates how prompt-guided vision–language detection can be adapted to deliver such data with limited supervision. We propose a Sim2Real pipeline that fuses photorealistic synthetic imagery generated from CAD models with a small body of coot-captured controlled imagery, then fine-tunes the open-vocabulary YOLO-World detector on the resulting mixed dataset. The pipeline is validated on a tidal-turbine workstation containing seven assembly states and is benchmarked against closed-vocabulary YOLOv8 baselines and against synthetic-only training. On 1,024 spontaneous production frames, the prompt-guided model attains mAP@0.5 = 0.579 and precision = 0.815 after a brief 943-image spontaneous fine-tune, raising mean average precision by 25.4 percentage points over the synthetic controlled-only configuration and by 18.0 points over a YOLOv8-S baseline trained on the same pool. Open-vocabulary prompts add a further 5–6 map points by exposing color and assembly-state attributes that were never explicit class labels at training time. Annotation effort is cut by 77% relative to a fully manual real-data pipeline. The paper concludes with a practitioner-oriented discussion of how the pipeline integrates into manufacturing digital twins and where its limits lie.
