Learning Conceptual Text Prompts from Visual Regions of Interest for Medical Image Segmentation

Zhu He; Haoran Zhang; Wentao Zhang; Shen Zhao; Qiqi Liu; Xiaohu Wu; Qicheng Lao

doi:10.1016/j.eng.2026.04.006

Engineering ›› :202604006 DOI: 10.1016/j.eng.2026.04.006

research-article

Learning Conceptual Text Prompts from Visual Regions of Interest for Medical Image Segmentation

Author information +

History +

PDF (4429KB)

Abstract

Vision–language segmentation models (VLSMs) are effective in medical image segmentation tasks. However, a major limitation of these models is their dependence on manually crafted textual inputs. Studies have used visual question answering to semiautomatically generate textual information. However, these methods encounter challenges such as error accumulation. Herein, we propose a method to learn conceptual text prompts directly from visual regions of interest (ROIs) for facilitating medical image segmentation. We extracted textual conceptual attributes from ROIs using a large multimodal model to derive coarse real-text prompts. A text latent space transformation module accepted the ROI images as input for generating ﬁne-grained pseudo-text prompts to compensate for the lack of image detail perception in the abovementioned real-text prompts. These prompts were encoded into a uniﬁed text embedding. Thereafter, we applied a self-adding noise knowledge distillation method to transfer the knowledge from text embedding to the class token of the image encoder, enabling direct text-guided inference during testing while reducing error accumulation. Our approach minimized the need for man- ual prompt design by leveraging explicit discrete and implicit continuous text prompts to effectively guide visual segmentation. Extensive evaluation across 13 medical image segmentation datasets demon- strated that our model outperformed the state-of-the-art VLSMs and vision-based segmentation models, exhibiting superior segmentation accuracy.

Keywords

Conceptual text Prompt learning / Knowledge distillation / Medical image segmentation

Cite this article

Download citation ▾

Zhu He, Haoran Zhang, Wentao Zhang, Shen Zhao, Qiqi Liu, Xiaohu Wu, Qicheng Lao. Learning Conceptual Text Prompts from Visual Regions of Interest for Medical Image Segmentation. Engineering 202604006 DOI:10.1016/j.eng.2026.04.006

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Radford A , Kim JW , Hallacy C , Ramesh A , Goh G , Agarwal S , et al. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning; 2021 Jul 18—24; online. Cambridge: PMLR; 2021. p. 8748—63.

[2]

Saito K , Sohn K , Zhang X , Li CL , Lee CY , Saenko K , et al. Pic2Word: mapping pictures to words for zero—shot composed image retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023 Jun 17—24; Vancouver, BC, Canada. New York City: IEEE; 2023. p. 19305-14.

[3]	Jia C , Yang Y , Xia Y , Chen YT , Parekh Z , Pham H , et al. Scaling up visual and vision—language representation learning with noisy text supervision. In: Proceedings of the 38th International Conference on Machine Learning; 2021 Jul 18—24; online. Cambridge: PMLR; 2021. p. 4904—16.

[4]

Yang Y , Panagopoulou A , Zhou S , Jin D , Callison—Burch C , Yatskar M . Language in a bottle: language model guided concept bottlenecks for interpretable image classiﬁcation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023 Jun 17—24; Vancouver, BC, Canada. New York City: IEEE; 2023. p. 19187—97.

[5]	Li LH , Zhang P , Zhang H , Yang J , Li C , Zhong Y , et al. Grounded language—image pre—training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022 Jun 18—24; New Orleans, LA, USA. New York City: IEEE; 2022. p. 10965-75.

[6]	Lüddecke T , Ecker A . Image segmentation using text and image prompts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022 Jun 18—24; New Orleans, LA, USA. New York City: IEEE; 2022. p. 7086-96.

[7]	Chng SY , Tern PJW , Kan MRX , Cheng LTE . Automated labelling of radiology reports using natural language processing: comparison of traditional and newer methods. Health Care Sci 2023;2(2):120-8.

[8]	You C , Dai W , Liu F , Min Y , Dvornek NC , Li X , et al. Mine your own anatomy: revisiting medical image segmentation with extremely limited labels. IEEE Trans Pattern Anal Mach Intell 2024; 46(12):11136-51.

[9]

Zhong Y , Xu M , Liang K , Chen K , Wu M . Ariadne’s thread: using text prompts to improve segmentation of infected areas from chest X—ray images. In: Proceedings of the 26th International Conference on Medical Image Computing and Computer Assisted Intervention; 2023 Oct 8—12; Vancouver, BC, Canada. Cham: Springer; 2023. p. 724-33.

[10]	Zhang Y , Jiang H , Miura Y , Manning CD , Langlotz CP . Contrastive learning of medical visual representations from paired images and text. In: Proceedings of the 7th Machine Learning for Healthcare Conference; 2022 Aug 11—12; Boston, MA, USA. Cambridge: PMLR; 2022. p. 2-25.

[11]	Wei Q , Gu Z , Tan W , Kong H , Fu H , Jiang Q , et al. Development and validation of an automatic ultrawide—ﬁeld fundus imaging enhancement system for facilitating clinical diagnosis: a cross—sectional multicenter study. Engineering 2024; 41:179-88.

[12]	Li X , Li L , Jiang Y , Wang H , Qiao X , Feng T , et al. Vision—language models in medical image analysis: from simple fusion to general large models. Inf Fusion 2025; 118:102995.

[13]

Li A , Zeng X , Zeng P , Ding S , Wang P , Wang C , et al. Textmatch: using text prompts to improve semi—supervised medical image segmentation. In: Proceedings of the 27th International Conference on Medical Image Computing and Computer Assisted Intervention; 2024 Oct 6—10; Marrakech, Morocco. Cham: Springer; 2024. p. 699-709.

[14]	Zhang Z , Yao L , Wang B , Jha D , Durak G , Keles E , et al. DiffBoost: enhancing medical image segmentation via text—guided diffusion model. IEEE Trans Med Imaging 2025; 44(9):3670-82.

[15]

Lee GE , Kim SH , Cho J , Choi ST , Choi SI . Text—guided cross—position attention for segmentation: case of medical image. In: Proceedings of the 26th International Conference on Medical Image Computing and Computer Assisted Intervention; 2023 Oct 8—12; Vancouver, BC, Canada. Cham: Springer; 2023. p. 537-46.

[16]	You C , Zhou Y , Zhao R , Staib L , Duncan JS . SimCVD: simple contrastive voxel—wise representation distillation for semi—supervised medical image segmentation. IEEE Trans Med Imaging 2022; 41(9):2228-37.

[17]

Lin W , Zhao Z , Zhang X , Wu C , Zhang Y , Wang Y , et al. PMC—CLIP: contrastive language—image pre—training using biomedical documents. In: Proceedings of the 26th International Conference on Medical Image Computing and Computer Assisted Intervention; 2023 Oct 8—12; Vancouver, BC, Canada. Cham: Springer; 2023. p. 525-36.

[18]	Liu Y , Li X , Luo Y , Du J , Zhang Y , Lv T , et al. Toward a large language model—driven medical knowledge retrieval and QA system: framework design and evaluation. Engineering 2025; 50:270-82.

[19]	Du C, Zhang Z, Liu B, Cao Z, Jiang N, Zhang Z. Explainable machine learning model for pre—frailty risk assessment in community—dwelling older adults. Health Care Sci 2024; 3(6):426-37.

[20]	Bazi Y , Rahhal MMA , Bashmal L , Zuair M . Vision—language model for visual question answering in medical imagery. Bioengineering 2023; 10(3):380.

[21]	Wang S , Zhao Z , Ouyang X , Wang Q , Shen D . ChatCAD: interactive computer—aided diagnosis on medical image using large language models. 2023. arXiv:2302.07257.

[22]

You C , Dai W , Min Y , Staib L , Duncan JS . Implicit anatomical rendering for medical image segmentation with stochastic experts. In: Proceedings of the 26th International Conference on Medical Image Computing and Computer Assisted Intervention; 2023 Oct 8—12; Vancouver, BC, Canada. Cham: Springer; 2023. p. 561—71.

[23]	Ma J , He Y , Li F , Han L , You C , Wang B . Segment anything in medical images. Nat Commun 2024; 15(1):654.

[24]	Zhang B , Zhang P , Dong X , Zang Y , Wang J . Long—CLIP: unlocking the long—text capability of CLIP. In: Proceedings of the 18th European Conference on Computer Vision; 2024 Oct 23—28; Milan, Italy. Cham: Springer; 2024. p. 310-25.

[25]	Liu H , Li C , Li Y , Li B , Zhang Y , Shen S , et al. LLaVA—NeXT: improved reasoning, OCR, and world knowledge [Internet]. San Francisco: GetHub;2024 Jan 30 [cited 2026 Mar 10]. Available from: https://llava—vl.github.io/blog/2024—01—30—llava—next/.

[26]	Wu F , Shen T , Bäck T , Chen J , Huang G , Jin Y , et al. Knowledge—empowered, collaborative, and co—evolving ai models: the post—LLM roadmap. Engineering 2025; 44:87-100.

[27]

You C , Dai W , Min Y , Liu F , Clifton D , Zhou SK , et al. Rethinking semi—supervised medical image segmentation: a variance—reduction perspective. In: Proceedings of the 37th Conference on Neural Information Processing Systems; 2023 Dec 10—16; New Orleans, LA, USA. Red Hook: Curran Associates Inc.; 2023, 36. p. 9984-10021.

[28]	Zhou Z, Lei Y, Zhang B, Liu L, Liu Y. ZegCLIP: towards adapting CLIP for zero—shot semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023 Jun 17—24; Vancouver, BC, Canada. New York City: IEEE; 2023. p. 11175-85.

[29]	He Z , Liu Y , Yang G , Bao X , Chai Y , Lao Q . Learning task—level pseudo—text prompt for improved medical image segmentation. In: Proceedings of the 2024 IEEE International Conference on Bioinformatics and Biomedicine; 2024 Dec 3—6; Lisbon, Portugal. New York City: IEEE; 2024. p. 3262—7.

[30]	Liu Y , Pei J , He Z , Yang G , Jiang Z , Lao Q . Medical language mixture of experts for improving medical image segmentation. In: Proceedings of the 2024 IEEE International Conference on Bioinformatics and Biomedicine; 2024 Dec 3—6; Lisbon, Portugal. New York City: IEEE; 2024. p. 2210-6.

[31]

Wang F , Zhou Y , Wang S , Vardhanabhuti V , Yu L . Multi—granularity cross—modal alignment for generalized medical visual representation learning. In: Proceedings of the 36th Conference on Neural Information Processing Systems; 2022 Dec 6—14; New Orleans, LA, USA. Red Hook: Curran Associates Inc.; 2022. p. 33536-49.

[32]	Boecking B , Usuyama N , Bannur S , Castro DC , Schwaighofer A , Hyland S , et al. Making the most of text semantics to improve biomedical vision—language processing. In: Proceedings of the 17th European Conference on Computer Vision; 2022 Oct 23—27; Tel Aviv, Israel. Cham: Springer; 2022. p. 1-21.

[33]	Han X , Chen Q , Xie Z , Li X , Yang H . Multiscale progressive text prompt network for medical image segmentation. Comput Graph 2023; 116:262-74.

[34]	Zhan C , Zhang Y , Lin Y , Wang G , Wang H . UniDCP: unifying multiple medical vision—language tasks via dynamic cross—modal learnable prompts. IEEE Trans Multimed 2024; 26:9736-48.

[35]	Tomar NK , Jha D , Bagci U , Ali S . TGANet: text—guided attention for improved polyp segmentation. In: Proceedings of the 25th International Conference on Medical Image Computing and Computer Assisted Intervention; 2022 Oct 2—6; Singapore. Cham: Springer; 2022. p. 151-60.

[36]

Poudel K , Dhakal M , Bhandari P , Adhikari R , Thapaliya S , Khanal B . Exploring transfer learning in medical image seg—mentation using vision—language models. In: Proceedings of the Medical Imaging with Deep Learning Conference; 2024 Jul 1—4; Singapore. Paris: MIDL Organizing Committee; 2024. p. 1-24.

[37]	Qin Z , Yi HH , Lao Q , Li K . Medical image understanding with pretrained vision language models: a comprehensive study. In: Proceedings of the 11th International Conference on Learning Representations; 2023 May 1—5; Kigali, Rwanda. Kigali: ICLR Organizing Committee;2022. p. 1-20.

[38]

Dadoun H , Delingette H , Rousseau AL , de Kerviler E , Ayache N . Joint representation learning from French radiological reports and ultrasound images. In: Proceedings of the IEEE 20th International Symposium on Biomedical Imaging; 2023 Apr 18—21; Cartagena, Colombia. New York City: IEEE; 2023. p. 1-5.

[39]	Li Z , Li Y , Li Q , Wang P , Guo D , Lu L , et al. LViT: language meets vision transformer in medical image segmentation. IEEE Trans Med Imaging 2024; 43(1):96-107.

[40]	Xu Y , Kong M , Xie W , Duan R , Fang Z , Lin Y , et al. Deep sequential feature learning in clinical image classiﬁcation of infectious keratitis. Engineering 2021;7(7):1002—10.

[41]	He Z , Lin M , Luo X , Xu Z . Structure—preserved self—attention for fusion image information in multiple color spaces. IEEE Trans Neural Netw Learn Syst 2025; 36(7):13021-35.

[42]	Liu H , Li C , Li Y , Lee YJ . Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2024 Jun 10—16; Seattle, WA, USA. New York City: IEEE; 2024; p. 26296-306.

[43]	Liu H , Li C , Wu Q , Lee YJ . Visual instruction tuning. In: Proceedings of the 37th Conference on Neural Information Processing Systems; 2023 Dec 10—16; New Orleans, LA, USA. Red Hook: Curran Associates Inc.; 2023, 36. p. 1-25.

[44]	Dumoulin V , Perez E , Schucher N , Strub F , Vries H , Courville A , et al. Feature—wise transformations. Distill 2018; 3(7):e11.

[45]

You C , Dai W , Min Y , Staib L , Sekhon J , Duncan JS . Action++: improving semi—supervised medical image segmentation with adaptive anatomical contrast. In: Proceedings of the 26th International Conference on Medical Image Computing and Computer Assisted Intervention; 2023 Oct 8—12; Vancouver, BC, Canada. Cham: Springer; 2023. p. 194-205.

[46]

Pogorelov K , Randel KR , Griwodz C , Eskeland SL , de Lange T , Johansen D , et al. KVASIR: a multi—class image dataset for computer aided gastrointestinal disease detection. In: Proceedings of the 8th ACM International Conference on Multimedia Systems; 2017 Jun 20—23; Taipei, China. New York City: ACM; 2017. p. 164-9.

[47]	Bernal J , Sánchez FJ , Fernández—Esparrach G , Gil D , Rodríguez C , Vilariño F . WM—DOVA maps for accurate polyp highlighting in colonoscopy: validation vs. saliency maps from physicians. Comput Med Imaging Graph 2015; 43:99-111.

[48]	Ngoc Lan P , An NS , Hang DV , Long DV , Trung TQ , Thuy NT , et al. NeoUNet: towards accurate colon polyp segmentation and neoplasm detection. In: Proceedings of the 16th International Symposium on Advanced Visual Computing; 2021 Dec 13—15; online. Cham: Springer; 2021. p. 15-28.

[49]

Codella NC , Gutman D , Celebi ME , Helba B , Marchetti MA , Dusza SW , et al. Skin lesion analysis toward melanoma detection: a challenge at the 2017 International Symposium on Biomedical Imaging (ISBI), hosted by the International Skin Imaging Collaboration (ISIC). In: Proceedings of the 15th IEEE International Symposium on Biomedical Imaging; 2018 Apr 4—7; Washington, DC, USA. New York City: IEEE; 2018. p. 168—72.

[50]	Wang C , Anisuzzaman DM , Williamson V , Dhar MK , Rostami B , Niezgoda J , et al. Fully automatic wound segmentation with deep convolutional neural networks. Sci Rep 2020; 10(1):21897.

[51]	Pedraza L , Vargas C , Narváez F , Durán O , Muñoz E , Romero E . An open access thyroid ultrasound image database. In: Proceedings of the 10th International Symposium on Medical Information Processing and Analysis; 2015 Apr 22—24; Cartagena, Colombia. Bellingham: SPIE; 2015. p. 188-93.

[52]	Xian M , Zhang Y , Cheng HD , Xu F , Huang K , Zhang B , et al. BUSIS: a benchmark for breast ultrasound image segmentation. 2018. arXiv:1801.03182.

[53]	Sirinukunwattana K , Pluim JPW , Chen H , Qi X , Heng PA , Guo YB , et al. Gland segmentation in colon histology images: the GlaSchallenge contest. Med Image Anal 2017; 35:489-502.

[54]	Morozov SP , Andreychenko AE , Pavlov NA , Vladzymyrskyy AV , Ledikhova NV , Gombolevskiy VA , et al. MosMedData: chest CT scans with COVID—19 related ﬁndings dataset. 2020. arXiv:2005.06465.

[55]	Hofmanninger J , Prayer F , Pan J , Röhrich S , Prosch H , Langs G . Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem. Eur Radiol Exp 2020; 4(1):50.

[56]	Kavur AE , Gezer NS , Barıs M , Aslan S , Conze PH , Groza V , et al. CHAOS challenge—combined (CT—MR) healthy abdominal organ segmentation. Med Image Anal 2021; 69:101950.

[57]	Degerli A, Kiranyaz S, Chowdhury ME, Gabbouj M. OSegNet: operational segmentation network for COVID—19 detection using chest X—ray images. In: Proceedings of the IEEE International Conference on Image Processing; 2022 Oct 16—19; Bordeaux, France. New York City: IEEE; 2022. p. 2306-10.

[58]	Bernard O , Lalande A , Zotti C , Cervenansky F , Yang X , Heng PA , et al. Deep learning techniques for automatic MRI cardiac multi—structures segmentation and diagnosis: is the problem solved? IEEE Trans Med Imaging 2018; 37(11):2514-25.

[59]	Leclerc S , Smistad E , Pedrosa J , Østvik A , Cervenansky F , Espinosa F , et al. Deep learning for segmentation using an open large—scale dataset in 2d echocardiography. IEEE Trans Med Imaging 2019; 38(9):2198-210.

[60]	Ibtehaz N , Kihara D . ACC—Unet: a completely convolutional Unet model for the 2020s. In: Proceedings of the 26th International Conference on Medical Image Computing and Computer Assisted Intervention; 2023 Oct 8—12; Vancouver, BC, Canada. Cham: Springer; 2023. p. 692-702.

[61]	Wang C , Wang L , Wang N , Wei X , Feng T , Wu M , et al. CFATransUnet: channel—wise cross fusion attention and transformer for 2D medical image segmentation. Comput Biol Med 2024; 168:107803.

[62]	Li Z , Li D , Xu C , Wang W , Hong Q , Li Q , et al. TFCNs: a CNN—transformer hybrid network for medical image segmentation. In: Proceedings of the International Conference on Artiﬁcial Neural Networks; 2022 Sep 6—9; Bristol, UK. Cham: Springer; 2022. p. 781-92.

[63]	Li Y , Jing B , Li Z , Wang J , Zhang Y . Plug—and—play segment anything model improves nnUNet performance. Med Phys 2025; 52(2): 899-912.

[64]	Wang Z , Lu Y , Li Q , Tao X , Guo Y , Gong M , et al. CRIS: CLIP—driven referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022 Jun 18—24; New Orleans, LA, USA. New York City: IEEE; 2022. p. 11686-95.

[65]

Wang P , Yang A , Men R , Lin J , Bai S , Li Z , et al. OFA: unifying architectures, tasks, and modalities through a simple sequence—to—sequence learning framework. In: Proceedings of the International Conference on Machine Learning; 2022 Jul 17—23; Baltimore, MD, USA. Cambridge: PMLR; 2022. p. 23318—40.

[66]

Nath V , Li W , Yang D , Myronenko A , Zheng M , Lu Y , et al. VILA—M3: enhancing vision—language models with medical expert knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2025 Jun 15—21; Los Angeles, CA, USA. New York City: IEEE; 2025. p. 14788-98.

[67]

Li C , Wong C , Zhang S , Usuyama N , Liu H , Yang J , et al. LLaVA—Med: training a large language—and—vision assistant for biomedicine in one day. In: Proceedings of the 37th Conference on Neural Information Processing Systems; 2023 Dec 10—16; New Orleans, LA, USA. Red Hook: Curran Associates Inc.; 2023. p. 28541—64.

[68]	Wu Z , Chen X , Pan Z , Liu X , Liu W , Dai D , et al. Deepseek—vl2: mixture—of—experts vision—language models for advanced multimodal understanding. 2024. arXiv:2412.10302.

[69]	Wang P , Bai S , Tan S , Wang S , Fan Z , Bai J , et al. Qwen2—vl: enhancing vision—language model’s perception of the world at any resolution. 2024. arXiv:2409.12191.

[70]	Gou J , Chen Y , Yu B , Liu J , Du L , Wan S , et al. Reciprocal teacher—student learning via forward and feedback knowledge distillation. IEEE Trans Multimed 2024; 26:7901-16.