SIG-Former: monocular surgical instruction generation with transformers.

Zhang, Jinglu; Nie, Yinyu; Chang, Jian; Zhang, Jian J.

SIG-Former: monocular surgical instruction generation with transformers.

Tools

Zhang, J., Nie, Y., Chang, J. and Zhang, J. J., 2022. SIG-Former: monocular surgical instruction generation with transformers. International Journal of Computer Assisted Radiology and Surgery, 17, 2203-2210.

Full text available as:

Preview

PDF (OPEN ACCESS ARTICLE)
s11548-022-02718-9.pdf - Published Version
Available under License Creative Commons Attribution.
1MB

PDF (OPEN ACCESS ARTICLE)
s11548-022-02718-9.pdf - Published Version
Restricted to Repository staff only
Available under License Creative Commons Attribution.
1MB

Copyright to original material in this document is with the original owner(s). Access to this content through BURO is granted on condition that you use it only for research, scholarly or other non-commercial purposes. If you wish to use it for any other purposes, you must contact BU via BURO@bournemouth.ac.uk.

Any third party copyright material in this document remains the property of its respective owner(s). BU grants no licence for further use of that third party material.

DOI: 10.1007/s11548-022-02718-9

Abstract

PURPOSE: Automatic surgical instruction generation is a crucial part for intra-operative surgical assistance. However, understanding and translating surgical activities into human-like sentences are particularly challenging due to the complexity of surgical environment and the modal gap between images and natural languages. To this end, we introduce SIG-Former, a transformer-backboned generation network to predict surgical instructions from monocular RGB images. METHODS: Taking a surgical image as input, we first extract its visual attentive feature map with a fine-tuned ResNet-101 model, followed by transformer attention blocks to correspondingly model its visual representation, text embedding and visual-textual relational feature. To tackle the loss-metric inconsistency between training and inference in sequence generation, we additionally apply a self-critical reinforcement learning approach to directly optimize the CIDEr score after regular training. RESULTS: We validate our proposed method on DAISI dataset, which contains 290 clinical procedures from diverse medical subjects. Extensive experiments demonstrate that our method outperforms the baselines and achieves promising performance on both quantitative and qualitative evaluations. CONCLUSION: Our experiments demonstrate that SIG-Former is capable of mapping dependencies between visual feature and textual information. Besides, surgical instruction generation is still at its preliminary stage. Future works include collecting large clinical dataset, annotating more reference instructions and preparing pre-trained models on medical images.

Item Type:	Article
ISSN:	1861-6410
Additional Information:	Funding Open Access funding enabled and organized by Projekt DEAL.
Uncontrolled Keywords:	Image captioning; Reinforcement learning; Surgical instruction generation; Transformer
Group:	Faculty of Media & Communication (Until 31/07/2025)
ID Code:	37330
Deposited By:	Symplectic RT2
Deposited On:	08 Aug 2022 14:59
Last Modified:	25 Jan 2023 12:52

Downloads

Downloads per month over past year

More statistics for this item...

Repository Staff Only -