Augmenting Efficient Real-time Surgical Instrument Segmentation in Video
with Point Tracking and Segment Anything
Abstract
The Segment Anything Model (SAM) is a powerful vision foundation model
that is revolutionizing the traditional paradigm of segmentation.
Despite this, reliance on prompting each frame and large computational
cost limit its usage in robotically assisted surgery. Applications, such
as augmented reality guidance, require little user intervention along
with efficient inference to be usable clinically. In this study, we
address these limitations by adopting lightweight SAM variants to meet
the efficiency requirement and employing fine-tuning techniques to
enhance their generalization in surgical scenes. Recent advancements in
Tracking Any Point (TAP) have shown promising results in both accuracy
and efficiency, particularly when points are occluded or leave the field
of view. Inspired by this progress, we present a novel framework that
combines an online point tracker with a lightweight SAM model that is
fine-tuned for surgical instrument segmentation. Sparse points within
the region of interest are tracked and used to prompt SAM throughout the
video sequence, providing temporal consistency. The quantitative results
surpass the state-of-the-art semi-supervised video segmentation method
XMem on EndoVis2015 with 84.8 IoU and 91.0 Dice. In terms of efficiency,
we tested our method on a single 4060/4090 GPU, achieving an over 25/90
FPS. Code is available at: https://github.com/wuzijian1997/SIS-PT-SAM