Integrating the artificial intelligence vision system into robots has significantly enhanced the adaptability of grasping, but are vulnerable to potential backdoor threats. Currently, the majority of backdoor attacks are focused on image classification and are limited to unimodal information and single-object digital scenarios. In this work, we make the first endeavor to realize the backdoor attack on the multimodal vision-guided robot grasping within high-clutter scenarios. Specifically, we propose a novel backdoor attack method named Shortcut-enhanced Multimodal Backdoor Attack (SEMBA), which is divided into two parts. Firstly, for the attack robustness and multimodality, we introduce the Multimodal Shortcut Searching Algorithm (MSSA) to find the pixel value that deviates the most from the mean and standard deviation of the dataset and the pivotal pixel position for individual images, respectively. Next, based on MSSA, we devise the Multimodal Trigger Generation (MTG), to diversify backdoor triggers and realize attacks in the real world. After being trained on this dataset, the model will be activated to prioritize grasping the trigger-like object within the camera view. We conduct extensive experiments on the benchmark datasets and a robotic arm, showing the effectiveness of this method in both the digital and real world. Note to Practitioners-Robots are typically designed to be safe and reliable. However, the integration of artificial intelligence technology with robots can make them unpredictable in certain situations, such as when utilizing third-party data or models. Therefore, it is necessary to explore the security of artificial intelligence-driven robots. In this paper, we address a backdoor attack on robots equipped with an artificial intelligence vision system. Unlike typical backdoor attack methods focused on the digital world, we pay more attention to the attack robustness, multimodality, and adaptability in complex real-world scenarios. Along these lines, we propose a new backdoor attack method and demonstrate its capability to attack multimodality-guided visual grasping systems in high-clutter environments. Our method proposes potential avenues for future research on data-driven security, bringing a wealth of practical insights on a trustworthy visual learning-based robot grasping system.