Human digital twin (HDT) is envisioned as a system interconnecting physical twins (PTs) in the real world with virtual twins (VTs) in the digital world, enabling advanced humancentric applications. Unlike optimizing the quality of experience (QoE) of users in single-modal signal transmission for conventional services, users' QoE in multi-modal signal transmission required by HDT is difficult to guarantee. To tackle this, we study an optimization of QoE in multi-modal transmission, focusing on joint visual and haptic signal feedback transmissions from VT to its PT, for providing immersive interactions in HDT. To evaluate a synthesized performance of visual and haptic experiences, we design a comprehensive QoE model, taking into account video quality, continuous video quality switching rate and average haptic feedback error. Then, to maximize QoE with a guarantee on synchronization between visual and haptic signal transmissions, we dynamically optimize bandwidth allocation, bitrate and rendering mode of the video, and haptic signal's compression threshold. To this end, we propose a deep reinforcement learning based algorithm, called VisHap. Furthermore, we build an HDT multi-modal interaction platform for collecting an authentic dataset, and by using it, we conduct experiments, showing that VisHap is not only feasible but also outperforms the counterparts.