Text-guided image generative diffusion models achieve fast development on the generation and editing of high-quality images. To extend such success to video editing, some efforts combining image generation with video editing have been made, which however only achieve inferior performance. We attribute it to two challenges: 1) different from the static image generation, it is tricky for dynamic video information to ensure the temporal fidelity of motion consistency across different frames; 2) the randomness of the frame generation process makes it hard to continuously retain the similar spatial fidelity for the original detailed features. In this paper, we propose a new high-fidelity diffusion model-based zero-shot text-guided video editing network, called HiFiVEditor, which aims to conduct effective video editing with high fidelity of the original video’s detailed and dynamic information. Specifically, we propose a Spatial-Temporal Fidelity Block (STFB) that enables the model to restore the spatial features by enlarging the spatial perceptual field to avoid loss of important information, and capture more dynamic information between different frames by using all frames for preserving temporal consistency to achieve better temporal fidelity. In addition, we introduce Null-Text Embedding to create a soft text embedding to optimize the noise learning process, so that the latent noise can be aligned with the prompt. Furthermore, to tune the video style and render it more realistic, we employ a Prior-Guided Perceptual Loss to constrain the prediction results to avoid deviating from the original video style. Extensive experiments demonstrate the superior video editing capability compared to existing works.