BEV-based 3D perception with multi-frame images input is crucial for autonomous driving. However, current methods for temporal BEV perception fail to fully utilize long sequence features because of local fusion or high complexity. Recently, Mamba, a powerful temporal modeling network with linear complexity, has shown exceptional performance in various 2D vision tasks, but its application to 3D perception tasks remains unexplored. Therefore, this paper proposes a general BEV perception backbone named BEVMamba, which is the first work to leverage State Space Model for 3D perception. Built upon the BEVFormer, to adapt Mamba for 3D perception we first add Hybrid Positional Encoding to the BEV features, enabling the networks to be aware of their spatial-temporal position. In the Temporal SSM block, the proposed 3D Factorized Scan ensures that historical BEV features are enriched with global temporalspatial information. Subsequently, the Spatial-Temporal Corridor Fusion aggregates all BEV features in a physically meaningful manner, achieving precise feature fusion. The reliable BEV features obtained by BEVMamba are used for various perception tasks, including 3D object detection and 3D occupancy prediction. Results on the nuScenes and Occ-3D nuScenes datasets show that BEVMamba outperforms its baseline BEVFormer in both dense and sparse perception tasks and demonstrates competitive performance compared to other methods, highlighting the potential of Mamba in 3D perception tasks. The code will be available at https://github.com/Liuxiaoaaa/bevmamba.