The safety and reliability of autonomous driving pivots on the accuracy of perception and motion prediction pipelines, which in turn reckons primarily on the sensors deployed onboard. Slight confusion in perception and motion prediction can result in catastrophic consequences due to misinterpretation in later pipelines. Therefore, researchers have recently devoted considerable effort towards developing accurate perception and motion prediction models. To that end, we propose LIDAR Camera network (LiCaNet) that leverages multi-modal fusion to further enhance the joint perception and motion prediction performance accomplished in our earlier work. LiCaNet expands on our previous fusion network by adding a camera image to the fusion of RV image with historical BEV data sourced from a LIDAR sensor. We present a comprehensive evaluation to validate the outstanding performance of LiCaNet compared to the state-of-the-art. Experiments reveal that utilizing a camera sensor results in a substantial perception gain over our previous fusion network and a steep reduction in displacement errors. Moreover, the majority of the achieved improvement falls within camera range, with the highest registered for small and distant objects, confirming the significance of incorporating a camera sensor into a fusion network.