In this work, we propose a novel deep learning based sensor fusion framework, that uses both camera and LiDAR sensors in a multi-modal and multi-view setting. In order to leverage both data streams, we incorporate two new sophisticated fusion mechanisms: element-wise multiplication and multi-modal factorized bilinear pooling. When compared to previously used fusion operators such as element-wise addition and concatenation of feature maps, our proposed fusion methods significantly increase the bird’s eye view moderate average precision score by +4.97% and +8.35% for both methods, respectively, when evaluated on KITTI dataset for object detection. Furthermore, we provide a detailed study of important design choices that contribute to the performance of deep learning based sensor fusion frameworks such as data augmentation, multi-task learning, and the design of the convolutional architecture. Finally, we provide qualitative results that showcase both success and failure cases for our proposed framework. We also discuss directions for mitigating failure cases.