In this work, the problem of predicting a pedestrian’s intention to cross the road is addressed using visual data captured from a camera. The proposed ROS-based modular architecture consists of four modules starting with the Visual-Perception, Intention Prediction, and the Planning and Control Modules. The visual perception is further divided into three sub-modules. First, pedestrian detection is responsible for detecting the pedestrian and analyzing his state using motion and looking classifiers. Secondly, the detection of the lane that is responsible for analyzing the structured environment which helps in the road state classifiers. The third sub-module aims to extract some curvilinear localization states that are essential for the vehicle’s motion planning and control. The intention prediction module is integrated to capture the pedestrian’s intention to cross the road. In this module, a comparative study is conducted between three different data-driven sequential models. Each model is trained on the JAAD dataset and different extracted features form the visual perception module. It is observed that the proposed GRU model obtained 86% average f1-score, and can predict a pedestrian’s intention three seconds before crossing. In order to control the maneuver of the vehicle, the Proportional-Integral (PI) controller is implemented for longitudinal velocity control to brake the vehicle to avoid collision with the pedestrian, and a Linderoth controller is used to control the lateral motion of the vehicle. Finally, this work is verified on a 1:4 scaled real vehicle to ensure the applicability of implementing this work in real hardware.