Predicting future walking joint kinematics is crucial for assistive device control, especially in variable walking environments. Traditional optical motion capture systems provide kinematics data but require laborious post-processing, whereas IMU based systems provide direct calculations but add delays due to data collection and algorithmic processes. Predicting future kinematics helps to compensate for these delays, enabling the system real-time. Furthermore, these predicted kinematics could serve as target trajectories for assistive devices such as exoskeletal robots and lower limb prostheses. However, given the complexity of human mobility and environmental factors, this prediction remains to be challenging. To address this challenge, we propose the Dual-ED-Attention-FAM-Net, a deep learning model utilizing two encoders, two decoders, a temporal attention module, and a feature attention module. Our model outperforms the state-of-the-art LSTM model. Specifically, for Dataset A, using IMUs and a combination of IMUs and videos, RMSE values decrease from 4.45° to 4.22° and from 4.52° to 4.15°, respectively. For Dataset B, IMUs and IMUs combined with pressure insoles result in RMSE reductions from 7.09° to 6.66° and from 7.20° to 6.77°, respectively. Additionally, incorporating other modalities alongside IMUs helps improve the performance of the model.