To maximize the autonomy of individuals with upper limb amputations in daily activities, leveraging forearm muscle information to infer movement intent is a promising research direction. While current prosthetic hand technologies can utilize forearm muscle data to achieve basic movements such as grasping, accurately estimating finger joint angles remains a significant challenge. Therefore, we propose a Multi-Stage Cascade Convolutional Neural Network with Long Short-Term Memory Network, where an upsampling module is introduced before the downsampling module to enhance model generalization. Additionally, we designed a transfer learning framework based on parameter freezing, where the pre-trained downsampling module is fixed, and only the upsampling module is updated with a small amount of out-of-distribution data to achieve transfer learning. Furthermore, we compared the performance of unimodal and multimodal models, collecting surface electromyography (sEMG) signals, brightness mode ultrasound images (B-mode US images), and motion capture data simultaneously. The results show that, on the validation set, the US image had the lowest error, while on the prediction set, the four-channel sEMG achieved the lowest error. The performance of the multimodal model in both datasets was intermediate between the unimodal models. On the prediction set, the average normalized root mean square error values for the four-channel sEMG, US images, and sensor fusion models across three subjects were 0.170, 0.203, and 0.186, respectively. By utilizing advanced sensor fusion techniques and transfer learning, our approach can reduce the need for extensive data collection and training for new users, making prosthetic control more accessible and adaptable to individual needs.