The rapid development of wearable sensors promotes convenient data collection in human daily life. Human Activity Recognition (HAR), as a prominent research direction for wearable applications, has made remarkable progress in recent years. However, existing efforts mostly focus on improving recognition accuracy, paying limited attention to the model’s functional scalability, specifically its ability for continual learning. This limitation greatly restricts its application in open-world scenarios. Moreover, due to storage and privacy concerns, it is often impractical to retain the activity data of different users for subsequent tasks, especially egocentric visual information. Furthermore, the imbalance between visual-based and inertial-measurement-unit (IMU) sensing modality introduces challenges of lack of generalization when employing conventional continual learning techniques. In this paper, we propose a motivational learning scheme to address the limited generalization caused by the modal imbalance, enabling foreseeable generalization in a visual-IMU multimodal network. To overcome forgetting, we introduce a robust representation estimation technique and a pseudo-representation generation strategy for continual learning. Experimental results on the egocentric multimodal activity dataset UESTC-MMEA-CL demonstrate the effectiveness of our proposed method. Furthermore, our method effectively leverages the generalization capabilities of IMU-based modal representations, outperforming general and state-of-the-art continual learning methods in various task settings.