With the development of social media and human-computer interaction, it is essential to serve people by perceiving people's emotional state in videos. In recent years, a large number of studies tackle the issue of emotion recognition based on three most common modalities in videos, that is, face, speech and text. The focus of this paper is to sort out the relevant studies of emotion recognition using facial, speech and textual cues based on deep learning techniques due to the lack of review papers concentrating on the three modalities. In this paper, we firstly introduce widely accepted emotion models for the purpose of interpreting the definition of emotion. Then we introduce the state-of-the-art for emotion recognition based on unimodality including facial expression recognition, speech emotion recognition and textual emotion recognition. For multimodal emotion recognition, we summarize the feature-level and decision-level fusion methods in detail. In addition, the description of relevant benchmark datasets, the definition of metrics and the performance of the state-of-the-art in recent years are also outlined for the convenience of readers to find out the current research progress. Ultimately, we explore some potential research challenges and opportunities to give researchers reference for the enrichment of emotion recognition-related researches.