WiFi-based human activity recognition (HAR) has been extensively studied due to its far-reaching applications in health domains, including elderly monitoring, exercise supervision, rehabilitation monitoring, etc. Although existing supervised deep learning techniques have achieved remarkable performances for these tasks, they are however data-hungry and hence are notoriously difficult due to the privacy and incomprehensibility of WiFi-based HAR data. Existing contrastive learning models, which are mainly designed for computer vision, cannot guarantee their performance on channel state information (CSI) data. To this end, we propose a new dual-stream contrastive learning model, that can process and learn the raw WiFi CSI data in a self-supervised manner. More specifically, our proposed method, coined as DualConFi, takes raw WiFI CSI data as input and consists of a channel stream and a temporal stream to learn highly-discriminative spatiotemporal features under mutual information constraint using unlabeled data. We demonstrate the efficacy of our model on three publicly available CSI data sets in various experiment settings, including linear evaluation, semi-supervised, and transfer learning. We show that DualConFi is able to perform favorably against challenging baselines in each setting. Moreover, by studying the effects of different transform functions on CSI data, we finally verify the effectiveness of highly-discriminative features.