With the increasing accessibility of remote sensing videos, remote sensing tracking is gradually becoming a hot issue. However, accurately detecting and tracking in complex remote sensing scenes is still a challenge. In this paper, we propose a collaborative learning tracking network for remote sensing videos, including a consistent receptive field parallel fusion module (CRFPF), dual-branch spatial-channel co-attention (DSCA) module, and geometric constraint re-track strategy (GCRT). Considering the small-size objects of remote sensing scenes are difficult for general forward networks to extract effective features, we propose a CRFPF-module to establish parallel branches with consistent receptive fields to separately extract from shallow to deep features and then fuse hierarchical features adaptively. Since the objects and their background are difficult to distinguish, the proposed DSCA-module uses the spatial-channel co-attention mechanism to collaboratively learn the relevant information, which enhances the saliency of the objects and regresses to precise bounding boxes. Considering the interference of similar objects, we designed a GCRT-strategy to judge whether there is a false detection through the estimated motion trajectory and then recover the correct object by weakening the feature response of interference. The experimental results and theoretical analysis on multiple data sets demonstrate our proposed method’s feasibility and effectiveness. Code and net are available at https://github.com/Dawn5786/CoCRF-TrackNet.