In brain-computer interface (BCI) applications, imagined speech (IMS) decoding based on electroencephalography (EEG) has established a new neuro-paradigm that offers an intuitive communication tool for physically impaired patients. However, existing IMS-EEG-based BCI systems have introduced difficulties in feasible deployment due to nonstationary EEG signals, suboptimal feature extraction, and constrained multi-class scalability. To address these challenges, we have presented a novel approach using the multivariate swarm-sparse decomposition method (MSSDM) for joint time-frequency (JTF) analysis and further developed a feasible end-to-end framework from multichannel IMS-EEG signals for imagined speech detection. MSSDM employs improved multivariate swarm filtering and sparse spectrum techniques to design optimal filter banks for extracting an ensemble of channel-aligned oscillatory components (CAOCs), significantly enhancing IMS activation-related sub-bands. To enhance channelaligned information, multivariate JTF images have been constructed using joint instantaneous frequency and instantaneous amplitude across channels from the obtained CAOCs. Further, JTFbased deep features (JTFDF) were computed using different pretrained neural networks and mapped most discriminant features using two well-known feature correlation techniques: Canonical correlation analysis and Hellinger distance-based correlation. The proposed method has been tested on the 5-class BCI Competition DB and 6-class Coretto DB IMS datasets. The experimental findings on cross-subject reveal that the novel JTFDF feature-based classification model, MSSDM-SqueezeNet-JTFDF, achieved the highest classification performance against all other existing state-of-theart methods in imagined speech recognition.