— The meaningful integration of environmental and neuroimaging information made available by multiple research centers is of significant interest for the study of psychiatric disorders, but is made difficult by having to deal with endogenous and exogenous confounders. To address these open challenges, we developed a pipeline for the extraction of relevant, confounder-free brain-environment latent information with a multimodal convolutional neural network autoencoder (CNN-AE) model. In our study, the model was applied to brain functional connectivity (FC) and environmental data from the multi-site PRONIA cohort, composed of controls and individuals affected with recent-onset depressive (ROD) and psychotic (ROP) disorders. We identified site as a confounder for brain features and diagnosis and sex as outcomes of interest; therefore, we designed the multimodal CNN-AE to integrate brain-environment latent representations, incorporating a strategy for mitigating site effects in the multi-site FC data, facilitating the downstream classification of diagnosis and sex. The site harmonization strategy was designed based on disentanglement representation learning with a site deconfounding block (DB) leveraging a cross-covariance (xcov) penalty. A comparison was performed between the CNN-AE with the site DB placed i) in the brain-related branch, and ii) in the fusion branch, and the CNN-AE without DB, following iii) the state-of-the-art harmonization ComBat model. Our experimental results demonstrate that the CNN-AE with a DB placed directly at the brain-related branch surpassed the state-of-the-art ComBat model in removing site effects and improved the identifiability of sex and diagnosis effects from the fused brain-environment latent space. These findings highlight the potential of embedding the deconfounding processes directly into deep learning (DL) analysis pipelines, offering a more effective approach for addressing confounding variables in neuroimaging studies.