EV proteomic data processing and analysis
Quality control of the sequencing data was performed using FastQC and the fastp package, with reads having a quality score below 20 being discarded. All data analyses were conducted in R version 4.3.1 within the RStudio environment. The EV ID–protein expression dataset was generated by counting the total number of distinct moleculeTags for each proteinTag sharing the same complexTag. Total protein abundance for each sample, referred to as the raw bulk abundance data, was determined by summing the associated moleculeTags for each protein. Differences in library size between samples were corrected using TMM normalization. For bulk data, differential expression analysis was performed using DESeq2 package and visualized through pheatmap package.
All single-EV proteomic data analysis were conducted using Seurat R package. Differential expression analysis between control and experimental groups or different clusters were performed with FindMarkers and FindAllMarkers. Functional enrichment analysis was conducted using the clusterProfiler R package (version 4.2.2), with key parameters specified as pAdjustMethod = ”BH”. All other parameters were left at their default settings to maintain consistency. Dimensionality reduction and clustering were performed using the FindNeighbors function (dimensions = 1:20) and FindClusters function (resolution = 0.8). The Harmony function was employed to remove batch effects during this process. The identified clusters were visualized using uniform manifold approximation and projection (UMAP). However, the high number of clusters observed suggested potential over-segmentation. To address this, we first then manually merged certain clusters based on their enriched and characteristic genes, resulting in fewer but more interpretable groups. Each group was then named according to its characteristic genes (e.g. Immunity Cluster for immune-related proteins). We used the Ro/e statistic to quantify the proportions of EVs from the experiment and control groups within individual clusters and sub-clusters, facilitating the comparison of group-specific differences in cluster composition. In order to avoid overestimation or bias caused by highly similar proteomic profiles between different clusters by carefully controlling for statistical uncertainties through marginalization techniques, we also use BayesPrism package which could integrate results from bulk and single EV analyses. This approach enables a more comprehensive understanding of the molecular landscape by reconciling bulk-level patterns with single EV heterogeneity.