Verifying model performance is essential for identifying strengths, weaknesses, and areas for improvement. This paper assesses the performance of four ionospheric models, IRI-2020, TIEGCM, WAM-IPE, and GITM in estimating the critical frequency of the F2 layer (foF2). The reference dataset was compiled from 40-50 ionosondes and COSMIC-2 satellite data. Model performance was quantified using Root-Mean-Square-Error (RMSE), Prediction Efficiency (PE), the Ratio of the Range (RR), and the Ratio of the maximum Amplitude (RA). The investigation period, from March 8 to April 30, 2023, featured diverse space weather conditions. Our findings indicate that IRI-2020 demonstrated superior RMSE and PE scores compared to the other models under conditions with maximum Kp < 7 in low- and mid-latitude regions. IRI-2020 was weakest in the low-latitude post-sunset sector however still outperformed the other models. WAM-IPE performed better than TIEGCM and GITM, generally underestimating foF2 peaks in low- and mid-latitude regions however excelled in estimating foF2 ranges as indicated by a superior RR score. TIEGCM exhibited unusually large foF2 ranges compared to observations in the low- and mid-latitude pre-sunrise and pre-sunset sectors. GITM consistently underestimated foF2 values in low- and mid-latitudes. For maximum Kp > 7, WAM-IPE showed the highest sensitivity to storm conditions however incorrectly estimated the absolute timing, location, and magnitude of foF2 changes. TIEGCM and GITM responses during disturbed periods were complex due to inherent model issues. In high latitudes, all models performed similarly for maximum Kp < 7, except WAM-IPE which maintained consistent levels of performance for Kp > 7.