• Differentiable models (DMs) better capture extreme events with magnitudes not included in the training data compared to LSTM. • DM optimized for better extreme predictions can still offer good spatial generalization and robust predictions for untrained variables. • We theorize that DM's good extrapolation skill comes from physical constraints like mass conservation and storage-dependent flow.