Machine-learning (ML) parameterizations of subgrid processes (here of turbulence, convection, and radiation) may one day replace conventional parameterizations by emulating high-resolution physics without the cost of explicit simulation. However, their development has been stymied by uncertainty surrounding whether or not improved offline performance translates to improved online performance (i.e., when coupled to a large-scale general circulation model (GCM)). A key barrier has been the limited sampling of the online effects of the ML design decisions and tuning due to the complexity of performing large ensembles of hybrid physics-ML climate simulations. Our work examines the coupled behavior of full-physics ML parameterizations using large ensembles of hybrid simulations, totalling 2,970 in our case. With extensive sampling, we statistically confirm that lowering offline error lowers online error (given certain constraints). However, we also reveal that decisions decreasing online error, like removing dropout, can trade off against hybrid model stability and vice versa. Nevertheless, we are able to identify design decisions that yield unambiguous improvements to offline and online performance, namely incorporating memory and training on multiple climates. We also find that converting moisture input from specific to relative humidity enhances online stability and that using a Mean Absolute Error (MAE) loss breaks the aforementioned offline/online error relationship. By enabling rapid online experimentation at scale, we empirically answer previously unresolved questions regarding subgrid ML parameterization design.