Many recent studies have addressed the detection of negative affective states such as stress and anxiety from physiological signals taken from body-worn sensors. Typically, machine learning classifiers are applied to features derived from sensor signals, and several authors have reported high accuracy results from a range of signals including cardiac, skin conductance and skin temperature. However, the issue of how robust these models are for deployment in the field is rarely addressed. In this paper, we use open data from two large experimental studies to evaluate the generalizability of models derived from cardiac signals, focusing on detection of stress and anxiety. We choose the cardiac signal since the commonly used heart-rate variability features can be derived from multiple sensor modalities, allowing us to evaluate the robustness of models within, as well as between, experimental settings. We show that consistent classification outside the original experimental setting relies on high-quality training data with minimal artefacts, and that models may often train on proxies within the noise of lower quality data. Our results also underline the importance of including a wide range of emotional states in the training data to minimize erroneous classification from unseen regions of feature space.