This manuscript proposes a new benchmark to assess the goodness of visual summaries without the necessity of human annotators. It is based on the Signature Transform, specifically on RMSE and MAE Signature and Log-Signature, and builds on the assumption that uniform random sampling can provide accurate summarization capabilities. First, we introduce a preliminary baseline for automatic video summarization, which has at its core a Vision Transformer, an image-text model pre-trained with contrastive learning (CLIP), as well as a module of object detection. Our baseline leverages video text descriptions to determine the most frequent nouns to use as anchors, and then it performs an open-vocabulary image search on the video frames. This enables a zero-shot text-conditioned object detection to select the frames for the final video summary. Despite not needing any proper fine-tuning, our approach provides accurate summaries on a wide range of video data. Since there are not many datasets available for this task, a new dataset consisting of videos from Youtube and the corresponding automatic audio transcriptions is provided. Then, a state-of-the-art accurate technique based on the harmonic components that the Signature Transform is able to capture, and that achieves compelling accuracy and outperforms previous methodologies, is proposed. The analytical measures are extensively evaluated, and we can conclude that correlate very well with the notion of a good summary.