People’s social acceptance and trust in robots are a direct consequence of people’s ability to infer and predict the robot’s behaviour. However, there is no clear consensus on how the legibility of a robot’s behaviour and explanations should be assessed. In this work, the construct of Theory of Mind (i.e., the ability to attribute mental states to others) is taken into account, and a computerised version of the Theory of Mind Picture Sequencing Task is presented. Our tool, called the Human-Robot Interaction Video Sequencing Task (HRIVST), evaluates the legibility of a robot’s behaviour for humans by asking them to order short videos to form a logical sequence of the robot’s actions. To validate the proposed metrics, we recruited a sample of 86 healthy subjects. Results showed that the HRIVST has good psychometric properties and is a valuable tool for assessing the legibility of robot behaviours. We also evaluated the effects of symbolic explanations, the presence of a person during the interaction, and the humanoid appearance on a robot’s behaviour prediction accuracy. Results showed that the interaction condition had no effect on the legibility of the robot’s behaviour while the combination of humanoid robots and explanations displaying seems to result in better performances of the task.