Challenging behaviors in children with autism is a serious clinical condition, oftentimes leading to aggression or self-injurious actions The Family Observation Schedule 2nd Edition (FOS-II) is an intensive and finegrained scale used to observe and analyze the behaviors of individuals with autism, which facilitates the diagnosis and monitoring of autism severity. Previous AI-based approaches for automated behavior analysis in autism often focused on predicting facial expressions and body movements without generating a clinically meaningful scale, mostly utilizing visual information. In this study, we propose a deep-learning based algorithm with audiovisual multimodal-data clinically coded with the Family Observation Schedule 2nd Edition (FOS-II), named AV-FOS model. Our proposed AV-FOS model leverages transformer-based structure and self-supervised learning to intelligently recognize Interaction Styles (IS) in the FOS-II scale from subjects' video recordings. This enables the automatic generation of the FOS-II measures with clinically acceptable accuracy. We explore the IS recognition using a multimodal large language model, GPT4V, with prompt engineering provided with FOS-II measure definitions as the baseline for this study and compare with other vision-based deep learning algorithms. We believe this research represents a significant advancement in autism research and clinical accessibility. The proposed AV-FOS and our FOS-II dataset will serve as a gateway toward the digital health era for future AI models related to autism.