Abstract
Artificial Neural Networks (ANNs) are among the leading computational frameworks for approximating human visual processing across a range of visual tasks. Despite this, a notable gap remains between human and ANN performance in tasks that require social perception and cognition such as intent prediction and social interaction recognition. Among ANN architectures, vision transformers have recently emerged as particularly promising approximations of human visual processing as they exhibit a stronger shape bias and produce errors more consistent with those of human observers, and develop more globally uniform internal representations. This paper examines how transformers align with human perception when identifying social interactions in videos, in which two agents help, hinder, or have neutral interactions with each other in a 2D simulated environment. With the use of a dataset containing videos paired with human gaze recordings captured during the classification of helping and hindering social interactions, we assess the degree of human-model alignment through both classification accuracy and visual attention metrics across different models (i.e., TimeSformer, ViViT, V-JEPA 2), different pre-training datasets (e.g., Something Something V2, Kinetics 400), and different visualization methods (i.e., different GradCam methods and integrated gradients). Our findings suggest TimeSformer and ViViT models have strong potential as cognitive models of social interaction, with attention patterns — both static and dynamic — that align meaningfully with those of human participants. Meanwhile, despite their higher classification accuracy, V-JEPA 2 models achieve saliency alignment at or below that of convolutional neural network baselines. Notably, classification performance is not a reliable predictor of human-model attention alignment. Regarding pretraining datasets, larger and more diverse datasets tend to improve saliency alignment for classification-supervised models, though the optimal visualization method for extracting alignment varies by architecture.