Title : Mapping visual action representations in the brain via LLaVa-derived semantic features
Abstract:
Understanding how the human brain encodes complex, higher-order visual information, such as human actions, remains a key challenge in cognitive neuroscience. With recent advancements in neuroimaging techniques, it has been possible to study the neural basis of such a representation in spatio-temporal perspective. In particular, brain’s ability to semantically represent similar actions, but in different contexts, such as grasping a cup and grasping a hammer effectively is due to its ability to utilize neurons in both the contexts. Recent advancement in AI, with Large Multimodal models further made it possible to learn representation of objects, by modeling image and textual semantics together with ground truth fMRI. In this study, we present a novel framework that leverages the Large Language and Vision Assistant (LLaVa) model to derive semantically rich textual representations of action videos, which are then aligned with Blood-Oxygen-Level Dependent (BOLD) signals acquired through functional Magnetic Resonance Imaging (fMRI). We hypothesize that LLaVa-generated textual embeddings can serve as interpretable and structured proxies for cognitive processes underlying action recognition, particularly within the visual and higher-order association networks. Using a cohort of participants exposed to diverse video stimuli, we extract voxel-wise BOLD responses and employ encoding models to map LLaVa-derived representations onto neural activation patterns. Our analysis reveals significant correspondence between LLaVa-informed features and activity in known regions of the dorsal and ventral visual streams, as well as the posterior superior temporal sulcus (pSTS) and lateral occipitotemporal cortex. These findings demonstrate the utility of multimodal AI models in decoding the representational structure of visual cognition and offer a new pathway for bridging high-level machine understanding with biological vision systems.