Title : Enhancing Alzheimer’s diagnosis with vision transformers: A deep learning approach to structural MRI analysis via global self-attention mechanisms
Abstract:
Alzheimer’s Disease (AD) is a progressive neurodegenerative disorder marked by subtle, non-linear structural atrophy that often appears before clinical symptoms. Early diagnosis using structural Magnetic Resonance Imaging (sMRI) is essential, yet current deep learning standards—primarily Convolutional Neural Networks (CNNs)—are limited by local receptive fields that do not capture long-range spatial dependencies across the brain.
This study introduces a Vision Transformer (ViT) framework to overcome these limitations by using multi-head self-attention to model global context. Using a composite dataset derived from ADNI, OASIS-3, and IXI, T1-weighted MRI volumes were preprocessed into normalized 128 × 128 axial slices and tokenized into 16 × 16 patches. A ViT architecture with 12 transformer encoder layers was trained and benchmarked against state-of-the-art convolutional baselines, including ResNet18 and EfficientNet-B0.
In binary classification tasks (Cognitively Normal vs. AD), the ViT model achieved 98.6% accuracy, a 97.9% F1-score, and a ROC-AUC of 0.992. In more complex multiclass stratification (Cognitively Normal vs. Mild Cognitive Impairment vs. AD), the model maintained strong robustness with 95.4% accuracy, outperforming both CNN benchmarks (87.8% and 89.1%).
Interpretability analysis using Grad-CAM and Attention Rollout showed that the model consistently focused on the hippocampus and temporal cortices, aligning its attention with established neuropathological biomarkers. These findings indicate that attention-based architectures provide a more sensitive, interpretable, and globally aware approach for automated neuroimaging diagnosis than traditional local-feature extraction methods.

