Improving Spatial Transcriptomics Prediction With Hybrid State Space-Vision Transformer Backbone in Pathology Vision Foundation Models

Motivation Spatial Transcriptomics (ST) links gene expression profiles to spatial locations within a tissue section, typically visualized through a co-registered Hematoxylin & Eosin (H&E) Whole Slide Image (WSI). While ST holds immense potential for precision medicine, its clinical deployment is hindered by high costs and a lack of standardized protocols. Consequently, recent works leverage pathology Vision Foundation Models (VFMs) to train downstream ST prediction models from H&E WSIs. However, current VFMs based on Vision Transformers (ViT) exhibit suboptimal performance, particularly on colorectal cancer (CRC) biomarker prediction.
Method & Architecture To resolve this limitation, we investigate whether altering the backbone architecture can improve biomarker prediction by analyzing frequency biases. We propose MV Hybrid, a novel architecture incorporating a hybrid State Space-Vision Transformer backbone. Built on the assumption that biomarkers hidden in H&E WSIs are predominantly low frequency, we utilize State Space Models (SSMs) whose negative, real eigenvalues provide a higher bias for learning diverse low-frequency features. MV Hybrid integrates elements such as MambaVision blocks, EinFFT, and attention mechanisms within its sequence mixing and channel mixing layers. For evaluation, models were pretrained on an identical dataset of 756k CRC patches using the DINOv2 self-supervised learning method, and benchmarks were conducted using HEST-Benchmark and HEST-Extended datasets via Ridge Regression.
Results & Discussion The proposed MV Hybrid demonstrated superior performance and robustness in predicting gene expressions compared to conventional ViT and other SSM-based baselines (such as Vim and Hydra). In a qualitative CRC H&E WSI example predicting CLCA1 expression (a tumor suppressor downregulated in malignant polyps), MV Hybrid achieved a Pearson Correlation Coefficient (PCC) of 0.422, significantly outperforming the ViT baseline which scored a PCC of 0.264.
Conclusion Compared to ViT, MV Hybrid enhances the quality of VFM embeddings by strategically analyzing and capturing essential frequency biases, thereby significantly improving CRC biomarker prediction ability and robustness. Since numerous clinically applicable tasks—including survival prediction, subtype classification, and multimodal models—rely heavily on VFM embeddings, they stand to benefit substantially from the MV Hybrid framework.