Bidirectional Hybrid State Space Transformer Architecture for Colorectal Cancer Histopathology Whole Slide Images

Object Vision Foundation Models (VFMs) in pathology have predominantly relied on Vision Transformer (ViT) architectures, overlooking potentially superior alternatives. State Space Models (SSMs) offer unique advantages, particularly Hydra with its bidirectional processing through quasiseparable matrices. We investigate whether hybrid SSM-ViT architectures can surpass pure ViT models in pathology VFM pretraining.
Methods Hydra_Hybrid, a 24-layer model (12 Hydra + 12 ViT), uses quasiseparable matrices enabling true bidirectional context with O(N) complexity, well-suited to tissue interactions, unlike causal SSMs such as Mamba. Hydra layers employ EinFFT for stable Fourier-based channel mixing and periodic pattern handling. Both Hydra_Hybrid and ViT_24 were pretrained with DINOv2 on 756,000 colorectal cancer patches. Evaluation included MSI status (CLAM), polyp classification (HYP vs SSA), and six-class adenoma grading under multiple protocols.
Results Hydra_Hybrid consistently outperformed ViT_24. For MSI classification, it achieved higher AUC (0.763±0.103 vs 0.746±0.134) and accuracy (0.718±0.102 vs 0.696±0.086). HP vs SSA tasks showed stronger gains: linear probing AUC +6.3% (0.855 vs 0.804), balanced accuracy +7.5% (0.758 vs 0.705), and KNN probing confirmed better feature separation (WF1: 0.699 vs 0.667, BAcc: 0.656 vs 0.624). Few-shot learning also improved (WF1: 0.562±0.047 vs 0.542±0.057), highlighting stronger generalization. Hydra’s quasiseparable matrices capture bidirectional tissue interactions with O(N) efficiency, unlike ViT’s O(N²) attention. This enables longer sequences and broader receptive fields. Combined with ViT layers, Hydra provides hierarchical feature extraction—global bidirectional patterns refined by local attention. EinFFT proved critical, as Hydra with MLPs failed to converge.
Conclusions Hybrid SSM-ViT architectures represent a promising advance for pathology VFMs. Hydra_Hybrid's consistent improvements validate that bidirectional processing through quasiseparable matrices enhances feature learning beyond pure attention mechanisms. This architectural innovation opens new avenues for specialized designs tailored to pathology's spatial characteristics.