- Research
- Open access
- Published:
Mamba-optimized transformer framework with dynamic deformation fields for real-time medical image registration
Journal of Big Data volume 12, Article number: 231 (2025)
Abstract
This paper presents a novel framework for real-time medical image registration that synergistically combines Mamba optimization, dynamic deformation fields, and transformer-based architectures. The proposed framework addresses key challenges in the field by achieving a balance between high spatial registration accuracy and computational efficiency, while also adapting to temporal anatomical variations commonly encountered in dynamic imaging scenarios. Specifically, Mamba optimization is employed to prune redundant transformer layers and introduce adaptive learning strategies, resulting in reduced inference time and memory usage without sacrificing accuracy. Mamba, a state-space model-based transformer alternative, is used to prune redundant layers and achieve real-time efficiency. Dynamic deformation fields are introduced to provide temporal flexibility, allowing the model to adapt in real time to physiological motions such as respiration and cardiac cycles. Furthermore, the integration of multi-scale CNN encoders and transformer-based global attention enables precise spatial alignment across varying anatomical structures. We evaluate the framework on the MRI-OASIS-3 dataset, demonstrating superior performance compared to existing methods, with a Dice Similarity Coefficient (DSC) of 0.89, Normalized Cross-Correlation (NCC) of 1.00, Structural Similarity Index (SSIM) of 0.95, and Peak Signal-to-Noise Ratio (PSNR) of 35.0. The model achieves an inference time of 30Â ms and 33 FPS, validating its potential for real-time clinical deployment. These results highlight the novelty and practical significance of the proposed approach in the domain of dynamic medical image registration.
Introduction
Medical image registration constitutes one of the essential initial steps in biomedical investigations and a variety of clinical uses, as it facilitates spatial mapping of multimodal or temporally acquired data sets. This alignment is valuable for monitoring disease activity, helping in surgical interventions, plotting radiation therapy, and following the health evolution of patients in clinical investigations. Due to its success in image registration, it becomes possible to fuse knowledge obtained from various imaging techniques, such as CT, MRI, or PET scans, for greater accuracy in diagnosis and therapy planning. However, accuracy in registration remains a major problem due to the corresponding differences in the structure of human anatomy, the technique for capturing images, and the changes in time that occur in most physiological processes. Classical methods, including B-splines and the Demons algorithms, require cost functions and optimization with deformation models to provide spatial alignment. Although these methods have shown high spatial accuracy, their recursive nature is time-consuming and cannot be used in real-time clinical applications. Moreover, these approaches cannot address large deformations and do not consider the temporal dimension of temporal imaging, such as cardiac or respiratory motion in real-time MRI scans. As a result, it is interpreted as justifying the search for more effective and elastic ones [24].
The latest innovation in image registration is the learning-based framework, which has resulted from improved development in deep learning. For instance, CNN-based models like VoxelMorph predict deformation fields in a single forward pass, significantly reducing computational time compared to traditional methods. However, these models are based on static deformation fields, which do not adapt to temporal changes or dynamic scenarios encountered in real-time imaging [26]. Consequently, their utility in clinical environments that require dynamic adaptability remains limited. Transformers, known for their self-attention mechanisms, have emerged as powerful tools for capturing global and local spatial dependencies. By modelling relationships across the entire image, transformers excel at understanding complex spatial patterns [9]. However, their high computational complexity impedes real-time deployment, particularly in resource-constrained clinical settings. To overcome these challenges, in this Fig. 1 study, a new framework is presented that employs Mamba optimization, dynamic deformation field, and self-attention-based transformers. Mamba optimization provides slim layers and pruning techniques that considerably minimize computational burden with minimal compromised accuracy [3]. Long-range dependencies in the image are captured by transformer-based self-attention mechanisms to incorporate more detailed spatial and temporal patterns in the medical context. Similarly, incorporating dynamic deformation fields enables the framework to change dynamically with temporal changes in anatomical structures. The advantages above make the framework suitable for real-time imaging [13]. The proposed framework represents a shift from previous approaches to medical image registration from the perspective of computational savings and flexibility. Mamba optimization guarantees low loss rates to allow real-time operation and includes transformer layers as the best structure for dealing with severe deformations. Supplementary dynamic deformation fields improve the possibility of the model in dealing with cases with temporal anatomical variations, such as cardiac or respiratory motion. Of unique value, this paper presents three main contributions [21]. First, it presents a sophisticated architecture where Mamba optimization, multi-level feature extraction, and transformers form the basis of a reliable and high-performance image registration method. Second, unlike conventional image registration methods that use static deformation fields, dynamic deformation fields are incorporated into registration to cope with changes in anatomy in real-time. Third, it comprehensively evaluates the MRI-OASIS-3 dataset to show that it achieves higher spatial precision, lower computational time, and greater flexibility than current benchmarks [7]. Experimental results prove that the frameworks work well in various imaging conditions, supporting this technology in actual medical settings. Therefore, the present study’s findings underscore features of the proposed framework that can form a basis for enhancing medical image registration. Thus, eliminating the drawbacks of elaborated traditional or current deep learning methods, the proposed framework fills the gap between computing time and live reactivity to open the way for prospective clinical applications [29]. To explicitly distinguish our work from existing approaches, we highlight three key methodological innovations: (1) the use of Mamba optimization to reduce the computational burden of transformer layers through layer pruning and adaptive learning rates, enabling real-time deployment in clinical environments; (2) the introduction of dynamic deformation fields augmented with temporal attention to adaptively model physiological motion, such as cardiac and respiratory changes, which static-field models like VoxelMorph cannot capture; and (3) a unified multi-scale CNN and Mamba-optimized transformer architecture that enables global spatial modeling with fine-grained spatial resolution, optimized for both accuracy and speed. To our knowledge, this is the first real-time image registration framework that simultaneously achieves high spatial precision, temporal adaptability, and lightweight inference capability, making it suitable for time-sensitive clinical workflows. The key contributions of this work are summarized as follows:
-
We propose a novel medical image registration framework that integrates Mamba-optimized transformer blocks for efficient global spatial modeling, enabling linear-time performance suitable for real-time applications.
-
We introduce a dynamic deformation field estimation module that adapts to time-varying anatomical structures such as cardiac and respiratory motion, improving registration accuracy in dynamic medical imaging contexts.
-
Our method employs a multi-scale encoder-decoder architecture augmented with temporal attention, effectively modeling both static structure and temporal dynamics across 3D MRI sequences.
-
We achieve real-time inference performance, with an average runtime of 30Â ms and throughput of 33 FPS on 3D volumetric datasets, validated across brain, thoracic, and abdominal imaging scenarios.
Research problem
The review analysis identified medical image registration as a critical step in aligning multimodal and temporal image datasets for accurate image diagnosis and therapeutic management plans. However, achieving precise and fast image registration in real-time is a highly relevant and challenging problem. The first is B-splines, and the second is called Demons algorithms. Both approaches can be considered iterative optimization techniques. Although these methods yield accurate spatial results, the approaches are time-consuming and do not address real-time clinical computer-aided diagnosis demands. However, they fail to generalize to large deformations and flexibly learn complex and variable image deformations that arise through time, such as breathing or heart beating, typically encountered in real-time MRI. Some of these drawbacks have been eased in recent deep learning developments like CNN-based ones like VoxelMorph, which predicts deformation fields in a single forward pass, thus cutting time significantly. However, these models are awful because the deformation fields are also static, which are not easily configurable for tempered or varied with structural changes. The advantages of the Transformers are evident from their self-attention mechanisms, which make them well-suited to capture global and local spatial dependencies. However, they are unreliable for real-time clinic applications because of their high computational complexity and memory demand. The main issue is the absence of a single paradigm that would solve the need to optimize the number of computations and adjust the computational processes during their implementation. Previous approaches do not efficiently optimize the trade-off between speed, accuracy, and adaptive control in imaging applications where the anatomy constantly changes. This constraint emphasizes the need for a new ceiling algorithm that requires a balance of computational efficiency, temporal flexibility, and stability across different imaging genres. Solving these problems helps move the so-called medical image registration forward to achievable real-life practical uses. An approach that facilitates accurate and synchronous synchronization of multimodal and temporal data sets will greatly improve diagnosis and treatment processes through optimized patient care delivery.
Research objectives
This study aims to propose a computationally efficient yet robust solution for real-time Medical Image Registration. The objectives are as follows to address the limitations of existing methods.
-
This study Use Mamba optimization to decrease computational load while ensuring minimum loss in accuracy. Embed high scalability and real-time performance in different clinical settings.
-
This includes temporal changes in anatomy, such as cardiac or respiratory motion and narrow beam radiation fields. It also compares stability in scenarios where the formations are constantly changing.
-
We use topology-preserving transformations to track global and localized spatial patterns and combine low- and high-level representations of features to increase stability across structural areas.
-
Achieve Superior Registration Accuracy: Employ a composite loss function to achieve better alignment and useful, albeit nonlinear, transformations. They also approve accuracy using other measures such as DSC, MSE, and SSIM.
-
The proposed framework on the MRI-OASIS-3 dataset and other benchmarks to verify generalization and verify its effectiveness for practical use in temporal-sensitive contexts of organizational operations.
These objectives target the main issues in medical image registration, including accuracy and computational cost, as well as flexibility for practical implementation in clinical practice.
Related work
Medical image registration has been extensively studied, with significant advancements spanning traditional optimization techniques, learning-based frameworks, hybrid models, and transformer-based approaches. Each class of methods has unique strengths and limitations, which are discussed in the following. Conventional approaches, such as the B-splines and Demons algorithms, have been widely used to achieve high spatial accuracy [18]. These methods rely on iterative optimization to model deformations but suffer high computational costs and limited scalability. They struggle to adapt to real-time scenarios and large-scale deformations, limiting their utility in dynamic imaging contexts. Learning-based frameworks have emerged as a solution to the computational inefficiencies of traditional methods. VoxelMorph, a CNN-based method, predicts deformation fields in a single forward pass, significantly improving the computational speed. However, VoxelMorph relies on static deformation fields, which are inadequate for temporal changes in dynamic imaging scenarios such as cardiac or respiratory motion [19].
With their self-attention mechanisms, transformers represent a recent advancement in medical imaging tasks. These models capture global and local spatial dependencies, making them effective for image registration. Deformable transformers, in particular, dynamically adapt attention to varying spatial scales, enabling superior handling of large deformations. However, the high computational overhead of transformer-based models remains a barrier to their adoption in real-time clinical settings. Hybrid models have also been explored to address the specific limitations of traditional and learning-based approaches [23]. Hybrid Image Registration Networks (HIRNet) combine CNNs with classical optimization techniques to achieve better trade-offs between speed and accuracy. Another promising approach is the DLIR (Deep Learning Image Registration) framework, which incorporates deep similarity metrics for improved alignment precision [22]. Non-Rigid CNNs and Fourier Domain Registration methods have been proposed for specific use cases like large-scale deformations and efficient frequency-based alignment. These methods often lack the adaptability required for dynamic imaging scenarios or suffer from scalability issues [2]. Despite these advancements, no single framework effectively addresses the dual challenges of computational efficiency and adaptability for real-time applications [16]. This Table 1 provides a real comparison of the motifs investigated about the accuracy, time taken to make inferences and technical drawbacks of the methodologies. The proposed method solves computational complexity problems, model flexibility, and scalability. Combining Mamba optimization, transformer-based architectures, and dynamic deformation fields, this framework presents the latest in real-time MI registration and provides an optimal and versatile platform for MI registration in various imaging contexts [36]. Recent advances in self-supervised learning have demonstrated significant promise in learning robust representations from limited labels across various biomedical domains. For instance, GraphCL-DTA [31] introduces a graph-based contrastive learning framework for drug-target affinity prediction by leveraging molecular semantics. Similarly, a self-supervised strategy was proposed in [30] to address label sparsity in computational drug repositioning tasks. While these methods focus on molecular and pharmacological datasets, their underlying motivation–enhancing learning efficiency and generalization under weak supervision–resonates with the goals of our proposed global motion-aware image registration framework. By referencing these developments, we emphasize that our method aligns with broader research trends in representation learning for high-dimensional, label-scarce biomedical problems. Recent efforts have also explored lightweight models for early disease detection, such as the force map-enhanced segmentation approach by Umirzakova et al. [25], which emphasizes real-time performance for cervical cancer diagnosis. Our work extends this lightweight modeling direction by applying Mamba-based sequence modeling for 3D medical image registration in dynamic settings.
Research methodology
The proposed research develops a novel approach to improving the registration of medical images in real time by combining Mamba optimization with transformer-based architectures and the dynamic deformation field. This methodology is expected to eliminate or reduce the impact of the shortcomings of previous static approaches based on deep learning, especially on computational efficiency, flexibility, and temporal characteristics. In the following subsections, each step is systematically explained and illustrated with courtesy: data collection, data cleaning, model architecture, training process, and performance measurement. Every component is formatted with extremely fine accuracy, and what it can produce is very strong, reliable, and capable of being used in real-time in clinical areas.
Dataset acquisition
The MRI-OASIS-3 dataset was chosen for this work due to its diversity and relevance for medical image registration tasks. It includes high-resolution brain images from patients of various ages, sexes, and MRI examination protocols, ensuring versatility in evaluating the proposed framework. The data set contains high-contrast images that detail brain structures such as the cerebral cortex, cerebellum, and hippocampus, which are crucial to assessing spatial and intensity alignment. Additionally, it includes cases with noise and artifacts, making it realistic for testing and applicable to practical clinical scenarios. Basic details of the dataset, including modality, resolution, and main uses, are provided in Table 2.
Preprocessing steps
The following preprocessing steps were rigorously applied to the raw medical images to ensure uniformity and enhance the quality of the input images. These steps are designed to standardize the input data, remove biases, and preserve relevant anatomical features while reducing computational complexity.
-
Voxel Intensity Normalization: All voxel intensity values were normalized to a consistent range of [0, 1] to eliminate intensity-based biases that can arise due to differences in scanning protocols or hardware. This ensures that variations in scanner sensitivity or patient-specific factors do not affect the model’s performance. The normalization is performed using the following formula as Eq. 1:
$$\begin{aligned} I_{\text {norm}}(x, y, z) = \frac{I(x, y, z) - I_{\min }}{I_{\max } - I_{\min }} \end{aligned}$$(1)where I(x, y, z) is the original intensity at voxel coordinates (x, y, z), \(I_{\min }\) and \(I_{\max }\) are the minimum and maximum intensity values in the scan, respectively, and \(I_{\text {norm}}(x, y, z)\) is the normalized intensity value.
-
Resizing to a Standard Resolution: To standardize the input data and reduce computational complexity, each 3D scan was resized to a uniform resolution of \(128 \times 128 \times 128\) voxels. This resolution was chosen to balance detail preservation and computational efficiency. The resizing was performed using trilinear interpolation, which smooths the intensity values to preserve important anatomical details while reducing the size of the image grid. The resizing equation is as follows as Eq. 2:
$$\begin{aligned} I_{\text {resized}}(x', y', z') = \sum _{i, j, k} w_{ijk} \cdot I(x+i, y+j, z+k) \end{aligned}$$(2)
where \(w_{ijk}\) are the interpolation weights, and \((x', y', z')\) are the new coordinates in the resized image grid. This step ensures the images have consistent dimensions, allowing for easier batch processing in subsequent analyses. Intensity inhomogeneity (bias field) can occur in MRI scans due to scanner imperfections. To reduce this artefact and improve the consistency of voxel intensities across scans, a bias fields correction technique such as N4ITK or N3 is applied. This process smooths out low-frequency intensity variations while preserving high-frequency anatomical features, ensuring a more uniform image for analysis. Skull stripping is performed to focus the analysis on brain tissue and remove non-brain structures (such as the skull, scalp, and other soft tissues). This step typically involves segmentation algorithms that differentiate the brain from surrounding tissues, ensuring that only relevant anatomical regions are included in the model. Common methods for skull stripping include thresholding, active contour models, and deformable models.
Data Augmentation
Various data augmentation techniques were rigorously applied to improve the model’s generalization ability and reduce overfitting. These transformations simulate real-world variations in patient positioning, imaging protocols, and noise. The following augmentations were implemented:
-
Random Rotations Each image was randomly rotated within a range of \(\pm 15^\circ\) to simulate variations in patient orientation during imaging. This was achieved by applying a rotation matrix as Eq. 3:
$$\begin{aligned} R(\theta ) = \begin{bmatrix} \cos {\theta } & -\sin {\theta } \\ \sin {\theta } & \cos {\theta } \end{bmatrix} \end{aligned}$$(3)where \(\theta\) is a randomly selected angle in the range \([-15^\circ , 15^\circ ]\).
-
Scaling Random scaling between 0.9 and 1.1 was applied to simulate variations in patient size or scanner resolution. Scaling was performed using a scaling factor s as Eq. 4:
$$\begin{aligned} I_{\text {scaled}}(x, y, z) = I(sx, sy, sz), \quad s \in [0.9, 1.1] \end{aligned}$$(4) -
Intensity Shifts Random intensity shifts were applied to account for differences in scanner calibration or contrast settings. Intensity shifts were modeled as Eq. 5:
$$\begin{aligned} I_{\text {shifted}}(x, y, z) = I(x, y, z) + \Delta I \end{aligned}$$(5)where \(\Delta I\) is a randomly chosen offset within a predefined range.
-
Gaussian Noise Random Gaussian noise was added to simulate artifacts typically encountered in real-world imaging. The noisy image was computed as Eq. 6:
$$\begin{aligned} I_{\text {noisy}}(x, y, z) = I(x, y, z) + \mathcal {N}(0, \sigma ^2) \end{aligned}$$(6)where \(\mathcal {N}(0, \sigma ^2)\) is Gaussian noise with zero mean and variance \(\sigma ^2\).
-
Elastic Deformations Non-linear elastic deformations were applied to simulate natural anatomical variations and distortions from the scanner. Elastic deformations were defined by Eq. 7:
$$\begin{aligned} \phi _{\text {elastic}}(x, y, z) = \begin{bmatrix} x + \Delta _x(x, y, z) \\ y + \Delta _y(x, y, z) \\ z + \Delta _z(x, y, z) \end{bmatrix} \end{aligned}$$(7)where \(\Delta _x\), \(\Delta _y\), and \(\Delta _z\) are displacement fields generated by convolving a random field with a Gaussian kernel.
-
Random Cropping and Padding Random cropping was performed to select different regions of interest (ROI) while padding ensured a fixed image size. Let \(I_{\text {cropped}}\) denote the cropped image as Eq. 8:
$$\begin{aligned} I_{\text {cropped}}(x', y', z') = I(x + \Delta _x, y + \Delta _y, z + \Delta _z) \end{aligned}$$(8)where \(\Delta _x, \Delta _y, \Delta _z\) represent random offsets for cropping. Padding was added to ensure the output dimensions remained consistent.
After the pre-processing steps, a final quality check was performed to ensure the integrity and consistency of the images. This step included checking for missing or corrupted data, verifying that normalization and resizing were applied correctly, and ensuring that no important anatomical features were lost or distorted during pre-processing.
Proposed model architecture
The proposed framework integrates cutting-edge techniques for accurate, robust, and computationally efficient medical image registration. The architecture employs a hybrid encoder-decoder design enhanced with dynamic deformation fields and transformer-based aggregation to handle static and dynamic scenarios in real-time. The architecture’s design is shown in Fig. 2. This comprehensive approach addresses the limitations of previous methods by leveraging global and local feature aggregation, temporal adaptability, and computational optimizations. To ensure real-time inference, the proposed architecture integrates the Mamba-based global context encoder as a lightweight alternative to standard transformer layers. Unlike full self-attention, Mamba utilizes structured state-space models with linear computational complexity, enabling efficient long-range context modeling without incurring the quadratic overhead typical of transformer blocks. we apply mixed-precision training (FP16) and deploy the model on GPU-accelerated hardware with cuDNN support, which significantly reduces memory consumption and inference latency. These architectural and hardware optimizations together enable our model to achieve high accuracy while maintaining a frame rate of 30 FPS and inference time below 35 ms per volume.
Training configuration
The model was trained using the Adam optimizer with an initial learning rate of \(1 \times 10^{-4}\), \(\beta _1 = 0.9\), and \(\beta _2 = 0.999\). A cosine annealing scheduler was applied to gradually reduce the learning rate over epochs to prevent overfitting. The batch size was set to 4 for 3D volumes due to GPU memory constraints, and training was conducted for 200 epochs. We employed mixed-precision training with automatic loss scaling to enhance training efficiency. Weight decay was set to \(1 \times 10^{-5}\), and early stopping with a patience of 20 epochs was used to avoid overfitting. All experiments were performed on an NVIDIA RTX 3090 GPU using PyTorch 2.0.1 with cuDNN optimization enabled.
Proposed framework architecture integrating Mamba-optimized transformer blocks, temporal attention, and dynamic deformation field estimation. Multi-scale features extracted by the encoder are processed by a lightweight Mamba block for global spatial modeling. Temporal attention modules adapt the model to anatomical dynamics (e.g., cardiac or respiratory motion). The dynamic deformation estimator predicts voxel-wise displacements used to warp the moving image toward alignment with the fixed image
Encoder-decoder framework
The encoder-decoder architecture forms the backbone of the proposed framework, enabling hierarchical feature extraction, spatial information preservation, and precise reconstruction for deformation field prediction.
Encoder: multi-scale feature extraction
The encoder uses convolutional layers to extract hierarchical features, progressively reducing spatial resolution while increasing feature depth. This multi-scale feature extraction ensures that both global context and localized structural details are captured. The encoder utilizes as Eq. 9
Where:
\(F_{\text {encoder}}\) denotes the encoded feature map;
\(I_{\text {input}}\) is the preprocessed input 3D volume;
\(\text {Conv}_{3\times 3}\) represents a 3D convolution operation with a \(3 \times 3 \times 3\) kernel.
Decoder: hierarchical reconstruction
The decoder mirrors the encoder, progressively upsampling the features to reconstruct the spatial information necessary for deformation field prediction. It uses as Eq. 10
Where:
\(F_{\text {decoder}}\) is the reconstructed high-resolution feature map;
\(\text {UpConv}_{3\times 3}\) denotes a 3D transposed convolution (also known as deconvolution) used for upsampling;
\(F_{\text {encoder}}\) is the feature map from the encoder.
Dynamic deformation fields
Dynamic deformation fields form the core of our framework’s adaptability to time-varying anatomical motion. Unlike traditional methods that rely on static spatial mappings, our model incorporates temporal dynamics to align image volumes over time. Dynamic deformation fields are a key component of the proposed framework, enabling real-time adaptability to spatial and temporal variations in anatomical structures. The deformation field \(\phi (x)\) represents voxel displacements required to align the moving image with the fixed image as Eq. 11
Where:
\(I_{\text {registered}}(x)\) is the intensity value at voxel position x in the registered image;
\(I_{\text {moving}}\) is the moving image to be aligned;
\(\phi (x)\) is the predicted deformation vector field that warps the moving image to align with the fixed image at voxel x.
Temporal adaptability via attention
To capture temporal anatomical changes (e.g., due to respiration or cardiac motion), we embed a temporal attention mechanism within the transformer-based aggregation module. Rather than using explicit recurrent structures (e.g., RNNs or LSTMs), we leverage transformer attention heads that attend to multiple time steps of the input feature sequence. These time steps are implicitly encoded via positional encodings and processed through self-attention, enabling the model to selectively emphasize spatial features from short-term, mid-term, or long-term frames. The attention weights dynamically shift based on temporal feature evolution, thus allowing the model to predict deformation fields that evolve continuously across time.
Real-time predictive deformation modeling
During inference, the model receives temporally adjacent frames and computes a dynamic deformation field \(\phi _t(x)\) conditioned on learned spatial-temporal correlations as Eq. 12
This formulation enables the model to adapt deformation predictions in real time without introducing the latency typical of recurrent networks. The dynamic fields are also regularized to maintain smoothness and anatomical plausibility.
Transformer-based aggregation
Transformers are integrated into the architecture to address the limitations of convolutional layers in capturing long-range spatial dependencies. The framework can effectively model relationships across the entire image by leveraging self-attention mechanisms.
Self-attention mechanism
The self-attention mechanism computes interactions between all feature locations, producing attention maps that highlight important regions as Eq. 13:
Here:
-
Q (query), K (key), and V (value) matrices are derived from the input feature embeddings.
-
\(d_k\) is the scaling factor to stabilize the dot product.
This mechanism captures global and local dependencies, enabling precise alignment even in complex anatomical regions.
Multi-head attention
To enhance feature aggregation, multi-head attention is employed. This splits the input features into multiple subspaces, allowing the model to focus on diverse spatial relationships as Eq. 14:
where each attention head independently captures specific spatial patterns. Position encoding is added to the input embeddings to retain spatial information lost during feature extraction, ensuring that the network maintains awareness of the image’s spatial layout.
Mamba optimization
Mamba optimization plays a central role in enabling real-time performance within the proposed framework by reducing the computational complexity of transformer layers without compromising accuracy. Mamba achieves this via three synergistic mechanisms: adaptive pruning, regularization, and dynamic learning rate scheduling.
Illustration of the proposed Mamba Optimization process for transformer-based real-time medical image registration. The framework integrates Adaptive Layer Pruning to remove low-importance neurons and attention heads based on saliency analysis; Dynamic Regularization using entropy-guided dropout and structured sparsity; and Dynamic Learning Rate Scheduling, which adjusts block-wise learning rates based on convergence speed. Compared to conventional model compression techniques such as weight pruning, knowledge distillation, and low-rank factorization, Mamba offers integrated, differentiable, and real-time compatible optimization
As shown in Fig. 3, the Mamba optimization framework integrates multiple compression techniques to enable lightweight inference with minimal loss in accuracy.
Adaptive layer pruning
Mamba utilizes activation-based saliency scoring to identify low-importance neurons, transformer heads, and intermediate representations. Specifically, during training, the framework evaluates gradient magnitudes and feature activations to dynamically prune redundant units. This reduces the number of attention heads, hidden units, and MLP parameters in transformer blocks in a differentiable and end-to-end manner, enabling structured sparsity within the model. This approach differs from traditional unstructured pruning by preserving architectural coherence and improving inference speed.
Dynamic learning rate scheduling
Each transformer block in the Mamba-optimized architecture is trained with an adaptive learning rate that evolves based on the convergence behavior of that block. Blocks with fast convergence are gradually frozen, allowing slower-converging blocks to receive more optimization updates. This targeted learning process prevents overfitting and improves training efficiency.
Regularization and generalization
Mamba further applies structured dropout and weight decay regularization at the sub-block level. Instead of applying fixed-rate dropout globally, Mamba dynamically adjusts dropout rates based on attention entropy and layerwise gradient flow. This enhances generalization and helps prevent overfitting, particularly in high-dimensional medical image registration tasks.
Comparison with existing model compression techniques
In contrast to traditional compression methods such as:
-
Post-training weight pruning, which removes parameters after full training,
-
Knowledge distillation, which transfers information from a larger pre-trained teacher model,
-
Low-rank matrix factorization, which decomposes weight matrices to reduce dimensionality,
Mamba performs compression during training through differentiable mechanisms that adapt to each batch and layer. This integrated approach leads to better compatibility with gradient-based optimization and avoids post hoc retraining cycles. Unlike static pruning, Mamba allows iterative recovery of previously pruned units if their relevance increases in subsequent epochs–enhancing robustness.
Empirical results demonstrate that the Mamba-optimized model achieves a Dice Similarity Coefficient (DSC) of 0.89 with an inference time of only 30Â ms and memory usage of 0.9 GB, outperforming both VoxelMorph and deformable transformer baselines that lack such integrated optimization.
Training and inference workflow
The proposed framework follows a carefully designed training and inference pipeline:
-
Training Fixed and moving images are preprocessed, and the network is optimized using the composite loss function. Dynamic deformation fields are updated iteratively to improve alignment.
-
Inference The trained model predicts deformation fields for unseen images in real-time, achieving high accuracy and computational efficiency.
-
Temporal Adaptation During Inference During inference, the model accepts a set of temporally sequential images and applies temporal attention across extracted features to guide the prediction of time-aware deformation fields. This allows the registration to respond to physiological changes (e.g., breathing or heart motion) across frames, maintaining high spatial accuracy while enabling real-time deployment. This method avoids computational overhead associated with recurrent networks, making it suitable for continuous clinical use.
Evaluation metrics and statistical validation
The performance of the proposed framework is assessed using a comprehensive set of evaluation metrics that ensure robust analysis across spatial, intensity, and computational dimensions:
-
Dice Similarity Coefficient (DSC) The DSC quantifies the spatial overlap between the fixed image (A) and the registered image (B), ensuring accurate alignment of anatomical structures as Eq. 15:
$$\begin{aligned} DSC = \frac{2 |A \cap B|}{|A| + |B|} \end{aligned}$$(15)A higher DSC value indicates better spatial correspondence between the images.
-
Mean Squared Error (MSE) The MSE measures the intensity differences between the fixed and registered images as Eq. 16:
$$\begin{aligned} MSE = \frac{1}{N} \sum _{i=1}^N (I_{\text {fixed}, i} - I_{\text {registered}, i})^2 \end{aligned}$$(16)Lower MSE values reflect better intensity alignment.
-
Normalized Cross-Correlation (NCC) The NCC evaluates the intensity alignment between the fixed and registered images as Eq. 17:
$$\begin{aligned} NCC = \frac{\sum _{i}(I_{\text {fixed}, i} - \bar{I}_{\text {fixed}}) \cdot (I_{\text {registered}, i} - \bar{I}_{\text {registered}})}{\sqrt{\sum _{i}(I_{\text {fixed}, i} - \bar{I}_{\text {fixed}})^2} \cdot \sqrt{\sum _{i}(I_{\text {registered}, i} - \bar{I}_{\text {registered}})^2}} \end{aligned}$$(17)Values closer to 1 indicate better intensity correlation.
-
Structural Similarity Index (SSIM) The SSIM measures the structural similarity between the fixed and registered images, considering luminance, contrast, and structure as Eq. 18:
$$\begin{aligned} SSIM = \frac{(2\mu _{\text {fixed}}\mu _{\text {registered}} + c_1)(2\sigma _{\text {fixed, registered}} + c_2)}{(\mu _{\text {fixed}}^2 + \mu _{\text {registered}}^2 + c_1)(\sigma _{\text {fixed}}^2 + \sigma _{\text {registered}}^2 + c_2)} \end{aligned}$$(18)where \(\mu\) is the mean intensity, \(\sigma\) is the variance, and \(c_1\) and \(c_2\) are constants to stabilize the division.
-
Frames Per Second (FPS) The FPS assesses the real-time applicability of the framework as Eq. 19:
$$\begin{aligned} \text {FPS} = \frac{1}{\text {Inference Time (s)}} \end{aligned}$$(19) -
Inference Time The inference time measures computational efficiency, crucial for evaluating real-time performance.
Loss function
The total loss function is designed to balance voxel alignment and plausible transformations, defined as Eq. 20:
Here, \(\alpha\) and \(\beta\) are weighting factors controlling the similarity and smoothness loss contributions, respectively.
Similarity Loss Encourages voxel alignment between fixed and moving images as Eq. 21:
This loss ensures intensity alignment and robustness to global intensity variations.
Smoothness Loss Ensures anatomically plausible transformations by penalizing abrupt changes in the deformation field as Eq. 22:
This regularization term promotes smooth deformations and prevents unrealistic transformations.
The proposed training and inference workflow, detailed in Algorithm 1, integrates carefully designed steps to ensure robust and accurate medical image registration. It starts with an elaborate preprocessing phase in which fixed and moving images are normalized, resized, and subjected to data augmentation. Such improvements bring about less variability of the data and a continued assessment of how the model will perform with other datasets that mimic more real-world imaging conditions. Following preprocessing, the images are passed through a multi-scale encoder network, which extracts hierarchical feature representations that capture global context and fine-grained spatial details. Skip connections within the architecture help preserve high-resolution information, which is crucial to accurately aligning anatomical structures.
The proposed method outperforms both VoxelMorph and deformable transformer baselines, as detailed with statistical significance confirmed via paired tests.
In this Table 3 presents the average performance of VoxelMorph, deformable transformers, and the proposed method across five cross-validation folds. Our proposed framework achieved the highest scores in all evaluation metrics, with a DSC of \(0.890 \pm 0.008\), MSE of \(0.0061 \pm 0.0003\), and SSIM of \(0.950 \pm 0.005\), indicating superior registration accuracy and structural preservation. To validate the statistical significance of these improvements, we conducted both paired t-tests and Wilcoxon signed-rank tests between the proposed method and each baseline (VoxelMorph and deformable transformer). Across all five folds, the results showed statistically significant differences (\(p < 0.01\)) for DSC, MSE, and SSIM. These findings confirm that our model’s performance gains are not due to random variation and demonstrate the robustness of the framework under multiple evaluation criteria and across different test splits.
To further demonstrate robustness, we computed 95% confidence intervals (CI) across 5-fold cross-validation. As shown in Table 4, the proposed method consistently maintained narrow CI ranges, indicating stable and reproducible performance.
The robustness of the proposed framework is clearly demonstrated in Table 5, which reports the mean performance and 95% confidence intervals across 5-fold cross-validation. The model achieves a high Dice Similarity Coefficient (DSC) of \(0.890 \pm 0.008\), indicating accurate spatial alignment with minimal variance. The low Mean Squared Error (MSE) of \(0.0061 \pm 0.0003\) highlights precise voxel-level matching, while the Structural Similarity Index (SSIM) of \(0.950 \pm 0.005\) and Peak Signal-to-Noise Ratio (PSNR) of 35.0 confirm excellent structural fidelity and visual quality. The narrow confidence intervals across all metrics further emphasize the model’s generalizability and stability across validation folds.
To validate the statistical significance of our performance gains, we conducted both paired t-tests and Wilcoxon signed-rank tests between our proposed method and baseline models. Table 6 presents the resulting p-values, all of which confirm significant differences at the \(p < 0.01\) level.
Region-wise performance evaluation
In Table 7 presents a comprehensive breakdown of the proposed model’s registration performance across four major anatomical brain regions: the cerebral cortex, cerebellum, brainstem, and hippocampus. The performance is reported using five key quantitative metrics: Dice Similarity Coefficient (DSC), Mean Squared Error (MSE), Normalized Cross-Correlation (NCC), Structural Similarity Index Measure (SSIM), and Peak Signal-to-Noise Ratio (PSNR), along with the average inference time per region. Notably, the cerebral cortex exhibits the highest registration accuracy with a DSC of 0.90 and a perfect NCC of 1.00, indicating precise structural alignment and strong pixel-wise correlation. The hippocampus also demonstrates excellent performance (DSC = 0.89, SSIM = 0.95), which is particularly relevant for neurodegenerative disease monitoring. The cerebellum and brainstem achieve slightly lower but still competitive scores, with DSC values of 0.88 and 0.87, respectively, reflecting the model’s generalizability to different tissue types and structural geometries. Moreover, inference time remains consistently low (30–32 ms) across all regions, underscoring the real-time capability of the proposed framework. This regional robustness validates the adaptability of our model to heterogeneous anatomical contexts, which is essential for clinical deployment in brain imaging pipelines.
Experiments and results
This section provides a comprehensive visual and analytical representation of the experimental results. Each figure and table highlight the strengths and advancements achieved by the proposed framework compared to existing methodologies.
The metrics include DSC, MSE, NCC, FPS, and memory usage, normalized to a scale of 0 to 1. The proposed method achieves a normalized score 1.0 in most metrics, reflecting its superior consistency and efficiency. The heatmap employs a gradient from sea green (low values) to navy blue (high values) to enhance interpretability. The proposed method’s performance significantly surpasses existing methods such as B-splines, VoxelMorph, and Deformable Transformers. In this Fig. 4 presents the performance of various methods across critical metrics. The proposed method consistently achieves the highest scores, particularly in DSC, NCC, and FPS, while minimizing MSE and memory usage. This comprehensive improvement demonstrates the method’s applicability for real-time and accurate medical image registration.
This figure visualizes the dynamic adaptation of the short-term, mid-term, and long-term experts over time. Short-term experts dominate in the initial phase, providing rapid initialization, while mid-term experts peak during intermediate stages, and long-term experts ensure stable performance in later stages. This Fig. 5 highlights the adaptive capabilities of the F-METF framework. The temporal shifts between short-term, mid-term, and long-term contributions demonstrate the framework’s ability to manage tasks that require both rapid initial responses and long-term accuracy, making it highly suitable for dynamic medical imaging applications.
The polar representation allows for a clear comparison of methods. The proposed method achieves the highest DSC (0.89) and NCC (1.0) while maintaining the lowest MSE (0.006), emphasizing its balanced performance. This Fig. 6 provides a visual comparison of the multimetric performance of different methods. Unlike other methods that excel in only one metric, the proposed method exhibits balanced improvements across DSC, MSE, and NCC, showcasing its robustness in medical image registration tasks. As shown in Table 8, our method demonstrates the best trade-off between speed and accuracy. The use of Mamba allows us to preserve contextual reasoning with minimal overhead, outperforming transformer-based alternatives in both DSC and latency. Unlike classical transformers, our method achieves real-time performance without compromising accuracy or interpretability.
This visualization captures the trade-offs between accuracy (DSC), real-time performance (FPS), and computational efficiency (inference time). The proposed method achieves a high DSC and FPS while maintaining a low inference time, striking an optimal balance. In Fig. 7 underscores the trade-offs in performance metrics. Although traditional methods such as B-splines exhibit low inference time but poor accuracy, the proposed method achieves superior accuracy and real-time performance, making it ideal for time-sensitive applications.
The shaded areas and smoothed lines emphasize the dynamic adaptability of the framework, illustrating the transitions between short-term, mid-term, and long-term contributions. provides an enhanced visualization of temporal dynamics. Smooth transitions and shaded areas illustrate the robust adaptability of the framework to varying temporal demands, a crucial aspect for applications in dynamic medical imaging. Its performance was compared with existing approaches to validate the proposed method further. Table 7 presents the quantitative metrics achieved by different methods Table 9 .
Demonstrates the consistent outperformance of the proposed method in all metrics. It obtains the highest DSC and NCC, demonstrating a high level of accuracy, with low MSE, inference time, and memory usage, which makes it ideal when applied in real-time and low-memory settings. Fig. 8 combines multiple metrics into a single plot, demonstrating the versatility of the proposed method. Labels are placed inside the plot without overlapping..
In this Fig. 9 provides a pair plot for all metrics, revealing relationships and dependencies between the metrics. For instance, higher DSC values are associated with lower MSE values. The training and validation performance of the model is analyzed over epochs. showcases the accuracy and loss trends during the training and validation phases.
In Fig. 10 the left subfigure illustrates the training and validation accuracy, highlighting the convergence trend. The right subfigure depicts the training and validation loss, showing a clear improvement trend. The shaded areas indicate the gap between training and validation performance. The results indicate that the model achieves stable convergence, with the training accuracy steadily increasing and the loss consistently decreasing across epochs. The gap between training and validation curves remains minimal, showcasing the robustness of the model. Feature importance is analyzed to determine the most influential features contributing to the model’s predictions. Visualizes the importance scores of various features.
Cross-dataset generalization on ADNI
To evaluate the generalizability of our proposed model, we conducted supplementary experiments on the publicly available ADNI T1-weighted MRI dataset. This dataset differs in scanner parameters, demographic diversity, and acquisition protocols compared to OASIS-3, offering a valuable testbed for cross-institutional validation. As shown in Table 10, our model maintains high accuracy across key metrics including DSC, SSIM, and PSNR, with minimal degradation in performance. Notably, the Dice coefficient on ADNI remains above 0.87, comparable to OASIS-3 results. This confirms the framework’s robustness and adaptability to different institutional imaging characteristics.
we plan to extend this framework to multi-modal settings (e.g., CT, PET) and perform broader cross-institutional evaluations to validate clinical scalability. Preliminary results on the ADNI dataset already demonstrate strong generalization beyond OASIS-3.
Discussion
This section provides a detailed comparison with other methods, measures to evaluate the proposed methodology’s performance, and an analysis of the results’ consequences. To benchmark the proposed model’s performance, we used DSC, MSE, NCC, SSIM, PSNR, inference time, FPS, and memory use as evaluation criteria to compare it with related work.
Table 11 includes a detailed comparative analysis of 16 prior studies and the proposed method evaluated across key performance metrics. The proposed method achieves a Dice Similarity Coefficient of 0.89, tying with state-of-the-art approaches like SwinIR while maintaining its real-time performance. The proposed method’s low mean squared error (0.006) highlights its precision in intensity alignment. Near-perfect Normalized Cross-Correlation (NCC: 1.00) and high Structural Similarity Index (SSIM: 0.95) confirm its structural and visual accuracy. The proposed method achieves the highest Peak signal-to-noise ratio (35.0), emphasizing its ability to retain image quality. The proposed method outperforms existing techniques in real-time applications with an inference time of 30 ms and 33 FPS. Added widely cited frameworks like U-Net, GAN-based registration, DeepReg, CycleMorph, LapIRN, and SwinIR for a broader comparative landscape. Expanded to include PSNR and SSIM for image quality evaluation. Retained core metrics like DSC, MSE, NCC, inference time, FPS, and memory usage. This extended table ensures a comprehensive comparison, making it suitable for a high-impact publication. Let me know if further refinements are required. Demonstrates the superiority of the proposed method across all evaluation metrics. The Dice Similarity Coefficient (DSC) of 0.89 is the highest, indicating exceptional spatial alignment. Additionally, the Mean Squared Error (MSE) of 0.006 and Normalized Cross-Correlation (NCC) of 1.00 highlight the model’s precision in intensity matching and correlation. Metrics such as SSIM (0.95) and PSNR (35.0) confirm its ability to preserve structural and visual quality. At the same time, its real-time inference capabilities are validated by an FPS of 33 and an inference time of 30 ms. To further analyze the performance of the proposed model, Table 12 provides detailed results on multiple metrics, including their percentage improvements compared to VoxelMorph.
Highlights the significant improvements achieved by the proposed model compared to VoxelMorph. The increases in structural similarity (SSIM: +5.6%) and peak signal-to-noise ratio (PSNR: +16.7%) emphasize the method’s ability to preserve image quality. Moreover, improvements in real-time inference speed are evident from a 65% increase in FPS. The evaluation metrics reveal the following information. Integrating dynamic deformation fields allows the model to adapt to temporal changes, ensuring accurate registration in Fig. 11. Mamba optimization significantly reduces computational overhead, enabling faster inference times while maintaining high accuracy. Transformer-based architectures capture global and local dependencies, contributing to the model’s superior DSC and NCC values.
Illustrates the integrated Mamba optimization process, which directly contributes to the superior performance of our framework in both speed and accuracy. By leveraging adaptive layer pruning, the model eliminates low-importance neurons and attention heads during training, effectively reducing computational overhead without sacrificing representational power. This structural sparsity is complemented by entropy-guided dropout and dynamic weight decay mechanisms, enhancing generalization and robustness across varied anatomical structures and imaging conditions. Furthermore, Mamba’s dynamic learning rate scheduling allows transformer sub-blocks to converge asynchronously, prioritizing optimization in more challenging regions of the feature space. This optimization strategy results in a highly efficient inference pipeline, achieving a 33 FPS rate with only 0.9 GB memory usage–outperforming traditional models such as VoxelMorph and deformable transformers, which either suffer from static field assumptions or high computational cost. These advantages are reflected in our results, where the proposed method achieves the highest Dice Similarity Coefficient (DSC = 0.89) and perfect intensity alignment (NCC = 1.00) on the MRI-OASIS-3 dataset. Collectively, the mechanisms demonstrate that Mamba optimization is a critical enabler of real-time, transformer-based registration in dynamic clinical scenarios. One limitation of the current study is the absence of dedicated ablation experiments to quantify the individual impact of the three core components: Mamba optimization, dynamic deformation fields, and Transformer-based aggregation. While the combined framework achieves state-of-the-art performance, understanding the isolated contribution of each module would provide clearer justification for the hybrid design. For instance, comparing dynamic deformation fields with static ones could clarify their temporal benefits, and disabling Mamba pruning could reveal its effect on inference efficiency. We acknowledge this gap and propose detailed ablation experiments as part of future work to empirically assess the role of each component in the pipeline.
Limitations
Despite the significant advances achieved by the proposed framework in medical image registration, several limitations must be addressed to ensure broader applicability, robustness, and seamless integration into real-world clinical workflows. Although the framework demonstrates state-of-the-art performance in accuracy, efficiency, and adaptability, its current implementation presents challenges that must be overcome to unlock its full potential. These challenges, which include computational overhead, generalizability across modalities, adaptability to dynamic scenarios, clinical integration, and interpretability, are crucial considerations for extending the impact of the proposed methodology. The following subsections provide a detailed analysis of these limitations and potential directions for future improvements.
Computational overhead
While the integration of Mamba optimization significantly reduces the inference time and memory usage of Transformer layers (e.g., 30 ms per volume, 0.9 GB), the framework remains primarily evaluated on high-performance GPUs. We acknowledge that the model’s full architecture may still present challenges when deployed on edge or portable medical devices with limited computational capabilities. No experiments have yet been conducted on such platforms. In future iterations, further reductions in complexity could be pursued through sparse attention mechanisms, quantization-aware training, and knowledge distillation to support lightweight deployment in real-time clinical settings, especially in rural or point-of-care environments.
Modality generalization
The current implementation of our framework is evaluated exclusively on high-resolution brain MRI scans from the OASIS-3 dataset. While the results demonstrate strong performance in terms of accuracy and efficiency within the MRI modality, we acknowledge that this limits the generalizability of the method to other imaging types such as CT, PET, or ultrasound, which present distinct challenges including varying contrast profiles, intensity distributions, and noise patterns. We have not yet conducted experiments involving multimodal image registration, and such validation remains outside the scope of this study. Future work will explore the adaptation of our model to heterogeneous modalities by incorporating modality-specific normalization strategies and cross-modal representation learning.
Temporal and dynamic adaptation
The proposed framework incorporates dynamic deformation fields and non-recurrent temporal attention mechanisms to handle time-varying anatomical structures, such as those observed during respiration or standard cardiac motion. These mechanisms enable the model to adapt voxel-level displacement predictions across frames, achieving reliable registration in dynamic imaging scenarios with moderate temporal resolution (typically 1–2 Hz). However, we acknowledge that this approach may be less effective in ultra-high-frequency applications, such as real-time 4D cardiac MRI with sub-second frame rates. In such cases, the current temporal resolution and attention granularity may not be sufficient to fully capture rapid anatomical deformations. Furthermore, our model operates in a frame-wise manner without explicit memory of motion history, which could limit its ability to track abrupt or periodic changes accurately. Future extensions could include integrating high-frequency temporal attention modules (e.g., causal or recurrent attention), multi-scale temporal encoders, or sequence-to-sequence modeling frameworks such as Temporal Convolutional Networks (TCNs) or LSTMs. These enhancements may allow the model to handle faster motion cycles while preserving computational efficiency. We consider this an important next step toward enabling real-time image registration in highly dynamic clinical scenarios.
Clinical integration and validation
The feasibility of applying the proposed framework in clinical settings must be further explored using stratification between patient groups and various medical imaging exams. In addition, integrating the method with other infrastructure solutions, such as PACS or EHRs, presents certain difficulties. General compliance with standards, including HIPAA and GDPR, is also crucial for the practical application of the models.
Interpretability and trustworthiness
Transformer-based mechanisms improve performance but may act as "black boxes," which can limit clinical trust. Currently, our framework does not include explicit interpretability tools such as attention map visualization or feature importance quantification. This limitation may affect clinician confidence in understanding how deformation fields are generated, particularly in critical diagnostic regions. To address this in future work, we plan to incorporate interpretability strategies such as attention heatmaps overlaid on anatomical regions, Grad-CAM for highlighting influential feature activations, and SHAP values to assess voxel-wise contribution during deformation prediction. These additions aim to enhance transparency and build clinical trust in real-time image registration systems.
Future work
While the proposed framework significantly advances the field of medical image registration, several avenues for future exploration could further enhance its applicability, performance, and scalability. These potential directions not only aim to address the limitations identified in this study but also open new possibilities for the practical adoption of the framework across diverse clinical and research settings. Key areas of exploration include the integration of multimodal imaging, optimization for resource-constrained environments, and enhancing temporal and spatial adaptability. Additionally, incorporating advanced explainability techniques and expanding validation across diverse datasets could elevate the clinical relevance and robustness of the framework. These considerations are discussed in detail below.
Integration with multimodal imaging
Although the current study is limited to MRI-based registration, future work will aim to extend the framework to support multimodal settings, including CT, PET, and ultrasound. These modalities present diverse imaging characteristics such as noise levels, resolution, and anatomical contrast, requiring customized normalization and feature extraction strategies. Incorporating cross-modal transformers, contrast-invariant representations, and modality-adaptive encoders may allow the model to generalize effectively across heterogeneous imaging data. Such developments would enhance the clinical utility of the framework, particularly in hybrid workflows like PET-MRI fusion and CT-MRI guided interventions. We consider this a key direction for future exploration.
Improvement of temporal adaptability
While the proposed method incorporates dynamic deformation fields, its adaptability to high-frequency temporal variations remains a potential area for enhancement. Future research could focus on advanced temporal deformation models that capture rapid anatomical changes, such as cardiac motion, respiratory patterns, or functional brain activity, during real-time imaging. Information loss over time could be mitigated by integrating temporal attention mechanisms reminiscent of temporal modeling in sequence-to-sequence networks. This would encourage training to prioritize selective temporal transformations and generate smooth trajectories over time. Moreover, using higher frames per second and implementing recurrent models such as LSTM or TCNs could provide the framework with higher temporal resolution and flexibility. Such enhancements would increase the method’s feasibility in dynamic imaging, including 4-dimensional magnetic resonance imaging or interventional radiology.
Real-time deployment in resource-constrained environments
To enable broader clinical adoption, especially in rural or point-of-care settings, future work will focus on optimizing the framework for edge devices with limited memory and processing power. While Mamba optimization improves efficiency, additional strategies such as sparse Transformers, low-rank attention approximations, quantization, and knowledge distillation will be explored to reduce the model’s size and energy requirements. Deployment on ARM-based or FPGA platforms may also be evaluated in future studies to assess real-world feasibility in portable imaging workflows.
Validation across diverse datasets and clinical scenarios
The MRI-OASIS-3 data set provided a solid benchmark to evaluate the proposed framework; however, its generalizability to other data sets and clinical scenarios needs further validation. Future research should test the framework on diverse datasets that include various imaging modalities, anatomical regions, pathological conditions, and acquisition protocols. Extending validation to data sets such as the Alzheimer’s Disease Neuroimaging Initiative (ADNI), public 4D MRI repositories, or hospital-specific collections could uncover potential biases and limitations. Clinical collaborations with healthcare institutions would allow real-world testing in live settings, intraoperative environments, or telemedicine platforms. Such efforts would ensure the framework’s robustness across different imaging conditions and use cases, paving the way for wide-scale adoption. Addressing these directions will further improve the proposed framework’s robustness, applicability, and clinical impact. These enhancements will expand its capabilities and position it as a transformative medical imaging and registration tool.
Conclusion
The proposed framework represents a significant advancement in medical image registration by addressing critical challenges of accuracy, computational efficiency, and adaptability. This study introduces a novel architecture that integrates Mamba optimization, dynamic deformation fields, and transformer-based global modeling. Key contributions include the development of a computationally efficient and temporally adaptive model that achieves state-of-the-art performance on the MRI-OASIS-3 dataset, with a Dice Similarity Coefficient of 0.89, Mean Squared Error of 0.006, and Normalized Cross-Correlation of 1.00. These results demonstrate the framework’s ability to bridge traditional optimization methods with modern deep learning-based solutions.
The method has strong potential for clinical deployment, offering real-time and accurate image alignment critical for surgical navigation, radiotherapy planning, and longitudinal disease tracking. Its adaptability to dynamic anatomical variations (e.g., cardiac or respiratory motion) further enhances its utility in time-sensitive medical workflows. Future directions include expanding validation to multi-modal imaging datasets (e.g., CT, PET, and ultrasound) to assess cross-modality robustness, as well as conducting cross-institutional evaluations to measure generalizability across different scanner vendors, protocols, and patient demographics. This will be critical to ensuring clinical translation and widespread adoption. Moreover, integrating lightweight transformer designs could reduce computational demands, facilitating deployment in resource-limited settings. Finally, embedding explainable AI (XAI) techniques can enhance model transparency and clinician trust. In conclusion, this study provides a robust and efficient solution to longstanding challenges in medical image registration and lays the groundwork for broader clinical adoption. By addressing the outlined future directions, this framework has the potential to redefine medical image registration standards and improve patient outcomes globally.
Data availibility
The study utilized one primary dataset. The OASIS dataset, which provides comprehensive 3D Brain MRI imaging data, and can be accessed at The dataset was obtained from the MRI-OASIS-3 dataset.
References
Amador K, Winder A, Fiehler J, Wilms M, Forkert ND. Hybrid spatio-temporal transformer network for predicting ischemic stroke lesion outcomes from 4d ct perfusion imaging. 2022:644–654. Springer: Cham
Aruna Kumari A, Bhagat A, Kumar HS. Classification of diabetic retinopathy severity using deep learning techniques on retinal images. Cybern Syst. 2024;1:25.
Balakrishnan G, Zhao A, Sabuncu MR, Guttag J, Dalca AV. Voxelmorph: a learning framework for deformable medical image registration. IEEE Trans Med Imaging. 2019;38(8):1788–800.
Bi L, Buehner U, Xiaohang F, Williamson T, Choong P, Kim J. Hybrid cnn-transformer network for interactive learning of challenging musculoskeletal images. Comput Methods Programs Biomed. 2024;243: 107875.
Bougourzi F, Dornaika F, Distante C, Taleb-Ahmed A. D-trattunet: Toward hybrid cnn-transformer architecture for generic and subtle segmentation in medical images. Comput Biol Med. 2024;176: 108590.
Chen Y, Wang T, Tang H, Zhao L, Zhang X, Tan T, Gao Q, Min D, Tong T. Cotrfuse: a novel framework by fusing cnn and transformer for medical image segmentation. Phys Med Biol. 2023;68(17): 175027.
Christensen GE, Johnson HJ. Consistent image registration. IEEE Trans Med Imaging. 2001;20(7):568–82.
Dalmaz O, Yurt M, Çukur T. Resvit: residual vision transformers for multimodal medical image synthesis. IEEE Trans Med Imaging. 2022;41(10):2598–614.
Yabo F, Lei Y, Wang T, Curran WJ, Liu T, Yang X. Deep learning in medical image registration: a review. Phys Med Biol. 2020;65(20):20TR01.
Gong Z, French AP, Qiu G, Chen X. Convtransseg: A multi-resolution convolution-transformer network for medical image segmentation. arXiv:2210.07072, 2022.
Gu P, Zhang Y, Wang C, Chen DZ. Convformer: Combining cnn and transformer for medical image segmentation. In 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), 2023:1–5. IEEE.
Guo X, Lin X, Yang X, Li Yu, Cheng K-T, Yan Z. Uctnet: Uncertainty-guided cnn-transformer hybrid networks for medical image segmentation. Pattern Recogn. 2024;152: 110491.
Haskins G, Kruger U, Yan P. Deep learning in medical image registration: a survey. Mach Vis Appl. 2020;31(1):8.
Joshi A, Sharma KK. Dense deep transformer for medical image segmentation: Ddtramis. Multimed Tools Appl. 2024;83(6):18073–89.
Kim JW, Khan AU, Banerjee I. Systematic review of hybrid vision transformer architectures for radiological image analysis. medRxiv, 2024:2024–06.
Kuang H, Wang Y, Liu J, Wang J, Cao Q, Bo H, Qiu W, Wang J. Hybrid cnn-transformer network with circular feature interaction for acute ischemic stroke lesion segmentation on non-contrast ct scans. IEEE Trans Med Imaging. 2024;43(6):2303–16.
Labbihi I, El Meslouhi O, Elassad ZEA, Benaddy M, Kardouchi M, Akhloufi M. Hybrid 3d medical image segmentation using cnn and frequency transformer fusion. Arabian Journal for Science and Engineering, 2024:1–14.
Liang J, Fan Y, Zhang K, Timofte R, Van Gool L, Ranjan R. Movideo: Motion-aware video generation with diffusion model. Cham: Springer; 2025.
Liao R. Three-Dimensional Medical Image Registration with Applications in Proton Therapy. PhD thesis, Washington University in St. Louis, 2024.
Lin X, Li Yu, Cheng K-T, Yan Z. The lighter the better: rethinking transformers in medical image segmentation through adaptive pruning. IEEE Trans Med Imaging. 2023;42(8):2325–37.
Mani VRS, Arivazhagan S. Survey of medical image registration. J Biomed Eng Technol. 2013;1(2):8–25.
Poloju N, Rajaram A. Transformation with yolo tiny network architecture for multimodal fusion in lung disease classification. Cybern Syst. 2024. https://doi.org/10.1080/01969722.2024.2343992.
Shin J, Hong S. and Jungwoo Lee. Nerflex: Flexible neural radiance fields with diffeomorphic deformation. IEEE Access; 2024.
Sotiras A, Davatzikos C, Paragios N. Deformable medical image registration: A survey. IEEE Trans Med Imaging. 2013;32(7):1153–90.
Umirzakova S, Muksimova S, Baltayev J, Cho YI. Force map-enhanced segmentation of a lightweight model for the early detection of cervical cancer. Diagnostics. 2025;15(5):513.
Viergever MA, Antoine Maintz JB, Klein S, Murphy K, Staring M, Pluim JPW. A survey of medical image registration-under review. Med Image Anal. 2016;33:140–4.
Wang J, Aixi Q, Wang Q, Zhao Q, Liu J, Wu Q. Tt-net: Tensorized transformer network for 3d medical image segmentation. Comput Med Imaging Graph. 2023;107: 102234.
Yang F, Wang F, Dong P, Wang B. Hca-former: Hybrid convolution attention transformer for 3d medical image segmentation. Biomed Signal Process Control. 2024;90: 105834.
Yang S, Li Q, Shen D, Gong B, Dou Q, Jin Y. Deform3dgs: Flexible deformation for fast surgical scene reconstruction with gaussian splatting. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 2024b:132–142. Springer.
Yang X, Yang G, Chu J. Self-supervised learning for label sparsity in computational drug repositioning. IEEE/ACM Trans Comput Biol Bioinf. 2023;20(5):3245–56.
Yang X, Yang G, Chu J. Graphcl-dta: a graph contrastive learning with molecular semantics for drug-target binding affinity prediction. IEEE J Biomed Health Inform. 2024;28(8):4544–52.
Zhihong Yu, Lee F, Chen Q. Hct-net: hybrid cnn-transformer model based on a neural architecture search network for medical image segmentation. Appl Intell. 2023;53(17):19990–20006.
Zhao A, Du X, Wang S, Wang W, Yuan S, Ma W, Yan W, Shen W, Zhu X. A cnn-transformer hybrid network for recognizing uterine fibroids. In 2022 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), 2022:1–4. IEEE.
Zhao L, Liu S, Li B, Cai W, Liang P, Yu J, Zhao J. A hybrid cnn-transformer for focal liver lesion classification. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024:13001–13005. IEEE.
Zhou L, Zhang Y, Zhang J, Qian X, Gong C, Sun K, Ding Z, Wang X, Li Z, Liu Z, et al. Prototype learning guided hybrid network for breast tumor segmentation in dce-mri. IEEE Transactions on Medical Imaging, 2024.
Zhu S, Lin L, Liu Q, Liu J, Song Y, Qin X. Integrating a deep neural network and transformer architecture for the automatic segmentation and survival prediction in cervical cancer. Quant Imaging Med Surg. 2024;14(8):5408.
Acknowledgements
This research is supported by "Guangdong Provincial Key Laboratory of Intelligent Information Processing", College of Electronics and Information Engineering of Shenzhen University.
Institutional Review Board Statement
Not applicable. All methods were carried out in accordance with relevant guidelines and regulations.
Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
Funding
This research was funded by the Guangdong Provincial Key Laboratory of Intelligent Information Processing with grant No. 2023B1212060076 and Shenzhen Science and Technology Program with grant No. JCYJ20220818100004008 at College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China.
Author information
Authors and Affiliations
Contributions
Conceptualization, Muhammad Kashif Jabbar, Huang Jianjun, Ayesha Jabbar, Tariq Mahmood, Sajid; Data Curation, Muhammad Kashif Jabbar, Huang Jianjun, and Ayesha Jabbar, Sajid; Formal analysis, Muhammad Kashif Jabbar, Huang Jianjun Methodology, Muhammad Kashif Jabbar, Huang Jianjun, Ayesha Jabbar, and Tariq Mahmood; Software, Muhammad Kashif Jabbar, Huang Jianjun, Ayesha Jabbar and Tariq Mahmood, Sajid; Supervision, Huang Jianjun.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Informed consent
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Jabbar, M.K., Jianjun, H., Jabbar, A. et al. Mamba-optimized transformer framework with dynamic deformation fields for real-time medical image registration. J Big Data 12, 231 (2025). https://doi.org/10.1186/s40537-025-01258-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40537-025-01258-8