Introduction

Minimally invasive surgery refers to the use of modern medical technologies and equipment, such as endoscopes and laparoscopes1, to enter the body through small incisions or natural orifices. By utilizing high-definition cameras, the surgical area is transmitted to a display monitor, allowing surgeons to clearly view the operative site and perform precise procedures to treat diseases. Compared to traditional open surgery, this approach offers advantages such as smaller trauma, faster recovery, and fewer complications. A monocular medical endoscope is an imaging device used in minimally invasive surgery, primarily providing a channel, illumination, and allowing surgeons to observe and operate on body cavities, hollow organs, and bodily conduits. It typically consists of components such as a flexible or rigid tube (capable of entering the body), an optical imaging system, an image processing system, a display system, and a light source system. The image processing system digitizes the captured images. With the continuous advancement of artificial intelligence technology, medical monocular endoscopic image processing systems are becoming more intelligent2. Through built-in AI algorithms, these systems can automatically recognize and mark diseased tissues, reducing the time required for diagnosis and the likelihood of errors by doctors. The image processing systems in medical monocular endoscopes are evolving towards higher resolution, greater intelligence, portability, and real-time interactivity3.

Depth estimation in medical endoscopic videos4 involves the task of estimating the depth value (the distance relative to the camera) for each pixel in a given monocular RGB endoscopic image. This task has broad applications in the medical field, including disease diagnosis and assessment, surgical navigation and assistance, patient monitoring and rehabilitation, as well as telemedicine and education. It offers significant potential by providing doctors with more accurate, intuitive, and comprehensive diagnostic information, as well as more detailed scene depth data to assist in operations, ultimately enhancing the efficiency and quality of medical services. Traditional monocular depth estimation (MDE)5 approaches rely on depth cues to predict depth. Various deep learning techniques have demonstrated their potential in managing and supporting these traditionally ill-posed problems. However, the challenges associated with medical endoscopic videos—such as uneven illumination, unclear textures, and non-rigid structures—pose significant difficulties for conventional methods. Additionally, the clinical surgical environment demands real-time processing and strict adaptability to dynamic changes, necessitating intelligent surgical assistance methods that can accommodate the deformations and complex geometrical structures within the human body.

To address the aforementioned challenges, this paper aims to efficiently and accurately extract depth information from monocular medical endoscopic videos using artificial intelligence and self-supervised neural network techniques, providing a more reliable data basis for subsequent tasks. The contributions of this paper are as follows:

  1. 1.

    Design of a Window-Adaptive Asymmetric Dual-Branch Siamese Network for Monocular Medical Endoscope Depth Extraction: The dual-branch design clearly delineates tasks, contributing to improved modularity and maintainability of the network.

  2. 2.

    Development of an Improved SE Attention Module: A lightweight adaptive attention module suitable for both global and local feature extraction is introduced. The Siamese network architecture leverages self-supervised learning to adaptively learn from windows, facilitating more effective extraction of endoscopic scene images with inconsistent illumination and weak textures, even in the absence of labeled data.

  3. 3.

    Proposal of a Lightweight Cross-Attention Feature Fusion Module: This module facilitates cross-branch feature interaction through channel fusion techniques, enhancing the overall feature representation capability of the network. This ensures that the network meets the near real-time requirements of clinical endoscopic surgery scenarios.

  4. 4.

    Model Training and Validation on Hamlyn Datasets, EAD2019, M2caiSeg, and UCL Synthetic Datasets: The model’s generalization performance is further validated on the NYU Depth V2 dataset. The superiority of the proposed method is demonstrated through ablation studies and comparative experiments.

Related work

Depth estimation in medical endoscopic images is a task within 3D reconstruction and scene understanding6, aiming to infer the geometric information of 3D organs from 2D images acquired by endoscopes, including the distance of each pixel from the endoscope. Since the rapid development of artificial intelligence technologies, many researchers have applied deep learning-based depth estimation methods to clinical medical endoscopy scenarios. This review summarizes the recent developments from three perspectives: classic depth estimation methods for medical endoscopic images, typical self-attention mechanisms, and typical adaptive window methods.

Overview of classic depth estimation methods for medical endoscopic images

Depth estimation in medical endoscopy encompasses various techniques, each with distinct principles, advantages, and limitations. These methods can be broadly categorized into traditional computer vision techniques, deep learning-based approaches, and imaging-based depth measurement methods.

Traditional computer vision methods

Shape from Shading (SfS)7 estimates 3D shape from illumination variations, making it particularly useful for endoscopic light source control. It is computationally efficient but highly dependent on accurate illumination models, making it susceptible to noise. SfS has been applied in endoscopic surgery and tissue surface analysis since 2003. By capturing images under different lighting directions, Photometric Stereo8 reconstructs surface normals to enhance depth estimation accuracy. While it provides high-precision depth maps, it requires multiple illuminations and complex setups, limiting its practicality. Since 2009, it has been employed in endoscopic surgery and tissue surface analysis.

Deep learning-based methods

Deep Learning Approaches9 such as Convolutional Neural Networks (CNNs) and U-Net architectures have been increasingly used to predict depth directly from endoscopic images. These methods offer automated, data-driven depth estimation suitable for a wide range of applications, including endoscopic navigation and surgical assistance. However, they require large labeled datasets and extensive computational resources for training. Their adoption has been growing, particularly since 2021. Shakhnoza et al. proposed RL-CancerNet, a novel artificial intelligence model that enhances cervical cancer screening by analyzing cytology images with advanced computational techniques. RL-CancerNet integrates EfficientNetV2 for detailed image analysis, Vision Transformers for contextual understanding, and Reinforcement Learning to focus on rare but critical features indicative of early-stage cancer. Yang et al. proposed a lightweight network with a tight coupling of convolutional neural network (CNN) and Transformer for depth estimation. Unlike other methods that use CNN and Transformer to extract features separately and then fuse them on the deepest layer, we utilize the modules of CNN and Transformer to extract features at different scales in the encoder. This hierarchical structure leverages the advantages of CNN in texture perception and Transformer in shape extraction. In the same scale of feature extraction, the CNN is used to acquire local features while the Transformer encodes global information. Rau et al. trained a conditional generative adversarial network pix2pix, to transform monocular endoscopic images to depth, which could be a building block in a navigational pipeline or be used to measure the size of polyps during colonoscopy.

Yin et al. proposed a lightweight dynamic convolution Network (LDCNet) that had the same superior segmentation performance as the state-of-the-art (SOTA) medical image segmentation network while running at the speed of the lightweight convolutional neural network. Wang et al. proposesd a learning-based two-branch U-Net deep imaging architecture, named DHU-Net, for the accurate and sharp reconstruction of EIT images. Specifically, deformable convolution layers are introduced to improve the representation of shape and spatial information by a small convolutional kernel, while SE-Attention is used to recalibrate the channel-wise features for global distributions; on the other hand, an implicit hyper-convolutional network with coordinate attention (CA) is used to construct the relationship between the spatial coordinates of the convolutional kernel and corresponding weights, so that the large convolutional kernel for conductivity recovery has a relatively lower parameter while having better convolutional robustness. A depth estimation network and a camera ego-motion estimation network are firstly constructed to obtain the depth information and pose information of the sequence respectively, and then the model is enabled to perform self-supervised training by calculating the multi-scale structural similarity with L1 norm (MS-SSIM + L1) loss function between the target frame and the reconstructed image as part of the loss of the training network. The MS-SSIM + L1 loss function is good for reserving high-frequency information and can maintain the invariance of brightness and color. A multi-frame depth model with multi-scale feature fusion is proposed for strengthening texture features and spatial–temporal features, which improves the robustness of depth estimation between frames with large camera ego-motion. A novel dynamic object detecting method with geometry explainability is proposed. The detected dynamic objects are excluded during training, which guarantees the static environment assumption and relieves the accuracy degradation problem of the multi-frame depth estimation. Robust knowledge distillation with a consistent teacher network and reliability guarantee is proposed, which improves the multi-frame depth estimation without an increase in computation complexity during the test.

These studies collectively demonstrate the ongoing innovation and exploration in medical imaging analysis, particularly in the domains of multi-modal data fusion, lightweight network design, and self-/adversarial training. Some methods leverage advanced network architectures like Reinforcement Learning combined with modern backbones (e.g., EfficientNetV2 and Vision Transformers) to spotlight critical but rare features. Others adopt multi-branch or dynamic convolution networks to strike a balance between computational efficiency and accuracy. Additionally, self-supervised or multi-frame fusion strategies incorporate knowledge distillation and geometric explainability to tackle data scarcity and capture temporal consistency. Overall, these approaches provide more robust and efficient solutions for medical image analysis and depth estimation, while offering valuable insights for addressing complex non-rigid structures and limited annotations in real-world clinical settings.

Imaging-based depth measurement techniques

Optical Coherence Tomography (OCT)10 utilizes interferometric imaging to generate high-resolution tomographic images. While it offers precise depth measurement, its high cost and complex equipment requirements limit accessibility. Since 1999, OCT has been widely used in clinical tissue imaging and pathological analysis. Endoscopic Ultrasound (EUS)11 combines ultrasound transmission with endoscopy, allowing penetration through tissues to obtain depth information in real time. Despite its effectiveness, it suffers from relatively low resolution and sensitivity to noise. Since 2011, EUS has been applied in endoscopic surgery and pathological analysis.

Typical self-attention mechanisms

Self-attention mechanisms play a crucial role in neural networks, particularly in tasks involving sequential data, such as text and image processing. These mechanisms have evolved to address various computational challenges while improving the ability to capture dependencies at different scales.

Among these approaches, global attention effectively models long-range dependencies by considering relationships between all elements within a sequence. This capability makes it highly suitable for tasks such as machine translation and document summarization, where contextual understanding across the entire sequence is essential. However, the computational cost of global attention scales significantly with sequence length, posing efficiency challenges for real-time applications. To mitigate this computational burden, local attention constrains the attention scope to a limited region, significantly reducing processing costs while maintaining efficiency. This approach is particularly advantageous for real-time tasks such as object detection and speech recognition, where rapid processing is prioritized. However, the restricted scope of local attention may result in the loss of critical global dependencies, which can affect performance in scenarios requiring a broader contextual understanding.

Another widely adopted technique, self-attention, computes dependencies between elements based on their similarities, enabling the extraction of intricate relationships within the input. This method provides a high degree of flexibility and adaptability, making it well-suited for applications such as natural language processing (NLP) and image segmentation. However, its effectiveness is contingent on large-scale datasets and considerable computational resources, which can be limiting factors in certain applications. An extension of this approach, multi-head attention, enhances feature extraction by employing multiple attention heads to capture diverse aspects of the data. This mechanism improves sensitivity to different features while optimizing computational efficiency. Despite its advantages, it exhibits limitations in effectively modeling spatial dependencies, which can restrict its performance in vision-related tasks such as image segmentation and object detection.

In domains where spatial and temporal relationships are fundamental, spatial and temporal attention mechanisms have been introduced to enhance the model’s ability to capture variations across space and time. These mechanisms are particularly valuable for video analysis and dynamic object detection, although they often entail increased computational demands.

More advanced forms of attention mechanisms have also emerged to further enhance contextual understanding and multi-modal processing capabilities. Relative attention integrates positional information within sequences, improving contextual coherence, whereas memory-augmented attention incorporates external memory networks to facilitate long-term dependency modeling. Additionally, cross-modality attention enables the fusion of information across different modalities, proving instrumental in tasks such as multi-modal sentiment analysis, visual question answering, and long-text generation.

To address the challenge of computational efficiency, several alternative mechanisms, including sparse, joint, dual, and stacked attention, have been developed. These techniques focus on optimizing computational resources while maintaining robust feature representation. Sparse attention leverages selective processing of key elements to reduce redundancy, while joint attention integrates multiple attention modules to improve feature fusion. Dual attention processes complementary information streams in parallel, and stacked attention employs hierarchical layers to refine representations progressively. These approaches are particularly effective in handling high-dimensional data, such as in video understanding and complex scene analysis.

Through continuous advancements in self-attention mechanisms, researchers have been able to strike a balance between computational efficiency and the ability to model intricate dependencies, paving the way for their application in increasingly complex tasks across a wide range of domains.

Typical adaptive window methods

Adaptive window methods dynamically adjust window parameters in response to local data characteristics, enhancing precision and efficiency in various data processing tasks.

Among these approaches, adaptive convolution and deformable convolution modify kernel sizes or learn spatial offsets to better accommodate shape and geometric variations. While they improve feature extraction accuracy, they also introduce higher computational complexity, making them suitable for image classification and pose estimation. Similarly, adaptive pooling and attention mechanisms adjust pooling window sizes or focus areas within attention modules, ensuring consistent feature mapping and model depth enhancement. However, these methods may lead to feature loss or increased computational demands, particularly in high-detail regions or complex scenarios.

In signal processing, adaptive Fourier transform and correlation analysis refine signal analysis by optimizing window size and shape, improving robustness in feature matching. These techniques, though computationally intensive, are widely applied in video analysis, signal processing, and 3D matching. Additionally, adaptive clustering, filtering, and LSTM-based methods dynamically adjust window sizes for clustering, filtering, and sequential data analysis, increasing adaptability and accuracy while requiring substantial computational resources. As a result, they are commonly used in image segmentation and sequence prediction tasks.

For applications requiring optimized data compression and multiscale feature extraction, adaptive block compression, multiscale spatial analysis, and window learning adjust block sizes or window parameters to balance precision and computational cost. These methods are particularly effective in handling data with varying densities and scales, proving beneficial in video encoding and real-time data analysis. Meanwhile, adaptive window smoothing, weighting, and transmission fine-tune data processing based on variance or signal characteristics, enhancing filtering accuracy and search efficiency. Despite their computational demands, these techniques play a crucial role in optimization and anomaly detection.

Ultimately, the selection of an appropriate adaptive window method depends on the specific application context and computational constraints. For high-precision depth estimation in complex environments, binocular or multi-view depth estimation is often preferred. When operating under resource-limited conditions, monocular depth estimation presents a more efficient alternative. Video-based depth estimation is particularly advantageous for processing continuous frames. In medical endoscopy, challenges such as uneven illumination, unclear textures, and non-rigid structures pose additional complexities. To address these issues, this paper proposes an adaptive window-based monocular endoscopic depth estimation network that leverages multiple attention mechanisms to enhance performance.

Methods

A Siamese Network is a network structure composed of two or more subnetworks that share the same parameters. These subnetworks are typically used to process two or more inputs and output their similarity or other relationships. Although an asymmetric dual-branch network is conceptually similar to a Siamese network, the “asymmetric” nature means that the two branches do not share identical structures and parameters. The asymmetric dual-branch Siamese network proposed in this paper is illustrated in Fig. 1. One branch focuses on processing global information from the image, compressing the image to a lower spatial resolution through a series of convolution operations to obtain a more abstract global representation. This is combined with a global pooling layer to further reduce the number of parameters. Additionally, an improved lightweight Squeeze-and-Excitation (SE) module is added to the final layer of this branch to dynamically adjust the inter-channel weights through self-attention. The other branch focuses on local details, such as textures and edges, using shallower convolutional layers and deformable convolution to adapt to features of different scales, ensuring that sufficient local details are captured. A lightweight cross-attention feature fusion module is introduced between the two branches of the Siamese network to facilitate cross-branch feature interaction. This is achieved through channel fusion techniques, allowing information exchange between the two branches, thereby enhancing the overall feature representation capability of the network. The framework of the asymmetric dual-branch Siamese network is detailed in section “Asymmetric dual-branch siamese network”. The improved lightweight attention SE module is discussed in section “Advanced lightweight SE self-attention module—ASE module”. The cross-branch fusion module (FM) is detailed in section  “Fusion module (purple section)”. The depth estimation-related techniques implemented by this network are described in section “Dense depth estimation”. The loss function for network training is described in section “Loss function”.

Fig. 1
figure 1

Network architecture.

Data augument

Since the designed Siamese network belongs to a self-supervised neural network model, it does not require traditional manually labeled data. By leveraging the intrinsic geometric relationships and consistency between images, effective depth estimation can be achieved. Continuous video frames are extracted from the M2caiSeg dataset, and affine transformations are applied to generate paired images for the EAD2019 dataset. Both the M2caiSeg and UCL synthetic datasets consist of continuous video frames, enabling the generation of multi-view images and depth information. Data augmentation techniques, such as rotation, flipping, scaling, random cropping, and color adjustments, are employed to randomly generate images with variations in angles, lighting conditions, and camera zoom, thereby increasing the diversity and quantity of the data. This enhances the model’s robustness and generalization ability. Since endoscopic images typically have high noise levels and uneven lighting conditions, we perform image preprocessing before training the model, including denoising and contrast enhancement. Additionally, synthetic endoscopic datasets are incorporated to enrich the training set and further improve the model’s generalization capabilities. On the other hand, since the proposed model is a self-supervised learning network, the self-supervised signals are generated by performing view transformations on the input images. The model is required to predict the images under new viewpoints, which constitutes a form of self-supervised signal. The global branch learns the overall structure and lighting information of the image, while the local branch focuses on learning the detailed information within the image. The input signals for the loss function during network training (such as images with different viewpoints and lighting variations) are auto-matically generated, eliminating the need for additional data labeling.

Asymmetric dual-branch siamese network

Traditional depth estimation networks are mostly single branches or symmetric branches, so it is difficult to take into account both local details and overall structure. We proposed a asymmetric dual-branch Siamese network, one branch focuses on processing global information of the image, such as brightness gradients and overall shapes. The asymmetric structure design allows the two branches to focus on the depth of the network, the convolution mode and the attention application, and to deal with the problems of uneven lighting or blurred details in the image more efficiently. The global branch employs deeper convolutional layers to capture the global features of the image, including overall shapes and lighting information. The network receives image sequences from endoscopic videos as input. This network belongs to self-supervised learning-based depth estimation methods, which do not require ground truth (GT) depth maps but instead predict depth maps by constructing geometric constraints. Through a series of convolution operations, the image is compressed to a lower spatial resolution to obtain a more abstract global representation. A global pooling layer is combined to further reduce the number of parameters. Additionally, an improved lightweight Squeeze-and-Excitation (SE) module is added to the last layer of this branch to dynamically adjust the inter-channel weights through self-attention. The other branch focuses on local details, such as textures and edges, using shallower convolutional layers and deformable convolutions to adapt to features of different scales, ensuring that sufficient local details, like textures and edges, are captured.

This branch maintains a higher spatial resolution, allowing for better integration of global and local information, thereby enhancing the robustness of depth estimation. The details of the encoder-decoder are shown in Fig. 2. The upsampled feature maps from both the global and local branches are connected to a fusion module, indicating that the outputs of the two branches enter the fusion module where the features are integrated. The fused features are then passed to the output, ultimately generating the depth estimation map.

Fig. 2
figure 2

Encoders of Global Branch and Local Branch.

Global branch (blue section)

The convolutional layers serve as the primary module for extracting global image features, typically utilizing deeper layers to capture high-level representations. To further integrate global context, a global pooling layer aggregates spatial information, enhancing feature abstraction. Additionally, the SE module, a lightweight attention mechanism, dynamically reweights features to refine their importance. In the decoding stage, upsampling restores low-resolution feature maps to higher resolutions, facilitating precise reconstruction.

Local branch (green section)

The shallow convolutional layers focus on capturing fine-grained local details, utilizing shallower structures to preserve texture and edge information. To enhance feature representation across different scales, multi-scale convolution is employed, enabling the extraction of local context at varying receptive fields. Additionally, deformable convolution adapts to non-rigid deformations, improving flexibility in feature learning. In the decoding stage, upsampling restores local feature maps to higher resolutions, ensuring the preservation of detailed spatial information.

Fusion module (purple section)

This module is located after the global and local branches and is used to fuse the features from both branches. The fusion method can be a simple weighted sum, concatenation, or achieved through more complex attention mechanisms. The upsampled feature maps from both the global and local branches are connected to the fusion module via arrows. This indicates that the outputs of the two branches will enter the fusion module where the features are integrated.

Connection relationships explained

The global and local branches are linked to the fusion module through upsampled feature maps, ensuring the integration of global and local information. This connection facilitates the combination of high-level contextual features with fine-grained local details. Subsequently, the fusion module transmits the refined features to the output module, where they are processed to generate the final depth estimation map.

A lightweight cross-attention feature fusion module is introduced between the two branches of the Siamese network to facilitate cross-branch feature interaction. This interaction is achieved through channel fusion techniques, allowing information exchange between the two branches and enhancing the overall feature representation capability of the network. The fusion module is placed after the last convolutional layer of the encoder, at the interface between the encoder and decoder. The fusion module combines the feature maps from the global and local branches, and the fused feature map is then fed into the decoder. The fusion module can integrate features through a weighted sum.

Self-attention mechanism and window adaptive

The adaptive attention mechanism can dynamically adjust the size and scope of the attention window. The attention mechanism selectively focuses on important regions by using adaptive weights while ignoring less important areas. By leveraging a data-driven approach, the model automatically learns during training to adaptively adjust the window size, sparsity, channel selection, and the complexity of the attention mechanism. The combination of local attention and global attention is used to achieve adaptive window adjustment. This section primarily introduces the attention mechanism used to implement adaptive windows in the model, with the basic principles illustrated in Fig. 3.

Fig. 3
figure 3

Schematic diagram of the attention mechanism used by the model to implement window adaptation.

Figure 4 shows the window adaptive calculation process combined with attention mechanism channel weighting, and explains the specific calculation method with a formula. In the general process of Window Adaptive attention, the input feature graph F1 is first divided into several sub-blocks (Window Partition & Adaptive Size), and then Sparse Selection is performed. Finally, the attention (Q,K,V) is calculated on the selected key subblocks to obtain the enhanced feature diagram F′. Select the window size based on local brightness/texture information (e.g., variance, gradient) \(W\):

$$ W = \max (W_{\min } ,\min (W_{\max } ,\alpha \cdot \sigma_{local} )) $$

where, \(\sigma_{local}\) represents local variance or texture intensity,\(\alpha\) is supraparameter. If \(\sigma_{local}\) too small(weak texture), larger window to get more context; On the other hand, when the texture is rich, you can use a smaller window to highlight local details. Then Sparse Selection module which the input is filtered by window mean or variance:

$$ Select(F_{1i} ) = \left\{ {\begin{array}{*{20}c} {1,} & {if\; \, VarF(F_{1i} ) \ge \tau ,} \\ {0,} & {otherwise\quad \quad \quad } \\ \end{array} } \right. $$

where, \(\tau\) is the threshold, the > 0 window is reserved to reduce the computation of no information areas. Then caculate the Attention based on ASE:

$$ F_{ase} = \sigma (W_{2} \cdot {\text{Re}} LU(W_{1} \cdot pool(F_{1i} ))) \cdot F_{1i} $$
Fig. 4
figure 4

Window-based adaptive attention operation diagram.

Key feature channels are highlighted by channel weighting.

In weak texture areas, the SE module adaptively enlarges the window or merges the context, and weights the low-contrast channels. In strong/dark light areas, automatically reduces the effect of ineffective white space by sparse selection or attention, focusing computing resources on relatively information-rich areas.

The network employs the concept of window adaptation and various attention mechanisms at multiple stages. During the window adaptation phases—ranging from small local patches to broader global regions—the system applies dynamic weighting to address uneven lighting and sparse-texture areas, automatically adjusting its focus scope. Sparse Attention reduces processing overhead for irrelevant regions, thereby improving efficiency. Meanwhile, the ASE module adaptively enhances critical channel weights to strengthen channel-level features. These attention mechanisms not only balance feature learning between brightly lit and dark areas but also emphasize edges or residual textures in low-texture regions. Through window adaptation, the network more effectively identifies tissue contours and local details during depth estimation.

Advanced lightweight SE self-attention module—ASE module

The core idea of the lightweight attention Squeeze-and-Excitation (SE) module12 is to adaptively recalibrate the weights of each channel in the feature map through the operations of “squeeze” and “excitation,” enabling the network to focus more on features that are useful for the task at hand.

The squeeze operation compresses spatial information within each channel into a single global descriptor using global average pooling, effectively capturing global channel features. Following this, the excitation operation learns the importance of each channel through fully connected layers, applying nonlinear scaling via a sigmoid function. The resulting weights are then re-applied to the original feature map, enhancing the most relevant features while suppressing less important ones.

To address the challenges of uneven illumination, low texture, and non-rigid conditions in endoscopic clinical surgery scenarios, we propose an improved SE module (hereafter referred to as ASE). The ASE module uses a local–global self-attention mechanism to guide the fusion of global and local features. To reduce computational burden, attention can be applied to only a subset of channels or layers. This allows the retention of some advantages of the attention mechanism at a lower computational cost, making it suitable for tasks that require flexible feature fusion. Specifically, based on the CMMCAN framework13, the ASE module with globally shared parameters is embedded at the end of each ASFF block in the model. This means that regardless of the input image size, the number of parameters in the ASE module remains fixed. This design reduces redundant parameters in the model and improves computational efficiency. The ASE module is lightweight, with reduced computational complexity achieved by decreasing the size of the fully connected layers (such as lowering the embedding dimension). Additionally, the ASE module can be applied only to specific layers or channels to further reduce overall overhead.

Assume the input feature map is \(X \in {\text{R}}^{H\, \times \,W\, \times \,C}\), where H and W are the spatial dimensions, and C is the number of channels. The computation process of the ASE module is as follows: A window of size k k is defined on the feature map, where k is typically much smaller than the feature map dimensions, and it adaptively changes with the attention learning during the model training and evaluation process. Attention is computed within the window for the features:

$$ {\text{Attention}}_{i} = \frac{{\exp (Q_{i} \cdot K_{i} )}}{{\sum\nolimits_{{j \in {\text{Window}}}} {\exp \left( {Q_{i} \cdot K_{j} } \right)} }} $$
(1)

where Q is the query matrix, K is the key matrix, and i, j are the pixel indices within the window. Assume V is the value matrix, A new feature map is generated as follows:

$$ Y_{i} = \sum\limits_{{j \in {\text{ Window }}}} {{\text{ Attention }}_{ij} } \cdot V_{j} $$
(2)

Local–global attention

To comprehensively and efficiently capture features, the model uses Local-Global Attention in the early and late stages to capture detailed feature representations, while Sparse and Efficient Attention are employed in the middle layers to reduce computational load. This combination design allows the model to strike a balance between computational cost and performance, yielding better results than a single attention mechanism. The computational load is close to that of a single attention mechanism, but it provides better outcomes in terms of detail capture and global information integration.

Adaptive window partitioning is used to dynamically adjust the window size based on image complexity or feature sparsity. On the basis of local attention, a small amount of global attention is retained, mixing global and local attention to help capture long-range dependencies. In the middle layers of the encoder, particularly when processing high-resolution feature maps, Sparse Attention is introduced after each convolution operation.

At the global level, features from different local regions are integrated, capturing long-range dependencies through a global self-attention mechanism. This enables the model to comprehend the overall structure of the image, such as brightness gradients and shapes. By capturing variations in lighting patterns, global attention enhances the network’s adaptability to uneven illumination conditions. Following this process, sparse attention further refines global feature integration, preserving structural information while reducing computational complexity.

To emphasize fine-grained details, local attention is first applied to individual regions of the image. The input is divided into multiple patches, and self-attention is computed within each patch, efficiently capturing local features while maintaining computational efficiency. By dynamically adjusting the position and size of the attention window, this mechanism flexibly captures local deformations, particularly in areas with significant structural variations. Even in low-texture regions, local attention enhances feature extraction, ensuring detailed representation.

While local attention is effective in handling deformations within specific regions, integrating it with global features ensures structural consistency. The fusion of local and global features enhances the model’s resilience to non-rigid deformations, which is particularly beneficial for tasks such as classification and depth estimation. This combined approach provides greater flexibility, making it well-suited for complex medical imaging applications.

Dense depth estimation

The training objective of the depth estimation branch shown in Fig. 5 is to predict the depth value of each pixel, making it as close as possible to the true depth value. Both branch with same architecture is divided into two primary stages: An encoder (left side) and a decoder (right side). The encoder extracts features from the input image at multiple scales and progressively increases the number of feature channels. The decoder reconstructs a higher-resolution representation (in this case, a depth map) from the encoded features. Between the encoder and decoder sits a Fusion Module, where key features from multiple branches are integrated before being upsampled and decoded into the final output. It Combines the separate or parallel feature streams (global features, local features) into a single, enriched representation. Four attention modules each serve a distinct purpose within the network: LGA (Local–Global Attention) simultaneously captures both local textures and overarching structure, helping the model handle uneven lighting or subtle texture changes. SA (Sparse Attention) focuses on the most discriminative regions to reduce redundant computations, making it well-suited for real-time applications. ASE (Advanced SE Module) improves upon the classic squeeze-and-excitation framework by adaptively reweighting channels, highlighting critical features. Meanwhile, EA (Edge/Early Attention) likely emphasizes boundaries or early-stage feature cues, providing a more refined foundation for subsequent layers. Together, these modules ensure that the network not only maintains precise local detail but also obtains a robust global understanding, enabling it to tackle the complex, variable nature of endoscopic imagery.

Fig. 5
figure 5

Single Branch Deep Synthesis Network Architecture. “k” represents the kernel size, “s” stands for the stride, and “c” denotes the number of channels. For simplicity, we do not illustrate the convolutional layers that follow each conv and deconv layer, nor the other branch, which have the same kernel and channel size as the previous layers but with a stride of 1. In particular, modules LGA, SA, and ASE act on this branch, while modules EA, SA, and ASE act on another branch at the same level.

During the inference stage, a confidence map is used for post-processing the depth map. Low-confidence depth values are removed, and depth completion techniques are employed to fill in low-confidence areas, thereby increasing the density of the depth map. Specifically, the completion of low-confidence areas is achieved by combining RGB guidance maps with depth information, improving the density of the depth map. Confidence information from the frame sequence is used to dynamically adjust each frame: high-confidence areas undergo less redundant computation, while low-confidence areas receive focused computation and completion, reducing computational overhead and enhancing real-time performance. Depth information from the time series is combined using confidence-weighted fusion and smoothing, reducing noise and sudden changes. Techniques such as Kalman filtering or sliding window are used to smooth the depth estimation results in the time series. By performing these operations, the confidence information can be utilized to enhance the real-time performance and density of depth estimation. A multi-scale feature fusion module14 is introduced to combine features at different scales, improving the density and accuracy of depth estimation. An appropriate self-supervised loss function (such as reprojection error or disparity consistency) is used to optimize the network, further enhancing the density and accuracy of depth estimation.

The disparity calculation derives a disparity map from the estimated depth map, where disparity is inversely proportional to depth. Given a focal length f and a baseline length B, the disparity d is computed as:

$$ {\text{d }} = {\text{ f B}}/{\text{Z}} $$

where Z represents the depth value at a given point.

Following this, pixel reprojection is performed using the camera model, mapping pixels from the source image onto the target image based on the computed disparity. The reprojection process follows the corresponding transformation equations, ensuring accurate spatial alignment between views.

Where p and p′ are the pixel points in the source and target images, respectively. K is the camera intrinsic matrix, and R and t are the rotation and translation matrices.

Error Calculation: The reprojection error E is calculated by measuring the Euclidean distance between the actual pixel position and the projected position. Assuming \(\hat{p}\) is the actual corresponding pixel position, the confidence of the depth information is then evaluated through disparity consistency \(C\left( x \right) = 1/\left( {1 + p^{\prime } \hat{p}} \right)\).

Specifically, by running the model multiple times during inference and introducing randomness, the uncertainty in depth prediction can be estimated. In each forward pass, a portion of the neurons is randomly dropped out, and then the variance of the multiple prediction results is calculated. Assuming predictions are made, the predicted values are denoted as \(\hat{y}_{1} = \hat{y}_{2} = \ldots \hat{y}_{T}\)

$$ \hat{y} = \frac{1}{T}\sum\limits_{t = 1}^{T} {\hat{y}_{t} } $$
(3)
$$ \sigma^{2} = \frac{1}{T}\sum\limits_{t = 1}^{T} {\left( {\hat{y}_{t} - \hat{y}} \right)^{2} } $$
(4)

The variance of the predicted values can be used as a measure of uncertainty, with confidence being inversely proportional to uncertainty. The Softmax function converts logits into a probability distribution, where the highest probability represents the model’s confidence. The output of the confidence branch can be used to weight the output of the depth estimation branch, thereby generating the final dense depth map. In the depth map, low-confidence areas can be specially treated, such as by applying blurring or marking them for further analysis.

Overall, our proposed method tackles the challenges of uneven illumination and weak textures through three main strategies: multi-branch and adaptive windows, multi-type attention fusion, and interactive feature fusion. In regions with limited texture or sudden changes in brightness, the local branch employs multi-scale and deformable convolutions to capture fine-grained features. Meanwhile, the global branch maintains an overview of the illumination distribution and uses global pooling to mitigate interference from overly bright or dark areas. This setup leverages Local–Global Attention to coordinate local and global information, Sparse Attention to reduce computational costs in irrelevant regions, and an Advanced SE module to dynamically activate the most relevant channels—thus preserving key features in uneven lighting conditions. By merging features of different scales and attention mechanisms in the Fusion Module, our approach maximally utilizes both local and global information, ensuring stable depth estimation even under low-texture or high-contrast lighting scenarios.

Compared with traditional single-branch, symmetric, or non-adaptive depth estimation networks, the “Window-Adaptive Asymmetric Dual-Branch Siamese Network” integrates branch partitioning, attention mechanisms, and adaptive window design to achieve greater robustness in the presence of uneven illumination and weak textures. Additionally, in processing multi-frame or multi-view data (when temporal or Siamese inputs are used), it provides more flexible adaptability, thereby improving both the accuracy and stability of depth estimation.

Loss function

Given the characteristics of endoscopic images, such as uneven illumination, weak textures, and non-rigid structures, we designed specialized loss functions tailored for endoscopic images. These functions are categorized into two main types: structure-aware loss and depth consistency loss. The structure-aware loss function specifically focuses on medically significant structural features in endoscopic images, such as blood vessels and tissue edges. By weighting these structural features more heavily in the loss calculation, the model’s accuracy in depth estimation for key areas is improved. The depth consistency loss function ensures that the network maintains depth estimation consistency even when processing regions with varying brightness levels. Moreover, when dealing with scenarios of uneven illumination, indistinct textures, non-rigid structures, and limited computational resources, the designed loss functions can adaptively adjust their weights under the influence of the attention mechanism to better suit these complex environments. We also considered the choice of optimizer during model training to further enhance the performance.

Photometric invariance loss

Standard reconstruction loss Lrecon may perform poorly in scenarios with uneven illumination, as directly comparing images with inconsistent brightness can increase errors, thereby affecting the accuracy of depth estimation. To address this, we designed a Photometric Invariance Loss Lphoto. By incorporating photometric invariance, we use brightness normalization or normalized reconstruction loss, along with a brightness deviation compensation term ∆L to correct for brightness discrepancies and reduce the impact of uneven illumination. This can be achieved by normalizing both the input image and the reconstructed image:

$$ L_{{{\text{photo}}}} = \frac{1}{N}\sum\limits_{i = 1}^{N} {\left\| {\frac{{I_{i} }}{{{\text{mean}}(I_{i} )}} - \frac{{\hat{I}_{i} }}{{{\text{mean}}(\hat{I}_{i} )}}} \right\|} + \Delta_{L} $$
(5)

Gradient-based loss

In scenarios with indistinct textures, reconstruction loss and view transformation loss may fail to capture sufficient image details, leading to unstable depth estimation. Therefore, a Gradient-based Loss is introduced to specifically focus on the edges and texture variations in the image. This helps the network to pay more attention to critical edge information in areas where textures are not prominent:

$$ L_{{{\text{grad}}}} = \frac{1}{N}\sum\limits_{i = 1}^{N} {\left\| {\nabla I_{i} - \nabla \hat{I}_{i} } \right\|} $$
(6)

Deformable Image matching loss

Object deformation in non-rigid scenarios can lead to mismatches in view transformation loss, affecting depth estimation. To address this, a Deformable Image Matching Loss is introduced. This loss uses deformable convolutions or flow field estimation to compensate for the effects of deformation. Asume Td represents the pixel displacement based on the deformation field. A deformation field is introduced to handle non-rigid object deformations:

$$ L_{{{\text{deform}}}} = \frac{1}{N}\sum\limits_{i = 1}^{N} {\left\| {I_{i} - \hat{I}_{i} \left( {T_{d} } \right)} \right\|_{1} } $$
(7)

Loss function construction

We define a loss function L total with multiple strategies to effectively train our networks for accurate, smooth and realistic depth.

The components of the total loss are dynamically weighted under the influence of the AdamW optimizer combined with the Lookahead mechanism. The dynamic weight terms α, β, γ are calculated dynamically based on the content of the feature maps, making them suitable for scenarios with complex and variable feature maps. The initial values of the parameters used in model training are listed in Table 1. These weights change dynamically during the model’s training process under the adaptive attention mechanism until convergence. To maintain computational complexity within a manageable range while ensuring the optimization of key loss components, the number of scales for multi-scale and gradient computations is reduced, and the weights of different loss components are adjusted dynamically. This approach keeps the loss function calculation lightweight. In scenarios with limited computational resources, we employ the AdamW optimizer combined with the Lookahead mechanism. This optimizer accelerates training by periodically looking back at better weights during the optimization process, making it well-suited for improving training stability and efficiency in resource-constrained environments.

$$ Loss_{total} = \alpha \cdot Loss_{photo} + \beta \cdot Loss_{grad} + \gamma \cdot Loss_{deform} $$
(8)
Table 1 Initial values of model training parameters.

Results

Implementation details and datasets

The hardware configuration for the model training and experiments includes: GPU RTX 4090 with 24.0 GB VRAM, a 14-core AMD EPYC 7453 CPU, 64.4 GB of RAM, and a 451.0 GB hard drive. The software environment consists of Pytorch 2.0.1, Tensorflow 2.13.0, and Python 3.10.12. Since medical imaging typically involves a smaller amount of data, it is essential to design effective data augmentation strategies to expand the dataset. The model training and experiments utilized five public datasets, both medical and non-medical: Hamlyn datasets15, EAD201916, M2caiSeg17, UCL synthetic dataset18, and NYU Depth V219. The model was trained using the Hamlyn datasets and UCL synthetic dataset. Continuous video frames were randomly extracted from M2caiSeg, and paired images were generated through affine transformations from EAD2019. The training, validation, and testing sets were split in a 6:2:2 ratio in random space to ensure the stability and reliability of the evaluation. The NYU Depth V2 dataset was used for the generalization testing of the model. The input image size was set to, with 50 epochs and a batch size of 32. The evaluation metrics used in the experiments are detailed in Table 2.

Table 2 The error and accuracy metrics for depth evaluation.

Ablation studies

We perform ablation studies on the model in this section. Specifically, we investigate the role of each branch in the proposed dual-branch asymmetric Siamese network, as well as the effect of adding different numbers of convolutional layers (with a 3 × 3 kernel) N (ranging from 1 to 3 layers) after fixing the first three layers of the encoder for scenarios of varying complexity. This analysis tests the impact of different model structures on the results. To quantitatively assess the performance of our proposed depth prediction network and compare it with previous work, we use several error and accuracy metrics listed in Table 3 to evaluate the methods in both the ablation studies and comparative experiments. Additionally, we compare the parameter count, accuracy, recall, and runtime across models with different degrees of ablation. The results of the ablation experiments are presented in Tables 3 and 4.

Table 3 Quantitative Analysis of Ablation Exam Results.
Table 4 Comparison of parameters, accuracy and time consuming of different volume model configurations in ablation experiments

The first three rows of the table pertain to the study of the characteristics of the model encoder’s first three fixed layers. The first row examines the performance of the Local Branch only, the second row assesses the performance of the Global Branch only, and the third row represents the performance of the most streamlined model with the first three fixed layers of the encoder intact and no additional convolutional layers. The last three rows of the table focus on the performance of models where 1–3 additional convolutional layers are added on top of the encoder’s first three fixed layers. The data indicate that the dual-branch Siamese network model outperforms either branch alone in terms of depth estimation capability. Additionally, as more convolutional layers are added to the encoder, accuracy improves, but the model size also increases, leading to longer training times. However, the trend of performance improvement diminishes as more layers are added. The results demonstrate that our proposed framework achieves optimal performance in depth estimation. Moreover, the model can be effectively adapted to specific application scenarios by adjusting its size, thereby accommodating the complexity of the images being processed, the hardware resource configuration, and the requirements for accuracy and speed.

Compared studies

Different datasets comparision

The model was trained and tested on the Hamlyn datasets, EAD2019, M2caiSeg, UCL synthetic dataset, and NYU Depth V2 dataset. Key step results were recorded and observed for qualitative comparison. The first column shows the original sample images from the datasets, the second column represents the extracted local features, and the third and fourth columns display different global features, including frequency features. The fifth column shows the feature maps after the dual-branch network has encoded and fused the extracted features. Colored circles are used to indicate feature points, with the size of the circles representing the confidence or intensity of the feature points. The qualitative results of the model at key steps across different datasets are shown in Fig. 6.

Fig. 6
figure 6

Visualization of the refined frames, Feature Extraction and depth maps.

In Fig. 6, the first column shows sample images from the dataset. The second column displays the features extracted by the encoder, where each circle represents a feature point. The size and color of the circles indicate the scale and orientation of the feature points, with more and denser circles representing a greater number of detected feature points. The position and size of each circle correspond to the location and scale of the feature point. Dense sampling allows for the detection of more subtle features. The third column shows the confidence map generated based on the response values of the feature points. Each point represents the confidence level of the feature point at that location, with brighter areas indicating more prominent feature points (in this context, confidence is represented qualitatively by the density of points, rather than by discernible brightness variations to the naked eye). The distribution of confidence across these images helps to understand the prominence of feature points in different regions and is used in the proposed network to guide the dual-branch Siamese network’s depth extraction branch in extracting more accurate depth information.

Depth estimation methods comparision

We selected six different types of monocular depth estimation methods for comparative experiments on the HyperKvasir, EAD2019, M2caiseg + CVC-ClinicDB + Kvasir-SEG, and UCL Synthetic Dataset. The methods include the traditional depth estimation method DFF, depth estimation methods based on classical and recent deep learning models the most recent state-of-the-art self-supervised depth estimation methods IndoorDepth20, MonoIndoor21, and DistDepth22, compared with the method proposed in this paper. Quantitative evaluation metrics are presented in the form of heatmaps in Fig. 7, including Absolute Relative Error (Abs Rel) (which better reflects the performance of relative errors across different scales), Root Mean Square Error (RMSE) (which is more sensitive to large errors), and Accuracy (which provides an intuitive understanding of how accurately the model predicts). These metrics are used to evaluate the results of the model training and comparative experiments.

Fig. 7
figure 7

Comparison of accuracy rates of depth estimation methods on four datasets.

All methods perform well on the UCL dataset because the characteristics of the dataset are consistent with the design assumptions of the model. Different data sets usually have different distribution characteristics, noise levels and complexity, leading to differences in training results. At the same time, the proposed method (dedicated in WADSN) has the best parameter performance compared with the comparison method in each data set.

As seen in the qualitative comparison in Fig. 8, the depth maps extracted by our method exhibit relatively more distinct depth information and clearer target contours, capturing more texture details than other methods. Our method demonstrates superior performance compared to the other methods.

Fig. 8
figure 8

Qualitative depth map comparison with different methods on Datasets EAD2019, Hamlyn, M2caiSeg, UCL-SYN and NYU Depth V2.

Qualitative observation of fusion with CT

Finally, to qualitatively demonstrate the application prospects of our depth estimation task in clinical scenarios, we selected the top three performing methods from the comparative experiments in Sect. 4.4—IndoorDepth, DFF, MonoIndoor, and DistDepth—and compared them with our method for registration with clinical CT scans. For the same case, we selected a CT image slice with a fixed depth and registered the corresponding depth information extracted by different methods with the CT scan. It can be observed that the depth features extracted by our method align densely and accurately with most of the organ features in the CT scan, as shown in Fig. 9. This indicates that our method can provide superior data support for subsequent tasks such as registration and multimodal fusion of medical information.

Fig. 9
figure 9

CT + Feature and Depth Key points on CT.

According to the above experimental comparison, different methods have different performance on different data sets. All methods perform well on the UCL dataset because the characteristics of the dataset are consistent with the design assumptions of the model. Different data sets usually have different distribution characteristics, noise levels and complexity, leading to differences in training results. At the same time, WADSN has the best parameter performance compared with the comparison method in each data set.

Table 5 lists the improvement rates of RMSE, AbsRel, SSIM and four accuracy indicators of WADSN on four data sets, and the results show that WADSN has achieved maximum improvement of 4.56%, 17.95%, 0.18% and 0.36% compared with the suboptimal method.

Table 5 Accuracy indicators performance improvement rate (/%).

Figure 8 compares the prediction results of multiple depth estimation methods in a complex set of scenarios. It can be observed that WADSN method performs best in deep coherence and local detail characterization, and the inferred scene geometry is smoother and more accurate. In contrast, the latest AF-SFMLearner and EndoDAC also show relatively excellent depth recovery ability, especially in the non-rigid surface and uneven lighting region can still maintain good consistency. However, compared to the bottommost method, these two methods are also slightly blurred or have a few pseudo-depth areas at the local texture details or edge transitions. In general, WADSN method is superior in overall smoothness and local detail fidelity, indicating that it has stronger understanding and adaptability to scene geometry, and represents a higher level of monocular depth prediction performance.

Discussion

The experimental results demonstrate that the proposed Monocular Endoscopic Image Depth Estimation Method Based on a Window-Adaptive Asymmetric Dual-Branch Siamese Network was tested and validated on the HyperKvasir, EAD2019, M2caiseg + CVC-ClinicDB + Kvasir-SEG, and UCL datasets. Compared to baseline methods, the accuracy increased by up to 0.50%, with the highest improvement observed on the M2caiseg + CVC-ClinicDB + Kvasir-SEG and UCL datasets. A possible reason for this improvement could be the more lenient accuracy calculation, as well as the presence of synthetic images in these datasets, which align better with the model’s expectations, resulting in the best outcomes. Additionally, RMSE improved by 5.1%, and AbsRel by 7.7%, both achieving optimal results.

Conclusion

In this study, we proposed a dual-branch asymmetric Siamese network for monocular depth estimation, designed to address the specific challenges of endoscopic imaging, such as uneven illumination, weak textures, and non-rigid structures. The network integrates both local and global feature extraction branches, enhancing the ability to capture detailed and global contextual information. Extensive experiments conducted on multiple datasets, including Hamlyn datasets, EAD2019, M2caiSeg, UCL Synthetic Dataset, and NYU.

Depth V2, demonstrated that our method outperforms existing approaches in both qualitative and quantitative evaluations. One of the key innovations in our work is the design of the ASE (Adaptive Squeeze-and-Excitation) module. This module effectively enhances the expression of useful features while suppressing irrelevant ones, making it well-suited for scenarios with relatively low computational overhead, such as embedded systems and mobile devices. However, it is important to note that in particularly large models, the ASE module may still introduce additional computational burden, especially when dealing with high-dimensional feature maps. The proposed global–local attention mechanism also proves highly flexible and useful in processing complex medical images, particularly in laparoscopic scenarios characterized by uneven illumination, weak textures, and non-rigid structures. This flexibility allows the network to effectively resist challenges posed by such environments, ensuring more accurate depth estimation.

While the dual-branch asymmetric architecture has shown significant promise, the impact of the asymmetry between the branches on the model’s generalization ability remains an area for further investigation. Understanding this relationship could lead to further optimizations and improvements in model performance across various datasets and applications. Additionally, through ablation studies, we explored the contributions of each branch and the impact of varying convolutional layers within the encoder. Our results indicated that the dual-branch architecture consistently provided superior depth estimation accuracy compared to single-branch models. However, the performance gains diminished as additional layers were added, suggesting a balance between complexity and accuracy. Finally, the potential of our method in clinical applications was demonstrated by registering the extracted depth maps with clinical CT scans. Our approach provided dense and accurate depth features that aligned well with organ structures in CT images, highlighting its utility in tasks such as multimodal image registration and fusion.

In summary, our proposed framework not only achieves state-of-the-art performance in depth estimation tasks but also shows great potential for integration into clinical workflows, particularly in resource-constrained environments like embedded and mobile devices. Future work will focus on further optimizing the model for real-time applications, exploring the impact of branch asymmetry on generalization, and expanding its use to other medical imaging modalities.