MAFNet: A novel adaptive multi-scale model for fine-grained grading of diabetic retinopathy - Scientific Reports

Additionally, single receptive fields or fixed-scale network structures exhibit obvious limitations when faced with multi-scale, multi-type lesion features, making it difficult to account for the subtle differences between different lesions. Due to the lack of refined modeling for complex or gradually changing lesion morphologies, this method still faces significant challenges in distinguishing boundaries, with adjacent levels of lesion features prone to confusion, thereby leading to non-negligible classification errors in actual grading.

Although significant progress has been made in the automatic classification of Diabetic Retinopathy with the introduction of deep learning models, existing research still faces several limitations, particularly in areas such as model generalization, multi-scale information capture, lesion spatial relationship modeling, the modeling of subtle lesions differences between adjacent stages, addressing data distribution imbalance, and the refined modeling of complex or gradual lesion morphology. These limitations restrict the performance of models in practical applications and highlight the need for further research and improvements.

In this study, we propose a multi-scale adaptive fine-grained network (MAFNet) to address the shortcomings of existing DR grading methods in cross-scale lesion feature extraction and spatial relationship modeling. Its structure is shown in Fig. 1. Traditional single-receptive field methods are difficult to simultaneously handle the localization and global analysis of multi-scale lesions such as microaneurysms and hemorrhages. To this end, the hierarchical global context module (HGCM) is designed to enhance cross-scale feature expression through multi-scale context aggregation and dynamic feature fusion. For the complex spatial dependencies among lesions, the multi-scale adaptive attention module (MSAM) is proposed to capture spatial location correlations, and the relational multiple attention module (RMA) is introduced to parse complex relationships in feature channels and higher-order interactions through a multi-head attention mechanism. Based on the continuity characteristics of DR development, a regression-categorization dual-task learning framework is innovatively constructed to ensure grading accuracy while preserving disease progression continuity.

Due to the excellent performance of ConvNeXt, especially its ability to handle complex image features, MAFNet adopts the recently introduced high-performance ConvNeXt network as its backbone, which enhances the expressiveness of the model by optimizing the convolutional structure, while maintaining good computational efficiency and strong generalization ability to adapt to various datasets. Building upon this, to further optimize feature representation, the model incorporates a Shuffle Attention Module at specific layers to enhance the capture of critical information.

The Hierarchical Global Context Module (HGCM) is an innovative multi-scale feature extraction component proposed in this study, designed to overcome the limitations of traditional convolutional neural networks in capturing fine-grained pathological features with a single receptive field. Its structure is shown in Fig. 2. HGCM employs multi-scale context aggregation, attention-guiding mechanism and dynamic feature fusion strategy to systematically solve the problem of insufficient characterization of lesion information at different scales. Compared with existing methods, HGCM not only effectively captures global semantic information through multi-scale aggregation, but also preserves subtle local structural features in the spatial dimension, ensuring a balance between large-scale context and precise local responses. By introducing an attention-guided dynamic weighting mechanism, HGCM further enhances the network's ability to adaptively perceive key lesion regions, emphasizing important features while suppressing redundant information.The HGCM module provides an efficient and flexible solution for fine-grained lesion recognition and greatly improves the network's performance in lesion detection.

The HGCM module aims to achieve more precise feature representation through efficient channel compression, multiscale information fusion, and adaptive attention mechanisms. First, a 3 × 3 convolution (denoted as ) reduces the number of channels in the input feature map to to reduce computational cost while introducing a basic local receptive field. Next, multi-scale adaptive average pooling is applied to extract context information across different scales, from local to global. The pooled features are then refined using a 1 × 1 convolution (denoted as ) and upsampled back to the original spatial dimensions to ensure alignment with the original features.Subsequently, a 3 × 3 convolution (denoted as ) followed by a Sigmoid function generates an attention map that guides the network to focus on key areas, enhancing the signal representation of lesions regions. Finally, the enhanced features from all scales are concatenated along the channel dimension and fused through a 1 × 1 convolution (denoted as ) to compress the channel count and construct the final multi-scale feature representation. This process allows the network to more effectively capture lesions information across different scales.

In this context, Conv2d3 × 3 represents a 3 × 3 convolution operation, while BN refers to batch normalization. ϕ denotes the PreLU activation function, and σ represents the Sigmoid function, which maps the attention map values to the range of [0, 1]. pi is the learned parameter vector, and ⊙ indicates element-wise multiplication. denotes the attention map generated for the i-th scale, highlighting regions with critical lesions features. is the upsampled feature map restoring the original spatial resolution (H × W).Here, represents the attention-enhanced feature at scale i, obtained by fusing and the attention map .

The Hierarchical Global Context Module (HGCM) effectively builds a multi-scale representation mechanism by integrating multi-scale feature extraction, adaptive attention enhancement and dynamic weighted fusion to complement global and local information. This provides accurate and powerful support for the task of fine grading of diabetic retinopathy (DR).

To further improve the model's ability to recognize fine-grained lesion regions in DR images, this study proposes an innovative module, the Multiscale Adaptive Attention Module (MSAM) as shown in Fig. 3. By introducing a dynamic convolutional kernel and an efficient channel attention mechanism, MSAM can accurately capture the complex dependencies between different spatial locations in the feature map. The design of this module not only greatly improves the spatial sensitivity of the features, enabling the model to capture the details of local lesions more accurately, but also enhances the robustness and generalization ability of the model through diversified feature repreadsentations.MSAM dynamically adjusts the feature weights to ensure the efficient extraction and representation of the features in a variety of complex lesions situations, thus significantly improving the model's accuracy in fine-grained lesion recognition.

The MSAM module skillfully captures both local and global information simultaneously through multi-scale partitioning and dynamic convolution kernels. It captures both local and global information simultaneously. It divides the input feature maps into S-groups along the channel dimensions, allowing each sub-feature map to extract features at different scales and angles using a scalable convolution kernel. The dynamic convolution kernel is then grouped to further enhance feature diversity and fine-grained representation.After performing convolutions at various scales, the SE (Squeeze-and-Excitation) module is applied to highlight key channels and suppress redundancy. It generates channel attention through adaptive pooling and fully connected layers, reallocating weights on a per-channel basis to adaptively emphasize important lesions features.Next, the multi-scale enhanced feature maps are stacked along a new dimension, and weighted fusion of each scale's features is performed using dynamic weights generated by global pooling of the original features and a 1 × 1 convolution (denoted as ). Finally, the fused features are concatenated back to the original dimension, followed by 1 × 1 convolution (denoted as ) and batch normalization to complete channel compression and information integration. This results in a final feature map that is a multi-scale aggregation with adaptive attention, ready for further processing in the model.

In this context,Conv2d represents the standard 2D convolution operation, while the parameter implements grouped convolution to enhance the diversity and fine-grained expression of features. ϕ denotes the ReLU activation function, W represents the weights of the fully connected layers, and σ is the Sigmoid activation function, ensuring that the weights are in the range of [0, 1]. ⊙ indicates element-wise multiplication. Here, denotes the number of channel groups, is the convolved feature at scale i, and represents the channel-attention enhanced output. Ws is the scale-specific weight vector generated via softmax normalization.

The MSAM module, proposed in this study, significantly enhances diabetic retinopathy grading performance by synergistically combining multi-scale feature extraction and adaptive attention mechanisms. This design not only improves perception of complex lesion structures but also provides an efficient framework for medical image feature processing.

In order to further enhance the model's capability to model complex feature relationships in the DR grading task, this study proposes an innovative module, Relational Multihead Attention (RMA). The RMA module aims to deeply explore and capture the complex inter-feature relationships within feature maps, thereby greatly improving the model's ability to recognize fine-grained lesion features and enhance its classification performance. By utilizing multiple heads of attention in different subspaces, RMA provides a more comprehensive understanding of the relationships between features, which is essential for improving classification performance Fig. 4.

The core of the RMA module is to establish and reinforce class-relevant feature representations. First, the original feature map is projected into a k × classes space using a 1 × 1 convolution (denoted as ), thereby capturing the fine-grained semantic cues necessary for disease grading. Subsequently, a channel attention mechanism is introduced to adaptively model the importance of each channel, amplifying the most discriminative channel signals. On this basis, the multi-head attention mechanism further characterizes the complex relationships among different feature subspaces by exploring potential semantic relationships across multiple dimensions through parallel scaled dot-product attention, and concatenating the outputs from each head to obtain a global representation. Next, the model generates an attention map based on the class dimension and weights the channel features accordingly, forming a class-level weighted output that enhances the depiction of lesions differences and semantic relationships. Finally, through a 1 × 1 convolution (denoted as ) combined with element-wise multiplication, the multi-scale, class-aware attention information is remapped and integrated back into the input feature map, thereby completing the semantic weight process. This approach preserves the original spatial structure while emphasizing the semantic contributions of key lesions areas.

In this context, k represents the feature dimension for each class, and classes denotes the number of classes. BN stands for batch normalization, and ReLU is the activation function.σ is the Sigmoid activation function, and ⊙ indicates element-wise multiplication. Q,K,V correspond to the Query, Key, and Value vectors respectively, while d represents the dimensionality of the key vector.

The Relational Multi-head Attention (RMA) module effectively enhances the model's capacity to understand and represent complex feature relationships in DR images. Not only does it improve the identification of fine-grained lesions features, but it also boosts overall model performance and robustness through flexible feature fusion strategies.

In order to better characterize the gradual process of DR between disease stages, we adopt a multitask learning framework in this study. On the one hand, the severity of DR was considered as a regression task to capture the continuum of the disease; on the other hand, a classification task was incorporated to discretize regression outputs into DR grades. The regression task aims to predict continuous scores associated with images, while the classification task converts these continuous values into final predicted values for five predefined categories. To facilitate the co-optimization of these tasks, we designed a weighted composite loss function that integrates the cross-entropy loss (CE) and the mean-square error loss (MSE) and introduces a regularization term to prevent overfitting.

The goal of the image classification task is to accurately classify the input image into five predefined categories. To achieve this, this paper adopts the Cross-Entropy Loss (CE), which is one of the most commonly used and effective loss functions in classification tasks. The cross-entropy loss can effectively measure the difference between the predicted probability distribution and the true label distribution. Its mathematical expression is as follows:

m denotes the number of inputs, k represents the number of categories, t indicates that the true label is the jth category, and prob denotes the predicted label probability after the activation function. Using the cross-entropy loss function alone in multi-task learning may ignore the optimization requirements of the regression tasks, leading to conflicts between tasks.

The regression task is the core of this study, and its goal is to predict continuous severity scores for input fundus images. To achieve this, this paper adopts the Mean Squared Error Loss (MSE) to measure the discrepancy between the predicted values and the true values, with the mathematical expression as follows:

y represents the output score. The MSE loss primarily considers the difference between the predicted label and the true label.However, the MSE loss function suffers from optimization efficiency issues.

To overcome the aforementioned issues, this paper proposes an innovative dynamic multi-task loss weighting strategy. Unlike traditional fixed-weight approaches, this study designs a performance feedback-based dynamic weight adjustment mechanism, with its composite loss function defined as:

Here, α(t) and β(t) are weight coefficients that are dynamically adjusted during the training process. These weights are adaptively updated based on the model's real-time performance in terms of classification accuracy and regression performance (Kappa coefficient). When there is a larger relative improvement in classification accuracy, the weight α(t) for the classification loss is increased; conversely, when the Kappa coefficient shows a greater relative improvement, the weight β(t) for the regression loss is increased.

In this study, to enhance the feature representation capacity of deep neural networks in the DR grading task, the Shuffle Attention module is introduced. The Shuffle Attention module effectively captures important channels and key spatial regions in the feature map by combining channel attention and spatial attention mechanisms. Additionally, the introduction of the channel shuffle operation enables cross-group feature interaction, thereby enhancing the diversity of feature representations and the model's generalization ability Fig. 5.

The Shuffle Attention module combines channel and spatial attention mechanisms to effectively capture key channels and important spatial regions in the feature map. The channel shuffle operation facilitates the exchange of information between groups, enhances the diversity of feature representations and the expressiveness of the model, and demonstrates significant advantages in DR hierarchical tasks.

MAFNet: A novel adaptive multi-scale model for fine-grained grading of diabetic retinopathy - Scientific Reports

POPULAR CATEGORY

corporate

entertainment

research

misc

wellness

athletics