The U-Net–based dual encoder ensemble was trained for 50 epochs to establish robust baseline results on the dataset. Training was conducted using Stochastic Gradient Descent (SGD) with Nesterov momentum, which provided stable updates and helped accelerate convergence compared to standard SGD. To dynamically adjust the learning rate during training, a OneCycleLR scheduler was employed, enabling rapid exploration of the parameter space in the initial phases and fine-tuning in later epochs. In order to improve training efficiency and reduce GPU memory consumption, mixed-precision training was adopted, allowing the model to compute certain operations in half precision without compromising accuracy. A combination of Dice Loss and Binary Cross-Entropy (BCE) was used as the hybrid objective function. This design ensured that the model remained robust to class imbalance common in crack segmentation tasks, while maintaining strong pixel-level discrimination. The model performance was evaluated using six widely adopted metrics as shown in Fig. 6: Dice Score, Intersection over Union (IoU), Accuracy, Precision, Recall, and F1-Score, thereby providing a holistic assessment of segmentation quality. To prevent overfitting, early stopping was applied based on validation Dice score stagnation, and the best-performing model checkpoint was saved for subsequent testing. This strategy allowed the segmentation model to capture both fine-grained damage boundaries and broader contextual cues while ensuring reproducibility. On the held-out test set, the best ensemble model achieved as shown in Table 5.
The dual-encoder attention-guided U-Net model was evaluated on both the Indian Footpath Damage Segmentation Dataset (our dataset) and the widely used CRACK500 benchmark dataset to assess its generalizability and robustness (See Table 6). On the Indian dataset, the model achieved a Dice Score of 0.6819 and an IoU of 0.6712, demonstrating strong pixel-level segmentation performance under diverse environmental and structural conditions. The relatively high accuracy (92.77%) reflects the model’s ability to discriminate between damaged and intact footpath regions, while the balanced precision (71.16%) and recall (69.78%) indicate stable performance across varying crack severities and sizes. On CRACK500, the model achieved superior results, with a Dice Score of 0.7588, IoU of 74.93, and accuracy of 97.49%, highlighting improved robustness when trained and evaluated on a large-scale standardized crack dataset. The higher recall (78.11%) compared to the Indian dataset suggests that the model is more sensitive in detecting fine cracks in benchmark data, whereas the slightly lower precision trade-off reflects an increased tendency to label ambiguous regions as cracks. The cross-dataset evaluation confirms that the model not only generalizes well to benchmark datasets but also adapts effectively to real-world pedestrian infrastructure conditions, which are often more heterogeneous and challenging.
The ablation analysis presented in Table 7 systematically evaluates the contribution of each architectural component of the segmentation framework. Starting from the baseline U-Net, which achieved a Dice score of 0.6123 and IoU of 0.5981, incremental modifications were introduced to assess their impact on performance. When the vanilla encoder was replaced with EfficientNet-B3, the Dice score increased to 0.6445, highlighting the advantage of multi-scale hierarchical feature extraction. Similarly, substituting the encoder with ResNeXt-50 improved Dice to 0.6368, showing that group convolutions are effective in capturing structural context, though slightly less than EfficientNet in handling fine-grained features. Importantly, combining both encoders in a dual-encoder ensemble further boosted Dice to 0.6651 and IoU to 0.6523, confirming the complementary nature of EfficientNet and ResNeXt-50.
Attention mechanisms were then evaluated independently. A baseline U-Net with attention showed modest gains (Dice 0.6281), indicating that attention alone is insufficient without strong feature extraction. However, coupling attention with advanced encoders led to more substantial improvements. The EfficientNet + Attention model achieved a Dice score of 0.6572, while the ResNeXt-50 + Attention variant achieved 0.6494, demonstrating consistent gains in both cases. Finally, the full model, which integrates dual encoders with attention-guided fusion, achieved the best results across all metrics (Dice 0.6819, IoU 0.6712, Accuracy 92.77%). This configuration not only improved pixel-level segmentation but also enhanced robustness by emphasizing damage-relevant features while suppressing background noise.
The severity prediction model, based on EfficientNet-B3 with transfer learning, was trained for 30 epochs with a batch size of 32 and an initial learning rate of 0.001, optimized using the Adam optimizer. Pretrained ImageNet weights were used to initialize the backbone, significantly improving convergence speed and reducing the risk of overfitting, especially given the limited dataset size. During training, the convolutional backbone extracted deep hierarchical features from each image, which were pooled via a Global Average Pooling (GAP) layer. To enhance generalization, Dropout regularization (rate = 0.3–0.5) was applied after pooling. The dense output layer then mapped the feature vector into three nodes corresponding to severity classes (low, medium, high). The final Softmax layer produced normalized probability distributions for class assignment. The model was trained using categorical cross-entropy loss, which is well-suited for multi-class classification tasks, and evaluated using Accuracy, Precision, Recall, and F1-score. To further monitor learning dynamics, training and validation loss/accuracy curves were analyzed across epochs. This not only provided insights into overfitting trends but also demonstrated the effectiveness of transfer learning for small but well-curated datasets. Figure 7 shows analysis of severity classification model training and validation performance.
Figure 7 illustrates a swaying behavior in both the training and validation curves, indicating effective learning during the training phase as reflected by the steady increase in accuracy. Performance rises sharply during the initial epochs for both training and validation, followed by stabilization around the 5th epoch. The final results show a training accuracy of 98% and a validation accuracy of 95%, with the model achieving 95% accuracy on the test dataset, demonstrating strong generalization and no significant evidence of overfitting, which also can be evident with confusion matrix in Fig. 8 and ROC curve in Fig. 9.
Table 8 presents the classification results for severity prediction across three categories: low, medium, and high. The model demonstrates strong overall performance, achieving an accuracy of 95% on the test set. Precision and recall values remain consistently high across all classes, with low severity attaining the best F1-score (0.97), indicating reliable detection of minor cracks. The high severity class achieved perfect recall (1.00), meaning all severe cracks were correctly identified, though precision (0.91) suggests a small number of false positives. Conversely, the medium severity class shows perfect precision (1.00) but a slightly reduced recall (0.90), indicating some medium cracks were misclassified. Confusion matrix represented in Fig. 8 depicts severity analysis for correctly classified categories as low medium and high. The macro-average F1-score of 0.96 and weighted-average F1-score of 0.95 confirm the model’s balanced performance across classes despite differences in class distribution. These results highlight the robustness of the severity prediction pipeline and its potential for supporting practical decision-making in footpath maintenance and pedestrian safety.
Usage and scope
The Indian Footpath Damage Segmentation Dataset is intended to serve as a benchmark resource for advancing research in automated infrastructure monitoring and pedestrian safety. By providing high-resolution images with pixel-level annotations and severity labels, the dataset enables the training, validation, and evaluation of deep learning models for both segmentation and severity classification tasks. From a research perspective, the dataset facilitates benchmarking of existing and novel semantic segmentation architectures. Researchers can employ it to explore improvements in encoder–decoder networks, attention mechanisms, ensemble approaches, and transformer-based architectures. The inclusion of severity labels extends its applicability beyond binary segmentation, enabling multi-class classification and severity-aware modeling. The baseline results presented in this study establish a reference point for future algorithmic improvements, providing an initial performance benchmark against which subsequent work can be compared. From a practical standpoint, the dataset offers significant potential in supporting data-driven decision-making for urban infrastructure management. Severity-aware segmentation outputs provide actionable insights for maintenance teams, allowing them to prioritize repairs based on crack intensity and potential pedestrian safety risks. This makes the dataset directly relevant to smart-city monitoring platforms, decision-support systems, and predictive maintenance frameworks. By integrating severity prediction with geospatial metadata, the dataset can further support urban planners and policymakers in developing strategies for safe, accessible, and sustainable pedestrian environments. Finally, as an open-access resource hosted on Zenodo, the dataset encourages reproducibility, collaboration, and extensibility. Researchers are invited to expand the dataset to diverse geographic contexts, incorporate multi-sensor modalities, and explore advanced learning paradigms such as self-supervised or domain-adaptive approaches. The resource is thus designed not only as a dataset but also as a foundation for future innovation in pedestrian safety and infrastructure monitoring.
Limitations
Although the Indian Footpath Damage Segmentation Dataset provides high-quality annotations and serves as a valuable resource for advancing pedestrian safety research, certain limitations should be acknowledged. First, the dataset is geographically restricted to Pune, India, and therefore may not fully capture the variability of pedestrian infrastructure across different cities, climates, or construction practices. Extending the dataset to include diverse geographic and cultural contexts would improve its generalizability. Second, the images were captured using 2D imaging techniques with handheld cameras. As a result, the dataset does not incorporate depth information or three-dimensional (3D) data, which could provide richer structural insights into crack depth and surface unevenness. The absence of 3D data may limit the accuracy of severity estimation in certain scenarios, particularly where surface depth plays a critical role in pedestrian fall risk. Finally, the dataset focuses exclusively on visible footpath damage (cracks, spalling, and irregularities) and does not account for other safety-related factors such as pedestrian density, material friction coefficients, or contextual hazards (e.g., obstructions or lighting conditions). Incorporating these broader environmental factors would enable a more comprehensive understanding of pedestrian fall risks. These limitations highlight opportunities for future work, including the expansion of data collection across multiple cities, the integration of multi-sensor modalities (e.g., LiDAR or stereo vision), and the inclusion of contextual metadata to support more robust safety assessment and predictive analytics.
Recommendations for future work
There are several directions for future research. First, expanding data collection to multiple cities and diverse geographic contexts would enhance generalizability and allow comparative studies across different infrastructure types. Second, integrating multi-modal data, such as LiDAR, IMU, or weather sensors, could provide richer context for severity estimation and fall-risk analysis. Third, exploring advanced deep learning architectures, including transformer-based segmentation models and self-supervised pretraining, may yield improved accuracy and robustness over the baseline methods reported in this study. Fourth, coupling visual severity predictions with geospatial data would support the development of decision-support systems for city planners and municipal authorities. Finally, the dataset may serve as a foundation for predictive modeling frameworks that anticipate infrastructure deterioration and prioritize preventive maintenance strategies, thereby contributing to safer and more sustainable pedestrian environments.