1 Introduction

Under the continuous impetus and leadership of China’s informatization and digitization, intelligent construction and smart tunneling are inevitably becoming the mainstream development direction in the tunnel engineering field (Fang et al., 2025; Aston et al., 1988; Xie et al., 2025), particularly for shield tunnels (Barton, 2012; Fang et al., 2023; Zheng et al., 2016). Shield tunnels, constructed using tunnel boring machines (TBMs), are equipped with numerous sensors that monitor and collect various mechanical, electrical, and environmental parameters in real time at specific frequencies during construction. This establishes a foundation for data-driven intelligent TBM construction. Furthermore, with the rapid advancement of computer technology, artificial intelligence (AI) has been widely applied across numerous fields such as finance, healthcare, transportation, and manufacturing. The emergence of open-source machine learning libraries (e.g., Scikit-learn, PyTorch, TensorFlow) has also provided more accessible channels for researchers in tunneling to learn and apply AI algorithms.

Currently, AI technologies based on big data are increasingly being introduced into tunnel engineering to assist TBM construction. Key research focuses include TBM efficiency optimization, intelligent perception of surrounding rock grades, tunneling performance prediction, and adverse geological condition forecasting (Tang et al., 2024; Chen et al., 2021; Li et al., 2023a; Li et al., 2023b). In the domain of surrounding rock grade classification, existing studies have employed various machine learning algorithms (Afradi and Ebrahimabadi, 2020; Guo et al., 2022a; Guo et al., 2022b; Ghorbani and Yagiz, 2024; Feng et al., 2021; Hou et al., 2022; Kohestani et al., 2017; Liu et al., 2020; Xiong, 2014; Zhu et al., 2021)—such as ensemble methods (Stacking, Random Forest, Adaboost/Adacost, Decision Trees, GBDT), clustering techniques (KNN, SVM), and deep learning models (MLP, DNN)—to develop intelligent surrounding rock grade perception models based on TBM tunneling parameters. These models incorporate parameters such as gripper pressure, gear seal pressure, advance displacement, cutterhead power, shield pressure and rolling force (Zhu et al., 2020; Mao et al., 2021; Zhang et al., 2019; Wu et al., 2021; Liu et al., 2021; Yin et al., 2022; Prechelt, 2002; Chen et al., 2015). However, due to data imbalance, these models exhibit poor prediction accuracy for Grade II and V surrounding rocks. Although the SMOTE oversampling method has been applied to balance datasets, its effectiveness remains limited (Wu et al., 2021; Liu et al., 2021; Yin et al., 2022; Prechelt, 2002; Chen et al., 2015). Current feature selection methods generally fall into two categories: “data-driven” and “knowledge-driven.” Data-driven approaches analyze correlations between input tunneling parameters and target variables using techniques like Random Forest, Pearson correlation analysis, PCA, and XGBoost, ultimately selecting modeling parameters based on correlation strength. Knowledge-driven methods rely on personal expertise and experience to choose input parameters, which introduces subjective biases. While using averaged stable-phase data for modeling can improve accuracy to some extent, it conflicts with real-time prediction requirements, rendering such models inadequate for guiding actual construction. Although AI technologies have advanced in TBM applications, existing models lack universality due to variations in tunneling parameters across different TBM types (e.g., cutterhead diameter, tool configuration) and geological conditions.

In summary, this study is based on the Luotian Reservoir-Tiegang Reservoir Water Diversion Tunnel Project (hereafter referred to as the Luotie Project), utilizing comprehensive geological data and tunneling parameters from its TBM1. First, feature parameters were selected by integrating data-driven and knowledge-driven criteria according to data characteristics and prediction targets. The raw data were then cleaned using box plots and partitioned into training (70%), testing (20%), and validation (10%) sets. Subsequently, three data imbalance mitigation strategies were applied to construct surrounding rock classification models using XGBoost, RF, CatBoost, and LightGBM algorithms. This research enables rapid, efficient, and intelligent perception of surrounding rock grades, guiding shield drivers to make parameter adjustments, ensuring safer and more efficient tunneling.

3 Data processing

3.1 Data cleaning

Invalid data were removed from the collected tunneling parameters by retaining only non-zero entries, i.e., any data row containing a zero value in any parameter was deleted according to the criterion defined in Equation 1. After preliminary data cleaning, boxplots were used for secondary outlier removal. A boxplot eliminates abnormal data based on the median, lower quartile (Q1), upper quartile (Q3), maximum value (Q3+1.5 (Q3-Q1)), and minimum value (Q1-1.5 (Q3-Q1)). In this study, the stable excavation phase was specifically extracted to analyze tunneling parameter variations. Notably, even during stable excavation, manual operational decisions introduce fluctuations in two critical parameters—thrust and rotation speed—which subsequently affect other operational indicators. To address this challenge, a refined segmentation method for the stable excavation phase was implemented using a stopping criterion method (Breiman, 2001), as defined in Equation 2. The processed distributions of four key tunneling parameters are shown in Figure 3, where distinct unimodal distributions under different surrounding rock grades demonstrate the effectiveness of the data cleaning methodology.
f F · f T · f N · f v = 0



P k t : = 1000 · t 0 = t k + 1 t E t 0 k · min t 0 + k 1 t E t 0 1


Kernel Density Curves of Different Feature Parameters under Corresponding surrounding rock Grades: (a) Thrust; (b)Torque; (c) Penetration rate; (d) Rotation speed.


Four density plots compare parameters with different subsets labeled III, IV, and V. Panel (a) shows thrust (10^3 kN) vs. density. Panel (b) displays torque (10^3 kN·m) vs. density. Panel (c) presents penetration rate (mm/r) vs. density. Panel (d) illustrates rotation speed (r/min) vs. density. Each plot includes a rug plot at the bottom and a small table with kernel smoothing values.




3.2 Feature parameter selection and dataset spliting

Pearson correlation analysis was first conducted to assess the linear relationships between all tunneling parameters and surrounding rock classification. The objective of this analysis was to quantify the strength of correlation between each parameter and the surrounding rock grade, facilitating the identification of the most relevant features. Based on the results of this analysis, and employing a hybrid “knowledge-driven” and “data-driven” approach, eight critical feature parameters were selected: Thrust, Rotation Speed (RS), Torque, Penetration Rate (PR), 1# Middle Shield Retracting Gripper Shoe Pressure (1#SSP), 2# Middle Shield Retracting Gripper Shoe Pressure (2#SSP), Foam Pressure (1#FP), and Surrounding Rock Classification (SR). The Pearson correlation coefficients for these parameters are presented in Figure 4. Notably, the absolute correlation coefficients for the other seven features with surrounding rock grade were all greater than 0.15, indicating a significant linear relationship. In contrast, Thrust exhibited a correlation coefficient of only 0.05, reflecting a weak association with surrounding rock grade. However, given that Thrust is a critical control parameter in TBM tunneling, it was retained in the final set of features. Following data cleaning and feature selection, the dataset was partitioned into training, testing, and validation sets in a 7:2:1 ratio. The partitioning process involved categorizing the data samples by surrounding rock grade (III, IV, V), and then, within each grade group, randomly selecting 70% of the data for the training set, 20% for the testing set, and the remaining 10% for the validation set.


Pearson correlation analysis results of tunneling parameters.


Scatter plot matrix with correlation heatmap displays variables: Thrust, RS, Torque, PR, 1#SSP, 2#SSP, 1#FP, SR. Diagonals show distribution histograms. Off-diagonals reveal scatter plots. Correlation coefficients range from -1 to 1, color-coded from blue to red indicating negative to positive correlations.




3.3 Model training methods

Based on the selected feature parameters and processed data, modeling is performed using the following three methods:

Method (1): Using raw data with default hyperparameters and no additional processing; Method (2): Applying hyperparameter optimization to the data; Method (3): Combining hyperparameter optimization with SMOTE for data processing. The SMOTE oversampling method was specifically employed to address severe data imbalance issues across Grades III, IV, and V surrounding rocks.

Surrounding rock grade classification models were developed using XGBoost, CatBoost, RF, and LightGBM algorithms based on these three modeling methods. Model performance was evaluated using key metrics—Precision, Recall, and F1_score—to assess the effectiveness of each strategy. The formulas for these metrics are defined in Equations 35.
P R E = T P T P + F P



R E C = T P T P + F N



F 1 = 2 × P R E × R E C P R E × R E C

In the equations, TP denotes the number of true positive samples (correctly predicted positive instances), FP represents false positive samples (negative instances incorrectly predicted as positive), and FN indicates false negative samples (positive instances incorrectly predicted as negative).

4 Machine learning algorithms

4.1 eXtreme gradient boosting (XGBoost)

The Extreme Gradient Boosting (XGBoost) algorithm, proposed by Chen and Guestrin et al. (Zhang et al., 2022), is a composite algorithm formed by combining base functions and weights to achieve superior data fitting. It belongs to the category of Gradient Boosting Decision Tree (GBDT). The core principle of GBDT lies in combining multiple weak learners (decision trees) to form a stronger predictive model. During training, GBDT iteratively adds new decision trees to correct errors from previous models until convergence or reaching predefined iteration limits [Prokhorenkova et al. (2017) and 35]. Unlike traditional GBDT, XGBoost enhances the objective loss function by incorporating regularization terms. To address challenges in calculating derivatives for certain loss functions, XGBoost approximates the loss function using a second-order Taylor expansion, improving computational precision. Additionally, XGBoost employs shrinkage strategies and feature subsampling to prevent overfitting and introduces a sparsity-aware algorithm to handle missing data. It also supports greedy algorithms and approximate learning for node splitting in tree models.

Due to its efficiency in processing large-scale data and complex models, as well as its robustness against overfitting, XGBoost has gained widespread attention and application since its inception.



4.2 Random Forest (RF)

Random Forest (RF), proposed by Breiman in 2001 (Hancock and Khoshgoftaar, 2020), is an ensemble learning method based on decision tree algorithms. The RF model employs a Bagging aggregation strategy, utilizing bootstrapping (a random sampling method with replacement) for data sampling. These techniques effectively mitigate overfitting risks during model construction. By integrating the results of multiple base estimators, RF achieves superior predictive performance compared to single-estimator models, demonstrating enhanced generalization capability and robustness. Additionally, the model reduces classification errors in imbalanced datasets and exhibits high training efficiency. Consequently, it has been widely applied to both regression and classification problems (Bentéjac et al., 2021).



4.3 Categorical boosting (CatBoost)

The CatBoost algorithm, proposed by Prokhorenkova et al. (2017), Ke et al. (2017), Shen et al. (2025a), is an advanced gradient boosting decision tree (GBDT) algorithm. Building upon GBDT, CatBoost introduces two key enhancements: adaptive learning rates and categorical feature processing, which enable superior performance in both classification and regression tasks. The adaptive learning rate optimizes the contribution of decision trees in each iteration, thereby improving overall model accuracy. Its calculation method is detailed in Equations 6, 7. Categorical feature processing employs the Ordered Target Statistics encoding technique to convert categorical features into numerical representations. This method applies hash encoding to categorical values and maps the resulting hash values to numerical equivalents, effectively capturing the influence of categorical features.

Compared to XGBoost, CatBoost exhibits the following advantages.

1. Higher Model Accuracy: CatBoost often achieves high precision without requiring extensive hyperparameter tuning.

2. Faster Training Speed: Outperforms XGBoost in training efficiency.

3. Superior Prediction Speed: Delivers significantly faster inference times than XGBoost.

4. Lower Memory Consumption: Requires less memory usage on computational hardware.

5. Native Categorical Feature Support: Unlike XGBoost, which relies on OneHot encoding for categorical features, CatBoost directly handles string-type categorical features without preprocessing.


η t = 1 1 + t



α t = i = 1 t η i t

In the equations, t denotes the iteration number, η

t
represents the learning rate at the tth iteration, and α

t
is the average learning rate from the previous iteration.



4.4 Light Gradient Boosting Machine (LightGBM)

Light Gradient Boosting Machine (LightGBM), proposed by Ke et al. (2017), Shen et al. (2025b), is an efficient framework for implementing GBDT algorithms. Compared to XGBoost, LightGBM significantly reduces time complexity by converting sample-wise traversal into bin-wise calculations via the histogram algorithm (optimizing from sample-level to bin-level traversal). Additionally, LightGBM employs Gradient-based One-Side Sampling (GOSS), which retains samples with large gradients while randomly selecting a subset of low-gradient samples, thereby reducing computational load while maintaining gradient distribution stability. Furthermore, its leaf-wise growth strategy prioritizes splitting leaf nodes that yield the greatest loss reduction during the tree-building process, generating deeper asymmetric tree structures to minimize redundant computations. LightGBM also combines optimized feature parallelism and data parallelism methods to accelerate training and introduces the Exclusive Feature Bundling (EFB) algorithm, which merges sparse and mutually exclusive features into single composite features to reduce dimensionality and memory consumption. These innovations enable LightGBM to achieve high efficiency and low resource utilization when processing large-scale, high-dimensional data.

5 Predictive models and results

5.1 Surrounding rock classification model

Surrounding rock classification models were developed using XGBoost, CatBoost, RF, and LightGBM algorithms. First, eight TBM tunneling parameters—Thrust, Rotation Speed (RS), Torque, Penetration Rate (PR), 1#SSP, 2#SSP, 1#FP, and SR—were selected through a hybrid “data-driven” and “knowledge-driven” approach. The data were cleaned according to Equation 1 and denoised using boxplots. The preprocessed dataset was partitioned into training, testing, and validation sets at a 7:2:1 ratio. Three imbalanced data processing methods were applied, with the input parameters being the eight selected TBM tunneling parameters and the output being the surrounding rock grades (III, IV, V), labeled as 3, 4, and 5, respectively.

Before model training, the training set was standardized using Equation 8. The Optuna hyperparameter optimizer was employed for automated hyperparameter tuning, with the optimization cycle set to 100 iterations, yielding final hyperparameters for models based on the three imbalanced data processing methods. Training was then conducted using these optimized hyperparameters. Finally, the trained models were validated using the testing set to obtain the final surrounding rock classification results.
x n e w = x μ σ

In the equation, x

new
represents the standardized data, x denotes the raw data, μ is the mean of the sample data, and σ is the standard deviation of the sample data.



5.2 Model performance evaluation

Surrounding rock classification models under three data processing methods were established using XGBoost, CatBoost, RF, and LightGBM ensemble learning algorithms. The prediction results of these models were evaluated using three key metrics: Precision, Recall, and F1_score, as shown in Figure 5. The optimal hyperparameters obtained via Optuna optimization are summarized in Table 2. The results indicate: Method (1) (raw data only): All four models exhibited poor performance. The RF model performed worst for Grade V surrounding rock (Precision:0.67, Recall:0.53, F1:0.89). Method (2) (hyperparameter optimization): Significant accuracy improvements were achieved compared to Method (1). The CatBoost model satisfied all evaluation criteria, indicating reliable predictions, while RF and XGBoost models still underperformed. Method (3) (hyperparameter optimization + SMOTE): All models showed further accuracy gains, with the CatBoost model achieving the best prediction performance.


Evaluation metrics for intelligent diagnostic models: (a) Precision; (b) Recall; (c) F1_score.


Three 3D bar charts labeled (a), (b), and (c) represent precision, recall, and F1-score metrics respectively. Each chart compares model types XGB, RF, Catboost, and LightGBM across three unbalanced data handling methods. Legend indicates sets III, IV, and V with varying colors. Data values are displayed on top of each bar.


The hyperparameter optimization results of the intelligent diagnostic model.

Model type XGBoost RF CatBoost LightGBM
Unbalanced data handingmethods (2) n_estimators: 1855
max_depth: 9
learning_rate: 0.0468
subsample: 0.8663
colsample_bytree: 0.8882
gamma: 0.5033
min_child_weight: 5
reg_alpha: 0.1713
reg_lambda: 2.6573
n_estimators: 1,559
max_depth: 18
min_samples_split: 3
min_samples_leaf: 3
iterations: 1,142
depth: 4
learning_rate: 5.38e-2
l2_leaf_reg: 0.0188
bagging_temperature: 0.2057
random_strength: 2.3258
border_count: 252
n_estimators: 1,241
max_depth: 0
learning_rate: 0.1681
subsample: 0.7184
colsample_bytree: 0.9625
num_leaves: 55
min_child_samples: 35
reg_alpha:0.0187
reg_lambda: 1.1492
Unbalanced data handingmethods (3) n_estimators: 455
max_depth: 18
learning_rate: 8.36e-3
subsample: 0.9832
colsample_bytree: 0.9238
gamma: 0.3049
min_child_weight: 9
reg_alpha: 1.003e-4
reg_lambda: 1.181e-4
n_estimators: 1,023
max_depth: 11
min_samples_split: 3
min_samples_leaf: 3
iterations: 542
depth: 6
learning_rate: 2.63e-2
l2_leaf_reg: 0.0610
bagging_temperature: 0.5349
random_strength: 3.6250
border_count: 35
n_estimators: 1,511
max_depth: 14
learning_rate: 9.30e-3
subsample: 0.9086
colsample_bytree: 0.8780
num_leaves: 130
min_child_samples: 4
reg_alpha:0.0024
reg_lambda: 5.2011

Performance analysis revealed varying degrees of classification confusion among all models using raw data, as demonstrated by the confusion matrix of the validation set in Figure 6. The RF model exhibited a misclassification rate of 1.92% (1/52) between Grades III and V, with cross-grade misclassifications observed. The misclassification rate between Grades IV and V reached 10.71% (6/56). The XGBoost and LightGBM models showed improved performance, reducing the IV-V misclassification rates to 7.14% (4/56) and 7.27% (4/55), respectively. The CatBoost model outperformed others, achieving a IV-V misclassification rate of 3.57% (2/56), with prediction accuracies of 100% for Grade III, 100% for Grade IV, and 86% for Grade V. These results confirm that the CatBoost model delivers the best predictive performance under raw data conditions.


Classification prediction results with default raw parameters: (a) XGBoost; (b) RF; (c) CatBoost; (d) LightGBM.


Four confusion matrices labeled (a), (b), (c), and (d) display predicted versus true labels for classes 3.0 to 5.0. Each matrix shows varying degrees of accuracy, with the most correct predictions concentrated along the diagonal. The color gradient from light to dark blue indicates increasing frequency, with numbers provided for each cell showing specific counts.

After hyperparameter optimization, the confusion matrices of different models on the validation set are shown in Figure 7. The results indicate: RF and XGBoost models exhibited no misclassification between Grades III and V, while the misclassification rate between Grades IV and V decreased to 7.14% (4/56). LightGBM model further reduced the IV-V misclassification rate to 3.57% (2/56). CatBoost model achieved zero misclassification with 100% prediction accuracy for Grades III, IV, and V, demonstrating its superior effectiveness.


Classification prediction results with hyperparameter optimization: (a) XGBoost; (b) RF; (c) CatBoost; (d) LightGBM.


Four confusion matrices labeled (a) to (d) compare predicted and true labels from three to five. Each matrix shows similar results: high true positive rates at labels three and four, with values of thirty-seven and forty-one, respectively. Lower accuracy is observed at label five, with varying values such as eleven, thirteen, or fifteen. A blue color gradient represents value intensity.

When combining SMOTE with hyperparameter optimization, the classification prediction results of the validation set confusion matrix are shown in Figure 8. The results demonstrate: The XGBoost model achieved further improvement, reducing the misclassification rate between Grades IV and V to 5.36% (3/56). The RF and LightGBM models increased the prediction accuracy for Grade V to 93.3% (14/15) with no cross-grade misclassifications. These results indicate that SMOTE-enhanced hyperparameter optimization further improves model accuracy. The CatBoost model, which already achieved excellent performance with hyperparameter optimization alone, showed negligible differences after SMOTE integration.


Classification prediction results with SMOTE and hyperparameter optimization: (a) XGBoost; (b) RF; (c) CatBoost; (d) LightGBM.


Four confusion matrix plots labeled (a), (b), (c), and (d) display classification results with predicted labels on the x-axis and true labels on the y-axis. The matrices predominantly show high values along the diagonal, indicating accurate predictions. In (a), (b), and (d), the value is thirty-nine for true label four and predicted label four, while in (c) it is forty. True label three and predicted label three consistently show thirty-seven. The color bars range from zero to forty, with darker shades representing higher values.

Therefore, through comprehensive consideration of prediction accuracy, the CatBoost model integrated with SMOTE and hyperparameter optimization was selected as the optimal intelligent diagnostic model in this study. Without further validating the performance curves of the CatBoost model, its convergence curve was plotted as shown in Figure 9. The results indicate that the model exhibits only slight overfitting, and its overall performance remains acceptable.


Convergence curve of CatBoost model.


Line chart showing loss versus epochs for training, validation, and test sets. Loss decreases sharply initially and then flattens out as epochs increase. The training, validation, and test sets are represented by blue, orange, and green lines, respectively.




Source link