In Fig. 7, we meticulously scrutinize the estimated values produced by each algorithm against the individually measured 28-day CS values. This scrutiny unfolds through graphs employing the a20‒index metric, revealing a noteworthy alignment of the majority of points within the \(\:x=1.20y\) and \(\:x=0.80y\) lines. This alignment signifies the commendable accuracy of predictions across all ML algorithms. The a20-index was selected because it is widely recognized in civil engineering and materials science as a clear and trustworthy gauge of predictive competence, especially when forecasting concrete behaviors. The a20-index determines the fraction of modelled values that lie within ± 20% of the corresponding measured values, giving a concrete benchmark for error that engineering practitioners regard as tolerable. In contrast to summary statistics like R² and RMSE, which summarize the entire data set, the a20-index focuses exclusively on the portion of predictions that satisfy a tolerance cut-off that is meaningful in practice, especially when the stakes include safety margins and the inherent variability of construction materials. Additionally, when placed alongside the other aα indices (notably a10 and a30), the a20 threshold settles into a widely accepted compromise. a10 is often considered too harsh, penalizing models for disparities that would not compromise workability. a30, on the other hand, is frequently dismissed as too forgiving, allowing models to appear trustworthy even when significant inaccuracies go unnoticed. The a20-index thus occupies a sound, practical middle ground, revealing the model’s dependability in contexts where engineering judgement is paramount.
According to Fig. 7, the a20‒index values span a range of 0.64 to 0.97, with the DTR algorithm exhibiting the lowest accuracy and the GPR and MLPR algorithms showcasing the highest accuracy. The other algorithms present acceptable accuracy levels, excluding the DTR model. Consequently, based on the a20‒index results, all models, except for DTR, exhibit satisfactory performance in estimating concrete CS.
The results from evaluating the performance of a ML model may differ from one to another based on the metric which is selected as the focal point of the evaluation. Each of the metrics capture different elements of performance which include the level of error, the explanation of variance, the error in prediction, and the degree of robustness. In this case, a multi-criteria scoring model is preferred to provide a more balanced outcome and therefore, in this case the metrics selected are R², MAPE, RMSE, VAF, and a20-index. Each ML model was ranked per metric based on its raw performance. For example, the model with the highest R² received a score of 12 (indicating 1st place out of 12 models), the next best received 11, and so forth down to the model with the lowest R², which received a score of 1. Each model was evaluated with the other models in a given metric competition. No weighing bias was introduced, and therefore, the final score was the outcome of every metric score. The ranking score in the final column of Table 4 is simply the sum of individual metric scores, reflecting the aggregate performance of each model. This clear model evaluation avoids bias towards models due to championing one performance metric. For instance, while SVR excelled in R² (0.9647), MAPE (0.04), and RMSE (2.85), it also ranked highly across VAF (98.2%) and a20-index (0.96), giving it a cumulative score of 51, the highest among all contenders. Similarly, GPR and NuSVR also demonstrated consistently high scores across metrics, securing strong overall rankings.
In Fig. 8, a schematic depiction delineates the total scores for each algorithm based on the comprehensive evaluation criteria. These overarching results distinctly position the SVR algorithm as the current frontrunner among its counterparts. However, it’s imperative to note that these estimates rest on test datasets, and the algorithms’ performance awaits confirmation through rigorous testing on new unseen datasets to ensure their sustained accuracy.
Evaluating the ML algorithms’ accuracy in predicting the concrete CS using the a20‒index. ANN: Artificial neural network, SVR: Support vector regression; GPR: Gaussian process regression; ETR: Extra tree regressor; DTR: Decision tree regressor GBR: Gradient boosting regressor; HGBR: Histogram-based gradient boosting regressor; XGBoost: Extreme gradient boosting; NuSVR: Null‒space SVR; VR: Voting regressor; RF: Random forest; MLPR: Multilayer perceptron regression.
Ranking of ML models based on using the hold-out validation method. ANN: Artificial neural network, SVR: Support vector regression; GPR: Gaussian process regression; ETR: Extra tree regressor; DTR: Decision tree regressor GBR: Gradient boosting regressor; HGBR: Histogram-based gradient boosting regressor; XGBoost: Extreme gradient boosting; NuSVR: Null‒space SVR; VR: Voting regressor; RF: Random forest; MLPR: Multilayer perceptron regression.
K-fold cross-validation is a widely accepted method for validating the performance of ML models. The entire dataset is partitioned into K equally sized subsets, known as folds. For every iteration, a single fold serves as the holdout test set, while the concatenation of the remaining K-1 folds is utilized to train the model. This rotation is carried out K distinct times, guaranteeing that every fold is designated as the test set once. The performance metrics from every cycle are subsequently averaged, yielding a composite score that mitigates the influence of any one particular split. By employing this procedure, the analysis confirms the model’s capacity to generalize, as every observation is subjected to testing while simultaneously being part of the training pool across the entire K passes. K-fold cross-validation serves a crucial role in ML by discouraging overfitting. Instead of havinging the model latch onto the idiosyncrasies of just one training set, K-fold forces it to encounter varied subsets, compelling it to learn useful patterns that hold across the entire dataset. Averaging performance scores across these multiple folds yields a reliability that a lone train-test split cannot match. This is especially beneficial in situations where the dataset is small, as every observation gets its day in court both for training and for validation. K-fold also streamlines the model-selection process, offering a fair playground to compare multiple algorithms or fine-tuned hyperparameters. The end result is a clearer, more detailed picture of how well a model might perform on unseen data.
We applied 5-fold cross-validation (K = 5) to rigorously evaluate the ML models. The complete dataset was divided into five equal parts; one part served as the test set for every fold while the remaining four were combined to form the training set. By rotating the test set across all five parts, we guaranteed that every observation contributed to the training and the validation process. This practice produces a robust and dependable estimate of how well the models can predict the CS of concrete incorporating RHA. The choice of 5 folds strikes a good balance, granting us reliable performance metrics without excessively prolonging training times, thereby enhancing our understanding of each model’s capacity to generalize to unseen samples.
Table 5 summarizes the comparative performance of the ML approaches assessed using three primary evaluation metrics (R², RMSE, and VAF). For every fold, the metrics are computed and thereafter used to order the models. Reviewing the table, the SVR model consistently records the highest R² scores across every fold, suggesting its strong capability to predict concentrated solids. For instance, in the first fold, SVR attains an R² benchmark of 0.9518, outperforming every competitor. In contrast, the DTR variant consistently appears at the foot of the R² hierarchy, evidencing weaker predictive quality. In Fold 1, DTR earns an R² of merely 0.3312, a value that falls well below that of any alternative model considered. For RMSE, the SVR model records the smallest values, pointing to the least prediction error on record. In Fold 1, it settles at 2.98, staking a strong claim to the model’s accuracy. On the other hand, the DTR model shows the highest RMSE values, especially in Fold 1, where it reaches 13.81, indicating that the predictions are farther from the true values compared to other models. Turning to VAF, SVR again leads with a score that shows it explains the greatest proportion of variance. In Fold 1, it notches up 97.2%, the summit among all contenders. DTR, however, posts the lowest VAF, 58.31% in the same Fold, a reading that reveals it captures only a fraction of the data’s underlying variation.
The model rankings provided in the accompanying table are derived from the overall score calculations made over the complete set of five cross-validation folds; in this system, a greater cumulative score indicates superior model performance. The SVR model exceeds every other candidate by this measure in each individual fold and attains the highest total score of 180. This result underlines SVR’s consistent merit and stability from fold to fold in the validation process. Analyzing performance on a fold-by-fold basis confirms SVR’s dominant position. In Fold 1, the model records the best R², the lowest RMSE, and the leading VAF, which together assure its first rank. Fold 2 sees SVR again on top, producing similarly strong R² and VAF values, though the RMSE rises by a small margin. The same pattern persists in Fold 3, where R² and VAF remain elevated and RMSE stays comparatively low. Fold 4 registers identical results: top R² and VAF, a slight RMSE increase. Finally, Fold 5 again delivers peak R² and VAF, paired with the best RMSE, reaffirming SVR’s overall superiority.
The SVR, NuSVR, and GPR models outperformed the other methods in this work for three mutually reinforcing reasons that matched the problem’s conditions. First, their architectures suit moderate datasets (like the 500 samples here) where deep learners, including ANNs, risk overfitting without heavy and sometimes unbalanced regularizations. Second, they employ kernel functions (specifically the radial basis function) that enable the mapping of input features into high-dimensional spaces where nonlinear trends can be effectively captured. Lastly, the three methods embed regularization: SVR and NuSVR impose it via the penalty parameters, while GPR incorporates it through the Bayesian priors. Together, these design choices supported reliable generalization and strong predictive accuracy through every evaluation phase.
In summary, the SVR model leads in every validation fold, showing the highest predictive accuracy. Its strength across all five partitions supports the model’s reliability for estimating the CS of RHA concrete. Conversely, the DTR model places at the bottom in each measure, underscoring its relative unsuitability for this application. The 5-fold cross-validation adopted here enhances the credibility of the results by preventing reliance on a single data split; instead, it confirms model behavior through a thorough evaluation over multiple data segments. This multi-partition method delivers a robust and consistent basis for judging the model’s potential to generalize and produce precise forecasts.
To ensure a robust evaluation of the trained algorithms in estimating concrete CS, we employ previously unused datasets from prior publications as validation datasets. Initially, we scrutinize the models’ performance on the 24 data points presented in Bui et al.15, outlined in (Table 6). These data points share identical parameters with our study, encompassing the same considerations in concrete sample preparation and testing methodology to ascertain the 28-day CS. These external data points were used exclusively for generalization assessment; no external samples were used for training or hyperparameter adjustment. This insertion aims to gauge how the trained models perform on independent data produced under differing experimental arrangements. This approach strengthens the credibility of the models. It addresses model robustness and transferability, which is especially important in ML applications to concrete materials where variability in raw materials and test conditions is common.
In Fig. 9, the outcomes predicted by each algorithm are showcased on these data points, juxtaposed with the experimental results. While a majority of the models exhibit behavior akin to the experimental outcomes, it is noteworthy that not all algorithms yield acceptable and accurate results, reflected in R2 values spanning from 0.46 to 0.94. Notably, the SVR, GPR, and NuSVR models demonstrate superior accuracy on the test dataset in our study, showcasing the best performance on these data points among other algorithms. This attests to the sound training of these algorithms. The MLPR and ANN algorithms secure the fourth and fifth positions in terms of accuracy, achieving R2 values of 0.84 and 0.82, respectively. Conversely, other algorithms exhibit subpar performance, registering R2 values within the range of 0.46 to 0.71. Notably, the DTR algorithm delivers the least accuracy, with an R2 value of 0.46.
Evaluating the ML algorithm predictions against the test results conducted by Bui et al.15.
In this analytical phase, a meticulous investigation was conducted into the performance of each intricately trained ML model using an additional set of six data points, which underwent CS testing as detailed by Chao-Lung et al.61. These specific data points, elucidated in (Table 7), deviate solely in the geometric configuration of samples, transitioning from cubic to cylindrical. The primary objective of this comparative analysis was to assess the adaptability of the ML models developed in our study to the diverse structural forms of concrete samples. We acknowledge that the differing shapes (cube vs. cylinder) can influence the CS results because each geometry redistributes stress and triggers failure in distinct patterns. Nonetheless, the goal of this comparison was to assess how well the ML models can generalize and remain robust when applied to datasets that contain only mildly different specimen silhouettes, even in the absence of direct geometric normalization.
Figure 10 serves as a visual representation, illustrating the correlation between the CS values estimated by each ML algorithm and the corresponding values obtained from laboratory tests conducted by Chao-Lung et al.61. The R2 values derived from these ML algorithms present a spectrum ranging from 0.50 to 0.98. A standout performer is the SVR model, showcasing exceptional accuracy with an impressive R2 of 0.98. The NuSVR and GPR models also exhibit noteworthy precision, achieving R2 values of 0.95 and 0.93, respectively. Conversely, models such as DTR, XGBoost, RF, GBR, HGBR, and VR, with R2 values below 0.80, demonstrate comparatively lower accuracy. Meanwhile, MLPR, ANN, and ETR models showcase acceptable accuracy, with R2 more than 0.80. It is essential to highlight that SVR and DTR models record the highest and lowest accuracies, with R2 values of 0.98 and 0.50, respectively, echoing trends observed in previous evaluations.
A comprehensive examination of these results reveals the proficiency of the SVR model in accurately estimating the concrete CS, particularly within the context of the dataset utilized in this study. This finding not only underscores the robustness of the SVR model but also prompts further exploration into the factors contributing to its superior predictive performance in this specific application. Additionally, these insights into the comparative accuracies of various ML models provide valuable guidance for selecting appropriate models in similar contexts, contributing to the ongoing refinement of predictive methodologies in the domain of concrete CS estimation.
Evaluating the ML algorithm predictions against the test results conducted by Chao-Lung et al.61.
The profound expertise demonstrated by the SVR model in estimating concrete CS, as evidenced through the comprehensive evaluation of results in this study, underscores its efficacy as a robust predictive tool. The successful application of the SVR model to the dataset employed herein attests to its nuanced understanding of various parameters and their intricate relationships with the model output (CS). This mastery positions the SVR model as a valuable asset for predictive modeling in concrete engineering.
Motivated by the proficiency of the SVR model, a meticulous exploration is initiated to unravel the influence of the RHA parameter in the concrete mixing plan on CS. This investigation is methodically conducted using three distinct datasets comprising 20 data points as novel test datasets. The systematic variation of the RHA parameter’s value within its range (0 to 190 kg/m3) in 10 kg/m3 increments while holding other parameters constant, according to (Table 8), forms the basis of this inquiry. The predictions, shown in Fig. 11, clearly demonstrate a parabolic pattern where CS increases with increasing RHA content to an optimal level (around 80–100 kg/m³). Beyond this, further increases in RHA content result in a gradual reduction in strength. This phenomenon illustrates a saturation effect, typically due to the pozzolanic reactivity of RHA, which improves strength to a certain level of replacement and then negatively affects it as the replacement level increases because of excessive RHA leading to dilution of cementitious materials and increased workability problems. It should be noted that the best observed RHA value is bound to a specific dataset and the chemical and physical properties of RHA relevant to this study such as particle size and the degree to which it has been charred, along with the complete mixture design which includes the W/B, SP, and aggregate size distribution. To provide example, finer RHA particles with greater amorphous silica content is more reactive and hence optimal dosage is shifted to higher value. In contrast, coarser and less reactive RHA shifted the optimum to lower value. Thus, the optimal range as described above should only be considered as relevant in the context of the experimental materials and proportions. Extrapolation to other contexts would necessitate recalibration or retraining of the relevant models with localized material properties and mix designs to maintain accurate strength and optimal performance value targets.
The examination firmly establishes that the addition of RHA to the concrete mixture holds the potential to enhance its CS, contingent upon the intricate interdependence of various parameters. This nuanced insight contributes to the ongoing discourse on optimizing concrete mix designs for superior performance.
GUI for practical deployment
To enable seamless integration of the ML models into everyday engineering workflows, a dedicated standalone GUI was crafted using the PyQt5 toolkit in Python. This user-friendly desktop application, illustrated in (Fig. 12), prompts input of seven critical mix design parameters, including W/B, C, RHA, total W, SP, FA, and CA. Users can swiftly obtain the predicted 28-day CS of concrete mixtures incorporating RHA by entering these values. Additionally, the interface permits selection from twelve pre-trained ML models with the underlying models serialized via the joblib library to guarantee rapid initialization and optimal computational performance.
The GUI was crafted to function seamlessly across Windows, Linux, and macOS, making it easy to access and use during critical on-site decision points. Built-in input checks confirm that the mix design parameters stay within proven, empirically grounded ranges from the authors’ dataset. For instance, if a user tries an unrealistically high W/B or RHA value, the tool instantly highlights it for revision. The application serves two main audiences: practitioners can quickly test different mix designs without incurring the expense and delay of full lab testing, while researchers can tweak parameters systematically to produce synthetic datasets for simulations or optimization studies.
Figure 12 presents the interface returning a predictive CS value derived from the chosen ML model. In the current scenario, the SVR model estimates a CS of 75.40 MPa. Though this level may appear surprising for concretes incorporating RHA, the forecast is supported by a carefully optimized set of parameters: W/B = 0.3, C = 468 kg/m³, RHA = 82 kg/m³, SP = 6.1 kg/m³, FA = 543 kg/m3, and CA = 1267 kg/m3. Together, these variables encourage a compact microstructure and improved pozzolanic reactivity. The result underlines the interface’s ability to quantify nonlinear and synergistic interactions that govern strength gain.
It should be emphasized that how suitable the predictions are and how well they can be applied in practice is dependent on the materials and data used in the model building process. For instance, the RHA used in this study had particular characteristics such as high amorphous silica content and low loss on ignition (LOI) and fine particle size around 15 micrometers, which was obtained through grinding and burning at 650–750 °C. Hydrothermally processed RHA with coarse particles, increased crystalline content, or high LOI can drastically modify the pozzolanic activity and hydration speeds and CS in a manner that the existing models do not account for. In order to solve this, users from different sources or grades of RHA are advised to re-train the models with datasets most relevant to their materials. Other approaches, such as transfer learning or implementation of corrective factors based on material property testing can make the model more flexible. The reliability of the model for various applications can be enhanced by adding metadata on the RHA properties for later versions of the GUI.