Evaluating machine learning models is important in finding a solution that achieves the best accuracy, reliability, and efficiency for a given task. With respect to the problem of predicting CS of SFRC, which is an intricate problem involving multiple input variables with nonlinear relationships, it is neither logically sound nor practical to assume that any specific algorithm will outperform all others without exhaustive benchmarking. Detailed benchmarking is necessary to explore the many algorithms available to determine which best estimates the CS.
By evaluating various algorithms, machine learning model researchers not only discover gaps in performance but also understand how to best optimize the model. Figure 9 shows a comparative graphic analysis on the prediction accuracy of the CS of SFRC using six machine learning models. Each subplot shows the measured versus predicted CS values for the 120-instance test dataset and the respective prediction errors, absolute (top) and signed (bottom) bar plots. This combination enables evaluating the prediction errors in all possible directions at the same time for all the models.
The KNN model has the most irregular prediction error values compared to the CS values meaning KNN predicted very poorly. The KNN model is capturing two temporal dependencies in the pattern embedded is strongly nonlinear which means the model is far from the truth especially for the mid and late parts of the data. These results indicates that KNN as a locally driven method is not effective for multi-dimensional interaction regression problems because it has poor generalization ability, thus the inability to complex high-dimensional problems.
Although KNN performed worse overall, SVR had a moderate predictive performance with lower error amplitudes. SVR did not perform as well in regions with sharp transitions in CS. With SVR, the predicted trendlines were smoother than the measured data, which shows an underfit SVR’s rigidity. Consequently, SVR offered limited capacity to accurately model the full response spectrum of SFRC compressive strength.
Among all the algorithms tested, GPR performed the best. GPR’s prediction line followed the measured values for the entire test set with low error bars and minimal deviation. GPR exhibited sharp trend peaks and valleys demonstrating good generalization. Its capture local and global variability enhances GPR’s probabilistic framework and kernel structure. Reliable for any task of SFRC prediction, the model’s strong performance across all examined test instances makes it the best candidate.
XGBR also performed admirably, showing a high degree of accuracy and error consistency across the dataset. Even though it was less precise than GPR, XGBR was still able to approximate the CS’s nonlinear behavior and fidelity graphically. There were some very small error spikes which most likely stemmed from local data overfitting along with overfitting tendencies, which is a common trait among boosting models. Overall, GPR is still better, but not by much, making XGBR a great alternative especially where computational cost is a concern.
The predictive accuracy of RFR was reasonable, but less accurate than GPR and XGBR in the center and outer ridges due to a greater spread of errors. The ensemble characteristic RFR model made stable but less accurate predictions, particularly in the quick shifts of the CS. It smoothed out too much crucial non-linear response region and would be costly in regimes with rapid fluctuations, thus it’s averaging, while useful in lowering variance, was counterproductive in regions where it is most needed.
ANN, in contrast, delivered the weakest performance among the tested models. The prediction line was often misaligned with measured values, and the associated error bars were highly erratic and pronounced. This suggests insufficient model training or suboptimal hyperparameter configuration. Additionally, the model’s inability to generalize to new, complex patterns in the test data reinforces concerns regarding its appropriateness in this context, particularly without extensive architecture tuning or data augmentation.
Rigorously defined, quantifiable measures of performance such as RMSE, VAF, R², a10-index, and nanofeature delineate accurate performance assessment and generalization of the model. Such studies help in making sure that the model does not just perform well on a specific dataset, but also withstands any variations in its distribution, thus preventing overfitting or underfitting. Alongside other models, the interpretability and usefulness of machine learning results, as well as predictions, become significantly more precise when multiple models are analyzed together. This offers clearer, more trustworthy, and actionable insights that can be relied upon throughout various engineering problems.
For every model, the assessment starts by using the holdout cross-validation technique that is carried out in the first iteration especially for all models as a consistent uniform evaluation setup. In this technique, the dataset is randomly divided with 80% of the data set aside for training the models so they can learn intricate patterns, and the remainder 20% is used to test the models, giving an unbiased estimate of how well the models generalize to new unseen data. This approach minimizes overfitting, maximizes realistic evaluation of the model’s predictive capabilities, and validates the use of machine learning techniques and CS prediction tests.
Figure 10 offers a holistic comparison of the machine learning models for predicting the CS of SFRC subjects with the models evaluated graphically through a scatter plot of predicted vs. actual values (Fig. 10a) and through quantitate statistics(Fig. 10b). The score shown in Fig. 10(b) acts as an ordinal scale for the machine learning models to benchmark comparably within the framework of the a₁₀-index, RMSE, R², and VAF. Each of the six models received a ranked score based on their performance from 1 to 6, where 1 is the worst and 6 is the best, and their cumulative scores determine a relative ranking. This fosters a more straightforward evaluation of models by calculating their performance based on different aspects, such as absolute error (RMSE), relative fit (R²), and agreement within a prescribed margin of error (a10‒index). This modified scoring system is needed to clearly address cases when models fail to perform consistently across metrics; for example, one model might have a high R² but very high RMSE. The total score encompasses different measurements and therefore acts as a multi-criteria decision supporting framework to assess model performance while ensuring the evaluation is objective, reproducible, and independent of personal biases, thus enabling identification of the best iterative model.
Among the machine learning models, GPR seems to outperform all other models, as it possesses the best statistical parameters. It also seems to have the best score from the ranking matrix with 24 as benchmarked with a10‒index of 0.98, RMSE of 1.34 MPa, R² of 0.93, and VAF of 0.96. GPR outputs are not only precise, but consistent across the metrics, proving the models trustworthiness in data with complex non-linear SFRC datasets.
The results obtained with SVR which provided a10‒index of 0.96, an RMSE of 1.65, R² of 0.89, VAF value of 0.94, and overall a score of 20 making it the second-best model, demonstrate SVR’s performance is also remarkable. Its performance on XGBR and RFR was moderately accepted, with total scores of 15 and 13. Also, the scatter plot highlights an increased clustering around the 1:1 line which validates its high predictive capability strengthen. Even though both models are showing reasonable fit, having slightly higher RMSE values of 2.23 and 2.05 while also having lower correlation of R² at 0.84 and 0.81, these numbers do suggest sufficient but not optimal performance. Performance that does not trend towards optimally achieving the model’s potential nonlinearity.
In contrast, KNN and ANN showed insignificantly weaker performance metrics. The lowest R² score of 0.72 and a10-index of 0.78 for the KNN model suggests deviating from actual values and higher discrepancy in predictions. While ANN is a model that performs well in a variety of cases, in this instance it also underperformed, scoring the lowest overall score of 5. Its scatter plot is far from the ideal line, which corresponds to an a10‒index of 0.77 and RMSE of 2.77 MPa, meaning that it might have been overfitted, learnt inadequately from the data, or perhaps due to sub-optimal hyperparameter tuning, excess sensitivity to data preprocessing steps, or overfitting.
To conclude, for the purposes of this study, the algorithm which best predicts the SFRC CS is the GPR. It provides the lowest error and highest accuracy alongside a consistently outperforming generalization, demonstrating superiority to other algorithms throughout the analyzed metrics. It validates the selection of non-linear algorithms that deal with unrefined and intricate interactions of features in material science datasets.
To further segment the evaluation of the algorithm’s performance, a 5-fold cross-validation method is utilized to improve the accuracy and validity of the results. The dataset is randomly distributed into five equal parts or folds. Four out of five folds are used to build the model, while one fold is used to validate it. This cycling method ensures that all data points are incorporated during both training and validation stages. The reliability and precision of the model’s performance evaluation improves when averaging results over all folds as this is a more comprehensive, unbiased estimate. Unlike the K-fold method, the holdout method divides the dataset into two subsets and evaluates performance on an arbitrary single subset, which is prone to arbitrary results. The model’s accuracy and generalization capability are better assessed with repeated different data split exposures. Such resilience to data partition assumptions is especially beneficial when there is limited data, as it optimally utilizes data and reduces overfitting to provide accurate trustable evaluations of the algorithms.
Figure 11 visually depicts the evaluation of the performance of six machine learning algorithms through 5-fold cross-validation, utilizing three vital statistical parameters. Each metric receives a score for each fold and the total score determines the model’s dependability and consistency. Based on the listed results in Fig. 11, it can be noted that GPR scored the highest out of all algorithms with a total score of 81 proving its accuracy and stability across all folds. GPR demonstrates the best R2 value ranging from 0.87 to 0.91, lowest RMSE ranging from 14.5 to 18.6 and highest VAF of 0.94 to 0.97 which means GPR had the best accuracy and generalisation. SVR was the second scoring 65, followed by XGBR with 57, RFR with 44. KNN and ANN scored significantly lower with 15 and 34 showing lack of predictive competence.
The importance of these outcomes rests in the thorough validation process: 5-fold cross-validation alleviates the risk of overfitting by ensuring that a model’s accuracy is not tied to a particular split of data, thereby enhancing confidence in generalizability. GPR’s superiority in every fold illustrates the precision with which it captured intricate, nonlinear relationships subsumed within the dataset, with little error and substantial explanatory strength. There is no other algorithm more consistently accurate than GPR when predicting CS within this context, which justifies its use in practical engineering scenarios that require strong predictive reliability.
To interpret the predictions made by the trained GPR model, shapley additive explanations (SHAP) analysis was used. SHAP is a technique that analyzes model predictions from a game theory perspective, measuring the contribution of each feature to the prediction. It allows for both global and local analysis, which enhances understanding of the contribution of each feature. For predicting the CS of SFRC, this level of interpretability is important in rationalizing the influence of material properties and mix parameters and in justifying the reliance on the machine learning model seamlessly integrated with engineering.
The SHAP analysis for this study was conducted with the KernelExplainer, which approximates SHAP values through sampling from the training set. This reasoning is GPR specific, as it does not allow for tree-based explainers, thus requiring this agnostic approach. The analysis was performed with the trained GPR model on a dataset that included several mix design parameters like fiber content, fiber length, curing time, and silica fume content among others. For each test instance, SHAP values were computed, and the results were interpreted using multiple visualization methods including summary plots (beeswarm), mean absolute SHAP values bar plot, and dependence plots showing featured interactivity and interactions among features and parameters (see Fig. 12).
In Fig. 12(a), the distribution of SHAP values of all features with the test dataset is shown. Every observation is represented by a single dot, where the color reflects value of the feature (low to high) and the position on the x-axis shows the degree and polarity of the feature’s impact on the predicted CS. The plot rather unambiguously shows that fiber content is the dominant feature, with high values improving CS. Fiber length and curing time also impact the results significantly, although their contributions appear to be more mixed, suggesting some form of non-linear relationship. Other features, such as w/c ratio or fiber diameter, seem to be insignificant as evidence by their SHAP values tightly clustered around zero. The beeswarm plot provides insight into features that have potential interaction effects, which are important when analyzing SFRC behaviors due to complex dependencies.
In Fig. 12(b), a bar chart depicts features of the dataset with respect to their average absolute SHAP value. This illustrates global importance of each feature measurement without regard to the Global Importance Measurement Value deemed negative or positive. As in the beeswarm plot, fiber content emerges as the most dominant feature followed by fiber length, curing time, and content of silica fume. The rankings provide stronger corroboration towards the hypothesis that fiber parameters are central to the CS of SFRC. Of particular interest, superplasticizer, fiber diameter, and w/c ratio are also observed as features with relatively low importance that had a greater value in the presence of other more important features.
The SHAP dependence plot for fiber length shown in Fig. 12(c), coded by fiber type, illustrates pronounced non-linear relationships. Lengths in the middle range give a positive contribution to the composite stiffness, whereas fibers that are either very short or excessively long yield negative SHAP values. This behavior defines a preferred fiber length window where crack bridging and the redistribution of stress work most effectively. The plot also clearly demonstrates that different fiber materials alter the position and steepness of the response. Basalt and steel fibers, for example, maintain a narrower band of positive SHAP values throughout the mid-range, while polypropylene fibers scatter over a wider area, occasionally dipping into negative contribution. Consequently, the data indicate that fiber length alone cannot dictate the outcome; rather, its impact is shaped by the fiber type itself. It follows that practical optimization must simultaneously account for the fiber material and its geometry to ensure target mechanical properties are realized.
Figure 12(d) displays a SHAP dependence plot for fiber content with color mapping of fiber length. The plot reveals a distinct and positive linear correlation, indicating that higher fiber content consistently elevates CS forecasts. This primary trend, however, is magnified with longer fibers, suggesting a synergistic interplay between the amount of fiber and its geometry. The slope of the SHAP values varying with fiber length further confirms that the impact of larger fiber volume fractions becomes much more pronounced when fibers are elongate. This finding points to a synergistic mechanism wherein both volume fraction and aspect ratio together bolster the matrix’s capacity for energy dissipation and resistance to crack propagation. Such a non-linear interaction underlines that peak mechanical performance is unattainable if fiber content or length is optimized in isolation; rather, fine-tuning the interplay of both parameters is essential.
From an engineering standpoint, the observed interaction effects underscore key refinements for optimizing the material blend. First, they reinforce the argument for simultaneously adjusting multiple variables when designing performance-based SFRC, since univariate adjustments might miss beneficial synergies among variables. Second, the results show that the CS is particularly responsive to both the fiber dosage and the geometry, a response that varies distinctly among different fiber materials. Third, the documented interaction trends can be encoded into either rule-based heuristics or more formal optimization routines, paving the way for custom-tailored, high-performance FRC that precisely meets predefined mechanical performance metrics.
Evaluation of the generalization capability of the developed machine learning models will now be performed on new and unseen data points. For this puprose, four new test data points provided in Table 3 with different types of fiber characteristics (Basalt, Glass, Steel, and Polypropylene) and keeping all other input parameters constant to isolate the effect of fiber variation, are used to evaluate the trained machine learning models. These unique samples have not been encountered so far and present a significant challenge for the models, testing their ability to predict accurately by employing novel data and concepts. This step is essential for verifying whether the model has overfitted the training data, or rather, has learned the system’s fundamental behavior patterns. Those models which demonstrate an optimal response on this test are expected to permit valid claims regarding their generalization potential, dependability, and usefulness in engineering practice where material property variations are encountered. This methodology facilitates high confidence reliability in the models, CS predictions, and ensures diverse utility across various SFRC compositions.
In Table 4 and the associated graphs shown in Fig. 13, a comparison is made for each model’s predicted CS versus the laboratory values for all four fiber types. SVR performs satisfactorily too; it is in the proximity of the lab results and does not deviate from the trend concordance measure across all fibers. GPR also performs equally well as it does not deviate significantly from the actual values, suggesting good generalization. XGBR also provides strong predictions, particularly for Basalt and Polypropylene, showing no deviation from trends observed in the lab across the board. RFR opinionated but was slightly lower than expected stating the same pattern to the lab results. The KNN model did not provide acceptable performance. ANN provided the lowest performance of all samples analyzed; covering all, the values were steeply lower than expected leading to a response curve not capturing peak strength variation.
Among the models assessed, GPR exhibits the highest accuracy and consistency, reliably tracking with the laboratory data across all fiber types and intricacies of CS. XGBR and RFR are also quite strong and can be considered dependable options. The ANN, on the other hand, does not possess the ability to generalize to these new inputs, nor does KNN who struggles too with underprediction. Hence, GPR remains the strongest and most reliable model for predicting CS of SFRC of different fiber types.
The generalization ability of machine learning models is assessed further by adding a new set of seven test data points (Table 5) where fiber content is the only varying parameter. This analysis helps isolate the effect of fiber content (leading to a range of 0.5–2.0% by volume) on CS and evaluate how well the trained models are able to generalize this particular behavior of SFRC. From laboratory tests (Table 6), the actual CSs show a clear increasing trend with increasing fiber content, starting at 26.93 MPa for 0.5% and reaching 43.65 MPa for 2.0%. This observation clearly shows the reinforcing effect of fiber content on the mechanical performance of concrete.
The predictions for the CS from the six machine learning models are provided on Table 6. Figure 14 compares the predicted values against the measured results from laboratory. Each subgraph showcases how well the different models captured the increasing trend in CS with the rise in fiber content. Once again, GPR is the most accurate model because it aligns best with the experimentally obtained data over the entire span without any gaps. XGBR, SVR and RFR also perform well because they follow the linear increase. However, they tend to underpredict in most range. KNN follows the correct trend, but is understated on every other range. The ANN approach is by far the worst performer of all.
The insights graphically depicted in Fig. 14 highlight the heuristic of strong generalization in machine learning models when faced with physically plausible novel situations. The ability of a model to predict CS accurately in relation to fiber content, which is considered a crucial parameter in SFRC design and also varies, determines the model’s reality applicability. The sophisticated predictive capacity of GPR with respect to laboratory values suggests its effectiveness in predicting non-linear and complex phenomena associated with materials, even when evaluated outside the training distribution. GPR is not alone, as SVR, XGBR, and RFR exhibit strong general adaptability and generalization capabilities despite some systematic under-reporting. ANN’s persistent underperformance indicates a domain of shallow understanding structuring its learning.
The attention is directed towards the impact of change in the w/c ratio on the CS of SFRC using seven additional test data points in Table 7. To evaluate in a controlled manner, all other factors are considered constant. The w/c ratio is deliberately changed from 0.30 to 0.60, thereby defining a set of baseline parameters to analyze the extent to which different machine learning models can generalize and predict CS within this range. Laboratory results provided in Table 8 corroborate expectations regarding concrete behavior. CS rises with w/c ratio until an inflection point (approximately 0.50), and then tapers off, a well-documented phenomenon reflecting the compromise between workability and strength in cement-based materials.
Table 8 also features the estimated CS values of w/c ratios from the six trained machine learning models. Each of them is visually compared against the experimental results in Fig. 14. Again, GPR performed impressively, precisely matching the laboratory results and surpassing them on all fronts. It tracks the growing strength upward as w/c increases from 0.30 to 0.50, and the slight drop after 0.50, which is a truthful behavior of concrete, is also captured. SVR also does quite well, although it tends to overpredict during higher w/c ratios (0.55–0.60). This could be due to its reaction to minimal changes. XGBR performs fairly well in terms of trend following, however, also suffers from a drop in prediction accuracy at the end (w/c = 0.60) demonstrating a lack of ability to generalize to regions without linear predictability. RFR performs fairly well overall, but miscalculates mid-range values which results in smoothing out peaks in response curves. KNN demonstrates a reasonable level of accuracy as it follows the pattern fairly well, although it does tend to underpredict both the peak and post-peak values which could be due to its dependence on local smoothing. The weakest performer is ANN, as there is a significant difference between the predicted and actual results at all w/c ratios. This suggests the model did not capture the genuine non-linear relationship of strength relative to water content.
This continues to emphasize the importance of having effective data-driven predictive models, particularly for advanced mix design features like the w/c ratio. It is critical for structural applications that enduring, strong and cost-efficient materials are used to ascertain the almost parabolic function between the w/c ratio and CS is predicted accurately. Among the models, GPR once again proves to be the most accurate and reliable, capturing both the ascending and descending trends very well. SVR and XGBR are solid performers as well; however, they appear to be inaccurate in some regions. In contrast, ANN’s performance is too poor to warrant consideration, proving beyond any doubt the inadequacy of information capture on this relationship with ANNs. These results further alleviate doubts regarding GPR’s capability to model intricate physical phenomena and enhance its credibility for predictive simulations in design of SFRC.
Now, 11 additional data points found in Table 9 are applied to assess the effect of the aggregate size on the CS of SFRC while the other parameters are kept constant. According to Table 10, the empirically determined values of the CSs in the experiments are between 33.05 MPa and 35.68 MPa, indicating a non-linear relationship with maximum strength occurring at an aggregate size of 17 mm.
The CS values predicted using the six machine learning models are displayed in Table 10. These values are compared directly against laboratory results to evaluate how well the models are able to generalize and predict with fidelity. The graphs in Fig. 14 and 15 represent the comparison of the laboratory results and the predictions made by the machine learning algorithms. KNN has shown a fairly good compliant behavior when it comes to following trends; however, it underpredicts the CS values across all but the smallest sizes of aggregates. It also misses the 17 mm peak because of local data density effects. SVR showed the best results. Although it also fails at capturing the 17 mm peak, its closely monitoring the nonlinear trend of the test data shows good scope for capturing complex dependencies. GPR continues to demonstrate good results, deviating only slightly from the experimental results where its performance still remains strong. Because GPR is a probabilistic model, it can effectively model uncertainty and complex nonlinear trends. Also, GPR is capable of capturing the peak behavior. In contrast, XGBR and RFR appear to onset overfitting to the mean while maintaining a constant prediction of around 35 MPa for all aggregate sizes. This lack of variability indicates that these algorithms may have captured the dominant trends instead of the more delicate relationship between aggregate size and CS. ANN had the most difficulty out of all the models in replicating the trends seen in the laboratory results. It fails to capture the experimental peak and overall variation, assuming a monotonically flat or declining pattern after 14 mm. This might stem from a lack of diversity in training data or improper tuning.
The nonlinear dependence of CS on aggregate size in the results indicates that aggregate size may have an aggregate effect on the internal packing density, interfacial transition zones, and load transfer efficiency of the concrete matrix. The further increase in strength with increase in aggregate size up to 17 mm may be attributed to more optimal packing and stress distribution, while the subsequent slight drop may be due to increased voids or weaker bonding interfaces. GPR model outperforms other models as it captures this peak behavior and trend nonlinearity, highlighting their appropriateness in complex material behavior modeling. As shown, GPR model is always able to learn nonlinear relationships and even subtle interactions that arise from a change in a single parameter, which is what this example demonstrates.
Figure 16 summarize, GPR performed best in all scenarios, showing the highest capacity propulsion in acquiring versatility with complex nonlinear interactions and subtle parameter interdependencies in SFRC’s behavior. SVR also performed quite well, particularly for smoother, less punctuated peaks. XGBR and RFR appeared to oversimplify the problem, while ANN was inaccurate with more complex and nonlinear solutions. These observations underline the effort to be made towards the selection of the appropriate machine learning models for material science case studies, especially when dealing with composite materials such as SFRC that are subjected to multi-factor influences.
It should be noted that, the maximum CS of SFRC studied in this research (based on the fiber type, fiber content, w/c ratio, and aggregate size), was achieved with the specified parameters. These particular parameters were dependent on the constant other parameters set in the laboratory test conditions. SFRC is a composit material where the constituents have mechanical interactions which creates more complex form of behavior. No parameter acts alone; the operators are part of a system and hence, every parameter has to rely on the others while at the same time become positively or negatively influenced by them. As an example, optimal fiber content calculated in this research together with glass fibers with specific shape dimensions assumes the use of glass fibers with specific physical characteristics (length, diameter, type). Changing the fiber from glass to steel or polypropylene alters interfacial bonding, crack-bridging, and dispersion behavior within the matrix. There is, thus, a shift in the ideal volumetric ratio needed to attain optimal performance. The best w/c ratio in this example depends on other ancillary traits of the aggregate, characteristics of the fibers, and even the amount of admixture which all influence workability and hydration, microstructure evolution. The nature of the aggregate size affects packing density and distribution causing it to interact with the other factors. Stress distribution is also partially affected by the aggregate size and zone of interfacial transition.
Consequently, the best values for the parameters attained in this work must be viewed as local optima for the given circumstances and are likely to change with differences in baseline conditions. Additionally, parameter predictions based on machine learning models are highly dependent on the specifics of the associated training dataset. This is because the models only derive relevant patterns and relationships from data they have been trained on, thus any change in range, distribution, or representativeness of the data may result in remarkably different predictions. For example, if a database does not sufficiently cover the interactions between fiber content and w/c ratio or does not capture some extreme or boundary cases, the models are very likely to mispredict values, which in this case would be suboptimal values. The latter case tends to occur most frequently for models such as KNN that rely strongly on local data density, or ANN, that require extensive training and tuning to deliver any non-robust results. On the other hand, models like GPR have been shown to capture non-linear trends and peak behaviors better because they use probabilistic reasoning which allows them to model features like the CS peak at 17 mm aggregate size more accurately. However, such models, regardless of their apparent advantages, are ultimately bounded by the data they are trained on.
Furthermore, it is important to note that the design parameters for SFRC obtained from machine learning model’s derived models are not final but rather flexible within the bounds of the data scope and its quality. Thus, further enhancement of the training dataset as well as increasing the scope of the parameters is required for the construction of effective, broad-based, and scientifically credible concrete technology models.
The trained models can be incorporated into intuitive design interfaces or as enhancements to existing software, enabling engineers to quickly estimate the CS of SFRC during the critical early phases of mix design or ongoing quality control. This embedding would streamline scenario evaluations, lessen dependence on lengthy lab tests, and promote a more performance-driven approach to concrete design. Moreover, the predictions can be merged with uncertainty quantification techniques (such as GPR) to generate confidence intervals that further refine decision-making for structural applications.