Dataset description

This study focuses on three regulators situated within the Delta Irrigation District, Delta Barrages, Egypt: Al-Tawfiki, Al-Menoufi, and Abasi regulators. These structures are designed for automatic operation under submerged flow conditions. Al-Tawfiki regulator, positioned 965 km downstream of the Aswan old dam (AOD) and upstream of the Damietta Barrage, boasts a capacity of up to 20 million cubic meters per day (MCM/d) and benefiting 1.60 million acres of farmland. It comprises six radial gates, each spanning 5 m in width and separated by five piers, each 2 m wide. Meanwhile, Al-Menoufi regulator, also located 965 km downstream of AOD and upstream of Damietta Barrage, possesses a capacity of up to 25 MCM/d, benefiting 750 thousand acres of farmland. This structure features nine radial gates, each with a width of 5 m and separated by eight piers, each measuring 2 m wide. Lastly, the Abasi regulator, positioned 1054.7 km downstream of AOD and upstream of Zefta Barrage, boasts a remarkable capacity of up to 30 MCM/d, serving 800,000 farm acres. Comprising eight 5-m wide radial gates separated by seven 2-m-wide piers, the Abasi regulator plays a vital role in water management within the region. Table 1 presents.

Table 1 Hydraulic characteristics of the employed three radial gates under submerged flow conditions46.

Input parameters

Figure 1 shows a sketch of radial gate under submerged flow conditions and the variables that affect the bed configurations downstream of radial gate, which can be expressed in the following functional form:

$$f{\text{ }}(Q,{\text{ }}{y_1},{\text{ }}{y_3},{\text{ }}w,{\text{ }}r,{\text{ }}a,q,{\text{ }}b,{\text{ }}B,{\text{ }}{Q_{th}},{\text{ }}g,r,m,d)$$

(1)

where the variables are the magnitude of the discharge following under the gate (Q), upstream depth (y1), downstream depth (y3), canal width (B), total width of open gates (b); gate opening (w), theoretical discharge (Qth), gate leaf angle (θ), gate radius (r), pinion height (a), water density (ρ), dynamic viscosity (µ), gravitational acceleration (g) and contraction coefficient (δ). The following dimensionless groups are derived to represent the phenomenon based on dimensional analysis using Buckingham’s theorem:

$$\:\text{f}\:(\frac{{y}_{1}}{w},\:\frac{{y}_{3}}{w},\:\frac{r}{w},\frac{Q}{{Q}_{th}}={C}_{d},\frac{a}{w},\frac{b}{w},\frac{\varvec{B}}{w},\frac{Q*{\uprho\:}}{B*{\upmu\:}}={R}_{e},{\uptheta\:},\frac{{y}_{3}}{{y}_{1}},{\updelta\:})=0$$

(2)

where Cd is discharge coefficient; Re is Reynolds number. In this study, b, B, g, ρ, and µ, are maintained constant so exclusion is justified due to site uniformity and data availability, while acknowledging possible bias as a limitation, also in open channel flow, the effect of viscus force is negligible with respect to inertia force. Thus the equation does not include them, instead including the coefficients affecting the discharge coefficient, according to27, and Cd is defined as:

$$\:{C}_{d}=f\left(\right(\frac{{y}_{1}}{w},\:\frac{{y}_{3}}{w},\:\frac{r}{w},\:\frac{a}{w},\:{\uptheta\:},\frac{{y}_{3}}{{y}_{1}},{\updelta\:})$$

(3)

Fig. 1
figure 1

Sketch of radial gates under submerged flow conditions.

Table 2 presents the statistical descriptive analysis of input parameters and Cd. Besides, Fig. 2 illustrates the correlogram depicting the linear relationship between input and output using the Pearson correlation coefficient. This analysis aids in pinpointing the key input parameters crucial for estimating the Cd.

Table 2 Descriptive statistics for input parameters46.
Fig. 2
figure 2

The correlation matrix between all dimensionless input parameters46.

The results revealed that y3/y1 takes the spotlight and has the highest magnitude of Pearson correlation coefficient (Pc = 0.7209). Therefore, this parameter emerges as the most effective feature for modelling the Cd in the submerged radial gates. Moreover, Fig. 3 display histograms of the dimensionless input parameters. These histograms offer concise summaries of the frequency distributions, providing insights into the factors controlling flow discharge in radial gates. Examining these histograms, we observe that y3/y1, θ and δ have a relatively even frequency distribution (skewness = 0.2465, − 0.2228 and 0.2941). Conversely, the frequency distribution for another dimensionless parameter, r/w displays positive skewness (skewness = 2.0183), with the majority of samples falling within the range of 5 to 15. Similarly, y1/w and a/w exhibit an asymmetric frequency distribution (skewness = 1.9685 and 1.9291).

Fig. 3
figure 3

Histogram of dimensionless input variables for submerged flow.

Ensemble learning

Ensemble learning stands out as one of the most effective approaches for enhancing system performance. It involves the integration of diverse separate models, working in concert to enhance the stability and predictive prowess of the model47. We implemented a stacking technique within our system, called Ensemble. This study adopted a strategy of employing multiple algorithms to train on the same dataset, yielding noteworthy system performance when coupled with stacking. The proposed approach elevates prediction accuracy beyond the capabilities of individual algorithms, exploiting the diverse problem-solving abilities inherent in multiple regression models. Specifically, our ensemble approach incorporated four base models, including GPR, SVM, LSBoost and ANN. The aim was to extract meta-features from each model. Subsequently, these meta-features were inputted into a meta-model, culminating in the final prediction step. Figure 4 displays the proposed ensemble model in this study. The following sections deliver a brief illustration of the proposed approach components.

Fig. 4
figure 4

The developed ensemble model architecture.

Gaussian process regression

GPR is a potent statistical tool in the field of data-driven modelling. GPR methods offer a non-parametric approach that proves invaluable for tackling a wide range of regression and modelling challenges. Rooted in the Bayesian framework, GPR can be conceptualized as a random process, enabling regression through Gaussian processes32. Regarding predictive tasks, GPR stands out as the preferred choice due to its inherent flexibility in providing representations of uncertainty. When considering the function spacef(x) = ϕ\(\:{\left(x\right)}^{T}\)w and constructing a variable set that conforms to the Gaussian distribution as [f (\(\:{x}^{1}\)), f (\(\:{x}^{2}\)), ., f(\(\:{x}^{n}\))], the GPR model takes the following formulation: f(x) ≈ GP(m(x), k (x, x′)). Where k(x, x′) represents covariance function for input from the training set (x) and testing set (x′) and can be calculated as follows: k(x, x′) = E[(f(x) − m(x))(f(x′) − m(x′))], wherem(x) is the mean function and can be calculated using: m(x) = E[f(x)]. The GPR model output is yielded using: y = f(x)+ϵ, ϵ ≈ N (0, \(\:{\sigma\:}_{n}^{2}\)). In this context, x represents the input vector, while y corresponds to the output vector. The symbol f denotes the function values within GPR, and ϵ signifies the noise component. GPR models seek to capture patterns within the data relying on a pivotal component, covariance function41. Various types of covariance functions can be harnessed as fundamental elements Within the GPR framework including matern32, exponential, ardmatern32, rational quadratic and ardexponential. To effectively train a GPR model, the selection of the appropriate covariance function (k) becomes a critical step as it plays a pivotal role in shaping the actual behaviour of the GPR model. Notably, the geometric structure of the training samples becomes embedded within this function12. Ardexponential is utilized in this work.

Support vector machines

SVM is renowned as one of the most formidable kernel-dependent models in the prediction domain. This attribute endows SVM with superior generalization and insensitive to over-training. Through the ingenious use of kernel functions, SVM accomplishes the transformation of intricate nonlinear input factors into high-dimensional space, effectively converting complex nonlinear problems into tractable linear ones. Thus, the selection of a kernel function significantly influences the performance of SVM model. The caveat in SVM is based on the quest for an optimal linear or hyper plane, serving as the decision boundary that maximizes the separation margin between the cases. To accentuate, kernel function elegantly addresses the non-linear predictive difficulties within high-dimensional space, standing as one of SVM principal strengths. Within the scope of the present study, delved into the predictive performance of SVM methods employing Radial Basis Function (RBF) kernel. Furthermore, the research optimized of SVM hyper-parameters by methodically conducting a grid search of parameters28.

Least-squares boosting

LSBoost is another ensemble method involves a large set of weak learners in addition to a meta-learner that sets weights to each learner. The algorithm follows a sequential process through training individual weak learners. Predictions form these learners are combined using voting techniques to ameliorate predictive performance. In particular, the loss criteria in LSBoost is least squares. At each iteration of the process, a new learner is fabricated based on training data to predict the difference between the observed target value and the cumulative predictions from all previously grown learners attempting to correct errors induced by them. Learners are added until the assigned maximum number or the optimal training data error is acquired. This gradient boosting methodology helps to build a durable regression model45. In this study, a grid search strategy was applied to the LSBoost method to optimize several parameters, such as learning rate, the number of learning cycles, min parent size, max number of splits and min leaf size, to achieve optimal results. Decision tree was exploited as the weak learner in LSBoost.

Artificial neural network

The foundation of ANN model is rooted in the emulation of the human brain function. Much like biological networks, it possesses the capacity to learn and extrapolate from that learning. A pivotal element of biological networks is the neuron, the fundamental building block of the human nervous system. Regarding neural networks, each entity comprises a multitude of nodes connected by directional links, forming a structured arrangement of layers namely, input, hidden, and output. Among the various manifestations of neural networks, back-propagation network is the most prevalent. Multi-layer perceptron (MLP) represents a widely adopted variety within ANN research. The interplay between error signals and input signals fine-tunes the parameters of the MLP. Choosing the number of hidden layers and neurons they contain stands as the most crucial considerations in ANN modeling. For optimal model performance, the sigmoid tangent function was employed for the input layer and the linear function was adopted for the output layer, facilitated by the Levenberg-Marquardt algorithm43.

Long short-term memory

LSTM architecture comprises input, hidden and output layers. The distinctive feature of LSTM hidden layers is the inclusion of memory cells, each equipped with three gates: input, forget, and output48. These gates possess access to both the current input and the previous output. Forget gate assumes the paramount role of determining how much information should be retained or discarded within the memory cell contents. Controlling the forget gate operation involves a neural network with a designated activation function, \(\:{f}_{{\uptau\:}}\) = σ (W [\(\:{x}_{{\uptau\:}}\), \(\:{h}_{{\uptau\:}-1}\),\(\:\:{C}_{{\uptau\:}-1}\)] + \(\:{b}_{\text{f}}\) ). The input gate denoted as \(\:{i}_{{\uptau\:}}\) plays a pivotal role in determining what information gets stored in the memory cell. Conversely, the output gate denoted as \(\:{o}_{{\uptau\:}}\) decides the timing for transmitting the stored information to the output. The following equations are employed to compute the operations associated with the input and output gates.

$$\:{i}_{{\uptau\:}}=\:\left(\text{W}\right[{x}_{{\uptau\:}},\:{h}_{{\uptau\:}-1},\:\:{C}_{{\uptau\:}-1}]\:+\:\:{b}_{\text{i}}\:)$$

(4)

$$\:{C}_{{\uptau\:}}={f}_{{\uptau\:}}*{C}_{{\uptau\:}-1}+{i}_{{\uptau\:}}*tanh\:\text{W}[{x}_{{\uptau\:}},\:{h}_{{\uptau\:}-1},\:{C}_{{\uptau\:}-1}]\:+\:{b}_{\text{c}})$$

(5)

$$\:{o}_{{\uptau\:}}=\left(\text{W}\right[{x}_{{\uptau\:}},\:{h}_{{\uptau\:}-1},\:{C}_{{\uptau\:}-1}]\:+\:{b}_{0})$$

(6)

$$\:{h}_{{\uptau\:}}=\text{tanh}\left({C}_{{\uptau\:}}\right)*{o}_{{\uptau\:}}$$

(7)

where b, \(\:{C}_{{\uptau\:}-1}\), σ and W denote bias vector, previous LSTM memory, sigmoid function and weights for each input, respectively. As a result, memory cells operate by modulating information through the three controllers: input, forget, and output gates. The forget gate activation function is employed on the previous memory to gauge its relevance in the current memory cell. If the activation output registers as zero, the memory cell will discard the prior information. Consequently, LSTM cells exhibit the capability to retain information over arbitrary durations, selectively controlling the influence of previous time steps49.

The proposed LSTM architecture consists of several key components: a sequence input layer for inputting sequences of a specific size, two LSTM layers with 100 and 20 hidden units, flatten layer, attention mechanism and two dense layers with 256 and 1 units using ReLU activation function. The architecture includes a regression layer responsible for producing the predictions.

Table 3 offers detailed properties for each layer. Following the final LSTM layer, the model employs a spatial attention mechanism to enhance its spatial awareness. This is accomplished through the utilization of a multi-head self-attention mechanism, which aids the model in directing its focus towards pertinent spatial regions. The multi-head self-attention process involves partitioning the input data into numerous heads, each of which signifies a unique subspace within the feature space. Within each head, the input is transformed into three distinct spaces: Key (k), Query (Q), and Value (V). Subsequently, attention scores (A) are computed using a scaled dot-product attention mechanism, A= SoftMax( \(\:\frac{Q*\:{K}^{T}}{\sqrt{{d}_{k}}}\) ). Here, \(\:{d}_{k}\) represents the dimensionality of the key vectors. Next, the scaled attention scores play a crucial role in deriving the ultimate attention output, O = A × V. These outputs from all attention heads are amalgamated and subjected to linear transformation to generate the final attention result. Subsequently, the resulting feature maps are flattened and passed through two fully connected layers, featuring ReLU activation.

Table 3 Details of the proposed meta model.

Stacking ensemble model for discharge coefficient prediction

A two-stage stacking ensemble framework was developed to accurately predict the discharge coefficient \(\:\left({C}_{d}\right)\:\)based on a set of dimensionless hydraulic and geometric input features, including \(\:{y}_{3}/{y}_{1}\), \(\:{y}_{1}/w\), \(\:{y}_{3}/w\), \(\:r/w\), \(\:a/w\), \(\:\theta\:\), and \(\:\delta\:\).

In the first stage, multiple heterogeneous base learners—namely Gaussian Process Regression (GPR), Support Vector Machine (SVM), Least Squares Boosting (LSBoost), and Artificial Neural Network (ANN)—were independently trained using the same input feature set. Each base model generated an individual prediction of the discharge coefficient, which collectively formed a meta-feature vector. This vector encapsulates the diverse learning behaviors and predictive strengths of the underlying models.

In the second stage, the meta-feature vector was treated as a short sequential input and passed to a meta-learner designed to perform adaptive model fusion. The meta-learner integrates a multi-head attention mechanism with Long Short-Term Memory (LSTM) layers, followed by fully connected layers. The attention mechanism dynamically assigns weights to the base model predictions using scaled dot-product attention, thereby emphasizing more informative predictions under varying hydraulic conditions. The resulting attention-weighted representation was then processed by the LSTM to capture inter-model dependencies before being mapped to the final discharge coefficient estimate through fully connected layers.

This hierarchical learning strategy enables both robust feature extraction and dynamic weighting of base learners, leading to improved predictive accuracy and generalization capability. The final output of the framework is the predicted discharge coefficient \(\:{\widehat{C}}_{d}\).

Model development and evaluation

In this investigation, five-fold cross validation was used to gauge the performance of the implemented methods. Input data was partitioned into 80/20 subsets for the purposes of training and testing. Precisely, 80% of the dataset was designated for training, leaving the remaining 20% for assessing the model performance. To guarantee an impartial analysis and dependable outcomes, we conducted each experiment five times, deriving the prediction average from these iterations. The development of the proposed models was executed within MATLAB (2022b), and all experimental runs were carried out on a computing device equipped with an Intel Core i7 CPU (3.20 GHz), 16 GB of RAM, and a GTX 1660 Ti GPU. The evaluation metrics employed for model performance assessment encompass root mean square error (RMSE), correlation coefficient (R), Nash Sutcliffe efficiency (NS), Willmott’s agreement index (WI) and mean absolute percentage error (MAPE). These indicators facilitate a comprehensive comparison and analysis of simulation results across different models. These measures are calculated using the following mathematical equations:

$$\text{RMSE}=\:\sqrt{\frac{1}{N}\sum\:_{i=1}^{N}{({p}_{i}-{o}_{i})}^{2}}$$

(8)

$$\:\text{R}=\frac{\sum\:_{i=1}^{N}({p}_{i}-\stackrel{-}{p}\:\left)\right({o}_{i}-\underset{\_}{o}\:)}{\sqrt{\sum\:_{i=1}^{N}{({p}_{i}-\overline{p})}^{2}\:\sum\:_{i=1}^{N}{({o}_{i}-\overline{o}\:)}^{2}}}\:-1\:\le\:R\le\:1$$

(9)

$$\:\text{N}\text{S}\hspace{0.17em}=\hspace{0.17em}1-\left(\frac{\sum\:_{i=1}^{N}{({p}_{i}-{o}_{i})}^{2}}{\sum\:_{i=1}^{N}{({o}_{i}-\overline{o}\:)}^{2}}\right)\:-\:{\infty\:}\:\le\:\:\text{N}\text{S}\hspace{0.17em}\le\:\hspace{0.17em}1$$

(10)

$$\:\text{M}\text{A}\text{P}\text{E}\:=\:\frac{100}{N}\sum\:_{i=1}^{N}\frac{\left|{o}_{i}-{p}_{i}\right|}{{o}_{i}}$$

(11)

where\(\:{\:o}_{i}\) represents the observed values, \(\:{p}_{i}\) signifies the predicted values, \(\:\overline{o}\) stands for the average of the observed values, and N denotes the total number of observations.

This study, introduced several robust approaches, including four base machine learning models and a meta deep learning model, which together form the ensemble model. Moreover, four longstanding regression models were exploited to validate the performance of the proposed model. Table 4 provides a comprehensive list of all hyperparameters for these models. Regarding the LSTM model, Table 3 highlights the details of each layer. To enhance reproducibility, all predictive models were tuned using a deterministic grid-search procedure. For each repetition, the dataset was evaluated using 5-fold cross-validation. Inside each fold, the training subset was further split into 90% for training and 10% for validation to monitor overfitting and select hyperparameters. Hyperparameter selection was based on the lowest mean validation RMSE, and when multiple configurations achieved statistically similar RMSE (difference ≤ 1% of the best RMSE), the simpler configuration was selected to improve generalization and reduce computational cost. For the SVM model, the grid search explored BoxConstraint (1, 10, 50, 100, 200, 500) and KernelScale (0.1, 0.5, 1, 2.2, 5, 10), with standardization enabled. The final selected values were Box Constraint = 100 and Kernel Scale = 2.2. For LSBoost, the tuned parameters included the number of learning cycles (100, 200, 300, 500, 800), learning rate (0.05, 0.10, 0.20, 0.30, 0.40), maximum number of splits (50, 100, 200, 300), and minimum leaf size (1, 5, 10, 20). The final configuration was Learning Cycles = 500, Learning Rate = 0.3, MaxNumSplits = 300, and Leaf Size = 10.

For the ANN model, we tested one hidden layer with neurons (20, 40, 60, 80, 100, 120), learning rate (10−4, 5 × 10−4, 10−3, 5 × 10−3), and epochs (50, 100, 150, 200). The final configuration used 80 neurons, LR = 0.001, and 100 epochs.

For GPR, several kernel functions were evaluated, including (exponential, squared exponential, matern32, matern52, rationalquadratic, ardexponential), and the kernel yielding the best validation RMSE (ardexponential) was retained as the final kernel.

For the LSTM model, we performed a two-stage grid search (coarse then refined around the best region) to avoid an excessively large search space. The explored parameters were: hidden units in the first LSTM layer (50, 100, 150), hidden units in the second LSTM layer (10, 20, 30), mini-batch size (64, 128, 256), learning rate ({10−4, 5 × 10−4, 10−3, 2 × 10−3), attention heads (4, 8), and attention channel size (32, 64, 128). Training employed Adam optimization with a maximum of 1000 epochs. The final architecture used 100 and 20 hidden units, mini-batch size = 256, learning rate = 0.001, and multi-head attention with 8 heads and 128 channels.

On the stated workstation, the average training time per fold was approximately: SVM 0.3–0.7 s, GPR 0.8–1.6 s, LSBoost 2–5 s, ANN 3–8 s, and the LSTM-attention meta-model 8–20 s. Overall, the complete training and evaluation of the stacked ensemble across 5 folds required approximately 6–12 min, which was considered acceptable given the improvement in predictive accuracy and robustness. Importantly, configurations with higher complexity (e.g., > 150 LSTM units or > 8 attention heads) yielded negligible RMSE gains while increasing training time by ~ 25–60%, and therefore were not adopted.

Table 4 Optimal configurations for all the models.



Source link