Runoff Modeling in Ungauged Catchments Using Machine Learning Algorithm-Based Model Parameters Regionalization Methodology

Houfa Wu; Jianyun Zhang; Zhenxin Bao; Guoqing Wang; Wensheng Wang; Yanqing Yang; Jie Wang

doi:10.1016/j.eng.2021.12.014

Engineering ›› 2023, Vol. 28 ›› Issue (9) :93 -104. DOI: 10.1016/j.eng.2021.12.014

Research

Article

Runoff Modeling in Ungauged Catchments Using Machine Learning Algorithm-Based Model Parameters Regionalization Methodology

Houfa Wu ^a^,^b^,^c
, Jianyun Zhang ^b^,^c
, Zhenxin Bao ^b^,^c^,^*
, Guoqing Wang ^b^,^c
, Wensheng Wang ^a
, Yanqing Yang ^a^,^b^,^c
, Jie Wang ^b^,^c

Author information +

History +

PDF (4424KB)

Abstract

Model parameters estimation is a pivotal issue for runoff modeling in ungauged catchments. The nonlinear relationship between model parameters and catchment descriptors is a major obstacle for parameter regionalization, which is the most widely used approach. Runoff modeling was studied in 38 catchments located in the Yellow-Huai-Hai River Basin (YHHRB). The values of the Nash-Sutcliffe efficiency coefficient (NSE), coefficient of determination (R²), and percent bias (PBIAS) indicated the acceptable performance of the soil and water assessment tool (SWAT) model in the YHHRB. Nine descriptors belonging to the categories of climate, soil, vegetation, and topography were used to express the catchment characteristics related to the hydrological processes. The quantitative relationships between the parameters of the SWAT model and the catchment descriptors were analyzed by six regression-based models, including linear regression (LR) equations, support vector regression (SVR), random forest (RF), k-nearest neighbor (kNN), decision tree (DT), and radial basis function (RBF). Each of the 38 catchments was assumed to be an ungauged catchment in turn. Then, the parameters in each target catchment were estimated by the constructed regression models based on the remaining 37 donor catchments. Furthermore, the similarity-based regionalization scheme was used for comparison with the regression-based approach. The results indicated that the runoff with the highest accuracy was modeled by the SVR-based scheme in ungauged catchments. Compared with the traditional LR-based approach, the accuracy of the runoff modeling in ungauged catchments was improved by the machine learning algorithms because of the outstanding capability to deal with nonlinear relationships. The performances of different approaches were similar in humid regions, while the advantages of the machine learning techniques were more evident in arid regions. When the study area contained nested catchments, the best result was calculated with the similarity-based parameter regionalization scheme because of the high catchment density and short spatial distance. The new findings could improve flood forecasting and water resources planning in regions that lack observed data.

Graphical abstract

Keywords

Parameters estimation / Ungauged catchments / Regionalization scheme / Machine learning algorithms / Soil and water assessment tool model

Cite this article

Download citation ▾

Houfa Wu, Jianyun Zhang, Zhenxin Bao, Guoqing Wang, Wensheng Wang, Yanqing Yang, Jie Wang. Runoff Modeling in Ungauged Catchments Using Machine Learning Algorithm-Based Model Parameters Regionalization Methodology. Engineering, 2023, 28(9): 93-104 DOI:10.1016/j.eng.2021.12.014

登录浏览全文

4963

注册一个新账户忘记密码

1. Introduction

Hydrological models are popular tools for hydrological process modeling, and these models have been extensively applied in flood forecasting, water resources management, and the assessment of climate change impact in recent decades [1], [2]. With the improvement of computer technology and the application of multiple interdisciplinary subjects, hydrological models can now describe hydrological processes more accurately. Hydrological models have developed from the original conceptual models (Tank and Sacramento) and centralized models (Xin’anjiang and simplified hydrology model (SIMHYD)) to the current popular semi-distributed models (TOPMODEL; soil and water assessment tool (SWAT)); and distributed model (variable infiltration capacity (VIC)) [3], [4], [5]. The accurate estimation of model parameters directly influences the accuracy of runoff simulation. Generally, the model parameters are optimized and calibrated by observed streamflow data at the outlet of a basin. However, numerous catchments are limited by geographical or economic conditions and lack adequate observed data to calibrate model parameters [1]. Therefore, runoff modeling in ungauged catchments has become a focus for researchers [6], [7]. This problem is termed the “prediction in ungauged basins” (PUB) in hydrology. To tackle the PUB issues, various regionalization approaches are widely used to simulate runoff in ungauged catchments by transferring the model parameters from similar catchments to ungauged ones [8], [9].

The three widely used parameter regionalization approaches are regression-based, physical similarity-based, and spatial proximity-based. Regression analysis method is the most popular and widely studied approach [10], [11]. The key steps are to establish regression equations between model parameters and catchment descriptors in gauged catchments, and to estimate model parameters in ungauged catchments with the constructed regression relationship [12], [13]. However, some studies have reported that the relationships between model parameters and catchment descriptors are often complex, and estimation in ungauged catchments usually leads to large errors [14].

The physical similarity approach assumes that catchments with the same physical attributes (such as climate, vegetation, and topography) have similar modes of runoff generation and confluence processes [1], [15]. The spatial proximity method selects the donor catchments according to the spatial distance between the neighboring observed and ungauged catchments, and the parameters of the donor catchment are transferred to the target catchment [16]. The advantage of the above two approaches over the regression analysis method is that they do not make linear assumptions, and these methods have been widely used in recent years [17], [18]. However, the spatial proximity approach is not suitable for the large spatial variation of adjacent basins [19], and the physical similarity approach is limited by the rationality of selecting catchment characteristics [20]. Some researchers have compared and evaluated the three approaches. In most cases, the spatial proximity and the physical similarity approaches are the most effective [21]. In addition, some researchers have combined the physical similarity and spatial proximity approaches to estimate the model parameters of ungauged catchments. The results found that the integrated similarity-based approach performed slightly better than spatial proximity-based or physical similarity-based alone [22].

Catchment descriptors and model parameters are interdependent, and their relationship may be nonlinear [23], [24]. Furthermore, a hydrological model is a generalized description of the catchment hydrological process, and it will inevitably have the phenomenon of equifinality, making it challenging to obtain the only optimal solution of the model parameters through calibration. Estimating the model parameters with the traditional multiple regression scheme may result in large errors. With the development of data mining and artificial intelligence technology, the machine learning technique has been successfully applied in flood forecasting, earth science modeling, and remote sensing due to its good performance in dealing with nonlinear relationships [25]. In the last decade, some machine learning models have received increasing attention in the field of model parameters regionalization, including support vector machine (SVM), random forest (RF), and decision tree (DT). For example, Saadi et al. [23] investigated the potential of RF algorithms in the regionalization of the hourly hydrological model parameters. Hao et al. [26] used an RF model to regionalize the parameters of the mountain flood prediction model. Jafarzadegan et al. [14] estimated the parameters of an environmental model in data-scarce regions with the SVM technique. Ragettli et al. [27] used the splitting rules of classification and regression trees (CART) to regionalize the parameters of 35 catchments in China. The results showed that machine learning algorithms could present accurate predictions in general. However, most existing research focuses on the comparative analysis between a single machine learning algorithm and the traditional regionalization approach. The applicability of different machine learning algorithms in parameter regionalization has not been assessed.

The main objective of this paper is to evaluate different machine learning techniques for parameter regionalization in the Yellow-Huai-Hai River Basin (YHHRB), analyzing their advantages and limitations. The performances of the five classical machine learning-based approaches were compared with linear regression (LR)-based and similarity-based schemes (combining the physical similarity and spatial proximity). The performances of the different parameter regionalization approaches in various climate regions were further compared. The sections of this paper are organized as follows: Section 2 describes the study area and the datasets; Section 3 introduces the methodology used; Sections 4 and 5 describe and discuss the regionalization results, and the conclusions are summarized in Section 6.

2. Study area and datasets

Located in northern China (95°E-123°E, 30°N-43°N), the YHHRB is the general name of the three first-class basins (Yellow River Basin, Huai River Basin, and Hai River Basin) in China. The YHHRB covers 16 provinces with a total area of 1 445 000 km². The Yellow River Basin, Huai River Basin, and Hai River Basin have drainage areas of 795 000 km², 330 000 km², and 320 000 km², respectively. The population and the gross domestic product (GDP) of the YHHRB account for about 35% and 32% of the national total, respectively. The eastern plains in the YHHRB are a substantial agricultural production base in China, and the areas of cultivated land and grain output account for 20.4% and 23.6% of the country’s total, respectively [28]. Thirty-eight typical catchments with different hydrologic and climatic conditions in the YHHRB were selected as the study areas in this study (Fig. 1(a)), including 22 catchments in relatively humid regions (aridity index φ < 1.7), and 16 catchments in relatively arid regions (φ > 1.7). The detailed information for the 38 catchments is presented in Table 1.

The monthly mean streamflow of the 38 catchments was obtained from China’s Hydrological Yearbook, published by the Hydrological Bureau of the Ministry of Water Resources, China. The daily data of the rainfall, temperature, wind speed, relative humidity, and solar radiation during 1961-2015 were extracted from the gridded daily observation dataset over the China region (CN05.1), published by the National Climate Center of the China Meteorological Administration [29]. The digital elevation model (DEM) of the YHHRB was derived from the Shuttle Radar Topography Mission (SRTM) data provided by the Geospatial Data Cloud Platform^†, with a resolution of 30 m, and these data were used to generate the river network of the hydrological model. The land use data in 1980 with a spatial resolution of 1 km were obtained from the Resources and Environment Data Cloud Platform^‡ of the Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences (Fig. 1(b)). The soil data was extracted from the Harmonized World Soil Database (HWSD), constructed by the Food and Agriculture Organization (FAO) and the International Institute for Applied Systems Analysis (IIASA), with a spatial resolution of 1 km (Fig. 1(c)).

3. Methodology

The observed streamflow data were used to calibrate the SWAT model parameters of 38 typical catchments, and the sensitivity and applicability of these parameters were analyzed. Then, each of the 38 catchments was assumed to be an ungauged catchment in turn, that is, the target catchment. The model parameters of the target catchment were estimated with machine learning techniques, LR-based method, and similarity-based approach. Finally, the estimated model parameters were input into the SWAT model to simulate the runoff process in the target catchment. Based on the results of the parameter regionalization and the runoff simulation in the 38 catchments, various regionalization approaches performance was evaluated.

3.1. SWAT model

The SWAT model is a physically-based, semi-distributed, and continuous hydrological model developed by the Agricultural Research Service of the US Department of Agriculture (USDA) [30]. The model can simultaneously consider meteorological conditions, soil types, land use patterns, and various water conservancy engineering conditions, and it has been widely used to simulate the hydrological change process at the watershed scale [31]. The detailed steps of model construction can be found in the literature [32]. The parameters of the SWAT model need to be calibrated to achieve the optimal simulation effect after the model is built and run. The sequential uncertainty fitting version 2 (SUFI2) algorithm in SWAT calibration and uncertainty programs (e.g., SWAT-CUP) software is used to calibrate and validate the model parameters, and the effect of the simulation results is evaluated with three indexes: Nash-Sutcliffe efficiency coefficient (NSE), coefficient of determination (R²), and percent bias (PBIAS), which can be expressed as follows:

(1)

N S E = 1 - ∑ i = 1 n Q s - Q o 2 ∑ i = 1 n Q o - Q o - 2

(2)

R 2 = ∑ i = 1 n (Q o - Q o -) (Q s - Q s -) 2 ∑ i = 1 n (Q o - Q o -) 2 ∑ i = 1 n (Q s - Q s -) 2

(3)

P B I A S = ∑ i = 1 n Q o - Q s ∑ i = 1 n Q o

where

Q o

and

Q s

are the observed and simulated streamflow (m³·s⁻¹),

Q o -

and

Q s -

are mean observed and simulated streamflow (m³·s⁻¹), and n is the amount of measured data. Existing studies have demonstrated that the model simulation results are credible when NSE > 0.5, R² > 0.5, −25% < PBIAS <25%, and simulation results with NSE above 0.75 are considered to be very good [33].

3.2. Regression-based methods

Six regression-based models were introduced to estimate model parameters, including the LR equations, support vector regression (SVR), RF, k-nearest neighbor (kNN), DT, and radial basis function (RBF). Based on the constructed models, the model parameters were modeled with nine catchment descriptors, including the catchment area (Area), mean catchment elevation (Ele), mean catchment slope (Slope), soil sand content (Sand), soil clay content (Clay), annual precipitation (P), annual mean temperature (T), normalized difference vegetation index (NDVI), and φ. The regression model can be expressed as follows:

(4)

y = f (x, u)

where y and x are the model parameters and the catchment characteristic values, respectively, and

u

is the vector of the model parameters.

Since the LR analysis cannot describe the nonlinear relationship between the model parameters and the catchment descriptors, the more complex algorithms were used, including SVR, RF, kNN, DT, and RBF. As a supervised learning method, the SVR can describe the nonlinear relationship between variables by mapping the kernel function to the high dimensional space [34], [35]. The RF is stable and insensitive to overfitting because some training samples are randomly selected from the regression tree. It also has good robustness compared with other algorithms [36]. The kNN is a non-parametric estimation method that is fitted by calculating the distance between different eigenvalues of samples, and it does not require making assumptions about data input [37]. The DT does not depend on the distribution of the sample data in the model construction and sample prediction, making the estimate results more stable [27]. The RBF is a type of feedforward neural network with wide application, and it can approximate any arbitrary nonlinear function with unlimited accuracy [38]. Based on these advantages, the above five classical machine learning algorithms were applied to the parameter regionalization as a supplement to the traditional LR approach.

A Taylor diagram was used to quantify the similarity of the model parameters between two patterns (calibration and estimation). It contains three indicators: standard deviations (STDs), root mean squared error (RMSE), and correlation coefficient (r) [39].

3.3. Similarity-based method

The similarity-based approach integrated consideration of both the physical similarity and the spatial proximity, which were combined according to their respective weights. Two options were considered to combine the information from the donor catchments: parameter weighted averaging (PA) and output weighted averaging (OA). The PA method involved combining the model parameters of the donor catchments according to their corresponding weights, and then substituting the integrated parameters into the SWAT model to simulate the runoff of the target catchment (${{Q}_{1j}}$).

(5)$\begin{matrix} {{Q}_{1j}}\ =\ Q\left( j,\ \underset{i=1}{\overset{k}{\mathop \sum }}\,({{w}_{i}}\ \times \ {{X}_{i}}) \right) \\\end{matrix}$

where k is the number of donor catchment; and j is the time step.

In the OA method, the model parameters of the donor catchment were substituted into the SWAT model to simulate the runoff, and then the simulation results were combined according to their corresponding weights to estimate the runoff of the target catchment (${{Q}_{2j}}$).

(6)$\begin{matrix} {{Q}_{2j}}\ =\ \underset{i=1}{\overset{k}{\mathop \sum }}\,{{w}_{i}}\ \times \ Q\left( j,\ {{X}_{i}} \right) \\\end{matrix}$

where ${{X}_{i}}$ is the model parameters of the donor catchment, and ${{w}_{i}}$ is the integrated weights of the spatial proximity and the physical similarity methods. The calculation method of ${{w}_{i}}$ can be found in the literature [40].

4. Results

4.1. Runoff modeling and parameter sensitivity analysis of typical catchment

Eleven parameters related to runoff in the SWAT model were selected for calibration and sensitivity testing. The physical meaning and the original range of the parameters are summarized in Table 2. These parameters could be divided into four groups: parameters that control water movement between soil aquifers (ALPHA_BF, GW_DELAY, GWQMN, GW_REVAP, and REVAPMN), soil hydraulic characteristics (SOL_AWC, SOL_K, and ESCO), hydraulic channel parameters (CH_K2 and ALPHA_BNK) and the Soil Conservation Service (SCS) curve number (CN2). The t-stat and the p-value were used to represent the sensitivity of the model parameter. The higher the absolute value of the t-stat was and the lower the p-value was, the more sensitive the parameter was. Generally, the most sensitive parameters were CN2, ALPHA_BNK, ESCO, and GWQMN. The result was generally consistent with previous studies [41], [42].

The runoff simulation accuracy of the SWAT model in the 38 catchments, indicated by NSE, R², and PBIAS, was quantitatively assessed on the monthly scale. To reduce the influence of the initial conditions of operation, the first year of the calibration period was used as the warm-up period of the model. The calibration and validation results of the SWAT model are shown in Fig. 2. During the simulation periods, the values of NSE and R² for all catchments were greater than 0.5, and the PBIAS values were less than 25%. The values of NSE and R² were not as high as expected during the calibration and validation periods, mainly because the constructed model was not perfect for some basins. Because the efficiency in the simulation period was acceptable, this part of the error was believed to be acceptable. The efficiency of the model in the validation period was usually inferior to that of the calibration period, because the model parameters were not adjusted to match the observed data during the validation period [5]. The SWAT model performed better in humid regions than in arid regions. For example, the 50th percentile values of NSE (R²) in the humid and arid regions in the calibration period were 0.85 (0.87) and 0.78 (0.81), respectively. The runoff simulation in arid areas was still a challenge for hydrology [43]. The performance of the SWAT model in the simulation period indicated that the constructed model had good applicability in the study area, and its calibrated parameters were reliable for parameter regionalization.

4.2. Model parameters regionalization based on regression-based method

The degree of correlation between the model parameters and the catchment descriptors is illustrated in Fig. 3(a). The result indicated that CN2, SOL_K, ESCO, and GW_REVAP were correlated with multiple descriptors. Taking CN2 as an example, its absolute value of correlation coefficients with Slope, Clay, P, and φ were all greater than 0.5, indicating that these descriptors were relatively crucial to CN2. The sensitivity of REVAPMN, CH_K2, and ALPHA_BF in the calibration period was low (Table 2), and these parameters were difficult to obtain the optimal solution, resulting in the low correlation between the parameters and the catchment descriptors [40]. Although ALPHA_BNK and GWQMN were the sensitivity parameters of the model, the correlation coefficients between these parameters and catchment descriptors were low, mainly because the physical meaning of the model parameters had little correlation with the descriptors.

The heat map of the correlation coefficients among the nine catchment descriptors is plotted in Fig. 3(b). The result indicated that Area and Ele, T and Ele, P and NDVI, φ and NDVI, T and P, φ and P had strong correlations, and their absolute value of correlation coefficients were greater than 0.7, thus indicating the poor independence of the variables. The variance inflation factor (VIF) values of Ele, P, and T were greater than 10 (Fig. 4), indicating statistically significant multicollinearity between the catchment descriptors. Hence, the principal component analysis (PCA) method was used to reduce the dimensions of nine descriptors to solve the collinearity problem. Based on the principle that the cumulative variance contribution rate was greater than 85%, four principal components were selected, and the final variables were calculated according to the principal component coefficients.

The four principal component variables were identified as the input of six regression-based models, and the model was evaluated with the leave-one-out method. The correlation diagrams of the estimated and calibrated high sensitivity parameters are shown in Fig. 5, including CN2, SOL_AWC, SOL_K, GW_DELAY, ESCO, and GW_REVAP. The correlation coefficients between the estimated values of CN2 and SOL_K and the calibration values were greater than 0.5, indicating the high estimation accuracy. The remaining five model calibration parameters were difficult to estimate using the catchment descriptors due to their low correlation with the descriptors and low sensitivity. In comparing the estimation results of the six regression-based models, SVR performed better than the other models, and the estimation effect of DT was relatively poor. After the estimated values of the model parameters were obtained, which were input into the SWAT model of the ungauged catchment for runoff simulation.

4.3. Model parameters regionalization based on similarity-based method

The number of donor catchments directly affects the simulation accuracy of the target catchment for the similarity-based approach. Therefore, donor catchment numbers from 1 to 38 were tested, and the relationship between the number of donor catchments and the model evaluation criterion (NSE and PBIAS) was analyzed (Fig. 6). The results indicated that one donor catchment was the most suitable when the PA and the OA methods were adopted. For example, the 50th percentile values of NSE were the highest when one donor catchment was used, and the 50th percentile values of PBIAS were low. One donor catchment meant that the catchment closest to the target catchment was used. In this case, the results of the two methods were consistent. The number of donor catchments obtained was smaller than that from Bao et al. [40] and Oudin et al. [44] used. Compared with these studies, multiple nested catchments were used in this study (Fig. 1(a)), including four catchments within the Heishiguan basin, seven catchments within the Huaibin basin, and one catchment within the Xianyang basin. The hydrometeorological conditions in the nested catchments were similar, leading to the excellent performance of the given regionalization methodology. In order to investigate the suitable number of donor catchments after the nested catchments were excluded, numbers of donor catchments from 1 to 26 were tested. In this case, one donor catchment was the most suitable when the PA method was used. When the OA method was used, three donor catchments were the most suitable. The regionalization performance decreased significantly when the nested catchments were excluded.

4.4. Results of regionalization approaches

Based on the calibration results indicated by NSE and PBIAS, the runoff simulation accuracy of the assumed ungauged catchment under the regression-based schemes and similarity-based approach was compared (Fig. 7). In 38 catchments, the simulation accuracy of SVR-based and RBF-based regionalization approaches was higher than that of the LR-based method. However, the runoff simulation result based on the similarity-based approach (Si) was more accurate than the six regression-based methods. The 50th percentile values of NSE and PBIAS of Si method were 0.71 and 12.5%, respectively, and their accuracy was significantly higher than that of the regression-based approaches. The reason for this phenomenon might have been the impact of the nested catchments. Therefore, the nested catchments in the study area were excluded, and the accuracy of different regionalization methods was compared in the remaining 26 catchments (Fig. 7). In this case, the 50th percentile values of NSE (PBIAS) were 0.68 (29%), 0.66 (28.15%), and 0.62 (39.05%) for the SVR, Si, and LR, respectively. The results indicated that the runoff simulation accuracy of the SVR-based approach was higher than that of Si and LR-based methods in the ungauged catchments.

According to Fig. 8, the most successful regression-based schemes were distributed differently in the 38 catchments. The LR-based method performed poorly, there were only 2 out of 38 catchments in which the NSE values were higher than that for other machine learning techniques when the LR-based method was used, and only 6 out of 38 catchments had the lowest PBIAS values. The number of the three best performing regionalization approaches on each catchment was investigated. The results showed that the performances of SVR, RBF, and kNN were significantly better than those of the other methods in the 38 catchments.

As presented in Fig. 9, all regionalization approaches performed better in humid regions than in arid regions. In humid regions, the 50th percentile values of NSE for different methods were all greater than 0.7. In contrast, the 50th percentile values of PBIAS varied in these regions. Generally, the regionalization effect order in humid regions from good to poor was kNN, Si, RF, SVR, RBF, DT, and LR. In arid regions, the 50th percentile values of NSE for different methodologies varied greatly. The 50th percentile values of PBIAS with the regression-based schemes were greater than 30%. Generally, the accuracy of the regionalization methods in arid regions, from high to low, was in the following order: Si, SVR, RF, RBF, kNN, LR, and DT.

5. Discussion

5.1. Application of machine learning algorithm in parameter regionalization

Given that catchment descriptors and model parameters are interdependent, and their relationship is complex and nonlinear. The machine learning technique is an interesting modeling structure for parameter regionalization, which can accurately capture the intrinsic relationships between the input and output variables, regardless of their internal physical links. This might be a reliable and robust solution to the PUB issues. Booker and Snelder [45] and Golian et al. [8] also found that complex modeling techniques were superior to the linear method in predicting hydrologic properties. These technologies produced improved performances and a high degree of flexibility in capturing nonlinear and complex relationships between the model parameters and catchment descriptors [46], [47]. Unlike the single machine learning algorithm used in previous research, the potential of multiple methods in regionalization application was compared in this study, and the result showed that the SVR-based method performed better than other algorithms. SVR sought to minimize structural risk in the modeling process, giving it additional generalization capabilities. Patel and Ramachandran [48] also found that SVR provided superior performance in modeling the discharge time series data. The performance of different machine learning algorithms varied significantly in different climate regions. Different data input might have had a greater impact on some model performance than the algorithm itself. It was difficult to determine whether the machine learning model was the best solution to all problems.

The parameter regionalization error of the regression-based schemes was larger than that of the similarity-based approach, because there were fewer training samples and the nonlinear relationship between the catchment descriptors and the model parameters was not fully learned. When the parameter regionalization results of the various methods were compared with the calibration results, the performances of the six regression-based schemes were significantly inferior to that of the calibration method. Regardless of the strength of the correlation between the model parameters and the catchment descriptors, with the use of the descriptors alone, estimating the model parameters would lead to the decline of parameter regionalization performance, indicating that there was still considerable room for improvement in the problem of model parameterization.

5.2. Donor catchment selection

As an important parameter regionalization methodology, the similarity-based approach involved the selection of donor catchment using the spatial distance and the physical similarity of the catchment. The number of donor catchments was related to the study area, basin density, and the approach (physical similarity or spatial proximity) used [1]. The regionalization performance was best when one donor catchment was selected in this study, whether the PA method or the OA method was adopted (Fig. 6). One reason for this was that the 38 catchments included multiple nested catchments. The similarity degree of the hydrometeorological conditions in the nested catchments was relatively high, resulting in a better performance of parameter regionalization. When the donor catchments did not contain nested catchments, the accuracy of the parameter regionalization decreased significantly. In addition to the impact of the nested catchments, the hydrologic and climatic conditions of some typical catchments had a large span, resulting in low similarity among catchments. For example, the Huangfu, Linjiaping, Daning, and Hejin catchments were significantly different from other catchments regardless of the spatial distance or physical similarity. In terms of the physical attributes, the existence of a gauged catchment adjacent to ungauged catchments was more important than the similarity between gauged and ungauged catchments [49].

Whether or not the study area contained nested catchments, the regionalization performance of the OA method was significantly better than that of the PA method (Fig. 6). The OA method was used to directly apply the model parameters from the donor catchment to the ungauged basin without modification, and the method involved the use of all information for the calibrated model parameters. However, the PA method was used to weigh and average the model parameters of the donor catchment, and then apply these parameters to the unmeasured catchment. There is strong interdependence among hydrological model parameters, which is weakened when the parameters are averaged [44]. Therefore, the PA method is commonly used when the correlation between the hydrological model parameters is small.

5.3. Descriptor importance in parameter regionalization

The application of regression-based schemes to parameter regionalization assumed that the selected catchment descriptors could describe the hydrological behavior of a basin well. Therefore, the selection of descriptors is crucial to the success of parameter regionalization. Although the selection of suitable catchment descriptors in hydrological parameterization studies has been widely discussed, no universally accepted selection criteria exist [14], [50]. Merz and Blöschl [51] mentioned that the selected catchment descriptors should be the influence factors that can drive the watershed hydrological response. Mwakalila [52] proposed that the catchment descriptors should have both geographical and parameter spatial significance. In previous studies, the selection of catchment descriptors was mainly based on geography, meteorology, hydrology, and soil [53]. Other descriptors have occasionally been used, such as land use, drainage density [49], and other meteorological data (mean annual evaporation) [47]. In addition, the selection of appropriate catchment descriptors also depends on the physical significance of the model parameters. For example, the CN2 in the SWAT model depends on the soil and land use characteristics of a catchment [18], and the catchment descriptor in this respect should be considered. The regional climate, soil types, and vegetation make regionalization special, and the SWAT model parameters are mostly related to these factors [49]. The nine catchment descriptors selected in this study covered these factors, but statistically significant multicollinearity problems existed among the descriptors. Saadi et al. [23] only retained the catchment descriptors with low correlation values to each other. Penas et al. [47] selected predictor variables based on the combination of scatter plots (hydrological indices versus environmental variables) and parametric correlations. The PCA method used in this study reduced the dimensions of the catchment descriptors. With the reduction of descriptors dimensions, the information regarding the original variables was retained to the greatest extent.

5.4. Influence of hydrologic and climatic conditions on regionalization

The study area included two climatic regions, namely, a relatively humid region and a relatively arid region. The humid areas were mainly located in the Huai River Basin, where the mean annual precipitation was 976.23 mm (Table 1), and the mean value of NSE in the calibration period was 0.82 (Fig. 2). The arid regions were distributed in the Yellow River Basin and the Hai River Basin, where the mean annual precipitation was 561.01 mm (Table 1), and the mean value of NSE in the calibration period was 0.77 (Fig. 2). Given the influence of the precipitation distribution, the runoff simulation accuracy of the SWAT model in the humid regions was better than that in the arid regions. According to the results of the parameter regionalization (Fig. 9), the regionalization performance in the humid areas was better than that in the arid areas, which was consistent with the runoff simulation results. Therefore, the hydrological model parameters regionalization results largely depended on the accuracy of the runoff simulation in the donor catchment. Only when the parameters with sufficient accuracy were obtained, could the simulation results in the ungauged catchments be obtained with parameter regionalization. In arid regions, the economy was usually undeveloped, the monitoring stations were few, and the hydrological data were relatively scarce. Moreover, the runoff simulation was more sensitive to the model parameters in these regions than in the humid regions. For the same parameter error, the deviation of the simulation results in arid regions was greater than that in humid regions [40]. Parajka et al. [54] and Yang et al. [55] also pointed out the impact of climate conditions on regionalization performance.

The most successful regionalization methodology in humid catchments may be differ from those in arid catchments. The performance of a particular approach varies between different studies more often than between methods tested in a single study [53], [56]. As shown in Fig. 9, the SVR showed better regionalization performance in arid areas, while the kNN had higher regionalization accuracy in humid areas. Different data inputs may have a greater impact on the performance of some models than the algorithm itself, and determining the machine learning model to use as the best solution to the problem was difficult [57]. To improve the accuracy of parameter regionalization, the next challenge to consider is the introduction of more machine learning techniques.

5.5. Uncertainty and limitation of the results

The uncertainty of this study came from three aspects. First, the selected catchment descriptors imposed limitations on the interpretation of some ungauged catchments. This also illustrated a fundamental challenge to the parameter regionalization, that is, the number or quality of selected catchment descriptors was insufficient to represent the catchment heterogeneity [58]. Second, the nonlinear relationship between the model parameters and the catchment descriptors is difficult to express perfectly with statistical models. Third, the SWAT model has an excellent performance in the runoff simulation, but uncertainty still exists [59], [60], which due to the following reasons: ① parameter uncertainty, in which the inconsistency of the model inputs and parameters in space and time leads to error in model parameter values; ② data uncertainty, in which the variability of natural conditions, limitation of measurement conditions, and the uncertainty of measurement methods all affect the accuracy of the model input data; and ③ the model uncertainty, in which the hydrological models generalize hydrological processes and cannot accurately represent the actual physical process of a watershed. Additionally, most of the parameters in the SWAT model adopt the default values, which deviate from the actual values, which also affects the accuracy of the model simulation.

6. Conclusions

The regionalization approach is a crucial method for solving the problem of runoff modeling in ungauged catchments. Different regionalization methods were used to estimate SWAT model parameters in this study, and runoff simulation was studied in 38 catchments located in the YHHRB. Due to the weakness of the LR-based method in coping with nonlinear relationships, five machine learning algorithms (SVR, RF, kNN, DT, and RBF) were used to describe the quantitative relationships between the model parameters and the catchment descriptors to improve the parameter regionalization performance. We found that the SVR-based regression scheme had the highest simulation accuracy in ungauged catchments, indicating that its performance was better than traditional LR-based and similarity-based approaches. The performance of different regionalization methods was similar in humid regions due to the relatively simple hydrometeorological processes and easy runoff simulation. However, the runoff simulation results in arid areas were more sensitive to the model parameters, and the advantages of the machine learning techniques were outstanding in these regions. The regionalization performance of the SVR, RBF, and RF based methods was better than that of the traditional LR techniques in arid regions. When the study area contained nested catchments, the best parameter regionalization performance was derived through similarity-based methods because of the high basin density and similarity among catchments. The study results enrich the method of parameter regionalization and provide a reference for future water resources planning and management in ungauged catchments.

Acknowledgments

This research is funded by the National Key Research and Development Program of China (2017YFA0605002, 2017YFA0605004, and 2016YFA0601501); the National Natural Science Foundation of China (41961124007, 51779145, and 41830863); and “Six top talents” in Jiangsu Province (RJFW-031).

Compliance with ethics guidelines

Houfa Wu, Jianyun Zhang, Zhenxin Bao, Guoqing Wang, Wensheng Wang, Yanqing Yang, and Jie Wang declare that they have no conflict of interest or financial conflicts to disclose.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Y. Guo, Y. Zhang, L. Zhang, Z. Wang. Regionalization of hydrological modeling for predicting streamflow in ungauged catchments: a comprehensive review. Wiley Interdiscip Rev Water, 8 (1) (2021), p. e1487

[2]	Q. Yang, J.E. Almendinger, X. Zhang, M. Huang, X. Chen, G. Leng, et al. Enhancing SWAT simulation of forest ecosystems for water resource assessment: a case study in the St. Croix River basin. Ecol Eng, 120 (2018), pp. 422-431

[3]	K.J. Beven, M.J. Kirkby, J.E. Freer, R. Lamb. A history of TOPMODEL. Hydrol Earth Syst Sci, 25 (2) (2021), pp. 527-549. DOI: 10.5194/hess-25-527-2021

[4]	J. Gong, C. Yao, Z. Li, Y. Chen, Y. Huang, B. Tong. Improving the flood forecasting capability of the Xinanjiang model for small- and medium-sized ungauged catchments in South China. Nat Hazards, 106 (3) (2021), pp. 2077-2109. DOI: 10.1007/s11069-021-04531-0

[5]	S.Y. Woo, S.J. Kim, J.W. Lee, S.H. Kim, Y.W. Kim. Evaluating the impact of interbasin water transfer on water quality in the recipient river basin with SWAT. Sci Total Environ, 776 (2021), p. 145984

[6]	G.E. Clark, K.H. Ahn, R.N. Palmer. Assessing a regression-based regionalization approach to ungauged sites with various hydrologic models in a forested catchment in the northeastern United States. J Hydrol Eng, 22 (12) (2017), p. 05017027. DOI: 10.1061/(ASCE)HE.1943-5584.0001582

[7]	G.Q. Wang, J.Y. Zhang, J.L. Jin, Y.L. Liu, R.M. He, Z.X. Bao, et al. Regional calibration of a water balance model for estimating stream flow in ungauged areas of the Yellow River Basin. Quat Int, 336 (2014), pp. 65-72

[8]	S. Golian, C. Murphy, H. Meresa. Regionalization of hydrological models for flow estimation in ungauged catchments in Ireland. J Hydrol Reg Stud, 36 (2021), p. 100859

[9]	J. Samuel, P. Coulibaly, R.A. Metcalfe. Estimation of continuous streamflow in Ontario ungauged basins: comparison of regionalization methods. J Hydrol Eng, 16 (5) (2011), pp. 447-459

[10]	R.R. Knight, W.S. Gain, W.J. Wolfe. Modelling ecological flow regime: an example from the Tennessee and Cumberland River basins. Ecohydrology, 5 (5) (2012), pp. 613-627. DOI: 10.1002/eco.246

[11]	X. Yang, J. Magnusson, C.Y. Xu. Transferability of regionalization methods under changing climate. J Hydrol, 568 (2019), pp. 67-81

[12]	H.E. Beck, A. I.J.M. van Dijk, A. de Roo, D.G. Miralles, T.R. McVicar, J. Schellekens, et al. Global-scale regionalization of hydrologic model parameters. Water Resour Res, 52 (5) (2016), pp. 3599-3622

[13]	W. Boughton, F. Chiew. Estimating runoff in ungauged catchments from rainfall, PET and the AWBM model. Environ Model Softw, 22 (4) (2007), pp. 476-487

[14]	K. Jafarzadegan, V. Merwade, H. Moradkhani. Combining clustering and classification for the regionalization of environmental model parameters: application to floodplain mapping in data-scarce regions. Environ Modell Softw, 125 (2020), p. 104613

[15]	L. Oudin, A. Kay, V. Andreassian, C. Perrin. Are seemingly physically similar catchments truly hydrologically similar?. Water Resour Res, 46 (11) (2010), p. W11558

[16]	I.A. Guiamel, H.S. Lee. Watershed modelling of the Mindanao River Basin in the Philippines using the SWAT for water resource management. Civ Eng J, 6 (4) (2020), pp. 626-648. DOI: 10.28991/cej-2020-03091496

[17]	J.P.C. Reichl, A.W. Western, N.R. McIntyre, F.H.S. Chiew. Optimization of a similarity measure for estimating ungauged streamflow. Water Resour Res, 45 (10) (2009), p. W10423

[18]	H. Sellami, I. La Jeunesse, S. Benabdallah, N. Baghdadi, M. Vanclooster. Uncertainty analysis in model parameters regionalization: a case study involving the SWAT model in Mediterranean catchments (Southern France). Hydrol Earth Syst Sci, 18 (6) (2014), pp. 2393-2413. DOI: 10.5194/hess-18-2393-2014

[19]	S. Ly, C. Charles, A. Degre. Different methods for spatial interpolation of rainfall data for operational hydrology and hydrological modeling at watershed scale: a review. Biotechnol Agron Soc, 17 (2) (2013), pp. 392-406

[20]	S. Heng, T. Suetsugi. Comparison of regionalization approaches in parameterizing sediment rating curve in ungauged catchments for subsequent instantaneous sediment yield prediction. J Hydrol, 512 (2014), pp. 240-253

[21]	C.M.M. Kittel, A.L. Arildsen, S. Dybkjær, E.R. Hansen, I. Linde, E. Slott, et al. Informing hydrological models of poorly gauged river catchments—a parameter regionalization and calibration approach. J Hydrol, 587 (2020), p. 124999

[22]	Y.Q. Zhang, F.H.S. Chiew. Relative merits of different methods for runoff predictions in ungauged catchments. Water Resour Res, 45 (7) (2009), p. W07412

[23]	M. Saadi, L. Oudin, P. Ribstein. Random forest ability in regionalizing hourly hydrological model parameters. Water, 11 (8) (2019), p. 1540. DOI: 10.3390/w11081540

[24]	P. Soni, S. Tripathi, R. Srivastava. A comparison of regionalization methods in monsoon dominated tropical river basins. J Water Clim Chang, 12 (5) (2021), pp. 1975-1996. DOI: 10.2166/wcc.2021.298

[25]	D.J. Lary, A.H. Alavi, A.H. Gandomi, A.L. Walker. Machine learning in geosciences and remote sensing. Geosci Front, 7 (1) (2016), pp. 3-10

[26]

S. Hao, Q. Ma, X. Zhai, G. Lyu, S. Fan, W. Wang. A new machine learning approach for parameter regionalization of flash flood modelling in Henan Province, China. S. Stanciu, K. Kassmi, G. Shmavonyan (Eds.), Proceedings of the 2021 2nd International Conference on Energy, Power and Environmental System Engineering; 2021 Jul 4-5; Shanghai, China, EDP Science, Les Ulis (2021), p. 02010. DOI: 10.1051/e3sconf/202130002010

[27]	S. Ragettli, J. Zhou, H. Wang, C. Liu, L. Guo. Modeling flash floods in ungauged mountain catchments of China: a decision tree learning approach for parameter regionalization. J Hydrol, 555 (2017), pp. 330-346

[28]	Ministry of Water Resources People’s Republic of China. China water resources bulletin 2019. China Water & Power Press, Beijing (2020)

[29]	J. Wu, X.J. Gao. A gridded daily observation dataset over China region and comparison with the other datasets. Chin J Geophys, 56 (4) (2013), pp. 1102-1111 [Chinese].

[30]	D.R. Samal, S. Gedam. Assessing the impacts of land use and land cover change on water resources in the Upper Bhima River Basin, India. Environ Chall, 5 (2021), p. 100251

[31]	M.L. Tan, P.W. Gassman, X. Yang, J. Haywood. A review of SWAT applications, performance and future needs for simulation of hydro-climatic extremes. Adv Water Resour, 143 (2020), p. 103662

[32]	J.G. Arnold, D.N. Moriasi, P.W. Gassman, K.C. Abbaspour, M.J. White, R. Srinivasan, et al. SWAT: model use, calibration, and validation. Trans ASABE, 55 (4) (2012), pp. 1491-1508

[33]	C. Li, H. Fang. Assessment of climate change impacts on the streamflow for the Mun River in the Mekong Basin, Southeast Asia: using SWAT model. Catena, 201 (2021), p. 105199

[34]	B. Mohammadi, S. Mehdizadeh. Modeling daily reference evapotranspiration via a novel approach based on support vector regression coupled with whale optimization algorithm. Agric Water Manage, 237 (2020), p. 106145

[35]	S.Y. Park, M. Park, W.Y. Lee, C.Y. Lee, J.H. Kim, S. Lee, et al. Machine learning-based prediction of Sasang constitution types using comprehensive clinical information and identification of key features for diagnosis. Integr Med Res, 10 (3) (2021), p. 100668

[36]	K.G. Liakos, P. Busato, D. Moshou, S. Pearson, D. Bochtis. Machine learning in agriculture: a review. Sensors, 18 (8) (2018), p. 2674. DOI: 10.3390/s18082674

[37]	K. Feng, A. González, M. Casero. A kNN algorithm for locating and quantifying stiffness loss in a bridge from the forced vibration due to a truck crossing at low speed. Mech Syst Signal Proc, 154 (2021), p. 107599

[38]	A.H. Mary, A.H. Miry, T. Kara, M.H. Miry. Nonlinear state feedback controller combined with RBF for nonlinear underactuated overhead crane system. J Eng Res, 9 (3A) (2021), pp. 197-208

[39]	Z. Hu, X. Chen, Q. Zhou, D. Chen, J. Li. DISO: a rethink of Taylor diagram. Int J Climatol, 39 (5) (2019), pp. 2825-2832. DOI: 10.1002/joc.5972

[40]	Z. Bao, J. Zhang, J. Liu, G. Fu, G. Wang, R. He, et al. Comparison of regionalization approaches based on regression and similarity for predictions in ungauged catchments under multiple hydro-climatic conditions. J Hydrol, 466-467 (2012), pp. 37-46

[41]	M. Ligaray, H. Kim, S. Sthiannopkao, S. Lee, K.H. Cho, J.H. Kim. Assessment on hydrologic response by climate change in the Chao Phraya River Basin, Thailand. Water, 7 (12) (2015), pp. 6892-6909. DOI: 10.3390/w7126665

[42]	D. Yu, P. Xie, X. Dong, X. Hu, J. Liu, Y. Li, et al. Improvement of the SWAT model for event-based flood simulation on a sub-daily timescale. Hydrol Earth Syst Sci, 22 (9) (2018), pp. 5001-5019. DOI: 10.5194/hess-22-5001-2018

[43]	M. Samimi, A. Mirchi, D. Moriasi, S. Ahn, S. Alian, S. Taghvaeian, et al. Modeling arid/semi-arid irrigated agricultural watersheds with SWAT: applications, challenges, and solution strategies. J Hydrol, 590 (2020), p. 125418

[44]	L. Oudin, V. Andreassian, C. Perrin, C. Michel, N. Le Moine. Spatial proximity, physical similarity, regression and ungaged catchments: a comparison of regionalization approaches based on 913 French catchments. Water Resour Res, 44 (2008), p. W03413

[45]	D.J. Booker, T.H. Snelder. Comparing methods for estimating flow duration curves at ungauged sites. J Hydrol, 434-435 (2012), pp. 78-94

[46]	J. Elith, C.H. Graham, R.P. Anderson, M. Dudík, S. Ferrier, A. Guisan, et al. Novel methods improve prediction of species’ distributions from occurrence data. Ecography, 29 (2) (2006), pp. 129-151. DOI: 10.1111/j.2006.0906-7590.04596.x

[47]	F.J. Penas, J. Barquin, C. Alvarez. A comparison of modeling techniques to predict hydrological indices in ungauged rivers. Limnetica, 37 (1) (2018), pp. 145-158

[48]	S.S. Patel, P. Ramachandran. A comparison of machine learning techniques for modeling river flow time series: the case of Upper Cauvery River Basin. Water Resour Manage, 29 (2) (2015), pp. 589-602. DOI: 10.1007/s11269-014-0705-0

[49]	J.B. Swain, K.C. Patra. Streamflow estimation in ungauged catchments using regionalization techniques. J Hydrol, 554 (2017), pp. 420-433

[50]	L. Boscarello, G. Ravazzani, A. Cislaghi, M. Mancini.Regionalization of flow-duration curves through catchment classification with streamflow signatures and physiographic-climate indices. J Hydrol Eng, 21 (3) (2016), p. 05015027. DOI: 10.1061/(ASCE)HE.1943-5584.0001307

[51]	R. Merz, G. Blöschl. Regionalisation of catchment model parameters. J Hydrol, 287 (1-4) (2004), pp. 95-123

[52]	S. Mwakalila. Estimation of stream flows of ungauged catchments for river basin management. Phys Chem Earth, 28 (20-27) (2003), pp. 935-942

[53]	T. Razavi, P. Coulibaly. Streamflow prediction in ungauged basins. Review of regionalization methods. J Hydrol Eng, 18 (8) (2013), pp. 958-975

[54]	J. Parajka, A. Viglione, M. Rogger, J.L. Salinas, M. Sivapalan, G. Blöschl. Comparative assessment of predictions in ungauged basins-part 1: runoff-hydrograph studies. Hydrol Earth Syst Sci, 17 (5) (2013), pp. 1783-1795. DOI: 10.5194/hess-17-1783-2013

[55]	X. Yang, J. Magnusson, S. Huang, S. Beldring, C.Y. Xu. Dependence of regionalization methods on the complexity of hydrological models in multiple climatic regions. J Hydrol, 582 (2020), p. 124357

[56]	S. Pool, M. Vis, J. Seibert. Regionalization for ungauged catchments—lessons learned from a comparative large‐sample study. Water Resour Res, 57 (10) (2021), p. WR030437

[57]	Z. Abdulelah Al-Sudani, S.Q. Salih, A. Sharafati, Z.M. Yaseen. Development of multivariate adaptive regression spline integrated with differential evolution model for streamflow simulation. J Hydrol, 573 (2019), pp. 1-12

[58]	B. Choubin, K. Solaimani, F. Rezanezhad, M.H. Roshan, A. Malekian, S. Shamshirband. Streamflow regionalization using a similarity approach in ungauged basins: application of the geo-environmental signatures in the Karkheh River Basin. Catena, 182 (2019), Article 104128

[59]	T. Abbas, F. Hussain, G. Nabi, M.W. Boota, R.S. Wu. Uncertainty evaluation of SWAT model for snowmelt runoff in a Himalayan watershed. Terr Atmos Ocean Sci, 30 (2) (2019), pp. 265-279

[60]	Y. Wang, R. Jiang, J. Xie, Y. Zhao, D. Yan, S. Yang. Soil and water assessment tool (SWAT) model: a systemic review. J Coast Res, 93 (SI) (2019), pp. 22-30