TY - JOUR
T1 - Comparing the use of all data or specific subsets for training machine learning models in hydrology
T2 - A case study of evapotranspiration prediction
AU - Shi, Haiyang
AU - Luo, Geping
AU - Hellwich, Olaf
AU - He, Xiufeng
AU - Xie, Mingjuan
AU - Zhang, Wenqiang
AU - Ochege, Friday U.
AU - Ling, Qing
AU - Zhang, Yu
AU - Gao, Ruixiang
AU - Kurban, Alishir
AU - De Maeyer, Philippe
AU - Van de Voorde, Tim
N1 - Funding Information:
This research has been supported by the Tianshan Talent Cultivation (grant no. 2022TSYCLJ0001), the Key Projects of the Natural Science Foundation of Xinjiang Autonomous Region (grant no. 2022D01D01), the National Natural Science Foundation of China (grant no. U1803243), the Strategic Priority Research Program of the Chinese Academy of Sciences (grant no. XDA20060302), and the High-End Foreign Experts project of China. We would like to thank the two reviewers for their insightful comments.
Publisher Copyright:
© 2023 Elsevier B.V.
PY - 2023/12
Y1 - 2023/12
N2 - Machine learning has been widely used in hydrological modeling. However, the question of whether to use all data for modeling or only a specific subset for modeling and its implications are rarely investigated explicitly. As a case study, combining evapotranspiration (ET) observations from 168 flux stations, meteorological and biophysical variables, we used Random Forests to separately construct an 'All data' model trained with all data and 6 'plant functional type (PFT) specific' models trained with specific PFT data (i.e., Forest, Grassland, Cropland, Shrubland‚ Savannah, Wetland). We found ET simulations between different specific PFTs are transferable. The 'All data' model captured better ET and had a higher R-squared at 94 of 168 sites, especially in Wetland, Shrubland, Cropland, and Grassland types. Compared to using the 'All data' model, the 'PFT specific' model can further improve the accuracy in high R-squared grassland sites by reducing the effect of confusion of other PFTs and constraining the variance of the training data. When shifting from the 'All data' model to the 'PFT specific' model, the increase in the degree of encapsulation of the training set into the prediction set leads to a decrease in the R-squared. Accuracy pre-evaluation may be necessary before applying models trained from either all data or subset data.
AB - Machine learning has been widely used in hydrological modeling. However, the question of whether to use all data for modeling or only a specific subset for modeling and its implications are rarely investigated explicitly. As a case study, combining evapotranspiration (ET) observations from 168 flux stations, meteorological and biophysical variables, we used Random Forests to separately construct an 'All data' model trained with all data and 6 'plant functional type (PFT) specific' models trained with specific PFT data (i.e., Forest, Grassland, Cropland, Shrubland‚ Savannah, Wetland). We found ET simulations between different specific PFTs are transferable. The 'All data' model captured better ET and had a higher R-squared at 94 of 168 sites, especially in Wetland, Shrubland, Cropland, and Grassland types. Compared to using the 'All data' model, the 'PFT specific' model can further improve the accuracy in high R-squared grassland sites by reducing the effect of confusion of other PFTs and constraining the variance of the training data. When shifting from the 'All data' model to the 'PFT specific' model, the increase in the degree of encapsulation of the training set into the prediction set leads to a decrease in the R-squared. Accuracy pre-evaluation may be necessary before applying models trained from either all data or subset data.
KW - Evapotranspiration
KW - FLUXNET
KW - Hydrological model
KW - Machine learning
KW - Plant functional type
KW - Random forests
UR - http://www.scopus.com/inward/record.url?scp=85177596617&partnerID=8YFLogxK
U2 - 10.1016/j.jhydrol.2023.130399
DO - 10.1016/j.jhydrol.2023.130399
M3 - Article
AN - SCOPUS:85177596617
SN - 0022-1694
VL - 627
JO - Journal of Hydrology
JF - Journal of Hydrology
M1 - 130399
ER -