How was the hydrology model trained?
The hydrology model is trained using streamflow data over the time period 1980 - 2023. We use training data from approximately 16,000 streamflow gauges, illustrated in the figure below. These locations represent watersheds ranging from 2 km2 to over 4,000,000 km2. We use data from all sizes of watersheds to train a single model.
Locations of streamflow gauges used for model training.
Notice that not all input data is available for the whole training period (1980 - 2023). In cases where a certain data product is missing we either mask or impute the missing data. ECMWF HRES forecast data is imputed with the corresponding ERA5-Land variable during the time period 1980 - 2016. Missing data from IMERG or CPC is masked, meaning that the actual missing value is ignored by the model.
The model is trained using a negative log-likelihood loss function. Other hyperparameters of the model (being quite numerous) are available on request. No part of the model architecture or training procedure is proprietary.
How was the hydrology model evaluated?
Various versions of LSTM-based streamflow models have been benchmarked extensively in peer-reviewed publications. For example, this paper compares the performance of an LSTM-based simulation (not forecast) model over watersheds in the continental US with several commonly-used research models. This paper compares a simulation LSTM against the two hydrology models that are run operationally for flood forecasting in the United States.
The most up-to-date peer-reviewed model performance statistics can be found in this paper. That paper compares the skill of the Global Flood Awareness System (GloFAS) with Google’s hydrology model at predicting the timing of extreme events (1, 2, 5 ,and 10-year return period events). The Google model shows better performance, on average, in all continents, and we achieve either better or statistically indistinguishable performance at five-day lead times as GloFAS achieves over nowcasts (zero-day lead times)
Our objective is to understand the reliability of forecasts of extreme events, so we report precision, recall, and F1 scores (F1 scores are the harmonic mean of precision and recall) over different return period events.
What are the inputs to the hydrological model?
Most of the data that we currently use to train and test the hydrology model is publicly available. This means, in principle, all of our modeling results are reproducible and verifiable. The meteorological forecast data that we use comes from the ECMWF Integrated Forecast System, GraphCast and obtaining this data in real time requires a paid license from ECMWF.
Target data: The hydrology model is trained on publicly available historical data from the Global Runoff Data Center, the Caravan dataset and BANDAS gauge data from Mexico for training.
Static attributes: Geophysical and geographical data are provided to the model as inputs. These come largely from the HydroAtlas dataset, which is part of the HydroSheds project. These include variables about climate, land use and land cover, soils, and human impacts. A full list of the static attributes that we use is available on request. These data are scalar (single) values representing aggregation (fractional coverage, mean, mode, max, etc.) over a given (sub)watershed.
Meteorological data: For the vast majority of our forecasts we rely on a variety of publicly available weather products, including ECMWF forecasts, IMERG precipitation, NOAA precipitation data, CPC rain gauge measurements, GraphCast forecasts by Google.
In a small number of cases we rely on local historical and real time data provided by the following governments:
- Bangladesh: BWDB - Bangladesh Water Development Board
- India: CWC - Central Water Commission
- Sri Lanka: Department of Irrigation
- Brazil: ANA- Agência Nacional de Águas e Saneamento Básico, Civil Defense and SGB/CPRM - Serviço Geológico do Brasil
Unless otherwise stated, the models do not use real-time data provided by governmental entities, nor is the Google Flood Forecasting project affiliated with, sponsored by, or endorsed by any governmental entity.
What is the level of confidence in the model?
Various versions of LSTM-based streamflow models have been benchmarked extensively in peer-reviewed publications. For example, this paper compares the performance of an LSTM-based simulation (not forecast) model over watersheds in the continental US with several commonly-used research models. This paper compares a simulation LSTM against the two hydrology models that were run operationally for flood forecasting in the United States.
The most up-to-date peer-reviewed model performance statistics can be found in this paper. That paper compares the skill of the Global Flood Awareness System (GloFAS) with Google’s hydrology model at predicting the timing of extreme events (1, 2, 5, and 10-year return period events). The Google model shows better performance, on average, in all continents, and we achieve either better or statistically indistinguishable performance at five-day lead times as GloFAS achieves over nowcasts (zero-day lead times) – see figure below.
How reliable are predictions in ungauged basins?
We recently published our paper “Global prediction of extreme floods in ungauged watersheds” in Nature, showing that AI generated global flood forecasts outperform existing models.
A longstanding challenge in hydrology is the problem of Prediction in Ungauged Basins (PUB). This refers to making streamflow predictions in river reaches where there is no streamflow data for calibrating models. Machine Learning hydrology models are able to transfer (learned) information between different watersheds and are significantly more reliable in ungauged catchments than other types of hydrology models. This is one of the main reasons why we use an ML-based modeling approach in the Google flood forecasting system. More information, including benchmarks against two hydrology models used operationally in the United States, can be found in this paper.
How accurate is the model in different areas compared to the traditional models?
The AI model has higher scores than GloFAS in all continents and return periods (p < 1e − 2, 0.10 < d < 0.68) with three exceptions where there is no statistical difference: Africa over 1-year return period events (p = 0.07, d = 0.03) and Asia over 5-year (p = 0.04, d = 0.12) and 10-year (p = 0.18, d = 0.12) return period events.
Over 5-year return period events, GloFAS has a 54% difference between mean F1 scores in the lowest scoring continent (South America: F1 = 0.15) and the highest scoring continent (Europe: F1 = 0.32), meaning that, on average, true positive predictions are twice as likely (at a proportional rate). The AI model also has a 54% difference between mean F1 scores in the lowest scoring continent (South America: F1 = 0.21) and the highest scoring continent (South West Pacific: F1 = 0.46), which is due mostly to a large increase in skill in the South West Pacific relative to GloFAS (d = 0.68).
How does the hydrological model deal with
Lumped Catchment Modeling
Our river forecast model is a lumped catchment model. This means that it directly predicts streamflow (discharge) at a given river reach. We do not use any type of routing model to route water downstream or through a river network. The main limitation of this approach is that we are currently unable to assimilate near-real time streamflow data to improve downstream predictions in real time. We are currently developing a graph modeling approach that simulates both rainfall-runoff processes and routing processes.
Dams and Reservoirs
We do not currently explicitly model dam and reservoir operations. In gauged locations, our ML models are able to learn some of the dam and reservoir operation signal that is present in downstream gauge data, however we do not expect that any learned patterns of regular operations will necessarily be similar to operational procedures during times of flooding. This is an area of ongoing research, and we are looking into a dam and reservoir model in the future.
Physical Realism
A common concern with ML-based hydrology modeling is that ML models are not physically constrained, and predictions therefore might not be physically realistic. After many years of working with these models in both research and operational environments, we have not observed cases where this happens.
Mass balance or other physical constraints
We have two peer-reviewed publications on the topic of introducing mass balance constraints into the ML models to ensure that water is not lost or created within the modeled systems. We approached this question by developing an ML model that is constrained by mass conservation.
This paper looks at the effect of enforcing mass balance constraints on peak flow estimates, and found that ML models that do not include mass balance constraints have less bias when predicting peak flow events. Similarly, this paper looks at the effect of enforcing mass balance constraints on biases in the overall water balance, and again found that ML models which don’t include explicit mass balance constraints result in less biased predictions. The reason for this is because ML models are able to learn local, heterogeneous, and heteroscedastic biases in precipitation data, and can mitigate these input biases to result in less biased streamflow forecasts.
Snow accumulation and snowmelt
Using explainable AI techniques, we found that our LSTM-based streamflow models are able to model snow accumulation and snow melt processes effectively without being trained on snow data explicitly. The LSTM models use certain cell states (portions of the state vector that is recursive in time) to track snow accumulation and snow melt as a function of temperature. Certain LSTM states correlate strongly with snow accumulation, and the model learns to release water from those states when temperature is higher than freezing. Again, this is done without the model ever needing to see snow data explicitly. We published a peer-reviewed book chapter on this, and a subsequent paper from researchers at the University of Oxford confirmed these findings.
Changes in Climate or Land Use and Land Cover
Although land use and land cover (LULC) is changing rapidly, we approximate land cover in watersheds as static. This is an approximation, and we are exploring to use satellite imagery as input data instead of static catchment attributes related to LULC. However, we have done extensive research on this and found that using dynamic land cover indexes does not generally improve the quality of streamflow predictions.
We retrain our models frequently, and climate change occurs at much longer timescales than our retraining cycles.