Developing reliable hourly electricity demand data through screening and imputation


Electricity usage (demand) data are used by utilities, governments, and academics to model electric grids for a variety of planning (e.g., capacity expansion and system operation) purposes. The U.S. Energy Information Administration collects hourly demand data from all balancing authorities (BAs) in the contiguous United States. As of September 2019, we find 2.2% of the demand data in their database are missing. Additionally, 0.5% of reported quantities are either negative values or are otherwise identified as outliers. With the goal of attaining non-missing, continuous, and physically plausible demand data to facilitate analysis, we developed a screening process to identify anomalous values. We then applied a Multiple Imputation by Chained Equations (MICE) technique to impute replacements for missing and anomalous values. We conduct cross-validation on the MICE technique by marking subsets of plausible data as missing, and using the remaining data to predict this “missing” data. The mean absolute percentage error of imputed values is 3.5% across all BAs. The cleaned data are published and available open access:

In Scientific data