Project Overview
This project focuses on engineering a large-scale data pipeline to generate rough heating and cooling load estimates for approximately 130 million U.S. residential building records provided by the research partner. To achieve this, the property records are aligned with the National Laboratory of the Rockies ResStock dataset, a residential building stock energy model used to represent the U.S. housing stock. By matching shared characteristics, peak heating and cooling loads (kW) derived from ResStock's EnergyPlus simulations are transferred to the property records.
Technical Methodology & Software Stack
Data Harmonization (Databricks / SQL): Variables overlapping between the two datasets are identified and inventoried. Units are standardized and categorical values are harmonized to ensure structural consistency across the database.
Stratification
 A key stratification scheme is defined for downstream estimation. Sample coverage is verified to ensure each stratum has a sufficient number of ResStock samples.
Load Transfer Strategy (Python)
Multiple load transfer methods are evaluated, utilizing a hybrid approach. Records are first matched within key strata. Then, a predictive Python model is deployed within each stratum to improve accuracy and prevent unrealistic extrapolation.
Imputation Strategy
Missing information in the property dataset is addressed using ResStock's conditional distributions. This imputation is executed by sampling based on the property’s observed characteristics.
Personal Contributions
Reviewing datasets, harmonization of the two sets, Stratification of combined set. These tasks were performed in Databricks using SQL and Python in collaboration with AI to help write scripts.

Back to Top