Predicting Water and Power Usage for the city of Los Angeles, California
An example project for PHY178/CSC171
M. J. Madsen
I present here an example final project. This project is not intended to be a carbon-copy template for your project, but rather it should give you and idea of the general shape and style of work that I am looking for. Note that I am following the general outline and instructions for this sample project.
Preliminary Data Exploration
I am using data from the city of Los Angeles originally found from this website.
The dataset has four columns of interest:
The month in which the data was recorded
The zip code (geographical location) for the data
The water use for that zip code (measured in HCF)
The power use for that zip code (measure in kWh)
The raw data has another column (Value Date) that we will not be using. I first examine the raw data.
Text Date | Value Date | Zip Code | Water Use | Power Use | |
---|---|---|---|---|---|
0 | Mar_2008 | Mar-08 | 90230\n(33.99506171100046, -118.39500957899969) | 16.70 | 396 |
1 | Jul_2011 | Jul-11 | 90272\n(34.04886156900045, -118.53572692799969) | 35.73 | 1013 |
Before moving forward, there are some obvious data cleaning steps that need to be taken. I change the date from text to a datetime object and extract the 5-digit zip code from the Zip Code column. Finally, I drop the unneeded columns from the dataframe.
Water Use | Power Use | Date | Zip | |
---|---|---|---|---|
0 | 16.70 | 396 | 2008-03-01 | 90230 |
1 | 35.73 | 1013 | 2011-07-01 | 90272 |
The goal of this project is to predict the water and power use. I plot the totals for each date.
Data Cleaning
There is an extra point at the end that is separated from the rest of the dataset. I look specifically for times after Jan 1, 2013.
Date | Water Use | Power Use | Zip | |
---|---|---|---|---|
90 | 2013-02-01 | 1349.45 | 54265 | 13136208 |
91 | 2013-03-01 | 1640.86 | 55289 | 13136208 |
92 | 2013-04-01 | 1788.75 | 52367 | 13136208 |
93 | 2013-05-01 | 2037.96 | 57791 | 13136208 |
94 | 2013-06-01 | 2112.38 | 62299 | 13136208 |
95 | 2013-12-01 | 1398.03 | 64430 | 13136208 |
There are missing data points between July 2013 and November 2013. I cut the data off at June 30, 2013.
Before moving forward, it looks like there was a major change in water usage around the mid-point of 2012. However, when I look at the data grouped by zip code, I see that the drop is due to a change in the number of reported zip codes that happened in the middle of 2012.
Because my fits will be by zip code, this should not be an issue. However, there are a handfull of possible outliers in the water use data.
Outlier Detection
I utilize the Mahalanobis distance to see how far from the average of the data the potential outliers are.
There are about 8 points that stand out as being very far from the rest of the 12,000 points (over 15 standard deviations). All of the points occur in two short time periods in the same zip code.
Water Use | Power Use | Date | Zip | M_dist | |
---|---|---|---|---|---|
683 | 2512.2 | 0 | 2008-03-01 | 91350 | 18.930970 |
10991 | 2407.0 | 0 | 2008-04-01 | 91350 | 18.121215 |
8 | 2694.8 | 0 | 2008-05-01 | 91350 | 20.336495 |
4406 | 3231.6 | 0 | 2009-07-01 | 91350 | 24.468400 |
10328 | 3235.6 | 0 | 2009-08-01 | 91350 | 24.499190 |
12452 | 3110.4 | 0 | 2009-09-01 | 91350 | 23.535489 |
7301 | 3202.4 | 0 | 2009-10-01 | 91350 | 24.243640 |
3227 | 3074.8 | 0 | 2009-11-01 | 91350 | 23.261465 |
There may have been something unusual happening during those time periods. I drop these points from the dataset as error outliers.
Water Use | Power Use | Date | Zip | |
---|---|---|---|---|
0 | 16.70 | 396 | 2008-03-01 | 90230 |
1 | 35.73 | 1013 | 2011-07-01 | 90272 |
Data Enrichment
There is a definite month/year periodicity in the data. I add month and year data features to the model. I visualize the data to and see that the periodicity is clearly visible in both the water use and the power use data.
Both the water and the power usage go up in the summer months, which makes sense. There is also a power spike in January that may be related to heating on cold days.
Enriching Data with other Datasets
I hypothesize that the water and power useage depends on the following additional factors:
Population of each zip code
Economic status of each zip code (weath distribution)
Weather data
I join the original dataset with three additonal datasets to add these features.
Adding Population Data
The population data was retrieved from the US Census Bureau.
https://www.census.gov/geo/maps-data/data/zcta_rel_download.html
The dataset includes information about the population, the land area, and the total area for each zip code. There are a number of other columns of data that I will not use. I convert the zip code column (ZCTA5) to an integer in order to match the datatype of my power/water usage dataframe zip code.
ZCTA5 | STATE | COUNTY | GEOID | POPPT | HUPT | AREAPT | AREALANDPT | ZPOP | ZHU | ... | COAREA | COAREALAND | ZPOPPCT | ZHUPCT | ZAREAPCT | ZAREALANDPCT | COPOPPCT | COHUPCT | COAREAPCT | COAREALANDPCT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 601 | 72 | 1 | 72001 | 18465 | 7695 | 165132671 | 164333375 | 18570 | 7744 | ... | 173777444 | 172725651 | 99.43 | 99.37 | 98.61 | 98.6 | 94.77 | 94.71 | 95.03 | 95.14 |
1 | 601 | 72 | 141 | 72141 | 105 | 49 | 2326414 | 2326414 | 18570 | 7744 | ... | 298027589 | 294039825 | 0.57 | 0.63 | 1.39 | 1.4 | 0.32 | 0.35 | 0.78 | 0.79 |
2 rows × 24 columns
Because the data in this set are intended to merge other datasets, there are multiple entries for each zip code. The "Z" columns are totals for each zip code, so I only need one of them - I use the max
aggregation. I cut the unneeded columns, keeping only the zip code, population, total area, and land area columns.
ZCTA5 | ZPOP | ZAREA | ZAREALAND | |
---|---|---|---|---|
0 | 601 | 18570 | 167459085 | 166659789 |
1 | 602 | 41520 | 83734431 | 79288158 |
To get a sense of the dataset, I look at the relationship between the land area and the population.
There is an interesting anti-correlation here. But perhaps that isn't too surprising, as larger population densities will have more zip codes associated with them, so most of the data would be in small land area, large populations.
Now I merge this will the resource use database, keeping only the columns that I need after the join.
Water Use | Power Use | Date | Zip | Month | Year | ZPOP | ZAREA | ZAREALAND | |
---|---|---|---|---|---|---|---|---|---|
0 | 16.70 | 396 | 2008-03-01 | 90230 | 3 | 2008 | 31766 | 11785759 | 11672688 |
1 | 17.59 | 407 | 2005-12-01 | 90230 | 12 | 2005 | 31766 | 11785759 | 11672688 |
There were a number of rows lost in the merge- these were zip codes that were not in the US Census database.
I graph the new inputs to see how the Water and Power use depend on them.
It looks like the power use is more strongly correlation with the census and land area data.
Adding Economic Data
I add the economic data which is based on IRS tax records:
https://www.irs.gov/uac/soi-tax-stats-individual-income-tax-statistics-zip-code-data-soi
This dataset has the following variables I am interested in:
Nreturns: number of filed tax returns
AGI: adjusted gross income (in thousands of $)
SW: Salary and Wages (in thousands of $)
EIC: total earned income tax credit (in thousands of $)
However, the dataset is originally an Excel spreadsheet with multiple sheet names. I created a short function to retrieve the tax data I want.
Zip Code | Nreturns | AGI | SW | EIC | Year | |
---|---|---|---|---|---|---|
0 | 90001 | 17313 | 406784 | 358687.0 | 16162.0 | 2005 |
1 | 90002 | 14712 | 331533 | 294916.0 | 16130.0 | 2005 |
I visualize the economic data in the following plot.
I merge the IRS dataset with the population and usage dataset.
Water Use | Power Use | Date | Zip | Month | Year | ZPOP | ZAREA | ZAREALAND | Nreturns | AGI | SW | EIC | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16.70 | 396 | 2008-03-01 | 90230 | 3 | 2008 | 31766 | 11785759 | 11672688 | 15572 | 1008925 | 765127.0 | 3537.0 |
1 | 20.95 | 368 | 2008-07-01 | 90230 | 7 | 2008 | 31766 | 11785759 | 11672688 | 15572 | 1008925 | 765127.0 | 3537.0 |
There is an additional data loss in this merge. However, I still have most of my original dataset.
I visualize the combined dataset.
Again, there aren't strong correlations here, but the correlations that exist may improve the model. I am also interested in how closely this IRS dataset tracks the US Census dataset. The number of tax returns in a zip code should be close to the number of people living in that area. I plot the number of returns versus the census population to check this correlation.
The correlation is as expected. Howver, there is also a trend based on the year which probably corresponds to the growth and decline in populations.
Adding Weather Data
I add historical weather data to my input features. This dataset comes from:
The dataset has the following variables:
AWND: average wind speed (m/s)
CLDD: cooling degree days
HTDD: heating degree days
PRCP: preciptiation (cm)
TSUN: daily total sunshine
TAVG: Average daily temperature (celsius) [Not Used]
The weather data is essentially the same for all of the zip codes in this dataset- they are all in the city of Los Angeles. Any small variation in the rain or temperature will be negligible because I only have the average over the full month of data.
AWND | CLDD | HTDD | PRCP | TAVG | TSUN | Date | |
---|---|---|---|---|---|---|---|
0 | 3.2 | 4.8 | 135.2 | 174.5 | 14.1 | 0.0 | 2005-01-01 |
1 | 3.5 | 0.0 | 106.5 | 176.6 | 14.5 | 0.0 | 2005-02-01 |
I merge this last dataset with my other features.
Water Use | Power Use | Date | Zip | Month | Year | ZPOP | ZAREA | ZAREALAND | Nreturns | AGI | SW | EIC | AWND | CLDD | HTDD | PRCP | TAVG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16.70 | 396 | 2008-03-01 | 90230 | 3 | 2008 | 31766 | 11785759 | 11672688 | 15572 | 1008925 | 765127.0 | 3537.0 | 3.2 | 6.7 | 91.5 | 0.8 | 15.6 |
1 | 30.18 | 970 | 2008-03-01 | 90272 | 3 | 2008 | 22986 | 60557886 | 59133992 | 11165 | 3569670 | 1750980.0 | 355.0 | 3.2 | 6.7 | 91.5 | 0.8 | 15.6 |
There were weather data for all of the existing rows, so there wasn't any data lost in this step. I visualize the weather data and the water/power use data together to look for correlations.
It certainly looks like there are correlations here - at least between the Water Use/Precepitation and the Power Use/Average Temp datasets.
Data Transformations
Categorical Data Transformation
I have three columns that are categorical in nature: Zip, Month, and Year. The other columns are continuous variables. I re-define these three variables as dummy variables.
Water Use | Power Use | ZPOP | ZAREA | ZAREALAND | Nreturns | AGI | SW | EIC | AWND | ... | M_12 | Y_2005 | Y_2006 | Y_2007 | Y_2008 | Y_2009 | Y_2010 | Y_2011 | Y_2012 | Y_2013 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16.70 | 396 | 31766 | 11785759 | 11672688 | 15572 | 1008925 | 765127.0 | 3537.0 | 3.2 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 30.18 | 970 | 22986 | 60557886 | 59133992 | 11165 | 3569670 | 1750980.0 | 355.0 | 3.2 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 rows × 169 columns
Data test/training split
I split my data into 80% training and 20% testing data sets. I do this before doing any further data transformations in order to avoid "data snooping".
Data Scaling
Almost all of my data columns are on different scales. I regularize the non-categorical data before fitting. Because only some of the columns need to be regularized, I split the data, regularize part of it, then re-join the training dataset. I do the same transformation to the test dataset.
I now separate out the features and the targets.
PCA
I tested reducing the dimensionality of the features using the PCA. However, this did not improve either the model performance or the time it takes to fit the models. So I will not be using PCA for this analysis.
Model Testing
My first model is to simply use the "prior year" data in order to predict the usage. I am using the Root-Mean-Square (RMS) as my metric for model performance. The RMS value gives me an estimate of the prediction error in the same units as the Water/Power use data.
Linear Regression
I next create a baseline predictive model using a linear regression using the same metrics
The linear regression model is already performing better than the 'last year' model. I follow this with further regression models.
Random Forest Regression
I next apply the random forest regression using the same train/test data.
The random forest has better performance than the linear model, though it took significantly longer.
Adaboost Regression
I look at two other models. First, the Adaboost ensemble method using the Decision Tree Regressor as the base model.
The performance and timing are both slightly better than the Random forest model. One final model:
Gradient Boosting Regressor
Model Optimization
The Adaboost model has the best performance for the Water Use data, but the Gradient Boosting regressor has the best performance for the Power Use model. Because they are two separate datasets, I use the best performing model for each one and tune the hyperparameters separately.
Water Use Model Hyperparameter Tuning
I tune the adaboost model hyperparameters, focusing on the minimum number of samples per leaf for the underlying decision tree regressor. I increased the number of estimators to reduce the noise in the trend data.
The best model performance comes around a minimum leaf size of 20. The model is overfitting with a smaller minimum leaf size and is underfitting with a larger minimum leaf size. Therefore, I run one final model for the water use data.
Power Use Model Optimization
I change to the Gradient Booting regression to optimize a model for predicting the power use. I again focus on tuning the minimum number of leaves hyperparameter.
Again, there is evidence that, for a minimum leaf size of less than 30, the model is overfitting. There is a soft minimum around 40 before the model underfits the data, so I use that parameter to train a final Power Use model.
Model Limitations
I investigate the limitations of each model - where do the models disagree from the actual data? I plot the residuals for the Water Use model first.
I investigate all points with a residual greater than 150.
Water Use | Power Use | Date | Zip | Month | Year | ZPOP | ZAREA | ZAREALAND | Nreturns | AGI | SW | EIC | AWND | CLDD | HTDD | PRCP | TAVG | Model Predictions | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9691 | 945.00 | 0 | 2011-07-01 | 91350 | 7 | 2011 | 33348 | 43948937 | 43886336 | 15077 | 1093595 | 908039.0 | 2387.0 | 3.6 | 57.9 | 0.8 | 0.0 | 20.2 | 574.852632 |
3839 | 908.40 | 0 | 2012-01-01 | 91350 | 1 | 2012 | 33348 | 43948937 | 43886336 | 15400 | 1158979 | 958999.0 | 2508.0 | 2.2 | 5.3 | 102.1 | 30.2 | 15.2 | 394.325714 |
18 | 415.00 | 0 | 2008-03-01 | 91201 | 3 | 2008 | 22781 | 6143646 | 6116160 | 10115 | 494601 | 369000.0 | 3797.0 | 3.2 | 6.7 | 91.5 | 0.8 | 15.6 | 583.333333 |
2110 | 277.00 | 0 | 2005-10-01 | 91350 | 10 | 2005 | 33348 | 43948937 | 43886336 | 13548 | 877019 | 743564.0 | 1370.0 | 2.7 | 29.1 | 29.4 | 25.7 | 18.3 | 434.602727 |
508 | 574.48 | 0 | 2008-02-01 | 90071 | 2 | 2008 | 15 | 319155 | 319155 | 639 | 364405 | 75762.0 | 33.0 | 3.0 | 2.0 | 134.7 | 55.2 | 13.8 | 730.676563 |
2942 | 0.00 | 0 | 2012-11-01 | 90211 | 11 | 2012 | 8434 | 1813980 | 1813980 | 5240 | 977691 | 477587.0 | 947.0 | 2.8 | 13.4 | 52.1 | 33.4 | 17.0 | 198.000000 |
682 | 501.00 | 0 | 2008-04-01 | 91201 | 4 | 2008 | 22781 | 6143646 | 6116160 | 10115 | 494601 | 369000.0 | 3797.0 | 3.4 | 40.9 | 79.6 | 0.8 | 17.1 | 742.842105 |
8095 | 634.50 | 0 | 2007-01-01 | 91350 | 1 | 2007 | 33348 | 43948937 | 43886336 | 15114 | 1073120 | 870460.0 | 1539.0 | 2.5 | 3.0 | 184.2 | 9.9 | 12.5 | 256.880656 |
8928 | 622.00 | 0 | 2007-02-01 | 90265 | 2 | 2007 | 18116 | 298041072 | 279184812 | 7969 | 2185356 | 748069.0 | 480.0 | 3.4 | 5.9 | 122.7 | 20.9 | 14.2 | 853.484516 |
8893 | 88.60 | 0 | 2007-11-01 | 91350 | 11 | 2007 | 33348 | 43948937 | 43886336 | 15114 | 1073120 | 870460.0 | 1539.0 | 2.4 | 13.7 | 72.2 | 12.7 | 16.4 | 278.685714 |
10655 | 0.00 | 0 | 2011-11-01 | 91201 | 11 | 2011 | 22781 | 6143646 | 6116160 | 10399 | 525338 | 394901.0 | 5178.0 | 2.5 | 4.5 | 102.2 | 43.0 | 15.1 | 156.121212 |
10522 | 0.00 | 0 | 2011-12-01 | 91201 | 12 | 2011 | 22781 | 6143646 | 6116160 | 10399 | 525338 | 394901.0 | 5178.0 | 2.0 | 0.0 | 172.2 | 17.0 | 12.8 | 156.883721 |
6233 | 652.80 | 0 | 2010-08-01 | 91350 | 8 | 2010 | 33348 | 43948937 | 43886336 | 14655 | 1054878 | 876876.0 | 2352.0 | 3.3 | 25.6 | 11.6 | 0.0 | 18.8 | 460.494444 |
10356 | 115.00 | 0 | 2011-06-01 | 91350 | 6 | 2011 | 33348 | 43948937 | 43886336 | 15077 | 1093595 | 908039.0 | 2387.0 | 3.4 | 6.2 | 26.7 | 0.5 | 17.6 | 797.615227 |
284 | 825.00 | 0 | 2008-12-01 | 90211 | 12 | 2008 | 8434 | 1813980 | 1813980 | 5438 | 894757 | 471144.0 | 780.0 | 2.6 | 0.0 | 174.7 | 63.8 | 12.7 | 625.862069 |
551 | 679.00 | 0 | 2008-01-01 | 90265 | 1 | 2008 | 18116 | 298041072 | 279184812 | 7837 | 1771931 | 682034.0 | 533.0 | 3.2 | 0.8 | 158.0 | 118.5 | 13.3 | 380.111111 |
152 | 854.67 | 0 | 2008-07-01 | 90265 | 7 | 2008 | 18116 | 298041072 | 279184812 | 7837 | 1771931 | 682034.0 | 533.0 | 3.3 | 83.0 | 0.0 | 0.0 | 21.0 | 1060.796800 |
Most of these points where the model does a poor job fitting the data come from only a handful of zip codes: 91350, 91201, 91265, and 90211. There may be other events or other unique problems with these locations that caused the model predictions to not match the actuals.
In the power model case, I investigate points where the residuals are greater than 200.
Water Use | Power Use | Date | Zip | Month | Year | ZPOP | ZAREA | ZAREALAND | Nreturns | AGI | SW | EIC | AWND | CLDD | HTDD | PRCP | TAVG | Model Predictions | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
520 | 30.68 | 343 | 2008-02-01 | 90731 | 2 | 2008 | 59662 | 40481090 | 23803092 | 23793 | 1149101 | 940785.0 | 9937.0 | 3.0 | 2.0 | 134.7 | 55.2 | 13.8 | 577.653810 |
2899 | 0.00 | 0 | 2012-08-01 | 91355 | 8 | 2012 | 32605 | 40487220 | 40079320 | 15080 | 1280204 | 950721.0 | 2222.0 | 3.2 | 124.7 | 0.0 | 0.0 | 22.4 | 482.980702 |
9089 | 31.76 | 1107 | 2007-06-01 | 91307 | 6 | 2007 | 24474 | 20860392 | 20836532 | 11961 | 1019116 | 688451.0 | 1508.0 | 3.0 | 13.1 | 16.9 | 0.0 | 18.2 | 903.294856 |
10879 | 16.83 | 626 | 2011-02-01 | 91355 | 2 | 2011 | 32605 | 40487220 | 40079320 | 15155 | 1363894 | 941778.0 | 2212.0 | 2.9 | 0.3 | 151.0 | 37.4 | 12.9 | 158.648852 |
7196 | 31.10 | 2002 | 2013-01-01 | 90210 | 1 | 2013 | 21741 | 26383345 | 26226245 | 9770 | 4939447 | 1504798.0 | 955.0 | 2.5 | 2.5 | 149.2 | 33.0 | 13.6 | 1764.628705 |
3089 | 0.00 | 1081 | 2012-09-01 | 91205 | 9 | 2012 | 37810 | 4898344 | 4897491 | 15790 | 594462 | 488257.0 | 9300.0 | 2.9 | 120.5 | 0.0 | 0.0 | 22.4 | 848.008836 |
2762 | 0.00 | 0 | 2012-03-01 | 91210 | 3 | 2012 | 328 | 113997 | 113997 | 200 | 31641 | 19477.0 | 0.0 | 3.4 | 4.0 | 139.3 | 45.3 | 14.0 | 268.884486 |
9594 | 0.00 | 0 | 2011-07-01 | 90245 | 7 | 2011 | 16654 | 15142094 | 14148053 | 9338 | 913373 | 669719.0 | 1191.0 | 3.6 | 57.9 | 0.8 | 0.0 | 20.2 | 291.448270 |
11008 | 30.00 | 1060 | 2011-05-01 | 91210 | 5 | 2011 | 328 | 113997 | 113997 | 215 | 35619 | 20750.0 | 0.0 | 3.9 | 16.1 | 60.7 | 13.5 | 16.9 | 538.781944 |
842 | 68.58 | 885 | 2008-10-01 | 91302 | 10 | 2008 | 25709 | 69273356 | 69143126 | 12150 | 2315019 | 1324983.0 | 890.0 | 2.6 | 79.6 | 12.8 | 0.0 | 20.5 | 1097.152344 |
10172 | 5.00 | 254 | 2011-01-01 | 90275 | 1 | 2011 | 41804 | 43402847 | 35004660 | 19040 | 2716592 | 1498530.0 | 1706.0 | 2.2 | 3.4 | 103.6 | 20.6 | 15.1 | 513.265376 |
3186 | 39.22 | 1477 | 2012-09-01 | 91436 | 9 | 2012 | 14372 | 14891780 | 14751608 | 9180 | 3782435 | 1814704.0 | 499.0 | 2.9 | 120.5 | 0.0 | 0.0 | 22.4 | 1706.908344 |
2167 | 47.71 | 1025 | 2005-07-01 | 91324 | 7 | 2005 | 27669 | 11437330 | 11373551 | 11136 | 715075 | 438407.0 | 2909.0 | 3.5 | 65.1 | 2.5 | 0.0 | 20.4 | 744.429354 |
8838 | 0.00 | 765 | 2007-11-01 | 90403 | 11 | 2007 | 24525 | 3944526 | 3697303 | 15272 | 1707337 | 956655.0 | 836.0 | 2.4 | 13.7 | 72.2 | 12.7 | 16.4 | 477.825001 |
10343 | 34.50 | 1224 | 2011-06-01 | 91210 | 6 | 2011 | 328 | 113997 | 113997 | 215 | 35619 | 20750.0 | 0.0 | 3.4 | 6.2 | 26.7 | 0.5 | 17.6 | 604.413090 |
5954 | 32.00 | 1494 | 2010-10-01 | 91210 | 10 | 2010 | 328 | 113997 | 113997 | 187 | 23612 | 15682.0 | 0.0 | 2.9 | 27.6 | 17.7 | 39.5 | 18.6 | 1163.261399 |
3871 | 68.34 | 2481 | 2006-07-01 | 90210 | 7 | 2006 | 21741 | 26383345 | 26226245 | 10497 | 5238507 | 1638406.0 | 433.0 | 3.6 | 159.3 | 0.0 | 2.5 | 23.5 | 2273.446871 |
1768 | 49.21 | 1055 | 2005-08-01 | 91324 | 8 | 2005 | 27669 | 11437330 | 11373551 | 11136 | 715075 | 438407.0 | 2909.0 | 3.5 | 79.2 | 0.0 | 0.0 | 20.9 | 837.734290 |
2513 | 22.42 | 310 | 2012-05-01 | 90731 | 5 | 2012 | 59662 | 40481090 | 23803092 | 25280 | 1240957 | 981814.0 | 14301.0 | 3.4 | 2.8 | 36.5 | 0.3 | 17.3 | 592.611504 |
9815 | 12.00 | 648 | 2011-04-01 | 91355 | 4 | 2011 | 32605 | 40487220 | 40079320 | 15155 | 1363894 | 941778.0 | 2212.0 | 3.7 | 9.5 | 63.5 | 0.0 | 16.5 | 323.366802 |
3896 | 50.47 | 1147 | 2006-07-01 | 91324 | 7 | 2006 | 27669 | 11437330 | 11373551 | 11373 | 705898 | 457689.0 | 2932.0 | 3.6 | 159.3 | 0.0 | 2.5 | 23.5 | 890.208032 |
The points where the power model fails are not as clearly grouped into a few locations, sizes, or any other obvious grouping. There may be a pattern where the HTDD or the CTDD are larger than typical, but that is not definitive.
Model Feature Importances
Finally, I look at the feature importances for both the water and power use models.
The most important features for determining water use are: the number of people, the level of low-income houses, overall economic status, the size of the zip code (more importantly the land area), and the amount of precipitation. Those all agree with my original hypothesis that population, economics, and weather all play important roles in determining water use.
The most important features for the power use model are population, economics, land area, then heating degree days, cooling degree dyas, and average temperature. Again, these agree with my hypothesis.
I calculated the accuracy of both the water use and the power use models.
Conclusion
I built machine learning models to predict the water use (in HCF) and power use (in kWh) for the city of Los Angeles. The models were based on data from July 2005 to June 2013. They cover the water and power usage for 145 different zip codes housing around 4.7 million people.
I compared a very simple initial model (that used the data from the previous year to predict the current month's usage) to machine learning models that utilize:
The zip code
The month and year
US Census population and land area data
IRS income tax return data (AGI, Salaries and Wages, and EIC)
NOAA weather data (Heating- and cooling-degree days, average temperature, wind, and precipitation).
I measured the model performace using the root-mean-squared of the differece between the model's predictions and the actual usage for approximately 20% of the dataset, reserved as a set of test data.
The initial model had the following RMS errors:
Water Use prediction RMS Error: 62.052 for the 'Last-year' Model
Power Use prediction RMS Error: 139.855 for the 'Last-year' Model
The best machine learning models had significantly improved meaures:
Water Use prediction RMS Error: 31.016 using an Adaboost Regressor
Power Use prediction RMS Error: 45.757 using a Gradient Boosting Regressor
Finally, I measured the relative error of the predictions of the machine learning models. This is calculated as
.
The power use model performed very well with 68% of the power use predictions scoring better than a 95% accuracy. The water use model did not perform quite as well; however, 72% of the water use predictions had better than a 60% accuracy.