CoCalc Public FilesSampleProject / Madsen_sample_project_draft_v4.ipynbOpen with one click!
Author: Thomas Isola
Views : 65
Description: Jupyter notebook SampleProject/Madsen_sample_project_draft_v4.ipynb
Compute Environment: Ubuntu 18.04 (Deprecated)

Predicting Water and Power Usage for the city of Los Angeles, California

An example project for PHY178/CSC171

M. J. Madsen

I present here an example final project. This project is not intended to be a carbon-copy template for your project, but rather it should give you and idea of the general shape and style of work that I am looking for. Note that I am following the general outline and instructions for this sample project.

In [61]:
## Library Imports import pandas as pd import matplotlib.pyplot as plt import numpy as np from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor from sklearn.ensemble import AdaBoostRegressor from sklearn.ensemble import GradientBoostingRegressor from sklearn.covariance import EmpiricalCovariance from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from time import process_time import seaborn as sns sns.set_style("white")

Preliminary Data Exploration

I am using data from the city of Los Angeles originally found from this website.

https://data.lacity.org/A-Livable-and-Sustainable-City/Water-and-Electric-Usage-from-2005-2013/asvq-scwp

The dataset has four columns of interest:

  1. The month in which the data was recorded
  2. The zip code (geographical location) for the data
  3. The water use for that zip code (measured in HCF)
  4. The power use for that zip code (measure in kWh)

The raw data has another column (Value Date) that we will not be using. I first examine the raw data.

In [2]:
dfv1 = pd.read_csv("Water_and_Electric_Usage_from_2005_-_2013.csv") dfv1.head(2)
Text Date Value Date Zip Code Water Use Power Use
0 Mar_2008 Mar-08 90230\n(33.99506171100046, -118.39500957899969) 16.70 396
1 Jul_2011 Jul-11 90272\n(34.04886156900045, -118.53572692799969) 35.73 1013

Before moving forward, there are some obvious data cleaning steps that need to be taken. I change the date from text to a datetime object and extract the 5-digit zip code from the Zip Code column. Finally, I drop the unneeded columns from the dataframe.

In [3]:
dfv1['Date'] = pd.to_datetime(dfv1["Text Date"], format="%b_%Y") dfv1['Zip'] = dfv1['Zip Code'].str.extract("(.*)\n", expand=True).astype('int') dfv2 = dfv1.drop(['Text Date', 'Value Date','Zip Code'],axis=1) dfv2.head(2)
Water Use Power Use Date Zip
0 16.70 396 2008-03-01 90230
1 35.73 1013 2011-07-01 90272

The goal of this project is to predict the water and power use. I plot the totals for each date.

In [4]:
dfsum = dfv2.groupby('Date').sum() dfsum.reset_index(inplace=True) f, (ax1, ax2) = plt.subplots(2, 1, sharex=True) ax1.plot(dfsum['Date'], dfsum['Water Use'], marker='o', linestyle='', ms=8) ax1.set_ylabel('Water Use (HCF)') ax2.plot(dfsum['Date'], dfsum['Power Use'], marker='o', linestyle='', ms=8) plt.xlabel('date') ax2.set_ylabel('Power Use (kWh)') plt.tight_layout()

Data Cleaning

There is an extra point at the end that is separated from the rest of the dataset. I look specifically for times after Jan 1, 2013.

In [5]:
dfsum[dfsum['Date'] > pd.Timestamp('2013-01-01')]
Date Water Use Power Use Zip
90 2013-02-01 1349.45 54265 13136208
91 2013-03-01 1640.86 55289 13136208
92 2013-04-01 1788.75 52367 13136208
93 2013-05-01 2037.96 57791 13136208
94 2013-06-01 2112.38 62299 13136208
95 2013-12-01 1398.03 64430 13136208

There are missing data points between July 2013 and November 2013. I cut the data off at June 30, 2013.

In [6]:
dfv3 = dfv2[dfv2['Date']<pd.Timestamp('2013-07-01')] dfsum = dfv3.groupby('Date').sum() dfsum.reset_index(inplace=True)

Before moving forward, it looks like there was a major change in water usage around the mid-point of 2012. However, when I look at the data grouped by zip code, I see that the drop is due to a change in the number of reported zip codes that happened in the middle of 2012.

In [7]:
# Group data by zip code groups = dfv3.groupby('Zip') # Plot f, (ax1, ax2) = plt.subplots(2, 1, sharex=True) ax1.margins(0.05) # Optional, just adds 5% padding to the autoscaling # The next step is to cycle through the groups (based on our categories) and plot each one on the same axis. for name, group in groups: ax1.plot(group['Date'], group['Water Use'], marker='o', linestyle='', ms=8, label=name) #ax.set_aspect(1) #break #ax.legend(bbox_to_anchor=(1,0.5)) ax1.set_ylabel('Water Use') ax2.margins(0.05) # Optional, just adds 5% padding to the autoscaling # The next step is to cycle through the groups (based on our categories) and plot each one on the same axis. for name, group in groups: ax2.plot(group['Date'], group['Power Use'], marker='o', linestyle='', ms=8, label=name) #ax.set_aspect(1) #break #ax.legend(bbox_to_anchor=(1,0.5)) plt.xlabel('date') ax2.set_ylabel('Power Use') plt.tight_layout()

Because my fits will be by zip code, this should not be an issue. However, there are a handfull of possible outliers in the water use data.

Outlier Detection

I utilize the Mahalanobis distance to see how far from the average of the data the potential outliers are.

In [8]:
# Measure the mahalanobis distance. X=dfv3[['Water Use']].values emp_cov = EmpiricalCovariance().fit(X) mahal_dist = np.sqrt(emp_cov.mahalanobis(X)) # Visualize the results plt.plot(mahal_dist,marker='o',linestyle='') plt.ylabel('Mahalanobis Distance')
<matplotlib.text.Text at 0x7f3e7c6c1160>

There are about 8 points that stand out as being very far from the rest of the 12,000 points (over 15 standard deviations). All of the points occur in two short time periods in the same zip code.

In [9]:
dfv3["M_dist"] = mahal_dist dfv3[dfv3["M_dist"]>15].sort_values('Date')
/projects/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy if __name__ == '__main__':
Water Use Power Use Date Zip M_dist
683 2512.2 0 2008-03-01 91350 18.930970
10991 2407.0 0 2008-04-01 91350 18.121215
8 2694.8 0 2008-05-01 91350 20.336495
4406 3231.6 0 2009-07-01 91350 24.468400
10328 3235.6 0 2009-08-01 91350 24.499190
12452 3110.4 0 2009-09-01 91350 23.535489
7301 3202.4 0 2009-10-01 91350 24.243640
3227 3074.8 0 2009-11-01 91350 23.261465

There may have been something unusual happening during those time periods. I drop these points from the dataset as error outliers.

In [10]:
dfv4 = dfv3[dfv3["M_dist"]<15].drop('M_dist',axis=1).reset_index(drop=True) dfv4.head(2)
Water Use Power Use Date Zip
0 16.70 396 2008-03-01 90230
1 35.73 1013 2011-07-01 90272

Data Enrichment

There is a definite month/year periodicity in the data. I add month and year data features to the model. I visualize the data to and see that the periodicity is clearly visible in both the water use and the power use data.

In [11]:
dfv4['Month'] = dfv4['Date'].apply(lambda x: x.month) dfv4['Year'] = dfv4['Date'].apply(lambda x: x.year) dfsum = dfv4.groupby('Month').sum() dfsum.reset_index(inplace=True) f, (ax1, ax2) = plt.subplots(2, 1, sharex=True) ax1.plot(dfsum['Month'], dfsum['Water Use'], marker='o', linestyle='', ms=8) ax1.set_ylabel('Water Use (HCF)') ax2.plot(dfsum['Month'], dfsum['Power Use'], marker='o', linestyle='', ms=8) plt.xlabel('month') ax2.set_ylabel('Power Use (kWh)') plt.tight_layout()

Both the water and the power usage go up in the summer months, which makes sense. There is also a power spike in January that may be related to heating on cold days.

Enriching Data with other Datasets

I hypothesize that the water and power useage depends on the following additional factors:

  1. Population of each zip code
  2. Economic status of each zip code (weath distribution)
  3. Weather data

I join the original dataset with three additonal datasets to add these features.

Adding Population Data

The population data was retrieved from the US Census Bureau.

https://www.census.gov/geo/maps-data/data/zcta_rel_download.html

The dataset includes information about the population, the land area, and the total area for each zip code. There are a number of other columns of data that I will not use. I convert the zip code column (ZCTA5) to an integer in order to match the datatype of my power/water usage dataframe zip code.

In [12]:
dfzip = pd.read_csv("zcta_county_rel_10.txt",dtype={'ZCTA5':'object'}) dfzip['ZCTA5'] = dfzip['ZCTA5'].astype(int) dfzip.head(2)
ZCTA5 STATE COUNTY GEOID POPPT HUPT AREAPT AREALANDPT ZPOP ZHU ... COAREA COAREALAND ZPOPPCT ZHUPCT ZAREAPCT ZAREALANDPCT COPOPPCT COHUPCT COAREAPCT COAREALANDPCT
0 601 72 1 72001 18465 7695 165132671 164333375 18570 7744 ... 173777444 172725651 99.43 99.37 98.61 98.6 94.77 94.71 95.03 95.14
1 601 72 141 72141 105 49 2326414 2326414 18570 7744 ... 298027589 294039825 0.57 0.63 1.39 1.4 0.32 0.35 0.78 0.79

2 rows × 24 columns

Because the data in this set are intended to merge other datasets, there are multiple entries for each zip code. The "Z" columns are totals for each zip code, so I only need one of them - I use the max aggregation. I cut the unneeded columns, keeping only the zip code, population, total area, and land area columns.

In [13]:
dfzipgroups = (dfzip[['ZCTA5','ZPOP','ZAREA','ZAREALAND']].groupby("ZCTA5").max()) dfzipgroups.reset_index(inplace=True) dfzipgroups.head(2)
ZCTA5 ZPOP ZAREA ZAREALAND
0 601 18570 167459085 166659789
1 602 41520 83734431 79288158

To get a sense of the dataset, I look at the relationship between the land area and the population.

In [14]:
dfzipgroups.plot.scatter(x='ZAREALAND',y='ZPOP')
<matplotlib.axes._subplots.AxesSubplot at 0x7f3e74bedcf8>

There is an interesting anti-correlation here. But perhaps that isn't too surprising, as larger population densities will have more zip codes associated with them, so most of the data would be in small land area, large populations.

Now I merge this will the resource use database, keeping only the columns that I need after the join.

In [15]:
dfv5 = pd.merge(dfv4,dfzipgroups,left_on="Zip",right_on="ZCTA5") print("{} rows lost in data merge.".format(len(dfv4.index)-len(dfv5.index))) dfv5.drop(['ZCTA5'],axis=1,inplace=True) dfv5.head(2)
950 rows lost in data merge.
Water Use Power Use Date Zip Month Year ZPOP ZAREA ZAREALAND
0 16.70 396 2008-03-01 90230 3 2008 31766 11785759 11672688
1 17.59 407 2005-12-01 90230 12 2005 31766 11785759 11672688

There were a number of rows lost in the merge- these were zip codes that were not in the US Census database.

I graph the new inputs to see how the Water and Power use depend on them.

In [16]:
f, (ax1, ax2) = plt.subplots(2, 3, sharey=True) ax1[0].plot(dfv5['ZPOP'], dfv5['Water Use'], marker='o', linestyle='', ms=8) ax1[0].set_ylabel('Water Use (HCF)') ax2[0].plot(dfv5['ZPOP'], dfv5['Power Use'], marker='o', linestyle='', ms=8) ax2[0].set_ylabel('Power Use (kWh)') ax2[0].set_xlabel('Population') ax1[1].plot(dfv5['ZAREA'], dfv5['Water Use'], marker='o', linestyle='', ms=8) ax2[1].plot(dfv5['ZAREA'], dfv5['Power Use'], marker='o', linestyle='', ms=8) ax2[1].set_xlabel('Area') ax1[2].plot(dfv5['ZAREALAND'], dfv5['Water Use'], marker='o', linestyle='', ms=8) ax2[2].plot(dfv5['ZAREALAND'], dfv5['Power Use'], marker='o', linestyle='', ms=8) ax2[2].set_xlabel('Land Area') plt.tight_layout()

It looks like the power use is more strongly correlation with the census and land area data.

Adding Economic Data

I add the economic data which is based on IRS tax records:

https://www.irs.gov/uac/soi-tax-stats-individual-income-tax-statistics-zip-code-data-soi

This dataset has the following variables I am interested in:

  • Nreturns: number of filed tax returns
  • AGI: adjusted gross income (in thousands of $)
  • SW: Salary and Wages (in thousands of $)
  • EIC: total earned income tax credit (in thousands of $)

However, the dataset is originally an Excel spreadsheet with multiple sheet names. I created a short function to retrieve the tax data I want.

In [17]:
def getTaxYear(year): # This function reads in a single sheet from the Excel document, then returns it as a dataframe. dftemp = pd.read_excel("allCAtaxdata2005-2013.xlsx",sheetname='{}'.format(year),thousands=",",na_values=["*","* ",".","-"]) dftemp['Zip Code'] = dftemp['Zip Code'].astype('int') dftemp['Year'] = year return dftemp dftaxes = pd.concat([getTaxYear(x) for x in range(2005,2014)]) # Drop NaNs from data dftaxes.dropna(inplace=True) dftaxes.head(2)
Zip Code Nreturns AGI SW EIC Year
0 90001 17313 406784 358687.0 16162.0 2005
1 90002 14712 331533 294916.0 16130.0 2005

I visualize the economic data in the following plot.

In [18]:
dfgroup = dftaxes.groupby('Year').sum() dfgroup.reset_index(inplace=True) plt.plot(dfgroup['Year'],dfgroup['AGI'],label='AGI') plt.plot(dfgroup['Year'],dfgroup['SW'],label='SW') plt.plot(dfgroup['Year'],dfgroup['EIC'],label='EIC') plt.legend(bbox_to_anchor=(1.2,0.5)) plt.gca().set_yscale('log') plt.gca().get_xaxis().get_major_formatter().set_useOffset(False) plt.gcf().autofmt_xdate() plt.ylabel('IRS Data (1000$)')
<matplotlib.text.Text at 0x7f3e74ac0860>

I merge the IRS dataset with the population and usage dataset.

In [19]:
dfv6 = pd.merge(dfv5,dftaxes,how="inner",left_on=["Zip","Year"],right_on=["Zip Code","Year"]) print("{} rows lost in data merge.".format(len(dfv5.index)-len(dfv6.index))) dfv6.drop('Zip Code',axis=1,inplace=True) dfv6.head(2)
190 rows lost in data merge.
Water Use Power Use Date Zip Month Year ZPOP ZAREA ZAREALAND Nreturns AGI SW EIC
0 16.70 396 2008-03-01 90230 3 2008 31766 11785759 11672688 15572 1008925 765127.0 3537.0
1 20.95 368 2008-07-01 90230 7 2008 31766 11785759 11672688 15572 1008925 765127.0 3537.0

There is an additional data loss in this merge. However, I still have most of my original dataset.

In [20]:
print("Remaining data (from original dataset): {0:.1f}%".format(float(len(dfv6.index))/(len(dfv1.index))*100))
Remaining data (from original dataset): 90.7%

I visualize the combined dataset.

In [21]:
# Plot usage vs Adjusted Gross Income f, (ax1, ax2) = plt.subplots(2, 2, sharey=True) ax1[0].scatter(dfv6['AGI'],dfv6['Water Use']) ax1[0].set_ylabel("Water Use") ax2[0].scatter(dfv6['AGI'],dfv6['Power Use']) ax2[0].set_ylabel("Power Use") ax2[0].set_xlabel("Adjusted Gross Income") ax2[0].set_xlim(0,max(dfv6['AGI'])) ax1[1].scatter(dfv6['EIC'],dfv6['Water Use']) ax2[1].scatter(dfv6['EIC'],dfv6['Power Use']) ax2[1].set_xlabel("Earned Income Credit") ax2[1].set_xlim(0,max(dfv6['EIC']))
(0, 46370.0)

Again, there aren't strong correlations here, but the correlations that exist may improve the model. I am also interested in how closely this IRS dataset tracks the US Census dataset. The number of tax returns in a zip code should be close to the number of people living in that area. I plot the number of returns versus the census population to check this correlation.

In [22]:
#We'll use a different tool to plot the data now that we know how to group the data by a category. This will help us make better combined plots later on. groups = dfv6.groupby('Year') # Plot trainfig, ax = plt.subplots() #ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling # The next step is to cycle through the groups (based on our categories) and plot each one on the same axis. colors = iter(plt.cm.rainbow(np.linspace(0, 1, len(groups)))) for name, group in groups: ax.plot(group['ZPOP'],group['Nreturns'],'o',label=name,color=next(colors)) ax.legend(bbox_to_anchor=(1,0.5)) ax.set_ylabel('N Returns') ax.set_xlabel('2010 Cenus Population')
<matplotlib.text.Text at 0x7f3e76862780>

The correlation is as expected. Howver, there is also a trend based on the year which probably corresponds to the growth and decline in populations.

Adding Weather Data

I add historical weather data to my input features. This dataset comes from:

https://www.ncdc.noaa.gov/cdo-web/search?datasetid=GSOM

The dataset has the following variables:

  • AWND: average wind speed (m/s)
  • CLDD: cooling degree days
  • HTDD: heating degree days
  • PRCP: preciptiation (cm)
  • TSUN: daily total sunshine
  • TAVG: Average daily temperature (celsius) [Not Used]

The weather data is essentially the same for all of the zip codes in this dataset- they are all in the city of Los Angeles. Any small variation in the rain or temperature will be negligible because I only have the average over the full month of data.

In [23]:
weatherdf = pd.read_csv("LAX_weather.csv",parse_dates=[2]) weatherdf.fillna(0.0, inplace=True) weatherdf['Date']=pd.to_datetime(weatherdf['DATE']) weatherdf.drop(['STATION', 'NAME','DATE'],axis=1, inplace=True) weatherdf.head(2)
AWND CLDD HTDD PRCP TAVG TSUN Date
0 3.2 4.8 135.2 174.5 14.1 0.0 2005-01-01
1 3.5 0.0 106.5 176.6 14.5 0.0 2005-02-01
In [24]:
# Plot weather data f, (ax1, ax2) = plt.subplots(2, 3, sharex=True) ax1[0].plot(weatherdf['Date'],weatherdf['AWND']) ax1[0].set_ylabel("AWND") ax2[0].plot(weatherdf['Date'],weatherdf['CLDD']) ax2[0].set_ylabel("CLDD") ax1[1].plot(weatherdf['Date'],weatherdf['HTDD']) ax1[1].set_ylabel("HTDD") ax2[1].plot(weatherdf['Date'],weatherdf['PRCP']) ax2[1].set_ylabel("PRCP") ax1[2].plot(weatherdf['Date'],weatherdf['TAVG']) ax1[2].set_ylabel("TAVG") ax2[2].plot(weatherdf['Date'],weatherdf['TSUN']) ax2[2].set_ylabel("TSUN") plt.tight_layout() plt.gcf().autofmt_xdate()

I merge this last dataset with my other features.

In [25]:
dfv7 = pd.merge(dfv6, weatherdf, on='Date') print("{} rows lost in data merge.".format(len(dfv6.index)-len(dfv7.index))) dfv7.drop(['TSUN'],axis=1,inplace=True) dfv7.head(2)
0 rows lost in data merge.
Water Use Power Use Date Zip Month Year ZPOP ZAREA ZAREALAND Nreturns AGI SW EIC AWND CLDD HTDD PRCP TAVG
0 16.70 396 2008-03-01 90230 3 2008 31766 11785759 11672688 15572 1008925 765127.0 3537.0 3.2 6.7 91.5 0.8 15.6
1 30.18 970 2008-03-01 90272 3 2008 22986 60557886 59133992 11165 3569670 1750980.0 355.0 3.2 6.7 91.5 0.8 15.6

There were weather data for all of the existing rows, so there wasn't any data lost in this step. I visualize the weather data and the water/power use data together to look for correlations.

In [26]:
# Plot usage vs Adjusted Gross Income f, (ax1, ax2) = plt.subplots(2, 2, sharey=True) ax1[0].scatter(dfv7['TAVG'],dfv7['Water Use']) ax1[0].set_ylabel("Water Use") ax2[0].scatter(dfv7['TAVG'],dfv7['Power Use']) ax2[0].set_ylabel("Power Use") ax2[0].set_xlabel("Average Temp") ax1[1].scatter(dfv7['PRCP'],dfv7['Water Use']) ax1[1].set_xlim(0,max(dfv7['PRCP'])) ax2[1].scatter(dfv7['PRCP'],dfv7['Power Use']) ax2[1].set_xlabel("Precipitation") ax2[1].set_xlim(0,max(dfv7['PRCP']))
(0, 224.19999999999999)

It certainly looks like there are correlations here - at least between the Water Use/Precepitation and the Power Use/Average Temp datasets.

Data Transformations

Categorical Data Transformation

I have three columns that are categorical in nature: Zip, Month, and Year. The other columns are continuous variables. I re-define these three variables as dummy variables.

In [27]:
zipdummy = pd.get_dummies(dfv7['Zip'],prefix='Z') monthdummy = pd.get_dummies(dfv7['Month'],prefix='M') yeardummy = pd.get_dummies(dfv7['Year'],prefix='Y') dfv8 = dfv7.join(zipdummy).join(monthdummy).join(yeardummy) dfv8.drop(['Date','Zip','Month','Year'],inplace=True,axis=1) dfv8.head(2)
Water Use Power Use ZPOP ZAREA ZAREALAND Nreturns AGI SW EIC AWND ... M_12 Y_2005 Y_2006 Y_2007 Y_2008 Y_2009 Y_2010 Y_2011 Y_2012 Y_2013
0 16.70 396 31766 11785759 11672688 15572 1008925 765127.0 3537.0 3.2 ... 0 0 0 0 1 0 0 0 0 0
1 30.18 970 22986 60557886 59133992 11165 3569670 1750980.0 355.0 3.2 ... 0 0 0 0 1 0 0 0 0 0

2 rows × 169 columns

Data test/training split

I split my data into 80% training and 20% testing data sets. I do this before doing any further data transformations in order to avoid "data snooping".

In [28]:
train, test = train_test_split(dfv8, test_size=0.2, random_state=23)

Data Scaling

Almost all of my data columns are on different scales. I regularize the non-categorical data before fitting. Because only some of the columns need to be regularized, I split the data, regularize part of it, then re-join the training dataset. I do the same transformation to the test dataset.

In [29]:
# Fit the scaler std_scaler = StandardScaler().fit(train.ix[:,2:14]) # Apply the transformation to the train and test datasets train_std = pd.DataFrame(std_scaler.transform(train.ix[:,2:14]),columns=train.columns[2:14]) test_std = pd.DataFrame(std_scaler.transform(test.ix[:,2:14]),columns=test.columns[2:14]) # Recombine the scaled datasets trainToMerge=train.reset_index(drop=True) train_scaled=trainToMerge.ix[:,0:2].merge(train_std.merge(trainToMerge.ix[:,15:],left_index=True,right_index=True),left_index=True,right_index=True) testToMerge=test.reset_index(drop=True) test_scaled=testToMerge.ix[:,0:2].merge(test_std.merge(testToMerge.ix[:,15:],left_index=True,right_index=True),left_index=True,right_index=True)

I now separate out the features and the targets.

In [30]:
train_features = train_scaled.ix[:,2:].values test_features = test_scaled.ix[:,2:].values water_target = train_scaled['Water Use'].values water_actual = test_scaled['Water Use'].values power_target = train_scaled['Power Use'].values power_actual = test_scaled['Power Use'].values

PCA

I tested reducing the dimensionality of the features using the PCA. However, this did not improve either the model performance or the time it takes to fit the models. So I will not be using PCA for this analysis.

Model Testing

My first model is to simply use the "prior year" data in order to predict the usage. I am using the Root-Mean-Square (RMS) as my metric for model performance. The RMS value gives me an estimate of the prediction error in the same units as the Water/Power use data.

In [188]:
dfoffset = dfv4 dfoffset['Prior Year'] = dfoffset['Year'].apply(lambda year: year - 1) df_new = dfv4.merge(dfoffset, left_on=['Prior Year', 'Month','Zip'], right_on=['Year','Month','Zip'],suffixes=['','_Prior']) df_new.drop(['Year_Prior','Prior Year_Prior','Prior Year'],inplace=True,axis=1) print("{} rows lost in data merge.".format(len(dfv4.index)-len(df_new.index)))
1748 rows lost in data merge.
In [189]:
print("Water RMS Error: {0:.3f} for 'Last-year' Model".format( np.sqrt(np.mean( (df_new['Water Use'].values - df_new['Water Use_Prior'].values)**2)) )) print("Power RMS Error: {0:.3f} for 'Last-year' Model".format( np.sqrt(np.mean( (df_new['Power Use'].values - df_new['Power Use_Prior'].values)**2)) ))
Water RMS Error: 62.052 for 'Last-year' Model Power RMS Error: 139.855 for 'Last-year' Model

Linear Regression

I next create a baseline predictive model using a linear regression using the same metrics

In [31]:
def fitAndPlot(train_features, test_features, model, **extraArgs): # Create linear regression objects for water and power water_model= model(**extraArgs) power_model= model(**extraArgs) # Timing start = process_time() # Fit the data water_model.fit(train_features,water_target) power_model.fit(train_features,power_target) fit_time = process_time() - start start = process_time() # Get the predictions wpredictions = water_model.predict(test_features) ppredictions = power_model.predict(test_features) predict_time = process_time() - start # Plot the actuals and the predictions f, (ax1, ax2) = plt.subplots(2, 1) ax1.scatter(water_actual, wpredictions, color='black') ax1.set_ylabel('Water Model Predictions') ax1.set_xlabel('Actuals') ax1.set_xlim(0,max(water_actual)) #Plot the slope=1 line for reference X=np.linspace(ax1.get_xlim()[0], ax1.get_xlim()[1], 100) ax1.plot(X,X,linestyle='--') ax2.scatter(power_actual, ppredictions, color='black') ax2.set_ylabel('Power Model Predictions') ax2.set_xlabel('Actuals') ax2.set_xlim(0,max(power_actual)) #Plot the slope=1 line for reference X=np.linspace(ax2.get_xlim()[0], ax2.get_xlim()[1], 100) ax2.plot(X,X,linestyle='--') plt.tight_layout() # Get the RMS values print("Water RMS Error: {0:.3f} for {1}".format( np.sqrt(np.mean((wpredictions - water_actual) ** 2)),model.__name__)) print("Power RMS Error: {0:.3f} for {1}".format( np.sqrt(np.mean((ppredictions - power_actual) ** 2)),model.__name__)) print("Fit Time: {} seconds".format(fit_time)) print("Predict Time: {} seconds".format(predict_time)) return (water_model, power_model) water_model,power_model = fitAndPlot(train_features, test_features,LinearRegression)
Water RMS Error: 44.844 for LinearRegression Power RMS Error: 85.875 for LinearRegression Fit Time: 1.3261661869999983 seconds Predict Time: 0.01653146099999958 seconds

The linear regression model is already performing better than the 'last year' model. I follow this with further regression models.

Random Forest Regression

I next apply the random forest regression using the same train/test data.

In [32]:
water_model,power_model = fitAndPlot(train_features,test_features, RandomForestRegressor, n_estimators=300, min_samples_leaf=10,random_state=32)
Water RMS Error: 32.742 for RandomForestRegressor Power RMS Error: 66.054 for RandomForestRegressor Fit Time: 82.216455096 seconds Predict Time: 0.30887336899999696 seconds

The random forest has better performance than the linear model, though it took significantly longer.

Adaboost Regression

I look at two other models. First, the Adaboost ensemble method using the Decision Tree Regressor as the base model.

In [59]:
water_model2,power_model2 = fitAndPlot(train_features, test_features, AdaBoostRegressor, base_estimator=DecisionTreeRegressor(min_samples_leaf=10), n_estimators=300,random_state=32,loss='square')
Water RMS Error: 31.656 for AdaBoostRegressor Power RMS Error: 57.677 for AdaBoostRegressor Fit Time: 74.75545528799921 seconds Predict Time: 0.4766953270009253 seconds

The performance and timing are both slightly better than the Random forest model. One final model:

Gradient Boosting Regressor

In [36]:
water_model3,power_model3 = fitAndPlot(train_features, test_features, GradientBoostingRegressor,min_samples_leaf=10, n_estimators=500,random_state=32,max_depth=200,verbose=1)
Iter Train Loss Remaining Time 1 9026.1864 5.67m 2 7532.4162 5.65m 3 6313.0380 5.65m 4 5318.0338 5.74m 5 4502.7267 5.72m 6 3838.5703 5.78m 7 3295.1466 5.66m 8 2843.7231 5.65m 9 2476.8513 5.72m 10 2171.9484 5.71m 20 879.0091 5.94m 30 584.7577 6.20m 40 440.3867 6.37m 50 333.8454 6.53m 60 262.5081 6.76m 70 208.6896 6.76m 80 167.7968 6.71m 90 138.7181 6.69m 100 116.2387 6.59m 200 22.5917 5.50m 300 5.2009 3.83m 400 1.5141 1.97m 500 0.5305 0.00s Iter Train Loss Remaining Time 1 72742.7815 3.30m 2 59565.0999 3.37m 3 48890.1988 3.48m 4 40206.0674 3.48m 5 33137.9319 3.52m 6 27406.6225 3.55m 7 22723.0439 3.53m 8 18899.7597 3.52m 9 15794.2527 3.56m 10 13260.4569 3.69m 20 3323.1769 3.83m 30 1711.0881 4.23m 40 1234.3053 4.42m 50 984.0259 4.96m 60 797.3781 5.32m 70 651.6153 5.83m 80 538.6646 6.00m 90 451.1444 6.03m 100 384.7462 6.10m 200 108.3785 5.43m 300 38.9381 3.91m 400 17.0833 1.99m 500 9.3090 0.00s Water RMS Error: 34.926 for GradientBoostingRegressor Power RMS Error: 47.818 for GradientBoostingRegressor Fit Time: 1173.0020852920002 seconds Predict Time: 0.5419542839999849 seconds

Model Optimization

The Adaboost model has the best performance for the Water Use data, but the Gradient Boosting regressor has the best performance for the Power Use model. Because they are two separate datasets, I use the best performing model for each one and tune the hyperparameters separately.

Water Use Model Hyperparameter Tuning

I tune the adaboost model hyperparameters, focusing on the minimum number of samples per leaf for the underlying decision tree regressor. I increased the number of estimators to reduce the noise in the trend data.

In [73]:
min_leaf_grid = range(1,41,4) rms_scores = [] for min_leaf in min_leaf_grid: # Create linear regression objects for water and power print("Working min_leaf {}".format(min_leaf)) water_model= AdaBoostRegressor(base_estimator=DecisionTreeRegressor(min_samples_leaf=min_leaf), n_estimators=1200,random_state=32,loss='square') # Timing start = process_time() # Fit the data water_model.fit(train_features,water_target) fit_time = process_time() - start start = process_time() # Get the predictions wpredictions = water_model.predict(test_features) predict_time = process_time() - start rms_score = np.sqrt(np.mean((wpredictions - water_actual) ** 2)) rms_scores.append(rms_score) plt.plot(min_leaf_grid, rms_scores,marker='o') plt.xlabel("Minimum Number of Leaves") plt.ylabel("RMS Error")
<matplotlib.text.Text at 0x7f3e7409a208>

The best model performance comes around a minimum leaf size of 20. The model is overfitting with a smaller minimum leaf size and is underfitting with a larger minimum leaf size. Therefore, I run one final model for the water use data.

In [75]:
final_water_model= AdaBoostRegressor(base_estimator=DecisionTreeRegressor(min_samples_leaf=20), n_estimators=1200,random_state=32,loss='square') # Timing start = process_time() # Fit the data final_water_model.fit(train_features,water_target) fit_time = process_time() - start start = process_time() # Get the predictions wpredictions = final_water_model.predict(test_features) predict_time = process_time() - start rms_score = np.sqrt(np.mean((wpredictions - water_actual) ** 2)) plt.scatter(water_actual, wpredictions, color='black') ax1 = plt.gca() ax1.set_ylabel('Water Model Predictions') ax1.set_xlabel('Actuals') ax1.set_xlim(0,max(water_actual)) #Plot the slope=1 line for reference X=np.linspace(ax1.get_xlim()[0], ax1.get_xlim()[1], 100) ax1.plot(X,X,linestyle='--') # Get the RMS values print("Water RMS Error: {0:.3f} for Adaboost Regressor".format( np.sqrt(np.mean((wpredictions - water_actual) ** 2)))) print("Fit Time: {} seconds".format(fit_time)) print("Predict Time: {} seconds".format(predict_time))
Water RMS Error: 31.016 for Adaboost Regressor Fit Time: 90.67291085299985 seconds Predict Time: 0.774193978000767 seconds

Power Use Model Optimization

I change to the Gradient Booting regression to optimize a model for predicting the power use. I again focus on tuning the minimum number of leaves hyperparameter.

In [90]:
min_leaf_grid = range(1,81,10) rms_scores = [] for min_leaf in min_leaf_grid: # Create linear regression objects for water and power print("Working min_leaf {}".format(min_leaf)) power_model= GradientBoostingRegressor(min_samples_leaf=min_leaf, n_estimators=500,random_state=32,max_depth=200) # Timing start = process_time() # Fit the data power_model.fit(train_features,power_target) fit_time = process_time() - start start = process_time() # Get the predictions ppredictions = power_model.predict(test_features) predict_time = process_time() - start rms_score = np.sqrt(np.mean((ppredictions - power_actual) ** 2)) rms_scores.append(rms_score) plt.plot(min_leaf_grid, rms_scores,marker='o') plt.xlabel("Minimum Number of Leaves") plt.ylabel("RMS Error") plt.plot(min_leaf_grid, rms_scores,marker='o') plt.xlabel("Minimum Number of Leaves") plt.ylabel("RMS Error")
<matplotlib.text.Text at 0x7f3e6ff86128>