Welcome to Cask Studies, where you can properly age your skills without getting old. Even sour grapes can become fine wines here.
Electric Bill Savings (Regression – Binary, Subsampling)
A real case study by Gregory Taketa. Non-data managers can briefly read this document to see how data analysis helps in hidden ways. Meanwhile, regression practitioners can hone using binary and control variables with a MS Excel Data Set to approach a realistic problem.
A house of 4 in the San Francisco Bay Area is in the highest tier of electricity usage, and the members desire to decrease their electric consumption to lower the utility bill.
Upon the advice of Pacific Gas & Electricity (PG&E), the utility supplier, one member decides to unplug a number of appliances. The rationale is that many appliances still employ residual energy while plugged in, even if they are not activated. In general, expected savings are 5%, or about $60/year.
Unfortunately, implementation is not so simple. While one member unplugs many appliances, other members are disgruntled at having yet another (3 second) chore to plug in their appliances (e.g. the portable tv).
Because of this tradeoff, the energy conserver of the house has decided that if unplugging does not decrease consumption by a statistically significant amount, then the efforts are too immaterial for the inconvenience.
PG&E has made the following 6 years’ data of electric consumption available to the household (some information, e.g. account information, has been omitted to maintain confidentiality of the household). The author has also added data, including the #people in the household for any given month, and the months of unplugging (3 months total):
Click Here to Download Data (MS Excel 2010+)
The relevant variables given in the monthly Source Data are as follows:
- kWh: the electric energy consumption measured in kilowatt-hours (output variable)
- People: the # of people living in that house for the respective months
- Unplug: a binary variable: “0” for when the Energy Conserver did NOT unplug the applicances, and a “1” for when the Conserver did (again, 3 months total).
You may consider other variables to add to this data set.
The author has run a regression with 77 observations (more than sufficient) and has discovered that the experimental variable, UNPLUG, is not statistically significant.
Do you tell the Energy Conserver to give up, or is there more to the story?
EPILOGUE: The author is convinced that the Energy Conserver IS saving a material amount of energy and money; the data are not as representative as the Energy Conserver initially believed.
Cask Questions:
- What is a variable or set of variables you think needs to be included before you run a regression off the source data?
- Run your own regression model using all the data. Do the coefficients of your own variables make sense? Did the coefficient of any variable surprise you?
- You will likely find, like the author has, that Unplug is not a statistically significant variable. Given that a whopping 77 data points exist in the model, a conventional regression analyst could argue that the data are sufficiently representative, and the experimental variable is not meaningful. What might make you suspect this argument?
- Run a regression using 75 data points, from End Date June 25, 2008, to End Date August 25, 2014.
- Run a regression using 76 data points, adding End Date September 24, 2014.
- What do you notice about Unplug as you marginally add 1 data point?
- What is your advice to the household? How does the above exercise buttress your argument?
- How will you apply this to your own analyses in your career?
Zum Wohl!