This is the first of a series of notebooks that dive into the problem of predicting hourly electricity demand through several time series analysis methods.
Predicting electricity demand is crucial for utilities companies as well as other organizations in the electric sector in order to properly plan and operate the generation and distribution of energy, as well as to devise future expansions of the power system.
This notebook serves as an initial Exploratory Data Analysis of PG&E's (a Californian utility company) Hourly Electricity Load, which will help understand the data before developing forecasting models later on. We will analyze a single year of electricity demand, 2016, although other years will be included during the model fitting stages.
All the analysis presented here was developed using Jupyter Lab. The original Jupyter notebook, with all the necessary code, can be found on my GitHub. I have cleaned the original notebook by removing most of the code cells and keeping the text and plots.
First, we need to get our data. We will be using the 2016 California EMS Hourly Load. The historical data is stored in an Excel file which can be found and downloaded from here. The first few records are shown below. The dataset includes hourly electricity loads (in MegaWatts (MW)) for each of the main electricity distributors in California:
For this analysis, we will only be interested in PG&E's electricity demand.
Dates | Date | HE | PGE | SCE | SDGE | VEA | CAISO Total | |
---|---|---|---|---|---|---|---|---|
0 | 2016-01-01 00:00:00 | 2016-01-01 | 1 | 10554 | 10543 | 2176 | 98 | 23370 |
1 | 2016-01-01 01:00:00 | 2016-01-01 | 2 | 10210 | 10215 | 2073 | 99 | 22598 |
2 | 2016-01-01 02:00:00 | 2016-01-01 | 3 | 9941 | 9954 | 2009 | 101 | 22005 |
3 | 2016-01-01 03:00:00 | 2016-01-01 | 4 | 9793 | 9815 | 1980 | 105 | 21692 |
4 | 2016-01-01 04:00:00 | 2016-01-01 | 5 | 9802 | 9822 | 1991 | 108 | 21724 |
In order to better understand the behavior of PG&E's electricity load throughout 2016, we will proceed to visualize our data in several ways, starting with a simple time plot of the hourly demand for the entire duration of 2016. As can be seen in the plot below, electricity load shows multiple seasonality patterns (e.g. daily, weekly, yearly) and is generally higher during the Summer.
Let's explore the seasonality of our data further. For that purpose, we add several useful fields to the original time series, such as Hour, Day of Week and Month, which we can use to segment and aggregate our data.
The plot below represents hourly load averages by month. It can be seen that there is daily seasonality, with higher demands towards the afternoon/evening hours and lower demands at night/dawn in most months. The daily seasonality pattern changes from Winter to Summer months, with the former showing an approximately bi-modal curve (with peaks at aroun 8am and 6pm) and the latter showing a relatively unimodal pattern centered around 6pm.
The next plot helps us study the load distributions at different times of the day and year. As can be seen, hourly load in February is lower and shows less variance than in July for several times of day. Indeed, the distributions plotted below are generally more "spread-out" in July than in February.
It is also worth exploring weekly seasonality patterns. The boxplots below clearly show how consumption during non-work days is on average lower, as we would expect.
To give a clearer view of how average demand changes throughout the year, the plot below shows monthly load averages (and an interval representing the standard deviation), maxima and minima. As mentioned before, the Summer months see a higher average load and stronger variability than the Winter months.
Apart from seasonal and time-of-year effects (which could also be considered seasonal for series longer than one year), it is interesting to explore the changes in electricity demand during holidays.
The holidays taken into account in our analysis correspond to the typical US Federal Holidays Calendar:
The plot below shows the daily load curves for several US Federal holidays, as well as the days before and after the holiday, compared to the daily curves for the rest of the days in the month when that holiday took place. Certain holidays, such as Christmas Day, show significantly low energy demand compared to most other days in December.
Finally, given that we are dealing with time series data, it is worth analyzing the correlation between observations at different lags. To remove non-stationary seasonal and non-seasonal effects we will take first seasonal (with a period of 24 hours) and non-seasonal differences and then calculate the Autocorrelation Function and Partial Autocorrelation Function of the differenced series.
The PACF shows significant correlations at lags 24, 48, and so on, as well as some short-term correlations (mainly at lag 1). This could mean that auto-regressive terms (AR) at the relevant lags could model this series well.
The ACF also shows significant correlations, but with a less clear structure.
As we have seen, purely temporal information (time of day, day of week, time of year, holidays and past observations) can tell us a lot about the expected hourly electricity load. However, it is well-known that weather, and most notably temperature, is correlated with electricity demand.
Let us know take a look at hourly temperature for two cities in California: Los Angeles and San Francisco, for 2016, and how they correlate with energy consumption. Temperature data was obtained from this Kaggle dataset, which in turn was compiled using Weather API.
Note that the original temperature dataset uses Israel Standard Time. To fix that we simply shift the index to match US Pacific timezone, which is the one used in our energy load dataset
Below are the first few temperature observations for LA and SF during the year 2016, according to this dataset.
San Francisco | Los Angeles | |
---|---|---|
Time | ||
2016-01-01 00:00:00 | 275.790000 | 278.970000 |
2016-01-01 01:00:00 | 275.460000 | 278.830000 |
2016-01-01 02:00:00 | 275.371332 | 278.617968 |
2016-01-01 03:00:00 | 275.170000 | 278.140000 |
2016-01-01 04:00:00 | 274.690000 | 278.080000 |
It is also useful to visualize the observed temperature for the entire duration of 2016, just as we did with the energy demand at the beginning of the analysis
In addition to this yearly pattern we would also expect temperature to follow a daily seasonality. The boxplots below represent the distribution of temperature in San Francisco by hour of day. Highest temperatures occur, obviously, between the late morning and early afternoon.
Finally, let's explore the relationship between temperature and energy demand by visualizing a scatterplot of both variables. As can be seen below, hourly load has a roughly quadratic relationship with San Francisco and Los Angeles temperature. The correlation seems to increase at higher temperatures and flatten out at lower ones. In fact, when temperatures decrease below a certain level, energy demand seems to slightly increase again. Note that, given the warm temperatures in these two cities, we are not able to fully see the typical "U"-shaped pattern between energy demand and temperature.
In this notebook we have performed an Exploratory Data Analysis of the hourly electricty load during 2016 for the Californian utility company PG&E. By visualizing energy demand aggregate at different time scales we have gathered a few key take-aways:
These take-aways will be helpful when deciding which features to include when developing a forecasting model for PG&E's hourly energy demand. The next step in this process is precisely developing these models: