Daily electricity price forecasting (report)

Материал из MachineLearning.

Перейти к: навигация, поиск

Введение в проект

Project description

Goal

The goal is to forecast average daily spot price of electricity. Forecasting horizon (the maximal time segment where the forecast error does not exceed given value) is supposed to be one month.

Motivation

For example, one needs precise forecast of electricity consumption for each hour during the next day to avoid transactions on the balancing market.

Data

Daily time series from 1/1/2003 until now. Time series are weather (average-, low-, and high daily temperature, relative humidity, precipitation, wind speed, heating-, and cooling degree day) and average daily price of electricity. There is no data on electricity consumption and energy units prices. The data on sunrise and sunset are coming from the internet.

Quality

The time series splits into the whole history but the last month and the last month. The model is created using the first part and tested with the second. The procedure must be repeated for each month of the last year. The target function is MAPE (mean absolute percentage error) for given month.

Requirements

The monthly error of obtained model must not exceed the error of the existing model, same to customers. It’s LASSO with some modifications.

Feasibility

The average daily price of electricity contains peaks and it is seems there is no information about these peaks in given data. There is no visible correlation between data and responses.

Methods

The model to be generated is a linear combination of selected features. Each primitive feature (of index j) is a set of j+nk-th samples of time series, k is a period. The set of the features includes primitive features and their superpositions.

Problem definition

We have variables matrix X and responses vector Y for this matrix. This is time series. Our goal is to recover regression \hat{Y} for variables matrix \hat{X}. This data is straight after initial data in time. Our goal is to find vector \beta� of linear coefficients between \hat{X} and \hat{Y}, Y = X\beta^T . As quality functional we use MAPE (mean average percent error).

{ Q(\hat{Y}) = \sum_{i=1}^n \frac{|y_i-\hat{y}_i|}{|y_i|},


Algorithm description

State of art

The main task of this subsection is to describe ways of daily electricity price forecasting and sort of data needed for this task. Also we have a deal with German electricity price market and there is brief survey about it. Let’s start from methods of forecasting daily electricity price. There are many ways to solve this problem. It can be ARIMA models or autoregression [1], [4]. Artificial neural networks are also used in combination with some serious improvements like wavelet techniques or ARIMA models [2]. SVM can be used in a close problem of price spikes forecasting [2]. Models of noise and jump can be constructed in some other ways [3]. Sets of data can be rather different. For neural networks it can be only time series [1], but in most of works some addition data is required. Weather is an important factor in price forecasting [5]. It can be median day temperature, HDD, CDD [4] or wind speed [6]. Dates of sunsets and sunrises can be useful too. Energy consumption and system load has important impact on daily electricity price [4]. Interesting features is prediction \log(P_t) instead of P_telectricity price in €. Our goal is forecasting daily electricity price for German electricity price market EEX, so let’s represent some information about it. Germany has free 2 electricity market, so models for free electricity market can be applied to it. Market of energy producing changes every year, main goals are phasing out nuclear energy and creation new renewable sources manufactures. Germany is one of the largest consumers of energy in the world. In 2008, it consumed energy from the following sources: oil (34.8%), coal including lignite (24.2%), natural gas (22.1%), nuclear (11.6%), renewables (1.6%), and other (5.8%), whereas renewable energy is far more present in produced energy, since Germany imports about two thirds of its energy. This country is the world’s largest operators of non-hydro renewables capacity in the world, including the world’s largest operator of wind generation [7].

References

[1] Hsiao-Tien Pao. A Neural Network Approach to m-Daily-Ahead Electricity Price Prediction

[2] Wei Wu, Jianzhong Zhou,Li Mo and Chengjun Zhu. Forecasting Electricity Market Price Spikes Based on Bayesian Expert with Support Vector Machines

[3] S.Borovkova, J.Permana. Modelling electricity prices by the potential jump-functions

[4] R.Weron, A.Misiorek. Forecasting spot electricity prices: A comparison of parametric and semiparametric time series models

[5] J.Cherry, H.Cullen, M.Vissbeck, A.Small and C.Uvo. Impacts of the North Atlantic Oscillation on Scandinavian Hydropower Production and Energy Markets

[6] Yuji Yamada. Optimal Hedging of Prediction Errors Using Prediction Errors Yuji Yamada

[7] http://en.wikipedia.org/wiki/Energy_in_Germany

Basic hypotheses and estimations

In case of linear regression we assume, that vector of responses Y is linear combination for modified incoming matrix of features \tilde{X} , what we designate as X:

Y = X\beta^T.

This modified matrix we can get from superposition, smoothing and autoregression in initial data set. LARS[8] seems suitable for this kind of problem, so we use it in our work. From state of art and data we get some additional hypotheses. We have a deal with data set with two periodic – week periodic and year periodic. Possible usage of this is generation new features and creation own model for each day of week.

Mathematic algorithm description

LARS overview

LARS (Least angle regression \cite{[9]}) is new (2002) model selection algorithm. It has several advantages:

  • It does less steps then older LASSO and Forward selection.
  • This algorithm gives same results as LASSO and Forward selection after simple modification.
  • It creates a connection between LASSO and Forward selection.


Let's describe LARS. We have initial data set X [ n \times m ] , where m is number of variables, n is number of objects. Y is responses vector. Our goal is recovery regression and make estimation for  {\beta}, where  {\hat{\mu}}=X{\hat{\beta}}' For each algorithm step we add new variable to model and create parameters estimation for variables subset. Finally we'll get set of parameters estimation \{ {\beta_1,\beta_2,\dots,\beta_m}\}. Algorithm works in condition of independence  \{ x_1,...,x_m \} =X.

Describe this algorithm single step. For subset of vectors with the largest correlations, we add one vector to it, and get indexes set  { A } . Then


X_{A}=(\cdots s_j , {\x_j}, \cdots)_{j \in A},

where s_j=\pm{1}. Introduce


{ G_{A} }=X'_{A} X_{A}


A_{A}=( {1}'_{A} {G}^{-1}_{A} {1}_{A})^{-1/2},

where  {1}_{A} --- ones vector size |{A}|. Then calculate equiangular vector between X_{A} :


{u_A}=X_{A}\omega_{A} \qquad  \omega_{A}=A_{A}
{G}^{-1}_{A}  {1}_{A}

Vector {u_A} has equal angles with all vectors from X_{A}.

After introducing all necessary designation one describe LARS. As Stagewise, our algorithm starts from responses vector estimation   {\hat{\mu}}_0=0. After that it makes consecutive steps in equiangular vector dimension. For each step we have  {\hat \mu}_{A} with estimation :

 {\hat{c}}=X'({  y} -  {\hat {\mu}}_{A} ).

Get active vectors indexes set A from

\hat{C}=\max_j(|\hat{c}_j|) \qquad {A}=\{j|\hat{c}_j=\hat{C} \}

 s_j=sign\{ \hat{c}_j \} j \in A. Introduce additional vector:

 a=X'{u}_{A}.

Write next approximation for estimation \hat{\mu}_{A}

 {\hat{\mu}}_{A_+}= {\hat{\mu}}_{A}+\hat{\gamma}{{u_A}},

where


\hat{\gamma}=\min_{j \in
{A}^c}{}^{+}(\frac{\hat{C}-\hat{c}_j}{A_{A}-a_j},\frac{\hat{C}+\hat{c}_j}{A_{A}+a_j}).
This actions do \hat{c} minimization step for all variables with indexes from set <\tex>A</tex>. In addition \hat{\gamma} is the least positive \gamma on condition that we add new index to A on the next step : {A}_{+}={A} \cup j.

This algorithm works only m steps. The experiment in article \cite{[1]} confirms, that LARS works better LASSO and Forward Selection, and makes less number of iterations. Last step  {\beta}_m estimation gives least-squares solution.

Apply LARS

We use three steps in this algorithm.

  • Modify of initial data set X and set to classify \hat{X}. Get \tilde{X}, \hat{\tilde{X}}.
  • LARS algorithm apply.

- We split our initial sample into two parts: for learning \{
X^1,Y^1 \} and for control \{ X^2,Y^2\}.

- For learning sample we apply procedure of removing spikes from this part of sample.

- From \{X^1,Y^1 \} we get a set of weights vectors \{ \beta_1, \beta_2, \dots , \beta_m \} for each step of LARS algorithm.

- By using control sample \{ X^2,Y^2\} we choose \beta_i with best MAPE rate.

  • Calculate \hat{Y}=\hat{X}\beta^T.


Local algorithm Fedorovoy description

This algorithm is variation of k-means for time series. We have time series \tilde{f}=(f_1,\dots,f_n). Suppose continuation of time series depends on preceding events  {F}_n=(f_{n-l+1},\dots,f_n). Select in our set subsets

  {F}_r=(f_{r-l+1},\dots,f_r), r=\overline{l,n-\max(l,t)}.

Introduce way to calculate distance \rho({F}_i,{F}_j) between {F}_i and {F}_j.

Define set {A} of linear transformations for F:


{A(F)}=\{ \check{F}^{a,b}|
\check{F}^{a,b}=a{F}+b, a,b\in \mathbb{R} \}

Note that for our problem importance of objects increases to the end of vector {F}_i because of forecasting straightforward vector {F}_{i+l}. Introduce parameter \lambda < 1. So, our distance \rho_{u,v} is

 \rho({F}_u,{F}_v)=\min_{a,b} \sum_{i=1}^l \lambda_i^2 (f_{u+i}-
\check{f}_{v+i}^{a,b})^2 ,

where


\lambda_i=\lambda^{l-i+1},
\check{F}_v^{a,b}=\{\check{f}_{v+1}^{a,b},\check{f}_{v+2}^{a,b},
\dots,\check{f}_{v+l}^{a,b} \}.

From that distance definition find k closest elements to {F}_n. We use this set of elements \{{F}_{(1)},\dots,{F}_{(k)} \}, sorted by ascending of \rho_{n,(i)}, to create forecast by equation


\hat{f}_{n+l+i+1}=\hat{f}_{n+l+i}+\sum_{j=1}^k \frac{\omega_j
({f}_{(j)+l+i+1}^{a,b} - {f}_{(j)+l+i}^{a,b})}{\sum_{i=1}^l
\omega_i},

for each i \in \{ 0,1,2,\cdots, l-1\}, where \omega_j proportional to distances \rho_{n,(j)}

 \omega_j=\Big( 1-\frac{\rho_{n,(j)}}{\rho_{n,(k+1)}} \Big)^2.

So ,we use integrated local averaged method to create forecast. This algorithm needs optimization by several parameters and it's not useful without this procedure \cite{[11]}. Our model requires \{ k,\lambda\} (l is fixed and equals one month length ). To define them we use two step iteration process from \{ k_0,\lambda_0\}
= \{ 5,0.5 \}:

  • Optimization by \lambda.
  • Optimization by k.

It's necessary to keep in mind, that we use specified model of general algorithm from \cite{[11]}. Function of distance and weights for each element from set of {F}_{(i)} can be chosen from another basic assumptions.

Apply Local algorithm Fedorovoy

For Local algorithm Fedorovoy we use only responses vector Y. To forecast future responses we have to initialize and optimize parameters of this algorithm (see section above).

  • Optimize parameters k,\lambda for algorithm
  • Apply algorithm for last l elements of our responses set Y

Set of algorithm variants and modifications

Numericals operations

First component consists of simple numerical operations. For example, square rooting or cubing. We add arithmetical operations like addition and multiplication. Superposition of this functions can be used in our new variables matrix X'.

Smoothing function

Second component consists of smoothing functions. We use Parzen window smoothing with two parameters: width of Parzen window h and periodic t. For each element from sample we calculate 
{\xi'}_i^k = \sum_{j=-h}^h \omega_{j} \xi_{i+tj}^l,
for each object, where k,l is indexes of elements from data set. Only condition for \{\omega_j\}_{j=-h}^h is  \sum_{j=-h}^h
\omega_{j} =1. In our case we use one from set of different kernel functions. It is Epanechnikov kernel function. \omega_{\pm j}=\frac{3}{4k}(1-(\frac{j}{h})^2), where k is normalization coefficient, k=\sum_{j=-h}^h \omega_{j}. There is periodic with step 7 in data set. For responses data set it seems reasonable apply kernel smoothing several times for windows with different periodic t \in \{ 1,7 \}.

Autoregression

In our model we add autoregression to set of primitive functions from variables and responses. For each variable adding we use parameters h\in H, H\subset \mathbb{N} -- shift of data.


{\xi'}_j^k=\xi_{j-h}^l,

for each object, where k,l are indexes from data matrix. For objects in sample to classify we have to calculate it step by step for responses. If there is no value for \xi_{i-h}^l, i<h, we have to assign this value to be zero. It decreases correlation value for this autoregression variable in LARS algorithm. But our sample has about three thousands elements and for small values of h, h<10, this factor decreases correlation in rate about 0.1-1\% or smaller in average.

As alternative we use another way to create autoregression matrix. There are periodicals in our data set. To reshape our answers matrix we create the set of new variables from responses. For periodical length t choose H=\{1,2,\dots,t-1 \}. Then create model for each part of periodical. From full sample of row indexes {I}_t select I_{k,t}=\{ k, k+t, k+2t \dots \} . In set of matrixes for each k\in\{1,2,\dots,t\} we get from LARS its own model of linear regression. Otherwise, we modify our variables matrix according to scheme \begin{pmatrix}k&\xi_{k-1}^l&\xi_{k-2}^l&\cdots&\xi_{k-(t-1)}^l \\
k+t&\xi_{k+t-1}^l&\xi_{k+t-2}^l&\cdots&\xi_{k+t-(t-1)}^l \\
k+2t&\xi_{k+2t-1}^l&\xi_{k+2t-2}^l&\cdots&\xi_{k+2t-(t-1)}^l \\
\cdots&\cdots&\cdots&\cdots&\cdots \\ \end{pmatrix}

First row is indexes, what we leave in our variables matrix. To the right of them is variables, which we add to new variables matrix.

Removing spikes

Another primitive function in our set is procedure of removing spikes from our sample. This function uses two parameters. It's maximum rate of first and second type errors r_1,r_2. At first step our algorithm classify all sample by using this initial sample X. At second step we removing objects from sample if rate of error to true value

\frac{|y_i-\hat{y}_i|}{|y_i|} > r_1

or if rate of error to responses mean value

\frac{|y_i-\hat{y}_i|}{\overline{|y_i|}} > r_2.

We get r_1,r_2 from crossvalidation: 12 times we split sample to control and learning sample. For learning sample we create forecast to control sample. By optimization r_1,r_2 we get minimum of mean MAPE for control set.


Normalization

For some variables we need normalization. This primitive function is necessary in our sample. It can be

{\xi'}_j^k=\frac{\xi_{j}^l-\xi_{min}^l}{\xi_{max}^l -
\xi_{min}^l}.

for each \xi_j^l from \xi^l. This normalization makes nearly equal initial conditions for each variable in LARS and that is we need. This primitive is useful for responses and variables.

Additional notes

It's necessary to make some additional notes. Some functions can be useful only for subset of variables and we have to take into account this fact. Another thing to pay attention is a possible usage primitives superposition.

All this statements need additional researching and computing experiments in other parts of this article to create good model for our problem.

Описание системы

  • Ссылка на файл system.docs
  • Ссылка на файлы системы

Отчет о вычислительных экспериментах

Визуальный анализ работы алгоритма

Анализ качества работы алгоритма

Анализ зависимости работы алгоритма от параметров

Отчет о полученных результатах

Список литературы

Данная статья является непроверенным учебным заданием.
Студент: Участник:Зайцев Алексей
Преподаватель: Участник:В.В. Стрижов
Срок: 15 декабря 2009

До указанного срока статья не должна редактироваться другими участниками проекта MachineLearning.ru. По его окончании любой участник вправе исправить данную статью по своему усмотрению и удалить данное предупреждение, выводимое с помощью шаблона {{Задание}}.

См. также методические указания по использованию Ресурса MachineLearning.ru в учебном процессе.


Личные инструменты