Any way for grouped forecast with multivariates? (python) - python

I was trying a VAR multivariate forecast for economics analysis.
I have 20 bank quarter financial data from 2010 to 2021, with corresponding quarter macro data.
I tried VAR multivariate forecast, but it did not work since I have 20 data on every single date.
I tried grouped / hierarchical auto-ARIMA, but it did not work since I need a multivariate forecast.
these are my codes corresponding to number 1. I prefer VAR forecast but can change models if more appropriate.
import pandas as pd
import numpy as np`
import matplotlib.pyplot as plt
%matplotlib inline
from statsmodels.tsa.api import VAR
from statsmodels.tsa.stattools import adfuller
from statsmodels.tools.eval_measures import rmse, aic
data = pd.read_csv('last.csv', parse_dates=['quarter'], index_col='quarter')
dff = data[["npl", "nom_gdp","kospi", "cpi", "avgex","interest", "m2", "lnasset", "bis", "nim", "lnlend"]]
df = data[["npl", "nom_gdp","kospi", "cpi", "avgex","interest", "m2", "nim"]]
nobs = 10
df_train, df_test = df[0:-nobs], df[-nobs:]
df_differenced = df_train.diff().dropna()
model = VAR(df_differenced)
model_fitted = model.fit(4)
model_fitted.summary()
fc = model_fitted.forecast(y=forecast_input, steps=nobs)
df_forecast = pd.DataFrame(fc, index=df.index[-nobs:], columns=df.columns + '_1q')
df_forecast
def invert_transformation(df_train, df_forecast, second_diff=False):
"""Revert back the differencing to get the forecast to original scale."""
df_fc = df_forecast.copy()
columns = df_train.columns
for col in columns:
df_fc[str(col)+'_forecast'] = df_train[col].iloc[-1] + df_fc[str(col)+'_1q'].cumsum()
return df_fc
df_results = invert_transformation(df_train, df_forecast, second_diff=True)
df_results.loc[:, ['npl_forecast', 'gdp_forecast', 'kospi_forecast', 'cpi_forecast',
'avgex_forecast', 'interest_forecast', 'm2_forecast', 'nim_forecast']]
These code give me certain numbers without errors, but since multiple numbers exist in the same period, these numbers are not correct. Is there any way I can forecast the number? I want forecast by banks and also the total forecast. + forecasts to more than 1 period, like 4 other quarters.
Blockquote

Related

Calculate mean from only one variable in pandas dataframe and netcdf

I am aiming to calculate daily climatology from a dataset, i.e. obtain the sea surface temperature (SST) for each day of the year by averaging all the years (for example, for January 1st, the average SST of all January 1st from 1982 to 2018). To do so, I made the following steps:
DATA PREPARATION STEPS
Here is a Drive link to both datasets to make the code reproducible:
link to datasets
First, I load two datasets:
ds1 = xr.open_dataset('./anomaly_dss/archive_to2018.nc') #from 1982 to 2018
ds2 = xr.open_dataset('./anomaly_dss/realtime_from2018.nc') #from 2018 to present
Then I convert to pandas dataframe and merge both in one:
ds1 = ds1.where(ds1.time > np.datetime64('1982-01-01'), drop=True) # Grab all data since 1/1/1982
ds2 = ds2.where(ds2.time > ds1.time.max(), drop=True) # Grab all data since the end of the archive
# Convert to Pandas Dataframe
df1 = ds1.to_dataframe().reset_index().set_index('time')
df2 = ds2.to_dataframe().reset_index().set_index('time')
# Merge these datasets
df = df1.combine_first(df2)
So far, this is how my dataframe looks like:
NOTE THAT THE LAT,LON GOES FROM LAT(35,37.7), LON(-10,-5), THIS MUST REMAIN LIKE THAT
ANOMALY CALCULATION STEPS
# Anomaly claculation
def standardize(x):
return (x - x.mean())/x.std()
# Calculate a daily average
df_daily = df.resample('1D').mean()
# Calculate the anomaly for each yearday
df_daily['anomaly'] = df_daily['analysed_sst'].groupby(df_daily.index.dayofyear).transform(standardize)
I obtain the following dataframe:
As you can see, I obtain the mean values of all three variables.
QUESTION
As I want to plot the climatology data on a map, I DO NOT want lat/lon variables to be averaged to one point. I need the anomaly from all the points lat/lon points, and I don't really know how to achieve that.
Any help would be very appreciated!!
I think you can do all that in a simpler and more straightforward way without converting your dataarray to a dataframe:
import os
#Will open and combine automatically the 2 datasets
DS = xr.open_mfdataset(os.path.join('./anomaly_dss', '*.nc'))
da = DS.analysed_sst
#Resampling
da = da.resample(time = '1D').mean()
# Anomaly calculation
def standardize(x):
return (x - x.mean())/x.std()
da_anomaly = da.groupby(da.time.dt.dayofyear).apply(standardize)
Then you can plot the anomaly for any day with:
da_anomaly[da_anomaly.dayofyear == 1].plot()

Forecasting time series with multiple seasonaliy by using auto_arima(SARIMAX) and Fourier terms

I am trying to forecast a time series in Python by using auto_arima and adding Fourier terms as exogenous features. The data come from kaggle's Store item demand forecasting challenge. It consists of a long format time series for 10 stores and 50 items resulting in 500 time series stacked on top of each other. The specificity of this time series is that it has daily data with weekly and annual seasonalities.
In order to capture these two levels of seasonality I first used TBATS as recommended by Rob J Hyndman in Forecasting with daily data which worked pretty well actually.
I also followed this medium article posted by the creator of TBATS python library who compared it with SARIMAX + Fourier terms (also recommended by Hyndman).
But now, when I tried to use the second approach with pmdarima's auto_arima and Fourier terms as exogenous features, I get unexpected results.
In the following code, I only used the train.csv file that I split into train and test data (last year used for forecasting) and set the maximum order of Fourier terms K = 2.
My problem is that I obtain a smoothed forecast (see Image below) that do not seem to capture the weekly seasonality which is different from the result at the end of this article.
Is there something wrong with my code ?
Complete code :
# imports
import pandas as pd
from pmdarima.preprocessing import FourierFeaturizer
from pmdarima import auto_arima
import matplotlib.pyplot as plt
# Upload the data that consist in a long format time series of multiple TS stacked on top of each other
# There are 10 (stores) * 50 (items) = 500 time series
train_data = pd.read_csv('train.csv', index_col='date', parse_dates=True)
# Select only one time series for store 1 and item 1 for the purpose of the example
train_data = train_data.query('store == 1 and item == 1').sales
# Prepare the fourier terms to add as exogenous features to auto_arima
# Annual seasonality covered by fourier terms
four_terms = FourierFeaturizer(365.25, 2)
y_prime, exog = four_terms.fit_transform(train_data)
exog['date'] = y_prime.index # is exactly the same as manual calculation in the above cells
exog = exog.set_index(exog['date'])
exog.index.freq = 'D'
exog = exog.drop(columns=['date'])
# Split the time series as well as exogenous features data into train and test splits
y_to_train = y_prime.iloc[:(len(y_prime)-365)]
y_to_test = y_prime.iloc[(len(y_prime)-365):] # last year for testing
exog_to_train = exog.iloc[:(len(exog)-365)]
exog_to_test = exog.iloc[(len(exog)-365):]
# Fit model
# Weekly seasonality covered by SARIMAX
arima_exog_model = auto_arima(y=y_to_train, exogenous=exog_to_train, seasonal=True, m=7)
# Forecast
y_arima_exog_forecast = arima_exog_model.predict(n_periods=365, exogenous=exog_to_test)
y_arima_exog_forecast = pd.DataFrame(y_arima_exog_forecast , index = pd.date_range(start='2017-01-01', end= '2017-12-31'))
# Plots
plt.plot(y_to_test, label='Actual data')
plt.plot(y_arima_exog_forecast, label='Forecast')
plt.legend()
Thanks in advance for your answers !
Here's the answer in case someone's interested.
Thanks again Flavia Giammarino.
# imports
import pandas as pd
from pmdarima.preprocessing import FourierFeaturizer
from pmdarima import auto_arima
import matplotlib.pyplot as plt
# Upload the data that consists long format time series of multiple TS stacked on top of each other
# There are 10 (stores) * 50 (items) time series
train_data = pd.read_csv('train.csv', index_col='date', parse_dates=True)
# Select only one time series for store 1 and item 1 for the purpose of the example
train_data = train_data.query('store == 1 and item == 1').sales
# Prepare the fourier terms to add as exogenous features to auto_arima
# Annual seasonality covered by fourier terms
four_terms = FourierFeaturizer(365.25, 1)
y_prime, exog = four_terms.fit_transform(train_data)
exog['date'] = y_prime.index # is exactly the same as manual calculation in the above cells
exog = exog.set_index(exog['date'])
exog.index.freq = 'D'
exog = exog.drop(columns=['date'])
# Split the time series as well as exogenous features data into train and test splits
y_to_train = y_prime.iloc[:(len(y_prime)-365)]
y_to_test = y_prime.iloc[(len(y_prime)-365):] # last year for testing
exog_to_train = exog.iloc[:(len(exog)-365)]
exog_to_test = exog.iloc[(len(exog)-365):]
# Fit model
# Weekly seasonality covered by SARIMAX
arima_exog_model = auto_arima(y=y_to_train, D=1, exogenous=exog_to_train, seasonal=True, m=7)
# Forecast
y_arima_exog_forecast = arima_exog_model.predict(n_periods=365, exogenous=exog_to_test)
y_arima_exog_forecast = pd.DataFrame(y_arima_exog_forecast , index = pd.date_range(start='2017-01-01', end= '2017-12-31'))
# Plots
plt.plot(y_to_test, label='Actual data')
plt.plot(y_arima_exog_forecast, label='Forecast')
plt.legend()

Summing time series after k-means clustering

I am trying out different variations of K in K-means clustering on a set with time series data.
For each experiment I want to sum up the time series for each cluster label and perform predictions on them.
So for example:
If I cluster the time series into 3 clusters I want to sum all the time series (column-wise) belonging to cluster 1 and all the times series belonging to cluster 2, and the same for cluster 3. After that I will make predictions on each aggregated time-series cluster, but I do not need help on the prediction part.
I was thinking to add the cluster labels to the original dataframe and then use .loc and a loop to extract time series corresponding to the same clusters. But I am wondering if there is a more efficient way?
import pandas as pd
from datetime import datetime
import numpy as np
from sklearn.cluster import KMeans
#create dataframe with time series
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
for i in range(20):1
df['ts' + str(i)] = np.random.randint(0,100,size=(len(date_rng)))
df_pivot = df.pivot_table(columns = 'date', values = df.columns)
#cluster
K = range(1,10,2)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(df_pivot)
print(km.labels_)
#sum/aggregate all ts in each cluster column-wise
#forecast next step for each cluster(dont need help with this part)
`
You can access data points for every cluster and then sum their values.
Something like this:
labels = km.labels_
centroids = km.cluster_centers_
cluster_sums_dict = {} # cluster number: sum of elements
for i in range(k):
# select
temp_cluster = df_pivot[np.where(labels==i)]
cluster_sums_dict[i] = temp_cluster['ts'].sum()
Also on a side note, instead of aggregating a cluster_values, can you use centroids of each cluster for prediction?

Can we cluster Multivariate Time Series dataset in Python

I have a dataset with many financial signal values for different stocks at different times.For example
StockName Date Signal1 Signal2
----------------------------------
Stock1 1/1/20 a b
Stock1 1/2/20 c d
.
.
.
Stock2 1/1/20 e f
Stock2 1/2/20 g h
.
.
.
I would like to build a time series table look like below and cluster stocks based on both signal1 and signal2 (2 variables)
StockName 1/1/20 1/2/20 ........ Cluster#
----------------------------------------------------
Stock1 [a,b] [c,d] 0
Stock2 [e,f] [g,h] 1
Stock3 ...... ..... 0
.
.
.
1)Are there any ways to do this? (Clustering stocks based on multiple variables for the time series data). I tried to search online but they are all about clustering time series based on one variable.
2)Also, are there any ways to cluster different stocks at different times as well? (So maybe Stock1 at time1 is in the same cluster with Stock2 at time3)
I am revising my answer here, based on the new information that you last posted.
from utils import *
import time
import numpy as np
from mxnet import nd, autograd, gluon
from mxnet.gluon import nn, rnn
import mxnet as mx
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
# %matplotlib inline
from sklearn.decomposition import PCA
import math
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")
context = mx.cpu(); model_ctx=mx.cpu()
mx.random.seed(1719)
# Note: The purpose of this section (3. The Data) is to show the data preprocessing and to give rationale for using different sources of data, hence I will only use a subset of the full data (that is used for training).
def parser(x):
return datetime.datetime.strptime(x,'%Y-%m-%d')
# dataset_ex_df = pd.read_csv('data/panel_data_close.csv', header=0, parse_dates=[0], date_parser=parser)
import yfinance as yf
# Get the data for the stock AAPL
start = '2018-01-01'
end = '2020-04-22'
data = yf.download('GS', start, end)
data = data.reset_index()
data
data.dtypes
# re-name field from 'Adj Close' to 'Adj_Close'
data = data.rename(columns={"Adj Close": "Adj_Close"})
data
num_training_days = int(data.shape[0]*.7)
print('Number of training days: {}. Number of test days: {}.'.format(num_training_days, data.shape[0]-num_training_days))
# TECHNICAL INDICATORS
#def get_technical_indicators(dataset):
# Create 7 and 21 days Moving Average
data['ma7'] = data['Adj_Close'].rolling(window=7).mean()
data['ma21'] = data['Adj_Close'].rolling(window=21).mean()
# Create exponential weighted moving average
data['26ema'] = data['Adj_Close'].ewm(span=26).mean()
data['12ema'] = data['Adj_Close'].ewm(span=12).mean()
data['MACD'] = (data['12ema']-data['26ema'])
# Create Bollinger Bands
data['20sd'] = data['Adj_Close'].rolling(window=20).std()
data['upper_band'] = data['ma21'] + (data['20sd']*2)
data['lower_band'] = data['ma21'] - (data['20sd']*2)
# Create Exponential moving average
data['ema'] = data['Adj_Close'].ewm(com=0.5).mean()
# Create Momentum
data['momentum'] = data['Adj_Close']-1
dataset_TI_df = data
dataset = data
def plot_technical_indicators(dataset, last_days):
plt.figure(figsize=(16, 10), dpi=100)
shape_0 = dataset.shape[0]
xmacd_ = shape_0-last_days
dataset = dataset.iloc[-last_days:, :]
x_ = range(3, dataset.shape[0])
x_ =list(dataset.index)
# Plot first subplot
plt.subplot(2, 1, 1)
plt.plot(dataset['ma7'],label='MA 7', color='g',linestyle='--')
plt.plot(dataset['Adj_Close'],label='Closing Price', color='b')
plt.plot(dataset['ma21'],label='MA 21', color='r',linestyle='--')
plt.plot(dataset['upper_band'],label='Upper Band', color='c')
plt.plot(dataset['lower_band'],label='Lower Band', color='c')
plt.fill_between(x_, dataset['lower_band'], dataset['upper_band'], alpha=0.35)
plt.title('Technical indicators for Goldman Sachs - last {} days.'.format(last_days))
plt.ylabel('USD')
plt.legend()
# Plot second subplot
plt.subplot(2, 1, 2)
plt.title('MACD')
plt.plot(dataset['MACD'],label='MACD', linestyle='-.')
plt.hlines(15, xmacd_, shape_0, colors='g', linestyles='--')
plt.hlines(-15, xmacd_, shape_0, colors='g', linestyles='--')
# plt.plot(dataset['log_momentum'],label='Momentum', color='b',linestyle='-')
plt.legend()
plt.show()
plot_technical_indicators(dataset_TI_df, 400)
This will give you some signals to work with. Of course, these features can be anything you want. I'm sure you know this is technical analysis, and not fundamental analysis. Now, you can do your clustering, and whatever else you want, at this point.
Here is a good link for clustering.
https://www.pythonforfinance.net/2018/02/08/stock-clusters-using-k-means-algorithm-in-python/
Good material to read (Title: Time Series Clustering and Dimensionality Reduction)
https://towardsdatascience.com/time-series-clustering-and-dimensionality-reduction-5b3b4e84f6a3

facebook prophet produces constant forecasts for time series with less than 25 observations

Prophet time series modeling from facebook-prophet produces constant or linear fitted values and forecasts for time series with less than 25 observations. I am wondering what causes this behavior and if there is a way to overwrite it.
from fbprophet import Prophet
import pandas
import numpy
training_length = 24
forecast_length = 5
training_endog = numpy.random.randint(50,150,training_length)
training_dates = pandas.date_range('2017-05-31', periods=training_length, freq='M')
df = pandas.DataFrame({'ds':training_dates, 'y':training_endog})
prophet_model = Prophet(
holidays=None,
daily_seasonality=False,
weekly_seasonality=False,
).fit(df)
future = prophet_model.make_future_dataframe(periods=5, freq='M', include_history=True)
prophet_model_predictions = prophet_model.predict(future)['yhat'].clip(lower=0).round()
y_long = numpy.concatenate([training_endog, numpy.zeros(5)])
future['y'] = y_long
future['yhat'] = prophet_model_predictions
future.plot(x='ds', y=['y','yhat'])

For the above example addingyearly_seasonality=Trueresolved the issue.

Categories

Resources