A for loop through dates for daily OLS regressions

A for loop through dates for daily OLS regressions - python

I am trying to run an OLS regression on a daily basis for the Y against X as depicted below. The period is between Jan 2008 to Dec 2011. The idea is to get the daily parameters from which volatility will be inferred. I have managed to run regressions for the first day but the other days are seemingly taking on values from the first day. Since the observations per day are not the same, its difficult since I am a newbie here.
Kindly help me rectify the code below.
import statsmodels.api as sm`
from statsmodels.formula.api import ols`
##groupby by index date to enable recursive regressions by day
Kempf_etal2015_model = np.zeros((1,5))`
intercept = np.zeros((1,1))`
betas = np.zeros((1,4))`
for idx, day in df_fp.groupby(df_fp.index.date):`
# for i in (day):
for i in range(len(day)) :
for j in range(len(df_fp)):
spreadlag1 = df_fp['relative_spread'].shift(-1)
c_spreeeed = df_fp['relative_spread'] - df_fp['relative_spread'].shift(-1)
c_spreeeed_lag1 = df_fp['relative_spread'] - df_fp['relative_spread'].shift(-2)
c_spreeeed_lag2 = df_fp['relative_spread'] - df_fp['relative_spread'].shift(-3)
c_spreeeed_lag3 = df_fp['relative_spread'] - df_fp['relative_spread'].shift(-4)
Kempf_etal2015_model = ols('c_spreeeed ~spreadlag1+c_spreeeed_lag1+c_spreeeed_lag2+c_spreeeed_lag3', df_fp).fit()
intercept =Kempf_etal2015_model.params[:0]
betas = Kempf_etal2015_model.params[1:]
print(Kempf_etal2015_model.params)
I end up getting the below results
Intercept 0.000102
spreadlag1 -0.104292
c_spreeeed_lag1 0.430733
c_spreeeed_lag2 0.020808
c_spreeeed_lag3 0.072575
dtype: float64
Intercept 0.000102
spreadlag1 -0.104292
c_spreeeed_lag1 0.430733
c_spreeeed_lag2 0.020808
c_spreeeed_lag3 0.072575
dtype: float64

Related

Calculate mean from only one variable in pandas dataframe and netcdf

I am aiming to calculate daily climatology from a dataset, i.e. obtain the sea surface temperature (SST) for each day of the year by averaging all the years (for example, for January 1st, the average SST of all January 1st from 1982 to 2018). To do so, I made the following steps:
DATA PREPARATION STEPS
Here is a Drive link to both datasets to make the code reproducible:
link to datasets
First, I load two datasets:
ds1 = xr.open_dataset('./anomaly_dss/archive_to2018.nc') #from 1982 to 2018
ds2 = xr.open_dataset('./anomaly_dss/realtime_from2018.nc') #from 2018 to present
Then I convert to pandas dataframe and merge both in one:
ds1 = ds1.where(ds1.time > np.datetime64('1982-01-01'), drop=True) # Grab all data since 1/1/1982
ds2 = ds2.where(ds2.time > ds1.time.max(), drop=True) # Grab all data since the end of the archive
# Convert to Pandas Dataframe
df1 = ds1.to_dataframe().reset_index().set_index('time')
df2 = ds2.to_dataframe().reset_index().set_index('time')
# Merge these datasets
df = df1.combine_first(df2)
So far, this is how my dataframe looks like:
NOTE THAT THE LAT,LON GOES FROM LAT(35,37.7), LON(-10,-5), THIS MUST REMAIN LIKE THAT
ANOMALY CALCULATION STEPS
# Anomaly claculation
def standardize(x):
return (x - x.mean())/x.std()
# Calculate a daily average
df_daily = df.resample('1D').mean()
# Calculate the anomaly for each yearday
df_daily['anomaly'] = df_daily['analysed_sst'].groupby(df_daily.index.dayofyear).transform(standardize)
I obtain the following dataframe:
As you can see, I obtain the mean values of all three variables.
QUESTION
As I want to plot the climatology data on a map, I DO NOT want lat/lon variables to be averaged to one point. I need the anomaly from all the points lat/lon points, and I don't really know how to achieve that.
Any help would be very appreciated!!

I think you can do all that in a simpler and more straightforward way without converting your dataarray to a dataframe:
import os
#Will open and combine automatically the 2 datasets
DS = xr.open_mfdataset(os.path.join('./anomaly_dss', '*.nc'))
da = DS.analysed_sst
#Resampling
da = da.resample(time = '1D').mean()
# Anomaly calculation
def standardize(x):
return (x - x.mean())/x.std()
da_anomaly = da.groupby(da.time.dt.dayofyear).apply(standardize)
Then you can plot the anomaly for any day with:
da_anomaly[da_anomaly.dayofyear == 1].plot()

Calculate monthly mean from daily data for each year

I have seen many answers how to calculate the monthly mean from daily data across multiple years.
But what I want to do is to calculate the monthly mean from daily data for each year in my xarray separately. So, I want to end up with a mean for Jan 2020, Feb 2020 ... Dec 2024 for each lon/lat gridpoint.
My xarray has the dimensions Frozen({'time': 1827, 'lon': 180, 'lat': 90})
I tried using
var_resampled = var_diff.resample(time='1M').mean()
but this calcualtes the mean across all years (ie mean for Jan 2020-2024).
I also tried
def mon_mean(x):
return x.groupby('time.month').mean('time')
# group by year, then apply the function:
var_diff_mon = var_diff.groupby('time.year').apply(mon_mean)
This seems to do what I want but I end up with different dimensions (ie "month" and "year" instead of the original "time" dimension).
Is there a different way to calculate the monthly mean from daily data for each year separately or is there a way that the code using groupby above retains the same time dimension as before just with year and month now?
P.S. I also tried "cdo monmean" but as far as I understand this also just gives mean the monthly mean across all years.
Thanks!
Solution
I found a way using
def mon_mean(x):
return x.groupby('time.month').mean('time')
# group by year, then apply the function:
var_diff_mon = var_diff.groupby('time.year').apply(mon_mean)
and then using
var_diff_mon.stack(time=("year", "month"))
to get my original time dimension back

Is var_diff.resample(time='M') (or time='MS') doing what you expect ?
Let's create a toy dataset like yours:
import numpy as np
import pandas as pd
import xarray as xr
dims = ('time', 'lat', 'lon')
time = pd.date_range("2021-01-01T00", "2023-12-31T23", freq="H")
lat = [0, 1]
lon = [0, 1]
coords = (time, lat, lon)
ds = xr.DataArray(data=np.random.randn(len(time), len(lat), len(lon)), coords=coords, dims=dims).rename("my_var")
ds = ds.to_dataset()
ds
Let's resample it:
ds.resample(time="MS").mean()
The dataset has now 36 time steps, associated with the 36 months which are in the original dataset.

Forecasting time series with multiple seasonaliy by using auto_arima(SARIMAX) and Fourier terms

I am trying to forecast a time series in Python by using auto_arima and adding Fourier terms as exogenous features. The data come from kaggle's Store item demand forecasting challenge. It consists of a long format time series for 10 stores and 50 items resulting in 500 time series stacked on top of each other. The specificity of this time series is that it has daily data with weekly and annual seasonalities.
In order to capture these two levels of seasonality I first used TBATS as recommended by Rob J Hyndman in Forecasting with daily data which worked pretty well actually.
I also followed this medium article posted by the creator of TBATS python library who compared it with SARIMAX + Fourier terms (also recommended by Hyndman).
But now, when I tried to use the second approach with pmdarima's auto_arima and Fourier terms as exogenous features, I get unexpected results.
In the following code, I only used the train.csv file that I split into train and test data (last year used for forecasting) and set the maximum order of Fourier terms K = 2.
My problem is that I obtain a smoothed forecast (see Image below) that do not seem to capture the weekly seasonality which is different from the result at the end of this article.
Is there something wrong with my code ?
Complete code :
# imports
import pandas as pd
from pmdarima.preprocessing import FourierFeaturizer
from pmdarima import auto_arima
import matplotlib.pyplot as plt
# Upload the data that consist in a long format time series of multiple TS stacked on top of each other
# There are 10 (stores) * 50 (items) = 500 time series
train_data = pd.read_csv('train.csv', index_col='date', parse_dates=True)
# Select only one time series for store 1 and item 1 for the purpose of the example
train_data = train_data.query('store == 1 and item == 1').sales
# Prepare the fourier terms to add as exogenous features to auto_arima
# Annual seasonality covered by fourier terms
four_terms = FourierFeaturizer(365.25, 2)
y_prime, exog = four_terms.fit_transform(train_data)
exog['date'] = y_prime.index # is exactly the same as manual calculation in the above cells
exog = exog.set_index(exog['date'])
exog.index.freq = 'D'
exog = exog.drop(columns=['date'])
# Split the time series as well as exogenous features data into train and test splits
y_to_train = y_prime.iloc[:(len(y_prime)-365)]
y_to_test = y_prime.iloc[(len(y_prime)-365):] # last year for testing
exog_to_train = exog.iloc[:(len(exog)-365)]
exog_to_test = exog.iloc[(len(exog)-365):]
# Fit model
# Weekly seasonality covered by SARIMAX
arima_exog_model = auto_arima(y=y_to_train, exogenous=exog_to_train, seasonal=True, m=7)
# Forecast
y_arima_exog_forecast = arima_exog_model.predict(n_periods=365, exogenous=exog_to_test)
y_arima_exog_forecast = pd.DataFrame(y_arima_exog_forecast , index = pd.date_range(start='2017-01-01', end= '2017-12-31'))
# Plots
plt.plot(y_to_test, label='Actual data')
plt.plot(y_arima_exog_forecast, label='Forecast')
plt.legend()
Thanks in advance for your answers !

Here's the answer in case someone's interested.
Thanks again Flavia Giammarino.
# imports
import pandas as pd
from pmdarima.preprocessing import FourierFeaturizer
from pmdarima import auto_arima
import matplotlib.pyplot as plt
# Upload the data that consists long format time series of multiple TS stacked on top of each other
# There are 10 (stores) * 50 (items) time series
train_data = pd.read_csv('train.csv', index_col='date', parse_dates=True)
# Select only one time series for store 1 and item 1 for the purpose of the example
train_data = train_data.query('store == 1 and item == 1').sales
# Prepare the fourier terms to add as exogenous features to auto_arima
# Annual seasonality covered by fourier terms
four_terms = FourierFeaturizer(365.25, 1)
y_prime, exog = four_terms.fit_transform(train_data)
exog['date'] = y_prime.index # is exactly the same as manual calculation in the above cells
exog = exog.set_index(exog['date'])
exog.index.freq = 'D'
exog = exog.drop(columns=['date'])
# Split the time series as well as exogenous features data into train and test splits
y_to_train = y_prime.iloc[:(len(y_prime)-365)]
y_to_test = y_prime.iloc[(len(y_prime)-365):] # last year for testing
exog_to_train = exog.iloc[:(len(exog)-365)]
exog_to_test = exog.iloc[(len(exog)-365):]
# Fit model
# Weekly seasonality covered by SARIMAX
arima_exog_model = auto_arima(y=y_to_train, D=1, exogenous=exog_to_train, seasonal=True, m=7)
# Forecast
y_arima_exog_forecast = arima_exog_model.predict(n_periods=365, exogenous=exog_to_test)
y_arima_exog_forecast = pd.DataFrame(y_arima_exog_forecast , index = pd.date_range(start='2017-01-01', end= '2017-12-31'))
# Plots
plt.plot(y_to_test, label='Actual data')
plt.plot(y_arima_exog_forecast, label='Forecast')
plt.legend()

Modeling time series in python with statsmodels and getting unexpected results and strange charts

I am doing some time series modeling (or trying to). I'm doing just a simple auto AR but am having issues with the output (so probably the input as well). The data is pretty straight forward. There are times, days since start and the values I want to model and forecast.
import statsmodels.api as sm
from statsmodels.tsa.ar_model import AutoReg, ar_select_order
train7 = p7.iloc[:len(p7)-1728]
test7 = p7.iloc[len(p7)-1728:]
mod = AutoReg(train7['Gv'], 5, old_names=False)
res = mod.fit()
print(res.summary())
pred = res.predict(start=len(train7), end=(len(p7)-1), dynamic=False)
from matplotlib import pyplot
pyplot.plot(pred)
pyplot.plot(test7['Gv'], color='red')
This produces a crazy chart seen below.
The prediction x axis values are way outside the range of what they should be. i.e. pred shows ranges outside the x axis range and they are not in proper order, which seems reflected in the chart below.
The p7 data looks like :
rec1; Gv:100; time: 23:50:39; days from start: 0
rec2; Gv:150; time: 23:55:39; days from start: 0
rec3; Gv:190; time: 00:00:39; days from start: 1
I'm at a loss. Clearly I've skipped or missed something important. First time modeling TS in python.

Pandas/Statsmodel OLS predicting future values

I've been trying to get a prediction for future values in a model I've created. I have tried both OLS in pandas and statsmodels. Here is what I have in statsmodels:
import statsmodels.api as sm
endog = pd.DataFrame(dframe['monthly_data_smoothed8'])
smresults = sm.OLS(dframe['monthly_data_smoothed8'], dframe['date_delta']).fit()
sm_pred = smresults.predict(endog)
sm_pred
The length of the array returned is equal to the number of records in my original dataframe but the values are not the same. When I do the following using pandas I get no values returned.
from pandas.stats.api import ols
res1 = ols(y=dframe['monthly_data_smoothed8'], x=dframe['date_delta'])
res1.predict
(Note that there is no .fit function for OLS in Pandas) Could somebody shed some light on how I might get future predictions from my OLS model in either pandas or statsmodel-I realize I must not be using .predict properly and I've read the multiple other problems people have had but they do not seem to apply to my case.
edit I believe 'endog' as defined is incorrect-I should be passing the values for which I want to predict; therefore I've created a date range of 12 periods past the last recorded value. But still I miss something as I am getting the error:
matrices are not aligned
edit here is a snippet of data, the last column (in red) of numbers is the date delta which is a difference in months from the first date:
month monthly_data monthly_data_smoothed5 monthly_data_smoothed8 monthly_data_smoothed12 monthly_data_smoothed3 date_delta
0 2011-01-31 3.711838e+11 3.711838e+11 3.711838e+11 3.711838e+11 3.711838e+11 0.000000
1 2011-02-28 3.776706e+11 3.750759e+11 3.748327e+11 3.746975e+11 3.755084e+11 0.919937
2 2011-03-31 4.547079e+11 4.127964e+11 4.083554e+11 4.059256e+11 4.207653e+11 1.938438
3 2011-04-30 4.688370e+11 4.360748e+11 4.295531e+11 4.257843e+11 4.464035e+11 2.924085

I think your issue here is that statsmodels doesn't add an intercept by default, so your model doesn't achieve much of a fit. To solve it in your code would be something like this:
dframe = pd.read_clipboard() # your sample data
dframe['intercept'] = 1
X = dframe[['intercept', 'date_delta']]
y = dframe['monthly_data_smoothed8']
smresults = sm.OLS(y, X).fit()
dframe['pred'] = smresults.predict()
Also, for what it's worth, I think the statsmodel formula api is much nicer to work with when dealing with DataFrames, and adds an intercept by default (add a - 1 to remove). See below, it should give the same answer.
import statsmodels.formula.api as smf
smresults = smf.ols('monthly_data_smoothed8 ~ date_delta', dframe).fit()
dframe['pred'] = smresults.predict()
Edit:
To predict future values, just pass new data to .predict() For example, using the first model:
In [165]: smresults.predict(pd.DataFrame({'intercept': 1,
'date_delta': [0.5, 0.75, 1.0]}))
Out[165]: array([ 2.03927604e+11, 2.95182280e+11, 3.86436955e+11])
On the intercept - there's nothing encoded in the number 1 it's just based on the math of OLS (an intercept is perfectly analogous to a regressor that always equals 1), so you can pull the value right off the summary. Looking at the statsmodels docs, an alternative way to add an intercept would be:
X = sm.add_constant(X)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

A for loop through dates for daily OLS regressions - python

Related

Calculate mean from only one variable in pandas dataframe and netcdf

Calculate monthly mean from daily data for each year

Forecasting time series with multiple seasonaliy by using auto_arima(SARIMAX) and Fourier terms

Modeling time series in python with statsmodels and getting unexpected results and strange charts

Pandas/Statsmodel OLS predicting future values

Categories

Resources