statsmodels, time series predict out-of-sample error

statsmodels, time series predict out-of-sample error - python

My data:
ipdb> dta
plays
2015-03-01 401.0
2015-03-02 350.0
2015-03-03 448.0
2015-03-04 490.0
... ...
2015-08-23 655.0
2015-08-24 731.0
2015-08-25 684.0
2015-08-26 774.0
2015-08-27 808.0
2015-08-28 732.0
2015-08-29 694.0
2015-08-30 798.0
The data starts from 2015-03-01 to 2015-08-30, and I want to predict the future values.
My code snippt
arma_mod30 = sm.tsa.ARMA(dta, (d_level, 0)).fit()
result = arma_mod30.predict('2015-09-01', '2015-10-30', dynamic=True)
But the predict fuction returned the fllowing error:
ValueError: date 2015-09-01 00:00:00 not in date index. Try giving a date that is in the dates index or use an integer
How to predict future dates that not in the date index? Thanks!

Related

python compute pct_change() from last value of the previous month

I am trying to compute returns from last value of the previous month, here is the sample dataframe, daily value. I can't figure out how to achieve this with pct_change() functio
Sample df
date value
31/07/2020 141.793,00
03/08/2020 145.401,00
04/08/2020 124.534,00
05/08/2020 147.562,00
06/08/2020 131.043,00
07/08/2020 132.556,00
10/08/2020 140.874,00
11/08/2020 128.603,00
01/09/2020 131.451,00
02/09/2020 137.862,00
03/09/2020 130.439,00
04/09/2020 124.608,00
07/09/2020 133.674,00
08/09/2020 126.454,00
09/09/2020 136.488,00
Goal
I need to compute the current monthly cumulated return for each day. The return value for the day should be the return from the last value of the previous month. Something like this:
date value monthly
31/07/2020 141.793,00 NaN
03/08/2020 145.401,00 0,025445544
04/08/2020 124.534,00 -0,12171969
05/08/2020 147.562,00 0,040686071
06/08/2020 131.043,00 -0,075814744
07/08/2020 132.556,00 -0,06514426
10/08/2020 140.874,00 -0,006481279
11/08/2020 128.603,00 -0,093022928
01/09/2020 131.451,00 0,022145673
02/09/2020 137.862,00 0,071996765
03/09/2020 130.439,00 0,014276494
04/09/2020 124.608,00 -0,031064594
07/09/2020 133.674,00 0,039431429
08/09/2020 126.454,00 -0,016710341
09/09/2020 136.488,00 0,061312722

I believe you can get what you need with the following.
Use str.replace to replace the , , then convert to float, and then apply pct_change() and return a new column:
df['monthly'] = df['value'].str.replace(',','').astype(float).pct_change()
which prints:
date value monthly
0 31/07/2020 141.793,00 NaN
1 2020-03-08 00:00:00 145.401,00 0.025446
2 2020-04-08 00:00:00 124.534,00 -0.143513
3 2020-05-08 00:00:00 147.562,00 0.184913
4 2020-06-08 00:00:00 131.043,00 -0.111946
5 2020-07-08 00:00:00 132.556,00 0.011546
6 2020-10-08 00:00:00 140.874,00 0.062751
7 2020-11-08 00:00:00 128.603,00 -0.087106
8 2020-01-09 00:00:00 131.451,00 0.022146
9 2020-02-09 00:00:00 137.862,00 0.048771
10 2020-03-09 00:00:00 130.439,00 -0.053844
11 2020-04-09 00:00:00 124.608,00 -0.044703
12 2020-07-09 00:00:00 133.674,00 0.072756
13 2020-08-09 00:00:00 126.454,00 -0.054012
14 2020-09-09 00:00:00 136.488,00 0.079349

rolling function does not print all values [duplicate]

This question already has answers here:
How do I expand the output display to see more columns of a Pandas DataFrame?
(22 answers)
Closed 3 years ago.
I am trying to understand rolling function on pandas on python here is my example code
# importing pandas as pd
import pandas as pd
# By default the "date" column was in string format,
# we need to convert it into date-time format
# parse_dates =["date"], converts the "date" column to date-time format
# Resampling works with time-series data only
# so convert "date" column to index
# index_col ="date", makes "date" column
df = pd.read_csv("apple.csv", parse_dates = ["date"], index_col = "date")
print (df.close.rolling(3).sum())
print (df.close.rolling(3, win_type ='triang').sum())
cvs input file has 255 entries but I get few entries on the output, I get "..." between 2018-10-04 and 2017-12-26. I verified the input file, it has a lot more valid entries in between these dates.
date
2018-11-14 NaN
2018-11-13 NaN
2018-11-12 578.63
2018-11-09 590.87
2018-11-08 607.13
2018-11-07 622.91
2018-11-06 622.21
2018-11-05 615.31
2018-11-02 612.84
2018-11-01 631.29
2018-10-31 648.56
2018-10-30 654.38
2018-10-29 644.40
2018-10-26 641.84
2018-10-25 648.34
2018-10-24 651.19
2018-10-23 657.62
2018-10-22 658.47
2018-10-19 662.69
2018-10-18 655.98
2018-10-17 656.52
2018-10-16 659.36
2018-10-15 660.70
2018-10-12 661.62
2018-10-11 653.92
2018-10-10 652.92
2018-10-09 657.68
2018-10-08 667.00
2018-10-05 674.93
2018-10-04 676.05
...
2017-12-26 512.25
2017-12-22 516.18
2017-12-21 520.59
2017-12-20 524.37
2017-12-19 523.90
2017-12-18 525.31
2017-12-15 524.93
2017-12-14 522.61
2017-12-13 518.46
2017-12-12 516.19
2017-12-11 516.64
2017-12-08 513.74
2017-12-07 511.36
2017-12-06 507.70
2017-12-05 507.97
2017-12-04 508.45
2017-12-01 510.49
2017-11-30 512.70
2017-11-29 512.38
2017-11-28 514.40
2017-11-27 516.64
2017-11-24 522.13
2017-11-22 524.02
2017-11-21 523.07
2017-11-20 518.08
2017-11-17 513.27
2017-11-16 511.23
2017-11-15 510.33
2017-11-14 511.52
2017-11-13 514.39
Name: close, Length: 254, dtype: float64
thank you for your help ...

... just means that pandas isn't showing you all the rows, that's where the 'missing' ones are.
To display all rows:
with pd.option_context("display.max_rows", None):
print (df.close.rolling(3, win_type ='triang').sum())

Plot each column mean grouped by specific date range

I have 7 columns of data, indexed by datetime (30 minutes frequency) starting from 2017-05-31 ending in 2018-05-25. I want to plot the mean of specific range of date (seasons). I have been trying groupby, but I can't get to group by specific range. I get wrong results if I do df.groupby(df.date.dt.month).mean().
A few lines from the dataset (date range is from 2017-05-31 to 2018-05-25)
50 51 56 58
date
2017-05-31 00:00:00 200.213542 276.929198 242.879051 NaN
2017-05-31 00:30:00 200.215478 276.928229 242.879051 NaN
2017-05-31 01:00:00 200.215478 276.925324 242.878083 NaN
2017-06-01 01:00:00 200.221288 276.944691 242.827729 NaN
2017-06-01 01:30:00 200.221288 276.944691 242.827729 NaN
2017-08-31 09:00:00 206.961886 283.374453 245.041349 184.358250
2017-08-31 09:30:00 206.966727 283.377358 245.042317 184.360187
2017-12-31 09:00:00 212.925877 287.198416 247.455413 187.175144
2017-12-31 09:30:00 212.926846 287.196480 247.465097 187.179987
2018-03-31 23:00:00 213.304498 286.933093 246.469647 186.887548
2018-03-31 23:30:00 213.308369 286.938902 246.468678 186.891422
2018-04-30 23:00:00 215.496812 288.342024 247.522230 188.104749
2018-04-30 23:30:00 215.497781 288.340086 247.520294 188.103780
I have created these variables (These are the ranges I need)
increment_rates_winter = df['2017-08-30'].mean() - df['2017-06-01'].mean()
increment_rates_spring = df['2017-11-30'].mean() - df['2017-09-01'].mean()
increment_rates_summer = df['2018-02-28'].mean() - df['2017-12-01'].mean()
increment_rates_fall = df['2018-05-24'].mean() - df['2018-03-01'].mean()
Concatenated them:
df_seasons =pd.concat([increment_rates_winter,increment_rates_spring,increment_rates_summer,increment_rates_fall],axis=1)
and after plotting, I got this:
However, I've been trying to get this:
df_seasons
Out[664]:
Winter Spring Summer Fall
50 6.697123 6.948447 -1.961549 7.662622
51 6.428329 4.760650 -2.188402 5.927087
52 5.580953 6.667529 1.136889 12.939295
53 6.406259 2.506279 -2.105125 6.964549
54 4.332826 3.678492 -2.574769 6.569398
56 2.222032 3.359607 -2.694863 5.348258
58 NaN 1.388535 -0.035889 4.213046
The seasons in x and the means plotted for each column.
Winter = df['2017-06-01':'2017-08-30']
Spring = df['2017-09-01':'2017-11-30']
Summer = df['2017-12-01':'2018-02-28']
Fall = df['2018-03-01':'2018-05-30']
Thank you in advance!

We can get a specific date range in the following way, and then you can define it however you want and take the mean
import pandas as pd
df = pd.read_csv('test.csv')
df['date'] = pd.to_datetime(df['date'])
start_date = "2017-12-31 09:00:00"
end_date = "2018-04-30 23:00:00"
mask = (df['date'] > start_date) & (df['date'] <= end_date)
f_df = df.loc[mask]
This gives the output
date 50 ... 58
8 2017-12-31 09:30:00 212.926846 ... 187.179987 NaN
9 2018-03-31 23:00:00 213.304498 ... 186.887548 NaN
10 2018-03-31 23:30:00 213.308369 ... 186.891422 NaN
11 2018-04-30 23:00:00 215.496812 ... 188.104749 NaN
Hope this helps

How about transpose it:
df_seasons.T.plot()
Output:

Pandas DataFrame.resample monthly offset from particular day of month

I have a DataFrame df with sporadic daily business day rows (i.e., there is not always a row for every business day.)
For each row in df I want to create a historical resampled mean dfm going back one month at a time. For example, if I have a row for 2018-02-22 then I want rolling means for rows in the following date ranges:
2018-01-23 : 2018-02-22
2017-12-23 : 2018-01-22
2017-11-23 : 2017-12-22
etc.
But I can't see a way to keep this pegged to the particular day of the month using conventional offsets. For example, if I do:
dfm = df.resample('30D').mean()
Then we see two problems:
It references the beginning of the DataFrame. In fact, I can't find a way to force .resample() to peg itself to the end of the DataFrame – even if I have it operate on df_reversed = df.loc[:'2018-02-22'].iloc[::-1]. Is there a way to "peg" the resampling to something other than the earliest date in the DataFrame? (And ideally pegged to each particular row as I run some lambda on the associated historical resampling from each row's date?)
It will drift over time, because not every month is 30 days long. So as I go back in time I will find that the interval 12 "months" prior ends 2017-02-27, not 2017-02-22 like I want.
Knowing that I want to resample by non-overlapping "months," the second problem can be well-defined for month days 29-31: For example, if I ask to resample for '2018-03-31' then the date ranges would end at the end of each preceding month:
2018-03-01 : 2018-03-31
2018-02-01 : 2018-02-28
2018-01-01 : 2018-02-31
etc.
Though again, I don't know: is there a good or easy way to do this in pandas?
tl;dr:
Given something like the following:
someperiods = 20 # this can be a number of days covering many years
somefrequency = '8D' # this can vary from 1D to maybe 10D
rng = pd.date_range('2017-01-03', periods=someperiods, freq=somefrequency)
df = pd.DataFrame({'x': rng.day}, index=rng) # x in practice is exogenous data
from pandas.tseries.offsets import *
df['MonthPrior'] = df.index.to_pydatetime() + DateOffset(months=-1)
Now:
For each row in df: calculate df['PreviousMonthMean'] = rolling average of all df.x in range [df.MonthPrior, df.index). In this example the resulting DataFrame would be:
Index x MonthPrior PreviousMonthMean
2017-01-03 3 2016-12-03 NaN
2017-01-11 11 2016-12-11 3
2017-01-19 19 2016-12-19 7
2017-01-27 27 2016-12-27 11
2017-02-04 4 2017-01-04 19
2017-02-12 12 2017-01-12 16.66666667
2017-02-20 20 2017-01-20 14.33333333
2017-02-28 28 2017-01-28 12
2017-03-08 8 2017-02-08 20
2017-03-16 16 2017-02-16 18.66666667
2017-03-24 24 2017-02-24 17.33333333
2017-04-01 1 2017-03-01 16
2017-04-09 9 2017-03-09 13.66666667
2017-04-17 17 2017-03-17 11.33333333
2017-04-25 25 2017-03-25 9
2017-05-03 3 2017-04-03 17
2017-05-11 11 2017-04-11 15
2017-05-19 19 2017-04-19 13
2017-05-27 27 2017-04-27 11
2017-06-04 4 2017-05-04 19
If we can get that far, then I need to find an efficient way to iterate that so that for each row in df I can aggregate consecutive but non-overlapping df['PreviousMonthMean'] values going back one calendar month at a time from the given DateTimeIndex....

How to properly set start/end params of statsmodels.tsa.ar_model.AR.predict function

I have a dataframe of project costs from an irregularly spaced time series that I would like to try to apply the statsmodel AR model against.
This is a sample of the data in it's dataframe:
cost
date
2015-07-16 35.98
2015-08-11 25.00
2015-08-11 43.94
2015-08-13 26.25
2015-08-18 15.38
2015-08-24 77.72
2015-09-09 40.00
2015-09-09 20.00
2015-09-09 65.00
2015-09-23 70.50
2015-09-29 59.00
2015-11-03 19.25
2015-11-04 19.97
2015-11-10 26.25
2015-11-12 19.97
2015-11-12 23.97
2015-11-12 21.88
2015-11-23 23.50
2015-11-23 33.75
2015-11-23 22.70
2015-11-23 33.75
2015-11-24 27.95
2015-11-24 27.95
2015-11-24 27.95
...
2017-03-31 21.93
2017-04-06 22.45
2017-04-06 26.85
2017-04-12 60.40
2017-04-12 37.00
2017-04-12 20.00
2017-04-12 66.00
2017-04-12 60.00
2017-04-13 41.95
2017-04-13 25.97
2017-04-13 29.48
2017-04-19 41.00
2017-04-19 58.00
2017-04-19 78.00
2017-04-19 12.00
2017-04-24 51.05
2017-04-26 21.88
2017-04-26 50.05
2017-04-28 21.00
2017-04-28 30.00
I am having a hard time understanding how to use start and end in the predict function.
According to the docs:
start : int, str, or datetime
Zero-indexed observation number at which to start forecasting, ie., the first > forecast is start. Can also be a date string to parse or a datetime type.
end : int, str, or datetime Zero-indexed observation number at which
to end forecasting, ie., the first forecast is start. Can also be a
date string to parse or a datetime type.
I create a dataframe that has an empty daily time series, add my irregularly spaced time series data to it, and then try to apply the model.
data = pd.read_csv('data.csv', index_col=1, parse_dates=True)
df = pd.DataFrame(index=pd.date_range(start=datetime(2015, 1, 1), end=datetime(2017, 12, 31), freq='d'))
df = df.join(data)
df.cost.interpolate(inplace=True)
ar_model = sm.tsa.AR(df, missing='drop', freq='D')
ar_res = ar_model.fit(maxlag=9, method='mle', disp=-1)
pred = ar_res.predict(start='2016', end='2016')
The predict function results in an error of pandas.tslib.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 605-12-31 00:00:00
If I try to use a more specific date, I get the same type of error:
pred = ar_res.predict(start='2016-01-01', end='2016-06-01')
If I try to use integers, I get a different error:
pred = ar_res.predict(start=0, end=len(data))
Wrong number of items passed 202, placement implies 197
If I actually use a datetime, I get an error that reads no rule for interpreting end.
I am hitting a wall so hard here I am thinking there must be something I am missing.
Ultimately, I would like to use the model to get out-of-sample predictions (such as a prediction for next quarter).

This works if you pass a datetime (rather than a date):
from datetime import datetime
...
pred = ar_res.predict(start=datetime(2015, 1, 1), end=datetime(2017,12,31))
In [21]: pred.head(2) # my dummy numbers from data
Out[21]:
2015-01-01 35
2015-01-02 23
Freq: D, dtype: float64
In [22]: pred.tail(2)
Out[22]:
2017-12-30 44
2017-12-31 44
Freq: D, dtype: float64

So I was creating a daily index to account for the equally spaced time series requirement, but it still remained non-unique (comment by #user333700).
I added a groupby function to sum duplicate dates together, and could then run the predict function using datetime objects (answer by #andy-hayden).
df = df.groupby(pd.TimeGrouper(freq='D')).sum()
...
ar_res.predict(start=min(df.index), end=datetime(2018,12,31))
With the predict function providing a result, I am now able to analyze the results and tweak the params to get something useful.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

statsmodels, time series predict out-of-sample error - python

Related

python compute pct_change() from last value of the previous month

rolling function does not print all values [duplicate]

Plot each column mean grouped by specific date range

Pandas DataFrame.resample monthly offset from particular day of month

How to properly set start/end params of statsmodels.tsa.ar_model.AR.predict function

Categories

Resources