VAR model with pandas + statsmodels in Python - python

I am an avid user of R, but recently switched to Python for a few different reasons. However, I am struggling a little to run the vector AR model in Python from statsmodels.
Q#1. I get an error when I run this, and I have a suspicion it has something to do with the type of my vector.
import numpy as np
import statsmodels.tsa.api
from statsmodels import datasets
import datetime as dt
import pandas as pd
from pandas import Series
from pandas import DataFrame
import os
df = pd.read_csv('myfile.csv')
speedonly = DataFrame(df['speed'])
results = statsmodels.tsa.api.VAR(speedonly)
Traceback (most recent call last):
File "<pyshell#14>", line 1, in <module>
results = statsmodels.tsa.api.VAR(speedonly)
File "C:\Python27\lib\site-packages\statsmodels\tsa\vector_ar\var_model.py", line 336, in __init__
super(VAR, self).__init__(endog, None, dates, freq)
File "C:\Python27\lib\site-packages\statsmodels\tsa\base\tsa_model.py", line 40, in __init__
self._init_dates(dates, freq)
File "C:\Python27\lib\site-packages\statsmodels\tsa\base\tsa_model.py", line 54, in _init_dates
raise ValueError("dates must be of type datetime")
ValueError: dates must be of type datetime
Now, interestingly, when I run the VAR example from here https://github.com/statsmodels/statsmodels/blob/master/docs/source/vector_ar.rst#id5, it works fine.
I try the VAR model with a third, shorter vector, ts, from Wes McKinney's "Python for Data Analysis," page 293 and it doesn't work.
Okay, so now I'm thinking it's because the vectors are different types:
>>> speedonly.head()
speed
0 559.984
1 559.984
2 559.984
3 559.984
4 559.984
>>> type(speedonly)
<class 'pandas.core.frame.DataFrame'> #DOESN'T WORK
>>> type(data)
<type 'numpy.ndarray'> #WORKS
>>> ts
2011-01-02 -0.682317
2011-01-05 1.121983
2011-01-07 0.507047
2011-01-08 -0.038240
2011-01-10 -0.890730
2011-01-12 -0.388685
>>> type(ts)
<class 'pandas.core.series.TimeSeries'> #DOESN'T WORK
So I convert speedonly to an ndarray... and it still doesn't work. But this time I get another error:
>>> nda_speedonly = np.array(speedonly)
>>> results = statsmodels.tsa.api.VAR(nda_speedonly)
Traceback (most recent call last):
File "<pyshell#47>", line 1, in <module>
results = statsmodels.tsa.api.VAR(nda_speedonly)
File "C:\Python27\lib\site-packages\statsmodels\tsa\vector_ar\var_model.py", line 345, in __init__
self.neqs = self.endog.shape[1]
IndexError: tuple index out of range
Any suggestions?
Q#2. I have exogenous feature variables in my data set that appear to be useful for predictions. Is the above model from statsmodels even the best one to use?

When you give a pandas object to a time-series model, it expects that the index is dates. The error message is improved in the current source (to be released soon).
ValueError: Given a pandas object and the index does not contain dates
In the second case, you're giving a single 1d series to a VAR. VARs are used when you have more than one series. That's why you have the shape error because it expects there to be a second dimension in your array. We could probably improve the error message here. For a single series AR model with exogenous variables, you probably want to use sm.tsa.ARMA. Note that there is a known bug in ARMA.predict for models with exogenous variables to fixed soon. If you could provide a test case for this it would be helpful.

Related

Why pandas DataFrame allows to set column using too large Series?

Is there a reason why pandas raises ValueError exception when setting DataFrame column using a list and doesn't do the same when using Series? Resulting in superfluous Series values being ignored (e.g. 7 in example below).
>>> import pandas as pd
>>> df = pd.DataFrame([[1],[2]])
>>> df
0
0 1
1 2
>>> df[0] = [5,6,7]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\Python310\lib\site-packages\pandas\core\frame.py", line 3655, in __setitem__
self._set_item(key, value)
File "D:\Python310\lib\site-packages\pandas\core\frame.py", line 3832, in _set_item
value = self._sanitize_column(value)
File "D:\Python310\lib\site-packages\pandas\core\frame.py", line 4529, in _sanitize_column
com.require_length_match(value, self.index)
File "D:\Python310\lib\site-packages\pandas\core\common.py", line 557, in require_length_match
raise ValueError(
ValueError: Length of values (3) does not match length of index (2)
>>>
>>> df[0] = pd.Series([5,6,7])
>>> df
0
0 5
1 6
Tested using python 3.10.6 and pandas 1.5.3 on Windows 10.
You have right the behaviour is different between list and np.array but it's expected.
If you take a look in the source code in the frame.py module you will see that if the value is a list then it checks the length, in np.array doesn't check the length and as you observed is the np.array is larger, its truncated.
NOTE: The details of the np.array truncation is here

Extract quarter information from numpy datetime64 obkect

I have below numpy datetime64 object
import numpy as np
date_time = np.datetime64('2012-05-01T01:00:00.000000+0100')
I would like to express this in YearQuarter i.e. '2012Q2'. Is there any method available to perform this? I tried with pandas Timestamp method but it generates error:
import pandas as pd
>>> pd.Timestamp(date_time).dt.quarter
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Timestamp' object has no attribute 'dt'
Any pointer will be very helpful
There are various ways that one can achieve that, depending on the desired output type.
If one wants the type pandas._libs.tslibs.period.Period, then one can use:
pandas.Period as follows
year_quarter = pd.Period(date_time, freq='Q')
[Out]: 2012Q2
pandas.Timestamp, as user7864386 mentioned, as follows
year_quarter = pd.Timestamp(date_time).to_period('Q')
[Out]: 2012Q2
Alternatively, if one wants the final output to be a string, one will have to pass pandas.Series.dt.strftime, more specifically .strftime('%YQ%q'), such as
year_quarter = pd.Period(date_time, freq='Q').strftime('%YQ%q')
# or
year_quarter = pd.Timestamp(date_time).to_period('Q').strftime('%YQ%q')
Notes:
date_time = np.datetime64('2012-05-01T01:00:00.000000+0100') gives a
DeprecationWarning: parsing timezone aware datetimes is deprecated; this will raise an error in the future
To check the variable year_quarter type, one can do the following
print(type(year_quarter))

Lifelines boolean index in Python did not match indexed array along dimension 0; dimension is 88 but corresponding boolean dimension is 76

This very simple piece of code,
# imports...
from lifelines import CoxPHFitter
import pandas as pd
src_file = "Pred.csv"
df = pd.read_csv(src_file, header=0, delimiter=',')
df = df.drop(columns=['score'])
cph = CoxPHFitter()
cph.fit(df, duration_col='Length', event_col='Status', show_progress=True)
produces an error:
Traceback (most recent call last):
File
"C:/Users/.../predictor.py", line 11,
in
cph.fit(df, duration_col='Length', event_col='Status', show_progress=True)
File
"C:\Users\...\AppData\Local\conda\conda\envs\hrpred\lib\site-packages\lifelines\fitters\coxph_fitter.py",
line 298, in fit
self._check_values(df)
File "C:\Users\...\AppData\Local\conda\conda\envs\hrpred\lib\site-packages\lifelines\fitters\coxph_fitter.py",
line 323, in _check_values
cols = str(list(X.columns[low_var]))
File "C:\Users\...\AppData\Local\conda\conda\envs\hrpred\lib\site-packages\pandas\core\indexes\base.py",
line 1754, in _ _ getitem _ _
result = getitem(key)
IndexError: boolean index did not match indexed array along dimension 0; dimension is 88 but corresponding
boolean dimension is 76
However, when I print df itself, everything's all right. As you can see, everything is inside the library. And the library's examples work fine.
Without knowing what your data look like - I had the same error, which was resolved when I removed all but the duration, event and coefficient(s) from the pandas df I was using. That is, I had a lot of extra columns in the df that were confusing the cox PH fitter since you don't actually specify which coef you want to include as an argument to cph.fit().

Memory error in numpy

I am trying to build this converter for one of my personal project using numpy and getting the Memory error. I am new to python. This works fine for small data but breaks when i give 5MB of data as input(attached the data). Here is the code. Could experts point out where the memory is blowing up here? Link to data can be found here
import numpy as np
import gc as gc
"""
USAGE: convert(data,cols)
data - numpy array of data
cols - tuple of columns to process. These columns should be categorical columns.
IMP: Indexing of colum in data starts with 0. Ypou cant index last column.
Ex: you want to index second col here, then
data
a b c
a b c
x y z
cols=(1,)
if you want to index 1st and second, then
cols=(0,1)
All 3
cols=(0,1,2)
You can also skip numeric column, which you dont want to encode, like
cols=(0,2) will skip 1 col
"""
def lookupBuilder(strArray):
a=np.arange(len(strArray))+1
lookups={k:v for (k,v) in zip(strArray,a)}
return lookups
def convert(data,cols):
for ix,i in enumerate(cols):
col=data[:,i:i+1]
lookup_data=lookupBuilder(np.unique(col))
for idx,value in enumerate(col):
col[idx]=lookup_data[value[0]]
np.delete(data,i,1)
gc.collect()
np.insert(data,i,col,axis=1)
return data
if __name__=="__main__":
pass
Error
Traceback (most recent call last):
File "C:\MLDatabases\python_scripts\MLP.py", line 230, in <module>
data=cc.convert(data,(1,2,3,4,5,6,7,8,9,13,19))
File "C:\MLDatabases\python_scripts\categorical_converter.py", line 49, in convert
np.insert(data,i,col,axis=1)
File "C:\python\lib\site-packages\numpy\lib\function_base.py", line 4906, in insert
new = empty(newshape, arr.dtype, arrorder)
MemoryError

How use the mean method on a pandas TimeSeries with Decimal type values?

I need to store Python decimal type values in a pandas TimeSeries/DataFrame object. Pandas gives me an error when using the "groupby" and "mean" on the TimeSeries/DataFrame. The following code based on floats works well:
[0]: by = lambda x: lambda y: getattr(y, x)
[1]: rng = date_range('1/1/2000', periods=40, freq='4h')
[2]: rnd = np.random.randn(len(rng))
[3]: ts = TimeSeries(rnd, index=rng)
[4]: ts.groupby([by('year'), by('month'), by('day')]).mean()
2000 1 1 0.512422
2 0.447235
3 0.290151
4 -0.227240
5 0.078815
6 0.396150
7 -0.507316
But i get an error if do the same using decimal values instead of floats:
[5]: rnd = [Decimal(x) for x in rnd]
[6]: ts = TimeSeries(rnd, index=rng, dtype=Decimal)
[7]: ts.groupby([by('year'), by('month'), by('day')]).mean() #Crash!
Traceback (most recent call last):
File "C:\Users\TM\Documents\Python\tm.py", line 100, in <module>
print ts.groupby([by('year'), by('month'), by('day')]).mean()
File "C:\Python27\lib\site-packages\pandas\core\groupby.py", line 293, in mean
return self._cython_agg_general('mean')
File "C:\Python27\lib\site-packages\pandas\core\groupby.py", line 365, in _cython_agg_general
raise GroupByError('No numeric types to aggregate')
pandas.core.groupby.GroupByError: No numeric types to aggregate
The error message is "GroupByError('No numeric types to aggregate')". Is there any chance to use the standard aggregations like sum, mean, and quantileon on the TimeSeries or DataFrame containing Decimal values?
Why doens't it work and is there a chance to have an equally fast alternative if it is not possible?
EDIT: I just realized that most of the other functions (min, max, median, etc.) work fine but not the mean function that i desperately need :-(.
import numpy as np
ts.groupby([by('year'), by('month'), by('day')]).apply(np.mean)

Categories

Resources