Error while implementing ARIMA in Python on Quandl Data

Error while implementing ARIMA in Python on Quandl Data - python

df = quandl.get('NSE/TATAMOTORS', start_date='2000-01-01', end_date='2018-05-10')
df=df.drop(['Last','Total Trade Quantity','Turnover (Lacs)'], axis=1)
df.head(10)
OUTPUT -
Open High Low Close
Date
2003-12-26 435.80 440.50 431.65 438.60
2003-12-29 441.00 449.70 441.00 447.80
2003-12-30 450.00 451.90 430.10 442.40
2003-12-31 446.00 459.30 443.55 452.05
2004-01-01 453.25 457.90 451.50 454.45
2004-01-02 458.00 460.35 454.05 456.40
2004-01-05 458.00 465.00 450.60 454.85
2004-01-06 460.00 465.00 448.50 454.45
2004-01-07 451.40 454.70 438.10 446.45
2004-01-08 449.00 466.95 449.00 464.75
-
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(df, order=(5,1,0))
OUTPUT -
Traceback (most recent call last):
File "<ipython-input-90-799de8e60d6f>", line 1, in <module>
model = ARIMA(df, order=(5,1,0))
File "D:\A\lib\site-packages\statsmodels\tsa\arima_model.py", line 1000, in __new__
mod.__init__(endog, order, exog, dates, freq, missing)
File "D:\A\lib\site-packages\statsmodels\tsa\arima_model.py", line 1024, in __init__
self.data.ynames = 'D.' + self.endog_names
TypeError: must be str, not list
So i converted the index column containing dates to proper column
by -
df = df.reset_index()
df.head(10)
Out[92]:
Date Open High Low Close
0 2003-12-26 435.80 440.50 431.65 438.60
1 2003-12-29 441.00 449.70 441.00 447.80
2 2003-12-30 450.00 451.90 430.10 442.40
3 2003-12-31 446.00 459.30 443.55 452.05
4 2004-01-01 453.25 457.90 451.50 454.45
5 2004-01-02 458.00 460.35 454.05 456.40
6 2004-01-05 458.00 465.00 450.60 454.85
7 2004-01-06 460.00 465.00 448.50 454.45
8 2004-01-07 451.40 454.70 438.10 446.45
9 2004-01-08 449.00 466.95 449.00 464.75
then when i run this line -
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(df, order=(5,1,0))
OUTPUT -
Traceback (most recent call last):
File "<ipython-input-94-799de8e60d6f>", line 1, in <module>
model = ARIMA(df, order=(5,1,0))
File "D:\A\lib\site-packages\statsmodels\tsa\arima_model.py", line 1000, in __new__
mod.__init__(endog, order, exog, dates, freq, missing)
File "D:\A\lib\site-packages\statsmodels\tsa\arima_model.py", line 1015, in __init__
super(ARIMA, self).__init__(endog, (p, q), exog, dates, freq, missing)
File "D:\A\lib\site-packages\statsmodels\tsa\arima_model.py", line 452, in __init__
super(ARMA, self).__init__(endog, exog, dates, freq, missing=missing)
File "D:\A\lib\site-packages\statsmodels\tsa\base\tsa_model.py", line 43, in __init__
super(TimeSeriesModel, self).__init__(endog, exog, missing=missing)
File "D:\A\lib\site-packages\statsmodels\base\model.py", line 212, in __init__
super(LikelihoodModel, self).__init__(endog, exog, **kwargs)
File "D:\A\lib\site-packages\statsmodels\base\model.py", line 63, in __init__
**kwargs)
File "D:\A\lib\site-packages\statsmodels\base\model.py", line 88, in _handle_data
data = handle_data(endog, exog, missing, hasconst, **kwargs)
File "D:\A\lib\site-packages\statsmodels\base\data.py", line 630, in handle_data
**kwargs)
File "D:\A\lib\site-packages\statsmodels\base\data.py", line 76, in __init__
self.endog, self.exog = self._convert_endog_exog(endog, exog)
File "D:\A\lib\site-packages\statsmodels\base\data.py", line 471, in _convert_endog_exog
raise ValueError("Pandas data cast to numpy dtype of object. "
ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).
HELP?

ARIMA is expected a array-like object, if we instead of using a 2D array(dataframe) and use a 1D array(Series) and this will work.
Try:
ARIMA(df['Close'].values, order=(5,1,0))
where df has a datetime in index and you select one column:
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10 entries, 2003-12-26 to 2004-01-08
Data columns (total 4 columns):
Open 10 non-null float64
High 10 non-null float64
Low 10 non-null float64
Close 10 non-null float64
dtypes: float64(4)
memory usage: 400.0 bytes

Related

Pandas apply getting KeyError: [duplicate]

This question already has an answer here:
Why do I get a KeyError when using pandas apply?
(1 answer)
Closed 13 days ago.
I was looking at this answer by Roman Pekar for using apply. I initially copied the code exactly and it worked fine. Then I used it on my df3 that is created from a csv file and I got a KeyError. I checked datatypes the columns I was using are int64, so that is okay. I don't have nulls. If I can get this working then I will make the function more complex. How do I get this working?
def fxy(x, y):
return x * y
df3 = pd.read_csv(path + 'test_data.csv', usecols=[0,1,2])
print(df3.dtypes)
df3['Area'] = df3.apply(lambda x: fxy(x['Len'], x['Width']))
Trace back
Traceback (most recent call last):
File "f:\...\my_file.py", line 54, in <module>
df3['Area'] = df3.apply(lambda x: fxy(x['Len'], x['Width']))
File "C:\...\frame.py", line 8833, in apply
return op.apply().__finalize__(self, method="apply")
File "C:\...\apply.py", line 727, in apply
return self.apply_standard()
File "C:\...\apply.py", line 851, in apply_standard
results, res_index = self.apply_series_generator()
File "C:\...\apply.py", line 867, in apply_series_generator
results[i] = self.f(v)
File "f:\...\my_file.py", line 54, in <lambda>
df3['Area'] = df3.apply(lambda x: fxy(x['Len'], x['Width']))
File "C:\...\series.py", line 958, in __getitem__
return self._get_value(key)
File "C:\...\series.py", line 1069, in _get_value
loc = self.index.get_loc(label)
File "C:\...\range.py", line 389, in get_loc
raise KeyError(key)
KeyError: 'Len'
I don't see a way to attach the csv file. Below is Sample df3 if I save the below with excel as "CSV (Comma delimited)(*.csv) I get the same results.
ID
Len
Width
A
170
4
B
362
5
C
12
15
D
42
7
E
15
3
F
46
49
G
71
74

I think you miss the axis=1 on apply:
df3['Area'] = df3.apply(lambda x: fxy(x['Len'], x['Width']), axis=1)
But in your case, you can just do:
df3['Area'] = df3['Len'] * df3['Width']
print(df3)
# Output
ID Len Width Area
0 A 170 4 680
1 B 362 5 1810
2 C 12 15 180
3 D 42 7 294
4 E 15 3 45
5 F 46 49 2254
6 G 71 74 5254

statsmodels ols from formula with groupby pandas

I have a dataframe of the type:
date TICKER x1 x2 ... Z Y month x3
0 1999-12-31 A UN Equity 52.1330 51.9645 ... 0.0052 NaN 12 NaN
1 1999-12-31 AA UN Equity 92.9415 92.8715 ... 0.0052 NaN 12 NaN
2 1999-12-31 ABC UN Equity 3.6843 3.6539 ... 0.0052 NaN 12 NaN
3 1999-12-31 ABF UN Equity 22.0625 21.9375 ... 0.0052 NaN 12 NaN
4 1999-12-31 ABM UN Equity 10.2188 10.1250 ... 0.0052 NaN 12 NaN
I would like to run an OLS regression from the formula 'Y ~ x1 + x2:x3' by the group ['TICKER','year','month'] (year is a column which does not appear here) from statsmodels.formula.api as smf. I therefore use:
data.groupby(['TICKER','year','month']).apply(lambda x: smf.ols(formula='Y ~ x1 + x2:x3', data=x))
However, I get the following error:
IndexError: tuple index out of range
Any idea why?
The full tracebakc is
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\pandas\core\groupby\groupby.py", line 894, in apply
result = self._python_apply_general(f, self._selected_obj)
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\pandas\core\groupby\groupby.py", line 928, in _python_apply_general
keys, values, mutated = self.grouper.apply(f, data, self.axis)
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\pandas\core\groupby\ops.py", line 238, in apply
res = f(group)
File "<input>", line 1, in <lambda>
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 195, in from_formula
mod = cls(endog, exog, *args, **kwargs)
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\regression\linear_model.py", line 872, in __init__
super(OLS, self).__init__(endog, exog, missing=missing,
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\regression\linear_model.py", line 703, in __init__
super(WLS, self).__init__(endog, exog, missing=missing,
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\regression\linear_model.py", line 190, in __init__
super(RegressionModel, self).__init__(endog, exog, **kwargs)
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 237, in __init__
super(LikelihoodModel, self).__init__(endog, exog, **kwargs)
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 77, in __init__
self.data = self._handle_data(endog, exog, missing, hasconst,
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 101, in _handle_data
data = handle_data(endog, exog, missing, hasconst, **kwargs)
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\data.py", line 672, in handle_data
return klass(endog, exog=exog, missing=missing, hasconst=hasconst,
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\data.py", line 71, in __init__
arrays, nan_idx = self.handle_missing(endog, exog, missing,
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\data.py", line 247, in handle_missing
if combined_nans.shape[0] != nan_mask.shape[0]:
IndexError: tuple index out of range

I see that your Y columns has a lot of NaNs, so you need to ensure that the subgroup has enough observations, so that the regression can work.
So if I use an example data:
import statsmodels.formula.api as smf
np.random.seed(123)
data = pd.concat([
pd.DataFrame({'TICKER':np.random.choice(['A','B','C'],30),
'year':np.random.choice([2000,2001],30),
'month':np.random.choice([1,2],30)}),
pd.DataFrame(np.random.normal(0,1,(30,4)),columns=['Y','x1','x2','x3'])
],axis=1)
data.loc[:6,'Y'] = np.nan
If I run your code on the data frame above, I get the same error.
So if we use only complete data (relevant for your regression):
complete_ix = data[['Y','x1','x2','x3']].dropna().index
data.loc[complete_ix].groupby(['TICKER','year','month']).apply(lambda x: smf.ols(formula='Y ~ x1 + x2:x3', data=x))
It works:
TICKER year month
A 2000 2 <statsmodels.regression.linear_model.OLS objec...
2001 1 <statsmodels.regression.linear_model.OLS objec...
2 <statsmodels.regression.linear_model.OLS objec...
B 2000 1 <statsmodels.regression.linear_model.OLS objec...
2 <statsmodels.regression.linear_model.OLS objec...
2001 1 <statsmodels.regression.linear_model.OLS objec...
C 2000 1 <statsmodels.regression.linear_model.OLS objec...
2 <statsmodels.regression.linear_model.OLS objec...

How to convert label data with None values to OneHot using LabelBinarizer sklearn

I have label data that same of the values are np.nan.
I want to convert the data to OneHot vector using LabelBinarizer, and the np.nan will convert to zero-array.
But I get an error. I success to convert the data with get_dummies from pandas model.
I can't use the get_dummies function because the train and the test data coming with different files and different time. I want to use sklearn model, for save it, and us the model latter.
Code for example:
In[11]: df = pd.DataFrame({'CITY':['London','NYC','Manchester',np.nan],'Country':['UK','US','UK','AUS']})
In[12]: df
Out[12]:
CITY Country
0 London UK
1 NYC US
2 Manchester UK
3 NaN AUS
In[13]: pd.get_dummies(df['CITY'])
Out[13]:
London Manchester NYC
0 1 0 0
1 0 0 1
2 0 1 0
3 0 0 0
In[14]: from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
In[15]: lb.fit_transform(df['CITY'])
Traceback (most recent call last):
File "/home/oshrib/.conda/envs/on_target/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-16-d0afb38b2695>", line 1, in <module>
lb.fit_transform(df['CITY'])
File "/home/oshrib/.conda/envs/on_target/lib/python3.5/site-packages/sklearn/preprocessing/label.py", line 307, in fit_transform
return self.fit(y).transform(y)
File "/home/oshrib/.conda/envs/on_target/lib/python3.5/site-packages/sklearn/preprocessing/label.py", line 276, in fit
self.y_type_ = type_of_target(y)
File "/home/oshrib/.conda/envs/on_target/lib/python3.5/site-packages/sklearn/utils/multiclass.py", line 288, in type_of_target
if (len(np.unique(y)) > 2) or (y.ndim >= 2 and len(y[0]) > 1):
File "/home/oshrib/.conda/envs/on_target/lib/python3.5/site-packages/numpy/lib/arraysetops.py", line 223, in unique
return _unique1d(ar, return_index, return_inverse, return_counts)
File "/home/oshrib/.conda/envs/on_target/lib/python3.5/site-packages/numpy/lib/arraysetops.py", line 283, in _unique1d
ar.sort()
TypeError: unorderable types: float() < str()

Query hdf5 datetime column

I have a hdf5 file that contains a table where the column time is in datetime64[ns] format.
I want to get all the rows that are older than thresh. How can I do that? This is what I've tried:
thresh = pd.datetime.strptime('2018-03-08 14:19:41','%Y-%m-%d %H:%M:%S').timestamp()
hdf = pd.read_hdf(STORE, 'gh1', where = 'time>thresh' )
I get the following error:
Traceback (most recent call last):
File "<ipython-input-80-fa444735d0a9>", line 1, in <module>
runfile('/home/joao/github/control_panel/controlpanel/controlpanel/reading_test.py', wdir='/home/joao/github/control_panel/controlpanel/controlpanel')
File "/home/joao/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "/home/joao/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/home/joao/github/control_panel/controlpanel/controlpanel/reading_test.py", line 15, in <module>
hdf = pd.read_hdf(STORE, 'gh1', where = 'time>thresh' )
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 370, in read_hdf
return store.select(key, auto_close=auto_close, **kwargs)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 717, in select
return it.get_result()
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 1457, in get_result
results = self.func(self.start, self.stop, where)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 710, in func
columns=columns, **kwargs)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 4141, in read
if not self.read_axes(where=where, **kwargs):
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 3340, in read_axes
self.selection = Selection(self, where=where, **kwargs)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 4706, in __init__
self.condition, self.filter = self.terms.evaluate()
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 556, in evaluate
self.condition = self.terms.prune(ConditionBinOp)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 118, in prune
res = pr(left.value, right.value)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 113, in pr
encoding=self.encoding).evaluate()
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 327, in evaluate
values = [self.convert_value(v) for v in rhs]
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 327, in <listcomp>
values = [self.convert_value(v) for v in rhs]
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 185, in convert_value
v = pd.Timestamp(v)
File "pandas/_libs/tslib.pyx", line 390, in pandas._libs.tslib.Timestamp.__new__
File "pandas/_libs/tslib.pyx", line 1549, in pandas._libs.tslib.convert_to_tsobject
File "pandas/_libs/tslib.pyx", line 1735, in pandas._libs.tslib.convert_str_to_tsobject
ValueError: could not convert string to Timestamp

Demo:
creating sample DF (100.000 rows):
In [9]: N = 10**5
In [10]: dates = pd.date_range('1980-01-01', freq='99T', periods=N)
In [11]: df = pd.DataFrame({'date':dates, 'val':np.random.rand(N)})
In [12]: df
Out[12]:
date val
0 1980-01-01 00:00:00 0.985215
1 1980-01-01 01:39:00 0.452295
2 1980-01-01 03:18:00 0.780096
3 1980-01-01 04:57:00 0.004596
4 1980-01-01 06:36:00 0.515051
... ... ...
99995 1998-10-27 15:45:00 0.509954
99996 1998-10-27 17:24:00 0.046636
99997 1998-10-27 19:03:00 0.026678
99998 1998-10-27 20:42:00 0.660652
99999 1998-10-27 22:21:00 0.839426
[100000 rows x 2 columns]
writing it to HDF5 file (index date column):
In [13]: df.to_hdf('d:/temp/test.h5', 'test', format='t', data_columns=['date'])
read HDF5 conditionally by index:
In [14]: x = pd.read_hdf('d:/temp/test.h5', 'test', where="date > '1998-10-27 15:00:00'")
In [15]: x
Out[15]:
date val
99995 1998-10-27 15:45:00 0.509954
99996 1998-10-27 17:24:00 0.046636
99997 1998-10-27 19:03:00 0.026678
99998 1998-10-27 20:42:00 0.660652
99999 1998-10-27 22:21:00 0.839426

How to convert netCDF4 file in an ASCII format file for a special grid point in Python?

I try to convert my data from a big netCDF file to an ascii format file for a special point (68,21). When I tried to run the following:
from pylab import *
from netCDF4 import Dataset
import pandas as pd
nc = Dataset("/home/python/PBLH_Exp_08_Jul_2006.nc")
PBLH = nc.variables['PBLH'][:,:,:]
Times = nc.variables['Times'][:,:]
d={}
d['Times'] = Times[:,0]
d['PBLH'] = PBLH[:,:,1]
df=pd.DataFrame(d)
df.to_csv('Produkt/PBLH_Exp_08_Jul_2006.csv')
I got the error message:
Traceback (most recent call last):
File "/home/python/wrf_map.py", line 62, in <module>
df=pd.DataFrame(d)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 226, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 363, in _init_dict
dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 5163, in _arrays_to_mgr
arrays = _homogenize(arrays, index, dtype)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 5477, in _homogenize
raise_cast_failure=False)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/series.py", line 2885, in _sanitize_array
raise Exception('Data must be 1-dimensional')
Exception: Data must be 1-dimensional
What can I do to solve this issue? And how can I extract the data for my grid point? By the way, here is a part of the header of my netCDF file:
<xarray.Dataset>
Dimensions: (Time: 744, south_north: 140, west_east: 140)
Coordinates:
* Time (Time) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...
* south_north (south_north) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ...
* west_east (west_east) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
Data variables:
Times (Time) |S19 '2006-07-01_01:00:00' '2006-07-01_02:00:00' ...
PBLH (Time, south_north, west_east) float32 59.9834 59.8574 ...
Thanks for any help!!

I solved a part of my problem. I used xarray to get the data I wanted (for the position with grid points 21,68):
import numpy as np
import xray as xr
from pylab import *
data = xr.open_dataset("/home/python/PBLH_Exp_08_jul_2006.nc")
d = xr.DataArray(data.variables['PBLH'])
print(d[:,21,68])
But I still can't save my data in a ASCII file...
EDIT: Got it! To save my data I used import csv. Then I wrote:
df = d[:,21,68]
with open ('/home/python/output.txt','w') as fout:
writer = csv.writer(fout)
writer.writerows(df)
It looks not so beautiful but I can handle it! So this should work for everybody with a similar issue! :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Error while implementing ARIMA in Python on Quandl Data - python

Related

Pandas apply getting KeyError: [duplicate]

statsmodels ols from formula with groupby pandas

How to convert label data with None values to OneHot using LabelBinarizer sklearn

Query hdf5 datetime column

How to convert netCDF4 file in an ASCII format file for a special grid point in Python?

Categories

Resources