df = quandl.get('NSE/TATAMOTORS', start_date='2000-01-01', end_date='2018-05-10')
df=df.drop(['Last','Total Trade Quantity','Turnover (Lacs)'], axis=1)
df.head(10)
OUTPUT -
Open High Low Close
Date
2003-12-26 435.80 440.50 431.65 438.60
2003-12-29 441.00 449.70 441.00 447.80
2003-12-30 450.00 451.90 430.10 442.40
2003-12-31 446.00 459.30 443.55 452.05
2004-01-01 453.25 457.90 451.50 454.45
2004-01-02 458.00 460.35 454.05 456.40
2004-01-05 458.00 465.00 450.60 454.85
2004-01-06 460.00 465.00 448.50 454.45
2004-01-07 451.40 454.70 438.10 446.45
2004-01-08 449.00 466.95 449.00 464.75
-
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(df, order=(5,1,0))
OUTPUT -
Traceback (most recent call last):
File "<ipython-input-90-799de8e60d6f>", line 1, in <module>
model = ARIMA(df, order=(5,1,0))
File "D:\A\lib\site-packages\statsmodels\tsa\arima_model.py", line 1000, in __new__
mod.__init__(endog, order, exog, dates, freq, missing)
File "D:\A\lib\site-packages\statsmodels\tsa\arima_model.py", line 1024, in __init__
self.data.ynames = 'D.' + self.endog_names
TypeError: must be str, not list
So i converted the index column containing dates to proper column
by -
df = df.reset_index()
df.head(10)
Out[92]:
Date Open High Low Close
0 2003-12-26 435.80 440.50 431.65 438.60
1 2003-12-29 441.00 449.70 441.00 447.80
2 2003-12-30 450.00 451.90 430.10 442.40
3 2003-12-31 446.00 459.30 443.55 452.05
4 2004-01-01 453.25 457.90 451.50 454.45
5 2004-01-02 458.00 460.35 454.05 456.40
6 2004-01-05 458.00 465.00 450.60 454.85
7 2004-01-06 460.00 465.00 448.50 454.45
8 2004-01-07 451.40 454.70 438.10 446.45
9 2004-01-08 449.00 466.95 449.00 464.75
then when i run this line -
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(df, order=(5,1,0))
OUTPUT -
Traceback (most recent call last):
File "<ipython-input-94-799de8e60d6f>", line 1, in <module>
model = ARIMA(df, order=(5,1,0))
File "D:\A\lib\site-packages\statsmodels\tsa\arima_model.py", line 1000, in __new__
mod.__init__(endog, order, exog, dates, freq, missing)
File "D:\A\lib\site-packages\statsmodels\tsa\arima_model.py", line 1015, in __init__
super(ARIMA, self).__init__(endog, (p, q), exog, dates, freq, missing)
File "D:\A\lib\site-packages\statsmodels\tsa\arima_model.py", line 452, in __init__
super(ARMA, self).__init__(endog, exog, dates, freq, missing=missing)
File "D:\A\lib\site-packages\statsmodels\tsa\base\tsa_model.py", line 43, in __init__
super(TimeSeriesModel, self).__init__(endog, exog, missing=missing)
File "D:\A\lib\site-packages\statsmodels\base\model.py", line 212, in __init__
super(LikelihoodModel, self).__init__(endog, exog, **kwargs)
File "D:\A\lib\site-packages\statsmodels\base\model.py", line 63, in __init__
**kwargs)
File "D:\A\lib\site-packages\statsmodels\base\model.py", line 88, in _handle_data
data = handle_data(endog, exog, missing, hasconst, **kwargs)
File "D:\A\lib\site-packages\statsmodels\base\data.py", line 630, in handle_data
**kwargs)
File "D:\A\lib\site-packages\statsmodels\base\data.py", line 76, in __init__
self.endog, self.exog = self._convert_endog_exog(endog, exog)
File "D:\A\lib\site-packages\statsmodels\base\data.py", line 471, in _convert_endog_exog
raise ValueError("Pandas data cast to numpy dtype of object. "
ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).
HELP?
ARIMA is expected a array-like object, if we instead of using a 2D array(dataframe) and use a 1D array(Series) and this will work.
Try:
ARIMA(df['Close'].values, order=(5,1,0))
where df has a datetime in index and you select one column:
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10 entries, 2003-12-26 to 2004-01-08
Data columns (total 4 columns):
Open 10 non-null float64
High 10 non-null float64
Low 10 non-null float64
Close 10 non-null float64
dtypes: float64(4)
memory usage: 400.0 bytes
Related
This question already has an answer here:
Why do I get a KeyError when using pandas apply?
(1 answer)
Closed 13 days ago.
I was looking at this answer by Roman Pekar for using apply. I initially copied the code exactly and it worked fine. Then I used it on my df3 that is created from a csv file and I got a KeyError. I checked datatypes the columns I was using are int64, so that is okay. I don't have nulls. If I can get this working then I will make the function more complex. How do I get this working?
def fxy(x, y):
return x * y
df3 = pd.read_csv(path + 'test_data.csv', usecols=[0,1,2])
print(df3.dtypes)
df3['Area'] = df3.apply(lambda x: fxy(x['Len'], x['Width']))
Trace back
Traceback (most recent call last):
File "f:\...\my_file.py", line 54, in <module>
df3['Area'] = df3.apply(lambda x: fxy(x['Len'], x['Width']))
File "C:\...\frame.py", line 8833, in apply
return op.apply().__finalize__(self, method="apply")
File "C:\...\apply.py", line 727, in apply
return self.apply_standard()
File "C:\...\apply.py", line 851, in apply_standard
results, res_index = self.apply_series_generator()
File "C:\...\apply.py", line 867, in apply_series_generator
results[i] = self.f(v)
File "f:\...\my_file.py", line 54, in <lambda>
df3['Area'] = df3.apply(lambda x: fxy(x['Len'], x['Width']))
File "C:\...\series.py", line 958, in __getitem__
return self._get_value(key)
File "C:\...\series.py", line 1069, in _get_value
loc = self.index.get_loc(label)
File "C:\...\range.py", line 389, in get_loc
raise KeyError(key)
KeyError: 'Len'
I don't see a way to attach the csv file. Below is Sample df3 if I save the below with excel as "CSV (Comma delimited)(*.csv) I get the same results.
ID
Len
Width
A
170
4
B
362
5
C
12
15
D
42
7
E
15
3
F
46
49
G
71
74
I think you miss the axis=1 on apply:
df3['Area'] = df3.apply(lambda x: fxy(x['Len'], x['Width']), axis=1)
But in your case, you can just do:
df3['Area'] = df3['Len'] * df3['Width']
print(df3)
# Output
ID Len Width Area
0 A 170 4 680
1 B 362 5 1810
2 C 12 15 180
3 D 42 7 294
4 E 15 3 45
5 F 46 49 2254
6 G 71 74 5254
I have a dataframe of the type:
date TICKER x1 x2 ... Z Y month x3
0 1999-12-31 A UN Equity 52.1330 51.9645 ... 0.0052 NaN 12 NaN
1 1999-12-31 AA UN Equity 92.9415 92.8715 ... 0.0052 NaN 12 NaN
2 1999-12-31 ABC UN Equity 3.6843 3.6539 ... 0.0052 NaN 12 NaN
3 1999-12-31 ABF UN Equity 22.0625 21.9375 ... 0.0052 NaN 12 NaN
4 1999-12-31 ABM UN Equity 10.2188 10.1250 ... 0.0052 NaN 12 NaN
I would like to run an OLS regression from the formula 'Y ~ x1 + x2:x3' by the group ['TICKER','year','month'] (year is a column which does not appear here) from statsmodels.formula.api as smf. I therefore use:
data.groupby(['TICKER','year','month']).apply(lambda x: smf.ols(formula='Y ~ x1 + x2:x3', data=x))
However, I get the following error:
IndexError: tuple index out of range
Any idea why?
The full tracebakc is
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\pandas\core\groupby\groupby.py", line 894, in apply
result = self._python_apply_general(f, self._selected_obj)
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\pandas\core\groupby\groupby.py", line 928, in _python_apply_general
keys, values, mutated = self.grouper.apply(f, data, self.axis)
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\pandas\core\groupby\ops.py", line 238, in apply
res = f(group)
File "<input>", line 1, in <lambda>
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 195, in from_formula
mod = cls(endog, exog, *args, **kwargs)
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\regression\linear_model.py", line 872, in __init__
super(OLS, self).__init__(endog, exog, missing=missing,
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\regression\linear_model.py", line 703, in __init__
super(WLS, self).__init__(endog, exog, missing=missing,
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\regression\linear_model.py", line 190, in __init__
super(RegressionModel, self).__init__(endog, exog, **kwargs)
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 237, in __init__
super(LikelihoodModel, self).__init__(endog, exog, **kwargs)
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 77, in __init__
self.data = self._handle_data(endog, exog, missing, hasconst,
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\model.py", line 101, in _handle_data
data = handle_data(endog, exog, missing, hasconst, **kwargs)
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\data.py", line 672, in handle_data
return klass(endog, exog=exog, missing=missing, hasconst=hasconst,
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\data.py", line 71, in __init__
arrays, nan_idx = self.handle_missing(endog, exog, missing,
File "C:\Users\xxxx\PycharmProjects\non_parametric\venv\lib\site-packages\statsmodels\base\data.py", line 247, in handle_missing
if combined_nans.shape[0] != nan_mask.shape[0]:
IndexError: tuple index out of range
I see that your Y columns has a lot of NaNs, so you need to ensure that the subgroup has enough observations, so that the regression can work.
So if I use an example data:
import statsmodels.formula.api as smf
np.random.seed(123)
data = pd.concat([
pd.DataFrame({'TICKER':np.random.choice(['A','B','C'],30),
'year':np.random.choice([2000,2001],30),
'month':np.random.choice([1,2],30)}),
pd.DataFrame(np.random.normal(0,1,(30,4)),columns=['Y','x1','x2','x3'])
],axis=1)
data.loc[:6,'Y'] = np.nan
If I run your code on the data frame above, I get the same error.
So if we use only complete data (relevant for your regression):
complete_ix = data[['Y','x1','x2','x3']].dropna().index
data.loc[complete_ix].groupby(['TICKER','year','month']).apply(lambda x: smf.ols(formula='Y ~ x1 + x2:x3', data=x))
It works:
TICKER year month
A 2000 2 <statsmodels.regression.linear_model.OLS objec...
2001 1 <statsmodels.regression.linear_model.OLS objec...
2 <statsmodels.regression.linear_model.OLS objec...
B 2000 1 <statsmodels.regression.linear_model.OLS objec...
2 <statsmodels.regression.linear_model.OLS objec...
2001 1 <statsmodels.regression.linear_model.OLS objec...
C 2000 1 <statsmodels.regression.linear_model.OLS objec...
2 <statsmodels.regression.linear_model.OLS objec...
I have label data that same of the values are np.nan.
I want to convert the data to OneHot vector using LabelBinarizer, and the np.nan will convert to zero-array.
But I get an error. I success to convert the data with get_dummies from pandas model.
I can't use the get_dummies function because the train and the test data coming with different files and different time. I want to use sklearn model, for save it, and us the model latter.
Code for example:
In[11]: df = pd.DataFrame({'CITY':['London','NYC','Manchester',np.nan],'Country':['UK','US','UK','AUS']})
In[12]: df
Out[12]:
CITY Country
0 London UK
1 NYC US
2 Manchester UK
3 NaN AUS
In[13]: pd.get_dummies(df['CITY'])
Out[13]:
London Manchester NYC
0 1 0 0
1 0 0 1
2 0 1 0
3 0 0 0
In[14]: from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
In[15]: lb.fit_transform(df['CITY'])
Traceback (most recent call last):
File "/home/oshrib/.conda/envs/on_target/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-16-d0afb38b2695>", line 1, in <module>
lb.fit_transform(df['CITY'])
File "/home/oshrib/.conda/envs/on_target/lib/python3.5/site-packages/sklearn/preprocessing/label.py", line 307, in fit_transform
return self.fit(y).transform(y)
File "/home/oshrib/.conda/envs/on_target/lib/python3.5/site-packages/sklearn/preprocessing/label.py", line 276, in fit
self.y_type_ = type_of_target(y)
File "/home/oshrib/.conda/envs/on_target/lib/python3.5/site-packages/sklearn/utils/multiclass.py", line 288, in type_of_target
if (len(np.unique(y)) > 2) or (y.ndim >= 2 and len(y[0]) > 1):
File "/home/oshrib/.conda/envs/on_target/lib/python3.5/site-packages/numpy/lib/arraysetops.py", line 223, in unique
return _unique1d(ar, return_index, return_inverse, return_counts)
File "/home/oshrib/.conda/envs/on_target/lib/python3.5/site-packages/numpy/lib/arraysetops.py", line 283, in _unique1d
ar.sort()
TypeError: unorderable types: float() < str()
I have a hdf5 file that contains a table where the column time is in datetime64[ns] format.
I want to get all the rows that are older than thresh. How can I do that? This is what I've tried:
thresh = pd.datetime.strptime('2018-03-08 14:19:41','%Y-%m-%d %H:%M:%S').timestamp()
hdf = pd.read_hdf(STORE, 'gh1', where = 'time>thresh' )
I get the following error:
Traceback (most recent call last):
File "<ipython-input-80-fa444735d0a9>", line 1, in <module>
runfile('/home/joao/github/control_panel/controlpanel/controlpanel/reading_test.py', wdir='/home/joao/github/control_panel/controlpanel/controlpanel')
File "/home/joao/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "/home/joao/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/home/joao/github/control_panel/controlpanel/controlpanel/reading_test.py", line 15, in <module>
hdf = pd.read_hdf(STORE, 'gh1', where = 'time>thresh' )
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 370, in read_hdf
return store.select(key, auto_close=auto_close, **kwargs)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 717, in select
return it.get_result()
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 1457, in get_result
results = self.func(self.start, self.stop, where)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 710, in func
columns=columns, **kwargs)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 4141, in read
if not self.read_axes(where=where, **kwargs):
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 3340, in read_axes
self.selection = Selection(self, where=where, **kwargs)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 4706, in __init__
self.condition, self.filter = self.terms.evaluate()
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 556, in evaluate
self.condition = self.terms.prune(ConditionBinOp)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 118, in prune
res = pr(left.value, right.value)
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 113, in pr
encoding=self.encoding).evaluate()
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 327, in evaluate
values = [self.convert_value(v) for v in rhs]
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 327, in <listcomp>
values = [self.convert_value(v) for v in rhs]
File "/home/joao/anaconda3/lib/python3.6/site-packages/pandas/core/computation/pytables.py", line 185, in convert_value
v = pd.Timestamp(v)
File "pandas/_libs/tslib.pyx", line 390, in pandas._libs.tslib.Timestamp.__new__
File "pandas/_libs/tslib.pyx", line 1549, in pandas._libs.tslib.convert_to_tsobject
File "pandas/_libs/tslib.pyx", line 1735, in pandas._libs.tslib.convert_str_to_tsobject
ValueError: could not convert string to Timestamp
Demo:
creating sample DF (100.000 rows):
In [9]: N = 10**5
In [10]: dates = pd.date_range('1980-01-01', freq='99T', periods=N)
In [11]: df = pd.DataFrame({'date':dates, 'val':np.random.rand(N)})
In [12]: df
Out[12]:
date val
0 1980-01-01 00:00:00 0.985215
1 1980-01-01 01:39:00 0.452295
2 1980-01-01 03:18:00 0.780096
3 1980-01-01 04:57:00 0.004596
4 1980-01-01 06:36:00 0.515051
... ... ...
99995 1998-10-27 15:45:00 0.509954
99996 1998-10-27 17:24:00 0.046636
99997 1998-10-27 19:03:00 0.026678
99998 1998-10-27 20:42:00 0.660652
99999 1998-10-27 22:21:00 0.839426
[100000 rows x 2 columns]
writing it to HDF5 file (index date column):
In [13]: df.to_hdf('d:/temp/test.h5', 'test', format='t', data_columns=['date'])
read HDF5 conditionally by index:
In [14]: x = pd.read_hdf('d:/temp/test.h5', 'test', where="date > '1998-10-27 15:00:00'")
In [15]: x
Out[15]:
date val
99995 1998-10-27 15:45:00 0.509954
99996 1998-10-27 17:24:00 0.046636
99997 1998-10-27 19:03:00 0.026678
99998 1998-10-27 20:42:00 0.660652
99999 1998-10-27 22:21:00 0.839426
I try to convert my data from a big netCDF file to an ascii format file for a special point (68,21). When I tried to run the following:
from pylab import *
from netCDF4 import Dataset
import pandas as pd
nc = Dataset("/home/python/PBLH_Exp_08_Jul_2006.nc")
PBLH = nc.variables['PBLH'][:,:,:]
Times = nc.variables['Times'][:,:]
d={}
d['Times'] = Times[:,0]
d['PBLH'] = PBLH[:,:,1]
df=pd.DataFrame(d)
df.to_csv('Produkt/PBLH_Exp_08_Jul_2006.csv')
I got the error message:
Traceback (most recent call last):
File "/home/python/wrf_map.py", line 62, in <module>
df=pd.DataFrame(d)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 226, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 363, in _init_dict
dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 5163, in _arrays_to_mgr
arrays = _homogenize(arrays, index, dtype)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 5477, in _homogenize
raise_cast_failure=False)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/series.py", line 2885, in _sanitize_array
raise Exception('Data must be 1-dimensional')
Exception: Data must be 1-dimensional
What can I do to solve this issue? And how can I extract the data for my grid point? By the way, here is a part of the header of my netCDF file:
<xarray.Dataset>
Dimensions: (Time: 744, south_north: 140, west_east: 140)
Coordinates:
* Time (Time) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...
* south_north (south_north) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ...
* west_east (west_east) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
Data variables:
Times (Time) |S19 '2006-07-01_01:00:00' '2006-07-01_02:00:00' ...
PBLH (Time, south_north, west_east) float32 59.9834 59.8574 ...
Thanks for any help!!
I solved a part of my problem. I used xarray to get the data I wanted (for the position with grid points 21,68):
import numpy as np
import xray as xr
from pylab import *
data = xr.open_dataset("/home/python/PBLH_Exp_08_jul_2006.nc")
d = xr.DataArray(data.variables['PBLH'])
print(d[:,21,68])
But I still can't save my data in a ASCII file...
EDIT: Got it! To save my data I used import csv. Then I wrote:
df = d[:,21,68]
with open ('/home/python/output.txt','w') as fout:
writer = csv.writer(fout)
writer.writerows(df)
It looks not so beautiful but I can handle it! So this should work for everybody with a similar issue! :)