How to apply If Else statements in Pandas Dataframe? - python

I have a Pandas Dataframe and I am trying to add a new column where the value in the new column is dependant on some conditions in the existing Dataframe. The Dataframe I have is as follows:
Date / Time
Open
High
Low
Close
Volume
2020-06-02 16:30:00
1.25566
1.25696
1.25439
1.25634
2720
2020-06-02 17:00:00
1.25638
1.25683
1.25532
1.25614
2800
2020-06-02 17:30:00
1.25615
1.25699
1.25520
1.25565
2827
2020-06-02 18:00:00
1.25565
1.25598
1.25334
1.25341
2993
2020-06-02 18:30:00
1.25341
1.25385
1.25272
1.25287
1899
2020-07-03 07:00:00
1.24651
1.24673
1.24596
1.24603
600
2020-07-03 07:30:00
1.24601
1.24641
1.24568
1.24594
487
2020-07-03 08:00:00
1.24593
1.24618
1.24580
1.24612
455
2020-07-03 08:30:00
1.24612
1.24667
1.24603
1.24666
552
2020-07-03 09:00:00
1.24666
1.24785
1.24623
1.24765
922
I would like to add a new column called 'Signal', which is dependant on the following if and else statements. I have written these into a Function:
def BullEngulf(df):
if (df['Open'] <= df['Close'].shift(1)) and (df['Close'] > df['Open'].shift(1)) and (df['Close'].shift(1) < df['Open'].shift(1)):
return 'Open'
else:
return '0'
I have then tried to apply the function as follows:
df['Signal'] = df.apply(BullEngulf)
df
When I run the code, the following error message occurs:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-62-f87af322a959> in <module>
5 return '0'
6
----> 7 df['Signal'] = df.apply(BullEngulf)
8 df
~\anaconda3\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
6876 kwds=kwds,
6877 )
-> 6878 return op.get_result()
6879
6880 def applymap(self, func) -> "DataFrame":
~\anaconda3\lib\site-packages\pandas\core\apply.py in get_result(self)
184 return self.apply_raw()
185
--> 186 return self.apply_standard()
187
188 def apply_empty_result(self):
~\anaconda3\lib\site-packages\pandas\core\apply.py in apply_standard(self)
293
294 try:
--> 295 result = libreduction.compute_reduction(
296 values, self.f, axis=self.axis, dummy=dummy, labels=labels
297 )
pandas\_libs\reduction.pyx in pandas._libs.reduction.compute_reduction()
pandas\_libs\reduction.pyx in pandas._libs.reduction.Reducer.get_result()
<ipython-input-62-f87af322a959> in BullEngulf(df)
1 def BullEngulf(df):
----> 2 if (df['Open'] <= df['Close'].shift(1)) and (df['Close'] > df['Open'].shift(1)) and (df['Close'].shift(1) < df['Open'].shift(1)):
3 return 'Open'
4 else:
5 return '0'
~\anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
869 key = com.apply_if_callable(key, self)
870 try:
--> 871 result = self.index.get_value(self, key)
872
873 if not is_scalar(result):
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_value(self, series, key)
4403 k = self._convert_scalar_indexer(k, kind="getitem")
4404 try:
-> 4405 return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
4406 except KeyError as e1:
4407 if len(self) > 0 and (self.holds_integer() or self.is_boolean()):
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()
KeyError: 'Open'
Please could somebody explain what is wrong with the code above?
Thanks

Try:
df['Signal'] = np.where((df['Open'] <= df['Close'].shift(1)) &
(df['Close'] > df['Open'].shift(1)) &
(df['Close'].shift(1) < df['Open'].shift(1)),
'Open', '0')

Related

Python Pandas Style to every nth row

I'm working on a Python project w/ Pandas and looking to implement a style to every Nth row. I've been able to select every Nth row using iloc but cannot get the style to work with a basic function. Here's my example in context:
data = [[1,2,3],[2,3,4],[3,4,5],[4,5,6]]
df = pd.DataFrame(data)
df
0 1 2
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6
df.iloc[1::2, :]
0 1 2
1 2 3 4
3 4 5 6
At this point everything returns as normal, but when applying the below function, I receive a too many indexes error which I can't seem to resolve
def highlight_everyother(s):
if s.iloc[1::2, :]:
return ['background-color: yellow']*3
df.style.apply(highlight_everyother, axis=1)
ERROR:
IndexingError Traceback (most recent call last)
~\Anaconda3\lib\site-packages\IPython\core\formatters.py in __call__(self, obj)
343 method = get_real_method(obj, self.print_method)
344 if method is not None:
--> 345 return method()
346 return None
347 else:
~\Anaconda3\lib\site-packages\pandas\io\formats\style.py in _repr_html_(self)
180 Hooks into Jupyter notebook rich display system.
181 """
--> 182 return self.render()
183
184 #Appender(
~\Anaconda3\lib\site-packages\pandas\io\formats\style.py in render(self, **kwargs)
535 * table_attributes
536 """
--> 537 self._compute()
538 # TODO: namespace all the pandas keys
539 d = self._translate()
~\Anaconda3\lib\site-packages\pandas\io\formats\style.py in _compute(self)
610 r = self
611 for func, args, kwargs in self._todo:
--> 612 r = func(self)(*args, **kwargs)
613 return r
614
~\Anaconda3\lib\site-packages\pandas\io\formats\style.py in _apply(self, func, axis, subset, **kwargs)
618 data = self.data.loc[subset]
619 if axis is not None:
--> 620 result = data.apply(func, axis=axis, result_type="expand", **kwargs)
621 result.columns = data.columns
622 else:
~\Anaconda3\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
6876 kwds=kwds,
6877 )
-> 6878 return op.get_result()
6879
6880 def applymap(self, func) -> "DataFrame":
~\Anaconda3\lib\site-packages\pandas\core\apply.py in get_result(self)
184 return self.apply_raw()
185
--> 186 return self.apply_standard()
187
188 def apply_empty_result(self):
~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply_standard(self)
311
312 # compute the result using the series generator
--> 313 results, res_index = self.apply_series_generator()
314
315 # wrap results
~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
339 else:
340 for i, v in enumerate(series_gen):
--> 341 results[i] = self.f(v)
342 keys.append(v.name)
343
<ipython-input-49-a5b996f8d6c8> in highlight_everyother(s)
11
12 def highlight_everyother(s):
---> 13 if s.iloc[1::2, :]:
14 return ['background-color: yellow']*3
15
~\Anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
1760 except (KeyError, IndexError, AttributeError):
1761 pass
-> 1762 return self._getitem_tuple(key)
1763 else:
1764 # we by definition only have the 0th axis
~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
2065 def _getitem_tuple(self, tup: Tuple):
2066
-> 2067 self._has_valid_tuple(tup)
2068 try:
2069 return self._getitem_lowerdim(tup)
~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _has_valid_tuple(self, key)
699 for i, k in enumerate(key):
700 if i >= self.ndim:
--> 701 raise IndexingError("Too many indexers")
702 try:
703 self._validate_key(k, i)
IndexingError: Too many indexers
Any help would be appreciated. Thank you.
I would apply on axis=0 in case df is not index by rangeIndex:
def highlight_everyother(s):
return ['background-color: yellow; color:blue' if x%2==1 else ''
for x in range(len(s))]
df.style.apply(highlight_everyother)
Output:
You are passing one row at a time to highlight_everyother. That's why you were getting the error. The below should work.
def highlight_everyother(s):
if s.name%2==1:
return ['background-color: yellow']*3
else:
return ['background-color: white']*3
df.style.apply(highlight_everyother, axis=1)

Filtering a dataframe with datetime index using .loc (rows & columns) [duplicate]

This question already has answers here:
How to slice a Pandas Dataframe based on datetime index
(3 answers)
Closed 3 years ago.
Trying to slice both rows and columns of a dataframe using the .loc method, but I am having trouble slicing the rows of the df (it has a datetime index)
The dataframe I am working with has 537 rows and 10 columns. The first date is 2018-01-01, but I want it to slice it so that it only shows dates for 2019.
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 536 entries, 2018-01-01 00:00:00 to 2019-06-20 00:00:00
Data columns (total 10 columns):
link_clicks 536 non-null int64
customer_count 536 non-null int64
transaction_count 536 non-null int64
customers_per_click 536 non-null float64
transactions_per_click 536 non-null float64
14_day_ma 523 non-null float64
14_day_std 523 non-null float64
Upper14 523 non-null float64
Lower14 523 non-null float64
lower_flag 536 non-null bool
dtypes: bool(1), float64(6), int64(3)
memory usage: 42.4+ KB
df.loc['2019-01-01':'2019-06-01', ['customers_per_click', '14_day_ma', 'Upper14', 'Lower14']]
Expected result is to return a filtered data frame within that date range. However when I execute that line of code it gives me the following error:
(clearly it is an issue with the index, but I am just not sure what the proper syntax is and am having trouble finding a solution online.)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
4804 try:
-> 4805 return self._searchsorted_monotonic(label, side)
4806 except ValueError:
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in _searchsorted_monotonic(self, label, side)
4764
-> 4765 raise ValueError('index must be monotonic increasing or decreasing')
4766
ValueError: index must be monotonic increasing or decreasing
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-599-5bdb485482ff> in <module>
----> 1 merge2.loc['2019-11-01':'2019-02-01', ['customers_per_click', '14_day_ma', 'Upper14', 'Lower14']].plot(figsize=(15,5))
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
1492 except (KeyError, IndexError, AttributeError):
1493 pass
-> 1494 return self._getitem_tuple(key)
1495 else:
1496 # we by definition only have the 0th axis
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
886 continue
887
--> 888 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
889
890 return retval
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
1865 if isinstance(key, slice):
1866 self._validate_key(key, axis)
-> 1867 return self._get_slice_axis(key, axis=axis)
1868 elif com.is_bool_indexer(key):
1869 return self._getbool_axis(key, axis=axis)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexing.py in _get_slice_axis(self, slice_obj, axis)
1531 labels = obj._get_axis(axis)
1532 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop,
-> 1533 slice_obj.step, kind=self.name)
1534
1535 if isinstance(indexer, slice):
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in slice_indexer(self, start, end, step, kind)
4671 """
4672 start_slice, end_slice = self.slice_locs(start, end, step=step,
-> 4673 kind=kind)
4674
4675 # return a slice
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in slice_locs(self, start, end, step, kind)
4870 start_slice = None
4871 if start is not None:
-> 4872 start_slice = self.get_slice_bound(start, 'left', kind)
4873 if start_slice is None:
4874 start_slice = 0
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
4806 except ValueError:
4807 # raise the original KeyError
-> 4808 raise err
4809
4810 if isinstance(slc, np.ndarray):
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
4800 # we need to look up the label
4801 try:
-> 4802 slc = self._get_loc_only_exact_matches(label)
4803 except KeyError as err:
4804 try:
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in _get_loc_only_exact_matches(self, key)
4770 get_slice_bound.
4771 """
-> 4772 return self.get_loc(key)
4773
4774 def get_slice_bound(self, label, side, kind):
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2657 return self._engine.get_loc(key)
2658 except KeyError:
-> 2659 return self._engine.get_loc(self._maybe_cast_indexer(key))
2660 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2661 if indexer.ndim > 1 or indexer.size > 1:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: '2019-11-01'
If your index is of type "datetime", try:
from datetime import datetime
df.loc[(df.index>=datetime(2019,1,1)) & (df.index<= datetime(2019,6,1)), ['customers_per_click', '14_day_ma', 'Upper14', 'Lower14']]
Without all the details I propose the following code:
index = pd.date_range('1/1/2018', periods=1100)
ts = pd.Series(np.random.normal(0.5, 2, 1100), index)
grouped = ts.groupby(lambda x: x.year)
grouped.size()
2018 365
2019 365
2020 366
2021 4
dtype: int64
You can select a year (a group) using:
grouped.get_group(2019)
len(grouped.get_group(2019))
365
Do you need something more specific?

Impute missing values using apply and lambda functions

I am trying to impute the missing values in "Item_Weight" variable by taking the average of the variable according to different "Item_Types" as per the code below. But when I run it, I am getting Key error as added below. Is it the pandas version that does not allow this or something wrong with the code?
Item_Weight_Average =
train.dropna(subset['Item_Weight']).pivot_table(values='Item_Weight',index='Item_Type')
missing = train['Item_Weight'].isnull()
train.loc[missing,'Item_Weight']= train.loc[missing,'Item_Type'].apply(lambda x: Item_Weight_Average[x])
KeyError Traceback (most recent call last)
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2441 try:
-> 2442 return self._engine.get_loc(key)
2443 except KeyError:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5280)()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5126)()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20523)()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20477)()
KeyError: 'Snack Foods'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-25-c9971d0bdaf7> in <module>()
1 Item_Weight_Average = train.dropna(subset=['Item_Weight']).pivot_table(values='Item_Weight',index='Item_Type')
2 missing = train['Item_Weight'].isnull()
----> 3 train.loc[missing,'Item_Weight'] = train.loc[missing,'Item_Type'].apply(lambda x: Item_Weight_Average[x])
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):
pandas\_libs\src\inference.pyx in pandas._libs.lib.map_infer (pandas\_libs\lib.c:66645)()
<ipython-input-25-c9971d0bdaf7> in <lambda>(x)
1 Item_Weight_Average = train.dropna(subset=['Item_Weight']).pivot_table(values='Item_Weight',index='Item_Type')
2 missing = train['Item_Weight'].isnull()
----> 3 train.loc[missing,'Item_Weight'] = train.loc[missing,'Item_Type'].apply(lambda x: Item_Weight_Average[x])
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
1962 return self._getitem_multilevel(key)
1963 else:
-> 1964 return self._getitem_column(key)
1965
1966 def _getitem_column(self, key):
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
1969 # get column
1970 if self.columns.is_unique:
-> 1971 return self._get_item_cache(key)
1972
1973 # duplicate columns & possible reduce dimensionality
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
1643 res = cache.get(item)
1644 if res is None:
-> 1645 values = self._data.get(item)
1646 res = self._box_item_values(item, values)
1647 cache[item] = res
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
3588
3589 if not isnull(item):
-> 3590 loc = self.items.get_loc(item)
3591 else:
3592 indexer = np.arange(len(self.items))[isnull(self.items)]
C:\Users\m1013523\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2442 return self._engine.get_loc(key)
2443 except KeyError:
-> 2444 return self._engine.get_loc(self._maybe_cast_indexer(key))
2445
2446 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5280)()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5126)()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20523)()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20477)()
KeyError: 'Snack Foods'
any ideas or workarounds for this one?
If I understand what you're trying to do, then there's an easier way to solve your problem. Instead of making a new series of averages, you can calculate the average item_weight by item_type using groupby, transform, and np.mean(), and fill in the missing spots in item_weight using fillna().
# Setting up some toy data
import pandas as pd
import numpy as np
df = pd.DataFrame({'item_type': [1,1,1,2,2,2],
'item_weight': [2,4,np.nan,10,np.nan,np.nan]})
# The solution
df.item_weight.fillna(df.groupby('item_type').item_weight.transform(np.mean), inplace=True)
The result:
item_type item_weight
0 1 2.0
1 1 4.0
2 1 3.0
3 2 10.0
4 2 10.0
5 2 10.0

Basic json and pandas DataFrame build

I am very new to python and learning my way up. My task is to crawl data from web and filing xlsx data using json and pandas (and etc..). I am researching through some examples of modifing json dic to pandas DataFrame, and I cant seem to find the one that I need.
Im gussing this would be very basic, but help me out.
so below is my code
js ='{"startDate":"2017-01-01","endDate":"2017-10-31","timeUnit":"month","results":
[{"title":"fruit","keywords":["apple","banana"],"data":
[{"period":"2017-01-01","ratio":19.35608},
{"period":"2017-02-01","ratio":17.33902},
{"period":"2017-03-01","ratio":22.30411},
{"period":"2017-04-01","ratio":20.94646},
{"period":"2017-05-01","ratio":23.8557},
{"period":"2017-06-01","ratio":22.38169},
{"period":"2017-07-01","ratio":27.38557},
{"period":"2017-08-01","ratio":19.16214},
{"period":"2017-09-01","ratio":32.07913},
{"period":"2017-10-01","ratio":41.89293}]},
{"title":"veg","keywords":["carrot","onion"],"data":
[{"period":"2017-01-01","ratio":100.0},
{"period":"2017-02-01","ratio":80.41117},
{"period":"2017-03-01","ratio":89.29402},
{"period":"2017-04-01","ratio":74.32118},
{"period":"2017-05-01","ratio":69.82156},
{"period":"2017-06-01","ratio":66.52444},
{"period":"2017-07-01","ratio":67.84328},
{"period":"2017-08-01","ratio":74.43754},
{"period":"2017-09-01","ratio":65.82621},
{"period":"2017-10-01","ratio":65.55469}]}]}'
And I have tried below
df = pd.DataFrame.from_dict(json_normalize(js), orient='columns')
df
and
df = pd.read_json(js)
results = df['results'].head()
dd = results['data']
results.to_json(orient='split')
and
data = json.loads(js)
data["results"]
data["startDate"]
data2 = json.loads(data["results"])
data2["data"]
And I want my DataFrame to be like below
Date Fruit Veg
0 2017-01-01 19.35608 100.0
1 2017-02-01 17.33902 80.41117
2 2017-03-01 22.30411 89.29402
3 2017-04-01 20.94646 74.32118
4 2017-05-01 23.8557 69.82156
--------------------------------------------------------------------------------------------------------------------edit
The code (from #COLDSPEED) worked perfect until one point. I use your code to my new crawler "Crawler: Combining DataFrame per each loop Python" and it ran perfectly until my DNA reached to 170. The error message is below
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2441 try:
-> 2442 return self._engine.get_loc(key)
2443 except KeyError:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'period'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-30-2a1de403b285> in <module>()
47 d = json.loads(js)
48 lst = [pd.DataFrame.from_dict(r['data']).set_index('period').rename(columns={'ratio' : r['title']})
---> 49 for r in d['results']]
50 df = pd.concat(lst, 1)
51 dfdfdf = Data.join(df)
<ipython-input-30-2a1de403b285> in <listcomp>(.0)
47 d = json.loads(js)
48 lst = [pd.DataFrame.from_dict(r['data']).set_index('period').rename(columns={'ratio' : r['title']})
---> 49 for r in d['results']]
50 df = pd.concat(lst, 1)
51 dfdfdf = Data.join(df)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
2828 names.append(None)
2829 else:
-> 2830 level = frame[col]._values
2831 names.append(col)
2832 if drop:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
1962 return self._getitem_multilevel(key)
1963 else:
-> 1964 return self._getitem_column(key)
1965
1966 def _getitem_column(self, key):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
1969 # get column
1970 if self.columns.is_unique:
-> 1971 return self._get_item_cache(key)
1972
1973 # duplicate columns & possible reduce dimensionality
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
1643 res = cache.get(item)
1644 if res is None:
-> 1645 values = self._data.get(item)
1646 res = self._box_item_values(item, values)
1647 cache[item] = res
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
3588
3589 if not isnull(item):
-> 3590 loc = self.items.get_loc(item)
3591 else:
3592 indexer = np.arange(len(self.items))[isnull(self.items)]
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2442 return self._engine.get_loc(key)
2443 except KeyError:
-> 2444 return self._engine.get_loc(self._maybe_cast_indexer(key))
2445
2446 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'period'
I found out that if the js has no value in 'data' which shows below. (please disregard the Korean title)
{"startDate":"2016-01-01","endDate":"2017-12-03","timeUnit":"date","results":[{"title":"황금뿔나팔버섯","keywords":["황금뿔나팔버섯"],"data":[]}]}
So I want to check if there is 'data' before using your code. please take a look below and tell me what is wrong with it please.
if ([pd.DataFrame.from_dict(r['data']) for r in d['results']] == []):
#want to put only the column name as 'title' and move on
else:
lst = [pd.DataFrame.from_dict(r['data']).set_index('period').rename(columns={'ratio' : r['title']})
for r in d['results']]
df = pd.concat(lst, 1)
Assuming your structure is consistent, use a list comprehension and then concatenate -
import json
d = json.loads(js)
lst = [
pd.DataFrame.from_dict(r['data'])\
.set_index('period').rename(columns={'ratio' : r['title']})
for r in d['results']
]
df = pd.concat(lst, 1)
df
fruit veg
period
2017-01-01 19.35608 100.00000
2017-02-01 17.33902 80.41117
2017-03-01 22.30411 89.29402
2017-04-01 20.94646 74.32118
2017-05-01 23.85570 69.82156
2017-06-01 22.38169 66.52444
2017-07-01 27.38557 67.84328
2017-08-01 19.16214 74.43754
2017-09-01 32.07913 65.82621
2017-10-01 41.89293 65.55469

datetime.date creating many problems with set_index, groupby, and apply in Pandas 0.8.1

I'm using Pandas 0.8.1 in an environment where it is not possible to upgrade for bureaucratic reasons.
You may want to skip down to the "simplified problem" section below, before reading all about the initial problem and my goal.
My goal: group a DataFrame by a categorical column "D", and then for each group, sort by a date column "dt", set the index to "dt", perform a rolling OLS regression, and return the DataFrame beta of regression coefficients indexed by date.
The end result would hopefully be a bunch of stacked beta frames, each one unique to some specific categorical variable, so that the final index would be two levels, one for category ID and one for date.
If I do something like
my_dataframe.groupby("D").apply(some_wrapped_OLS_caller)
then I am getting frustratingly uninformative KeyError: 0 errors often, and the tracebacks seems to be choking on datetime issues:
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/frame.pyc in set_index(self, keys, drop, inplace, verify_integrity)
2287 arrays.append(level)
2288
-> 2289 index = MultiIndex.from_arrays(arrays, names=keys)
2290
2291 if verify_integrity and not index.is_unique:
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/index.pyc in from_arrays(cls, arrays, sortorder, names)
1505 if len(arrays) == 1:
1506 name = None if names is None else names[0]
-> 1507 return Index(arrays[0], name=name)
1508
1509 cats = [Categorical.from_array(arr) for arr in arrays]
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/index.pyc in __new__(cls, data, dtype, copy, name)
102 if dtype is None:
103 if (lib.is_datetime_array(subarr)
--> 104 or lib.is_datetime64_array(subarr)
105 or lib.is_timestamp_array(subarr)):
106 from pandas.tseries.index import DatetimeIndex
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.is_datetime64_array (pandas/src/tseries.c:90291)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/series.pyc in __getitem__(self, key)
427 def __getitem__(self, key):
428 try:
--> 429 return self.index.get_value(self, key)
430 except InvalidIndexError:
431 pass
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/index.pyc in get_value(self, series, key)
639 """
640 try:
--> 641 return self._engine.get_value(series, key)
642 except KeyError, e1:
643 if len(self) > 0 and self.inferred_type == 'integer':
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.IndexEngine.get_value (pandas/src/tseries.c:103842)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.IndexEngine.get_value (pandas/src/tseries.c:103670)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.IndexEngine.get_loc (pandas/src/tseries.c:104379)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.Int64HashTable.get_item (pandas/src/tseries.c:15547)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.Int64HashTable.get_item (pandas/src/tseries.c:15501)()
KeyError: 0
If I perform the regression steps manually on each group in the group-by object, one by one, everything works without a hitch.
Code:
import numpy as np
import pandas
import datetime
from dateutil.relativedelta import relativedelta as drr
def foo(zz):
zz1 = zz.sort("dt", ascending=True).set_index("dt")
r1 = pandas.ols(y=zz1["y1"], x=zz1["x"], window=60, min_periods=12)
return r1.beta
dfrm_test = pandas.DataFrame({"x":np.random.rand(731),
"y1":np.random.rand(731),
"y2":np.random.rand(731),
"z":np.random.rand(731)})
dfrm_test['d'] = np.random.randint(0,2, size= (len(dfrm_test),))
dfrm_test['dt'] = [datetime.date(2000, 1, 1) + drr(days=i)
for i in range(len(dfrm_test))]
Now here is what happens when I try to work with these using groupby and apply:
In [102]: dfrm_test.groupby("d").apply(foo)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-102-345a8d45df50> in <module>()
----> 1 dfrm_test.groupby("d").apply(foo)
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/groupby.pyc in apply(self, func, *args, **kwargs)
267 applied : type depending on grouped object and function
268 """
--> 269 return self._python_apply_general(func, *args, **kwargs)
270
271 def aggregate(self, func, *args, **kwargs):
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/groupby.pyc in _python_apply_general(self, func, *args, **kwargs)
402 group_axes = _get_axes(group)
403
--> 404 res = func(group, *args, **kwargs)
405
406 if not _is_indexed_like(res, group_axes):
<ipython-input-101-8b9184c63365> in foo(zz)
1 def foo(zz):
----> 2 zz1 = zz.sort("dt", ascending=True).set_index("dt")
3 r1 = pandas.ols(y=zz1["y1"], x=zz1["x"], window=60, min_periods=12)
4 return r1.beta
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/frame.pyc in set_index(self, keys, drop, inplace, verify_integrity)
2287 arrays.append(level)
2288
-> 2289 index = MultiIndex.from_arrays(arrays, names=keys)
2290
2291 if verify_integrity and not index.is_unique:
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/index.pyc in from_arrays(cls, arrays, sortorder, names)
1505 if len(arrays) == 1:
1506 name = None if names is None else names[0]
-> 1507 return Index(arrays[0], name=name)
1508
1509 cats = [Categorical.from_array(arr) for arr in arrays]
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/index.pyc in __new__(cls, data, dtype, copy, name)
102 if dtype is None:
103 if (lib.is_datetime_array(subarr)
--> 104 or lib.is_datetime64_array(subarr)
105 or lib.is_timestamp_array(subarr)):
106 from pandas.tseries.index import DatetimeIndex
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.is_datetime64_array (pandas/src/tseries.c:90291)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/series.pyc in __getitem__(self, key)
427 def __getitem__(self, key):
428 try:
--> 429 return self.index.get_value(self, key)
430 except InvalidIndexError:
431 pass
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/index.pyc in get_value(self, series, key)
639 """
640 try:
--> 641 return self._engine.get_value(series, key)
642 except KeyError, e1:
643 if len(self) > 0 and self.inferred_type == 'integer':
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.IndexEngine.get_value (pandas/src/tseries.c:103842)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.IndexEngine.get_value (pandas/src/tseries.c:103670)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.IndexEngine.get_loc (pandas/src/tseries.c:104379)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.Int64HashTable.get_item (pandas/src/tseries.c:15547)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.Int64HashTable.get_item (pandas/src/tseries.c:15501)()
KeyError: 0
If I save the groupby object and attempt to apply foo myself, then in the straightforward way, this also fails:
In [103]: grps = dfrm_test.groupby("d")
In [104]: for grp in grps:
foo(grp[1])
.....:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-104-f215ff55c12b> in <module>()
1 for grp in grps:
----> 2 foo(grp[1])
3
<ipython-input-101-8b9184c63365> in foo(zz)
1 def foo(zz):
----> 2 zz1 = zz.sort("dt", ascending=True).set_index("dt")
3 r1 = pandas.ols(y=zz1["y1"], x=zz1["x"], window=60, min_periods=12)
4 return r1.beta
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/frame.pyc in set_index(self, keys, drop, inplace, verify_integrity)
2287 arrays.append(level)
2288
-> 2289 index = MultiIndex.from_arrays(arrays, names=keys)
2290
2291 if verify_integrity and not index.is_unique:
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/index.pyc in from_arrays(cls, arrays, sortorder, names)
1505 if len(arrays) == 1:
1506 name = None if names is None else names[0]
-> 1507 return Index(arrays[0], name=name)
1508
1509 cats = [Categorical.from_array(arr) for arr in arrays]
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/index.pyc in __new__(cls, data, dtype, copy, name)
102 if dtype is None:
103 if (lib.is_datetime_array(subarr)
--> 104 or lib.is_datetime64_array(subarr)
105 or lib.is_timestamp_array(subarr)):
106 from pandas.tseries.index import DatetimeIndex
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.is_datetime64_array (pandas/src/tseries.c:90291)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/series.pyc in __getitem__(self, key)
427 def __getitem__(self, key):
428 try:
--> 429 return self.index.get_value(self, key)
430 except InvalidIndexError:
431 pass
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/index.pyc in get_value(self, series, key)
639 """
640 try:
--> 641 return self._engine.get_value(series, key)
642 except KeyError, e1:
643 if len(self) > 0 and self.inferred_type == 'integer':
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.IndexEngine.get_value (pandas/src/tseries.c:103842)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.IndexEngine.get_value (pandas/src/tseries.c:103670)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.IndexEngine.get_loc (pandas/src/tseries.c:104379)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.Int64HashTable.get_item (pandas/src/tseries.c:15547)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.Int64HashTable.get_item (pandas/src/tseries.c:15501)()
KeyError: 0
But if I store off one of the group data frames, and then call foo on it, this works just fine ... ??
In [105]: for grp in grps:
x = grp[1]
.....:
In [106]: x.head()
Out[106]:
x y1 y2 z dt d
0 0.240858 0.235135 0.196027 0.940180 2000-01-01 1
1 0.115784 0.802576 0.870014 0.482418 2000-01-02 1
2 0.081640 0.939411 0.344041 0.846485 2000-01-03 1
5 0.608413 0.100349 0.306595 0.739987 2000-01-06 1
6 0.429635 0.678575 0.449520 0.362761 2000-01-07 1
In [107]: foo(x)
Out[107]:
<class 'pandas.core.frame.DataFrame'>
Index: 360 entries, 2000-01-17 to 2001-12-29
Data columns:
x 360 non-null values
intercept 360 non-null values
dtypes: float64(2)
What's going on here? Does it have to do with cases when the logic to trigger conversion to bad date/time types is tripped? How can I work around it?
Simplified Problem
I can simplify the problem just to the set_index call within the apply function. But this is getting really weird. Here's an example with a simpler test DataFrame, just with set_index.
In [154]: tdf = pandas.DataFrame(
{"dt":([datetime.date(2000,1,i+1) for i in range(12)] +
[datetime.date(2001,3,j+1) for j in range(13)]),
"d":np.random.randint(1,4,(25,)),
"x":np.random.rand(25)})
In [155]: tdf
Out[155]:
d dt x
0 1 2000-01-01 0.430667
1 3 2000-01-02 0.159652
2 1 2000-01-03 0.719015
3 1 2000-01-04 0.175328
4 3 2000-01-05 0.233810
5 3 2000-01-06 0.581176
6 1 2000-01-07 0.912615
7 1 2000-01-08 0.534971
8 3 2000-01-09 0.373345
9 1 2000-01-10 0.182665
10 1 2000-01-11 0.286681
11 3 2000-01-12 0.054054
12 3 2001-03-01 0.861348
13 1 2001-03-02 0.093717
14 2 2001-03-03 0.729503
15 1 2001-03-04 0.888558
16 1 2001-03-05 0.263055
17 1 2001-03-06 0.558430
18 3 2001-03-07 0.064216
19 3 2001-03-08 0.018823
20 3 2001-03-09 0.207845
21 2 2001-03-10 0.735640
22 2 2001-03-11 0.908427
23 2 2001-03-12 0.819994
24 2 2001-03-13 0.798267
set_index works fine here, no date changing or anything.
In [156]: tdf.set_index("dt")
Out[156]:
d x
dt
2000-01-01 1 0.430667
2000-01-02 3 0.159652
2000-01-03 1 0.719015
2000-01-04 1 0.175328
2000-01-05 3 0.233810
2000-01-06 3 0.581176
2000-01-07 1 0.912615
2000-01-08 1 0.534971
2000-01-09 3 0.373345
2000-01-10 1 0.182665
2000-01-11 1 0.286681
2000-01-12 3 0.054054
2001-03-01 3 0.861348
2001-03-02 1 0.093717
2001-03-03 2 0.729503
2001-03-04 1 0.888558
2001-03-05 1 0.263055
2001-03-06 1 0.558430
2001-03-07 3 0.064216
2001-03-08 3 0.018823
2001-03-09 3 0.207845
2001-03-10 2 0.735640
2001-03-11 2 0.908427
2001-03-12 2 0.819994
2001-03-13 2 0.798267
groupby cannot successfully set_index though (note it errors before hitting any unpacking problems with incongruent sizes, it just cannot reset indices at all).
In [157]: tdf.groupby("d").apply(lambda x: x.set_index("dt"))
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-157-cf2d3964f4d3> in <module>()
----> 1 tdf.groupby("d").apply(lambda x: x.set_index("dt"))
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/groupby.pyc in apply(self, func, *args, **kwargs)
267 applied : type depending on grouped object and function
268 """
--> 269 return self._python_apply_general(func, *args, **kwargs)
270
271 def aggregate(self, func, *args, **kwargs):
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/groupby.pyc in _python_apply_general(self, func, *args, **kwargs)
402 group_axes = _get_axes(group)
403
--> 404 res = func(group, *args, **kwargs)
405
406 if not _is_indexed_like(res, group_axes):
<ipython-input-157-cf2d3964f4d3> in <lambda>(x)
----> 1 tdf.groupby("d").apply(lambda x: x.set_index("dt"))
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/frame.pyc in set_index(self, keys, drop, inplace, verify_integrity)
2287 arrays.append(level)
2288
-> 2289 index = MultiIndex.from_arrays(arrays, names=keys)
2290
2291 if verify_integrity and not index.is_unique:
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/index.pyc in from_arrays(cls, arrays, sortorder, names)
1505 if len(arrays) == 1:
1506 name = None if names is None else names[0]
-> 1507 return Index(arrays[0], name=name)
1508
1509 cats = [Categorical.from_array(arr) for arr in arrays]
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/index.pyc in __new__(cls, data, dtype, copy, name)
102 if dtype is None:
103 if (lib.is_datetime_array(subarr)
--> 104 or lib.is_datetime64_array(subarr)
105 or lib.is_timestamp_array(subarr)):
106 from pandas.tseries.index import DatetimeIndex
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.is_datetime64_array (pandas/src/tseries.c:90291)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/series.pyc in __getitem__(self, key)
427 def __getitem__(self, key):
428 try:
--> 429 return self.index.get_value(self, key)
430 except InvalidIndexError:
431 pass
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/index.pyc in get_value(self, series, key)
639 """
640 try:
--> 641 return self._engine.get_value(series, key)
642 except KeyError, e1:
643 if len(self) > 0 and self.inferred_type == 'integer':
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.IndexEngine.get_value (pandas/src/tseries.c:103842)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.IndexEngine.get_value (pandas/src/tseries.c:103670)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.IndexEngine.get_loc (pandas/src/tseries.c:104379)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.Int64HashTable.get_item (pandas/src/tseries.c:15547)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.Int64HashTable.get_item (pandas/src/tseries.c:15501)()
KeyError: 0
Very weird part
Here I save off the group objects, and try to manually call set_index on them. This doesn't work. Even if I save out the specific DataFrame element from the group, it does not work.
In [159]: grps = tdf.groupby("d")
In [160]: grps
Out[160]: <pandas.core.groupby.DataFrameGroupBy at 0x7600bd0>
In [161]: grps_list = [(x,y) for x,y in grps]
In [162]: grps_list[2][1].set_index("dt")
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-162-77f985a6e063> in <module>()
----> 1 grps_list[2][1].set_index("dt")
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/frame.pyc in set_index(self, keys, drop, inplace, verify_integrity)
2287 arrays.append(level)
2288
-> 2289 index = MultiIndex.from_arrays(arrays, names=keys)
2290
2291 if verify_integrity and not index.is_unique:
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/index.pyc in from_arrays(cls, arrays, sortorder, names)
1505 if len(arrays) == 1:
1506 name = None if names is None else names[0]
-> 1507 return Index(arrays[0], name=name)
1508
1509 cats = [Categorical.from_array(arr) for arr in arrays]
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/index.pyc in __new__(cls, data, dtype, copy, name)
102 if dtype is None:
103 if (lib.is_datetime_array(subarr)
--> 104 or lib.is_datetime64_array(subarr)
105 or lib.is_timestamp_array(subarr)):
106 from pandas.tseries.index import DatetimeIndex
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.is_datetime64_array (pandas/src/tseries.c:90291)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/series.pyc in __getitem__(self, key)
427 def __getitem__(self, key):
428 try:
--> 429 return self.index.get_value(self, key)
430 except InvalidIndexError:
431 pass
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/index.pyc in get_value(self, series, key)
639 """
640 try:
--> 641 return self._engine.get_value(series, key)
642 except KeyError, e1:
643 if len(self) > 0 and self.inferred_type == 'integer':
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.IndexEngine.get_value (pandas/src/tseries.c:103842)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.IndexEngine.get_value (pandas/src/tseries.c:103670)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.IndexEngine.get_loc (pandas/src/tseries.c:104379)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.Int64HashTable.get_item (pandas/src/tseries.c:15547)()
/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.Int64HashTable.get_item (pandas/src/tseries.c:15501)()
KeyError: 0
But if I construct a manual direct copy of the group's DataFrame, then set_index does work on the manual reconstruction??
In [163]: grps_list[2][1]
Out[163]:
d dt x
1 3 2000-01-02 0.159652
4 3 2000-01-05 0.233810
5 3 2000-01-06 0.581176
8 3 2000-01-09 0.373345
11 3 2000-01-12 0.054054
12 3 2001-03-01 0.861348
18 3 2001-03-07 0.064216
19 3 2001-03-08 0.018823
20 3 2001-03-09 0.207845
In [165]: recreation = pandas.DataFrame(
{"d":[3,3,3,3,3,3,3,3,3],
"dt":[datetime.date(2000,1,2), datetime.date(2000,1,5), datetime.date(2000,1,6),
datetime.date(2000,1,9), datetime.date(2000,1,12), datetime.date(2001,3,1),
datetime.date(2001,3,7), datetime.date(2001,3,8), datetime.date(2001,3,9)],
"x":[0.159, 0.233, 0.581, 0.3733, 0.054, 0.861, 0.064, 0.0188, 0.2078]})
In [166]: recreation
Out[166]:
d dt x
0 3 2000-01-02 0.1590
1 3 2000-01-05 0.2330
2 3 2000-01-06 0.5810
3 3 2000-01-09 0.3733
4 3 2000-01-12 0.0540
5 3 2001-03-01 0.8610
6 3 2001-03-07 0.0640
7 3 2001-03-08 0.0188
8 3 2001-03-09 0.2078
In [167]: recreation.set_index("dt")
Out[167]:
d x
dt
2000-01-02 3 0.1590
2000-01-05 3 0.2330
2000-01-06 3 0.5810
2000-01-09 3 0.3733
2000-01-12 3 0.0540
2001-03-01 3 0.8610
2001-03-07 3 0.0640
2001-03-08 3 0.0188
2001-03-09 3 0.2078
As the pirates might say in the first few episodes of Archer Season 3: What hell damn guy?
Turns out this is based on something that happens in groupby which changes indices of the groups into a MultiIndex.
By adding a call to reset the index inside of the function to be applied with apply, it gets rid of the problem:
def foo(zz):
zz1 = zz.sort("dt", ascending=True).reset_index().set_index("dt", inplace=True)
r1 = pandas.ols(y=zz1["y1"], x=zz1["x"], window=60, min_periods=12)
return r1.beta
and this at least provides a workaround.

Categories

Resources