Pandas dataframe apply to get a list throws error

Pandas dataframe apply to get a list throws error - python

I am running Python 3.6.x and pandas version 0.19.2. I am trying to create a list for each entry in a dataframe as below. This example works.
df = pd.DataFrame({'names':['a', 'b', 'c'], 'year_min':[2001, 2010, 2005], 'year_max':[2018, 2019, 2017]})
start_year = 2017
df['years'] = df.apply(lambda x: list(range(max(x['year_min'],start_year), x['year_max']+1)), axis=1)
df
Out[37]:
names year_max year_min years
0 a 2018 2001 [2017, 2018]
1 b 2019 2010 [2017, 2018, 2019]
2 c 2017 2005 [2017]
Unfortunately, when I try the same line of code for the dataframe in this pickle file, I get an error, despite the dtypes of the two columns still being int64. Without doubt, I have messed up some bit of this dataframe, but I have no clue what the problem is (!). Any ideas?
players = pd.read_pickle("players_2017_2019.p")
start_year = 2017
players['years']= players.apply(lambda x: list(range(max(x['year_min'],start_year), x['year_max']+1)), axis=1)
Traceback (most recent call last):
File "...\python36\win64\431\lib\site-packages\pandas\core\internals.py", line 4262, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "...\python36\win64\431\lib\site-packages\pandas\core\internals.py", line 4339, in form_blocks
int_blocks = _multi_blockify(int_items)
File "...\python36\win64\431\lib\site-packages\pandas\core\internals.py", line 4408, in _multi_blockify
values, placement = _stack_arrays(list(tup_block), dtype)
File "...\python36\win64\431\lib\site-packages\pandas\core\internals.py", line 4453, in _stack_arrays
stacked[i] = _asarray_compat(arr)
ValueError: could not broadcast input array from shape (2) into shape (3)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "...\python36\win64\431\lib\site-packages\IPython\core\interactiveshell.py", line 2881, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-2-7fc8712b01b0>", line 1, in <module>
players.apply(lambda x: list(range(max(x['year_min'],start_year), x['year_max']+1)), axis=1)
File "...\python36\win64\431\lib\site-packages\pandas\core\frame.py", line 4152, in apply
return self._apply_standard(f, axis, reduce=reduce)
File "...\python36\win64\431\lib\site-packages\pandas\core\frame.py", line 4265, in _apply_standard
result = self._constructor(data=results, index=index)
File "...\python36\win64\431\lib\site-packages\pandas\core\frame.py", line 266, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "...\python36\win64\431\lib\site-packages\pandas\core\frame.py", line 402, in _init_dict
return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "...\python36\win64\431\lib\site-packages\pandas\core\frame.py", line 5408, in _arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "...\python36\win64\431\lib\site-packages\pandas\core\internals.py", line 4267, in create_block_manager_from_arrays
construction_error(len(arrays), arrays[0].shape, axes, e)
File "...\python36\win64\431\lib\site-packages\pandas\core\internals.py", line 4231, in construction_error
raise ValueError("Empty data passed with indices specified.")
ValueError: Empty data passed with indices specified.
EDIT:
The issue was solved when I updated my pandas to 0.23.0
Also, the issue is linked to https://github.com/pandas-dev/pandas/issues/17892.

Related

Pandas SparseDtype not working with GroupBy

data.groupby(by="DAY").agg({"CLOSING_DATE": min})
How come that when I tried to groupby my dataframe to get the oldest date for a sparse column (CLOSING_DATE is mostly empty) I get the following error?
Traceback (most recent call last):
File "<ipython-input-23-37f9fe161304>", line 1, in <module>
data[:10000].groupby(by="DAY").agg({"CLOSING_DATE": min})
File "/home/user/miniconda3/envs/churn/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 951, in aggregate
result, how = self._aggregate(func, *args, **kwargs)
File "/home/user/miniconda3/envs/py_env/lib/python3.8/site-packages/pandas/core/base.py", line 416, in _aggregate
result = _agg(arg, _agg_1dim)
File "/home/user/miniconda3/envs/py_env/lib/python3.8/site-packages/pandas/core/base.py", line 383, in _agg
result[fname] = func(fname, agg_how)
File "/home/user/miniconda3/envs/py_env/lib/python3.8/site-packages/pandas/core/base.py", line 367, in _agg_1dim
return colg.aggregate(how)
File "/home/user/miniconda3/envs/py_env/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 252, in aggregate
return getattr(self, cyfunc)()
File "/home/user/miniconda3/envs/py_env/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1553, in min
return self._agg_general(
File "/home/user/miniconda3/envs/py_env/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1000, in _agg_general
result = self._cython_agg_general(
File "/home/user/miniconda3/envs/py_env/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1035, in _cython_agg_general
result, agg_names = self.grouper.aggregate(
File "/home/user/miniconda3/envs/py_env/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 591, in aggregate
return self._cython_operation(
File "/home/user/miniconda3/envs/py_env/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 471, in _cython_operation
raise NotImplementedError(f"{values.dtype} dtype not supported")
NotImplementedError: Sparse[float64, nan] dtype not supported

This is a bug in pandas, related to a recent refactor of cython optimized groupbys:
https://github.com/pandas-dev/pandas/issues/38980
You have two choices:
Downgrade the version of pandas you're using to 1.1.4 and wait for the bug to be fixed (maybe ~4-6 weeks)
Convert your sparse matrix to a dense matrix before the groupby with to_dense()

Filter pandas df multiple columns from a pandas series

I have a dataframe that I have to retrieve the unique values out of in order to create some partitioning. I have that part and I can get a small dataframe with each row being a certain partition. The challenge I have is that I then need to filter the original dataframe to only the appropriate data (without modifying the original frame so I can filter all the values) so I can send it to S3.
I am having trouble filtering the dataframe based on the series from the small dataframe.
here is my code:
df_partitions = df.groupby(['grid_id', 'case_id', 'snapshot_year', 'snapshot_month', 'snapshot_day']).size().reset_index()
df_parts = df_partitions[['grid_id', 'case_id', 'snapshot_year', 'snapshot_month', 'snapshot_day']]
for index, row in df_parts.iterrows() :
dest_key_name = '/rec/{}/{}/{}/{}/{}/{}/{}'.format(row['grid_id'], row['case_id'],
row['snapshot_year'], row['snapshot_month'],
row['snapshot_day'], file_partition_time,
'df.csv')
df_test = df
filter_df = df_test[(df_test['grid_id'] == row['grid_id'] &
df_test['case_id'] == row['case_id'] &
df_test['snapshot_year'] == row['snapshot_year'] &
df_test['snapshot_month'] == row['snapshot_month'] &
df_test['snapshot_day'] == row['snapshot_day'])]
print(filter_df)
here is the error:
Traceback (most recent call last):
File "<input>", line 8, in <module>
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/ops.py", line 954, in wrapper
na_op(self.values, other),
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/ops.py", line 924, in na_op
raise TypeError(msg)
TypeError: cannot compare a dtyped [object] array with a scalar of type [bool]
I also tried
filters_df = df[row]
here is the error:
KeyError: "['pjm' 'base' 2020 2 21] not in index"
and
df_test = df
i1 = df_test.set_index(row).index
i2 = df_parts.set_index(row).index
filter_df = df_test[~i1.isin(i2)]
here is the error:
Traceback (most recent call last):
File "<input>", line 7, in <module>
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/frame.py", line 3164, in set_index
frame.index = index
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/generic.py", line 3627, in __setattr__
return object.__setattr__(self, name, value)
File "pandas/_libs/properties.pyx", line 69, in pandas._libs.properties.AxisProperty.__set__
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/generic.py", line 559, in _set_axis
self._data.set_axis(axis, labels)
File "/local/workspace/FinBIPortal/env/RenewableEnergyValuationLambda-1.0/runtime/lib/python3.6/site-packages/pandas/core/internals.py", line 3074, in set_axis
(old_len, new_len))
ValueError: Length mismatch: Expected axis has 130 elements, new values have 5 elements

Very simple solution here. The format for filtering on multiple criteria is df[(...)&(...)], while you are trying df[(... & ... )]. Close out those parentheses where you're setting filter_df.

Can use dataframe ix for assignment, but not retrieval

I am looping through rows of a pandas df, loop index i.
I am able to assign several columns using the ix function with the loop index as first parameter, column name as second.
However, when I try to retrieve/print using this method,
print(df.ix[i,"Run"])
I get a the following Typerror: str object cannot be interpreted as an integer.
somehow related to Keyerror: 'Run'
Not quite sure why this is occurring, as Run is indeed a column in the dataframe.
Any suggestions?
Traceback (most recent call last):
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexes\base.py\!", line 3124, in get_value
return libindex.get_value_box(s, key)
File \!"pandas\_libs\index.pyx\!", line 55, in pandas._libs.index.get_value_box
File \!"pandas\_libs\index.pyx\!", line 63, in pandas._libs.index.get_value_box
TypeError: 'str' object cannot be interpreted as an integer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File \!"C:\...", line 365, in <module>
print(df.ix[i,\!"Run\!"])
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 116, in __getitem__
return self._getitem_tuple(key)
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 870, in _getitem_tuple
return self._getitem_lowerdim(tup)
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 1027, in _getitem_lowerdim
return getattr(section, self.name)[new_key]
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 122, in __getitem__
return self._getitem_axis(key, axis=axis)
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 1116, in _getitem_axis
return self._get_label(key, axis=axis)
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexing.py\!", line 136, in _get_label
return self.obj[label]
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\series.py\!", line 767, in __getitem__
result = self.index.get_value(self, key)
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexes\base.py\!", line 3132, in get_value
raise e1
File \!"C:\WPy-3670\python-3.6.7.amd64\lib\site-packages\pandas\core\indexes\base.py\!", line 3118, in get_value
tz=getattr(series.dtype, 'tz', None))
File \!"pandas\_libs\index.pyx\!", line 106, in pandas._libs.index.IndexEngine.get_value
File \!"pandas\_libs\index.pyx\!", line 114, in pandas._libs.index.IndexEngine.get_value
File \!"pandas\_libs\index.pyx\!", line 162, in pandas._libs.index.IndexEngine.get_loc
File \!"pandas\_libs\hashtable_class_helper.pxi\!", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File \!"pandas\_libs\hashtable_class_helper.pxi\!", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Run'
"
Upon changing the name of the column I print to any other column, it does work correctly. Earlier in the code, I "compressed" the rows, which had multiple rows per unique string in 'Run' column, using the following.
df=df.groupby('Run').max()
Did this last line somehow remove the column/column name from the table?

ix has been deprecated. ix has always been ambiguous: does ix[10] refer to the row with the label 10, or the row at position 10?
Use loc or iloc instead:
df.loc[i,"Run"] = ... # by label
df.iloc[i]["Run"] = ... # by position
As for the groupby removing Run: it moves Run to the index of the data frame. To get it back as a column, call reset_index:
df=df.groupby('Run').max().reset_index()
Differences between indexing by label and position:
Suppose you have a series like this:
s = pd.Series(['a', 'b', 'c', 'd', 'e'], index=np.arange(0,9,2))
0 a
2 b
4 c
6 d
8 e
The first column is the labels (aka the index). The second column is the values of the series.
Label based indexing:
s.loc[2] --> b
s.loc[3] --> error. The label doesn't exist
Position based indexing:
s.iloc[2] --> c. since `a` has position 0, `b` has position 1, and so on
s.iloc[3] --> d
According to the documentation, s.ix[3] would have returned d since it first searches for the label 3. When that fails, it falls back to the position 3. On my machine (Pandas 0.24.2), it returns an error, along with a deprecation warning, so I guess the developers changed it to behave like loc.
If you want to use mixed indexing, you have to be explicit about that:
s.loc[3] if 3 in s.index else s.iloc[3]

Conditional delete in pandas dataframe

I want to delete any rows including specific string in dataframe.
I want to delete data rows with abnormal email address (with .jpg)
Here's my code, what's wrong with it?
df = pd.DataFrame({'email':['abc#gmail.com', 'cde#gmail.com', 'ghe#ss.jpg', 'sldkslk#sss.com']})
df
email
0 abc#gmail.com
1 cde#gmail.com
2 ghe#ss.jpg
3 sldkslk#sss.com
for i, r in df.iterrows():
if df.loc[i,'email'][-3:] == 'com':
df.drop(df.index[i], inplace=True)
Traceback (most recent call last):
File "<ipython-input-84-4f12d22e5e4c>", line 2, in <module>
if df.loc[i,'email'][-3:] == 'com':
File "C:\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1472, in __getitem__
return self._getitem_tuple(key)
File "C:\Anaconda\lib\site-packages\pandas\core\indexing.py", line 870, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "C:\Anaconda\lib\site-packages\pandas\core\indexing.py", line 998, in _getitem_lowerdim
section = self._getitem_axis(key, axis=i)
File "C:\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1911, in _getitem_axis
self._validate_key(key, axis)
File "C:\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1798, in _validate_key
error()
File "C:\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1785, in error
axis=self.obj._get_axis_name(axis)))
KeyError: 'the label [2] is not in the [index]'

IIUC, you can do this rather than iterating through your frame with iterrows:
df = df[df.email.str.endswith('.com')]
which returns:
>>> df
email
0 abc#gmail.com
1 cde#gmail.com
3 sldkslk#sss.com
Or, for larger dataframes, it's sometimes faster to not use the str methods provided by pandas, but just to do it in a plain list comprehension with python's built in string methods:
df = df[[i.endswith('.com') for i in df.email]]

Pandas: Nondeterministic broadcast failure in variable assignment

This is a cut down version of a program that converts trade data into OHLCV format.
import pandas as pd
data = pd.DataFrame({ 'time' : [pd.Timestamp('2017-12-26 16:01:04.628431600')], 'price': [100.0], 'size': [0.06] })
data.set_index('time', inplace=True)
data = data.resample('1s').apply({ 'price' : 'ohlc', 'size': 'sum' })
I'm getting the following error
Traceback (most recent call last):
File "/home/jun/.local/lib/python3.5/site-packages/pandas/core/common.py", line 1404, in _asarray_tuplesafe
result[:] = values
ValueError: could not broadcast input array from shape (4) into shape (1)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/tmp.py", line 5, in <module>
data = data.resample('1s').apply({ 'price' : 'ohlc', 'size': 'sum' })
File "/home/jun/.local/lib/python3.5/site-packages/pandas/tseries/resample.py", line 293, in aggregate
result, how = self._aggregate(arg, *args, **kwargs)
File "/home/jun/.local/lib/python3.5/site-packages/pandas/core/base.py", line 560, in _aggregate
result = DataFrame(result)
File "/home/jun/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 224, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "/home/jun/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 360, in _init_dict
return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "/home/jun/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 5236, in _arrays_to_mgr
arrays = _homogenize(arrays, index, dtype)
File "/home/jun/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 5546, in _homogenize
raise_cast_failure=False)
File "/home/jun/.local/lib/python3.5/site-packages/pandas/core/series.py", line 2922, in _sanitize_array
subarr = _asarray_tuplesafe(data, dtype=dtype)
File "/home/jun/.local/lib/python3.5/site-packages/pandas/core/common.py", line 1407, in _asarray_tuplesafe
result[:] = [tuple(x) for x in values]
ValueError: cannot copy sequence with size 4 to array axis with dimension 1
This doesn't make sense. IIUC, the assignment on line 5 is just assigning a new DataFrame to a variable, so there's nothing to broadcast. Stranger even, this failure is non-deterministic: running the script sometimes results in the error, but sometimes it doesn't.
Am I doing something wrong, or is this a bug in pandas? Where's the source of non-determinism in this program? I'm using Python 3.5.3 with pandas 0.18.1.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas dataframe apply to get a list throws error - python

Related

Pandas SparseDtype not working with GroupBy

Filter pandas df multiple columns from a pandas series

Can use dataframe ix for assignment, but not retrieval

Conditional delete in pandas dataframe

Pandas: Nondeterministic broadcast failure in variable assignment

Categories

Resources