Why I am getting Error while using Lambda within Apply

Why I am getting Error while using Lambda within Apply - python

Request help on why the following is giving error?:
import numpy as np
from pydataset import data
mtcars = data('mtcars')
mtcars.apply(['mean', lambda x: max(x)-min(x), lambda x: np.percentile(x, 0.15)])
I am trying to create a data frame with the mean, max-min and 15th Percentile for all columns of the dataset mtcars.
Error Message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/aggregation.py in agg_list_like(obj, arg, _axis)
674 try:
--> 675 return concat(results, keys=keys, axis=1, sort=False)
676 except TypeError as err:
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
284 """
--> 285 op = _Concatenator(
286 objs,
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
369 )
--> 370 raise TypeError(msg)
371
TypeError: cannot concatenate object of type '<class 'float'>'; only Series and DataFrame objs are valid
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-645-51b8f1de1855> in <module>
----> 1 mtcars.apply(['mean', lambda x: max(x)-min(x), lambda x: np.percentile(x, 0.15)])
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
7766 kwds=kwds,
7767 )
-> 7768 return op.get_result()
7769
7770 def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/apply.py in get_result(self)
145 # pandas\core\apply.py:144: error: "aggregate" of "DataFrame" gets
146 # multiple values for keyword argument "axis"
--> 147 return self.obj.aggregate( # type: ignore[misc]
148 self.f, axis=self.axis, *self.args, **self.kwds
149 )
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py in aggregate(self, func, axis, *args, **kwargs)
7576 result = None
7577 try:
-> 7578 result, how = self._aggregate(func, axis, *args, **kwargs)
7579 except TypeError as err:
7580 exc = TypeError(
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py in _aggregate(self, arg, axis, *args, **kwargs)
7607 result = result.T if result is not None else result
7608 return result, how
-> 7609 return aggregate(self, arg, *args, **kwargs)
7610
7611 agg = aggregate
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/aggregation.py in aggregate(obj, arg, *args, **kwargs)
584 # we require a list, but not an 'str'
585 arg = cast(List[AggFuncTypeBase], arg)
--> 586 return agg_list_like(obj, arg, _axis=_axis), None
587 else:
588 result = None
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/aggregation.py in agg_list_like(obj, arg, _axis)
651 colg = obj._gotitem(col, ndim=1, subset=selected_obj.iloc[:, index])
652 try:
--> 653 new_res = colg.aggregate(arg)
654 except (TypeError, DataError):
655 pass
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in aggregate(self, func, axis, *args, **kwargs)
3972 func = dict(kwargs.items())
3973
-> 3974 result, how = aggregate(self, func, *args, **kwargs)
3975 if result is None:
3976
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/aggregation.py in aggregate(obj, arg, *args, **kwargs)
584 # we require a list, but not an 'str'
585 arg = cast(List[AggFuncTypeBase], arg)
--> 586 return agg_list_like(obj, arg, _axis=_axis), None
587 else:
588 result = None
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/aggregation.py in agg_list_like(obj, arg, _axis)
683 result = Series(results, index=keys, name=obj.name)
684 if is_nested_object(result):
--> 685 raise ValueError(
686 "cannot combine transform and aggregation operations"
687 ) from err
ValueError: cannot combine transform and aggregation operations
But, the following works:
mtcars.apply(['mean', lambda x: max(x)-min(x)])
Both type(mtcars.apply(lambda x: np.percentile(x, 0.15))) and type(mtcars.apply(lambda x: max(x)-min(x))) gives Pandas Series. Then why the problem is happening with only the Percentile?
Thanks

Reading the answer by #James my guess is that you need to write the custom function such that the function is applied on the series and not over each element. Maybe someone else who is more familiar with the underlying pandas code can chip in:
def min_max(x):
return max(x)-min(x)
def perc(x):
return x.quantile(0.15)
mtcars.agg(['mean',min_max,perc])
mpg cyl disp hp drat wt qsec vs am gear carb
mean 20.090625 6.1875 230.721875 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125
min_max 23.500000 4.0000 400.900000 283.0000 2.170000 3.91100 8.40000 1.0000 1.00000 2.0000 7.0000
perc 14.895000 4.0000 103.485000 82.2500 3.070000 2.17900 16.24300 0.0000 0.00000 3.0000 1.0000

Related

df.apply(hurst_function) gave TypeError: must be real number, not tuple in, Python

I have a column in form of a data-frame that contains the ratio of some numbers.
On that df col, I want to apply hurst function using df.apply() method.
I don't know if the error is with the df.apply or with the hurst_function.
Consider the code which calculates hurst exponent on a col using the df.apply method:
import hurst
def hurst_function(df_col_slice):
display(df_col_slice)
return hurst.compute_Hc(df_col_slice)
def func(df_col):
results = round(df_col.rolling(101).apply(hurst_function)[100:],1)
return results
func(df_col)
I get the error:
Input In [73], in func(df_col)
---> 32 results = round(df_col.rolling(101).apply(hurst_function)[100:],1)
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\window\rolling.py:1843, in Rolling.apply(self, func, raw, engine, engine_kwargs, args, kwargs)
1822 #doc(
1823 template_header,
1824 create_section_header("Parameters"),
(...)
1841 kwargs: dict[str, Any] | None = None,
1842 ):
-> 1843 return super().apply(
1844 func,
1845 raw=raw,
1846 engine=engine,
1847 engine_kwargs=engine_kwargs,
1848 args=args,
1849 kwargs=kwargs,
1850 )
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\window\rolling.py:1315, in RollingAndExpandingMixin.apply(self, func, raw, engine, engine_kwargs, args, kwargs)
1312 else:
1313 raise ValueError("engine must be either 'numba' or 'cython'")
-> 1315 return self._apply(
1316 apply_func,
1317 numba_cache_key=numba_cache_key,
1318 numba_args=numba_args,
1319 )
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\window\rolling.py:590, in BaseWindow._apply(self, func, name, numba_cache_key, numba_args, **kwargs)
587 return result
589 if self.method == "single":
--> 590 return self._apply_blockwise(homogeneous_func, name)
591 else:
592 return self._apply_tablewise(homogeneous_func, name)
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\window\rolling.py:442, in BaseWindow._apply_blockwise(self, homogeneous_func, name)
437 """
438 Apply the given function to the DataFrame broken down into homogeneous
439 sub-frames.
440 """
441 if self._selected_obj.ndim == 1:
--> 442 return self._apply_series(homogeneous_func, name)
444 obj = self._create_data(self._selected_obj)
445 if name == "count":
446 # GH 12541: Special case for count where we support date-like types
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\window\rolling.py:431, in BaseWindow._apply_series(self, homogeneous_func, name)
428 except (TypeError, NotImplementedError) as err:
429 raise DataError("No numeric types to aggregate") from err
--> 431 result = homogeneous_func(values)
432 return obj._constructor(result, index=obj.index, name=obj.name)
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\window\rolling.py:582, in BaseWindow._apply.<locals>.homogeneous_func(values)
579 return func(x, start, end, min_periods, *numba_args)
581 with np.errstate(all="ignore"):
--> 582 result = calc(values)
584 if numba_cache_key is not None:
585 NUMBA_FUNC_CACHE[numba_cache_key] = func
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\window\rolling.py:579, in BaseWindow._apply.<locals>.homogeneous_func.<locals>.calc(x)
571 start, end = window_indexer.get_window_bounds(
572 num_values=len(x),
573 min_periods=min_periods,
574 center=self.center,
575 closed=self.closed,
576 )
577 self._check_window_bounds(start, end, len(x))
--> 579 return func(x, start, end, min_periods, *numba_args)
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\window\rolling.py:1342, in RollingAndExpandingMixin._generate_cython_apply_func.<locals>.apply_func(values, begin, end, min_periods, raw)
1339 if not raw:
1340 # GH 45912
1341 values = Series(values, index=self._on)
-> 1342 return window_func(values, begin, end, min_periods)
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\_libs\window\aggregations.pyx:1315, in pandas._libs.window.aggregations.roll_apply()
TypeError: must be real number, not tuple
What can I do to solve this?
Edit: display(df_col_slice) is giving the following output:
0 0.282043
1 0.103355
2 0.537766
3 0.491976
4 0.535050
...
96 0.022696
97 0.438995
98 -0.131486
99 0.248250
100 1.246463
Length: 101, dtype: float64

hurst.compute_Hc function returns a tuple of 3 values:
H, c, vals = compute_Hc(df_col_slice)
where H is the Hurst exponent , and c - is some constant.
But, pandas._libs.window.aggregations.roll_apply() expects its argument (function) to return a single (scalar) which is the reduced result of a rolling window.
That's why your hurst_function function need to return a certain value from vals.

how can i convert a string column to a date type using apply() function

Although I have used another method i.e astype(datetime[ns]) and it worked, but I still want to know why this code gives an error
'''transaction['transaction_date'] = transaction['transaction_date'].apply([lambda x :datetime.strptime(x,"%y-%m-%d, %H:%M:%S")])'''
it gives this error message
ValueError Traceback (most recent call last)
<ipython-input-76-17fc8b725c4b> in <module>
----> 1 transaction['transaction_date'] = transaction['transaction_date'].apply([lambda x :datetime.strptime(x,"%y-%m-%d, %H:%M:%S")])
~\Anacondas\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
3821 # dispatch to agg
3822 if isinstance(func, (list, dict)):
-> 3823 return self.aggregate(func, *args, **kwds)
3824
3825 # if we are a string, try to dispatch
~\Anacondas\lib\site-packages\pandas\core\series.py in aggregate(self, func, axis, *args, **kwargs)
3686 # Validate the axis parameter
3687 self._get_axis_number(axis)
-> 3688 result, how = self._aggregate(func, *args, **kwargs)
3689 if result is None:
3690
~\Anacondas\lib\site-packages\pandas\core\base.py in _aggregate(self, arg, *args, **kwargs)
484 elif is_list_like(arg):
485 # we require a list, but not an 'str'
--> 486 return self._aggregate_multiple_funcs(arg, _axis=_axis), None
487 else:
488 result = None
~\Anacondas\lib\site-packages\pandas\core\base.py in _aggregate_multiple_funcs(self, arg, _axis)
549 # if we are empty
550 if not len(results):
--> 551 raise ValueError("no results")
552
553 try:
ValueError: no results

Dask groupby with multiple columns issue

I have the following dataframe created by using dataframe.from_delayed method tha has the following columns
_id hour_timestamp http_method total_hits username hour weekday.
Some details on the source dataframe:
hits_rate_stats._meta.dtypes
_id object
hour_timestamp datetime64[ns]
http_method object
total_hits object
username object
hour int64
weekday int64
dtype: object
meta index:
RangeIndex(start=0, stop=0, step=1)
When I execute the following code
my_df_grouped = my_df.groupby(['username', 'http_method', 'weekday', 'hour'])
my_df_grouped.total_hits.sum().reset_index().compute()
I get the following exception:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-27-b24b24fc86db> in <module>()
----> 1 hits_rate_stats_grouped.total_hits.sum().reset_index().compute()
/home/avlach/virtualenvs/enorasys_sa_v2/local/lib/python2.7/site-packages/dask/base.pyc in compute(self, **kwargs)
141 dask.base.compute
142 """
--> 143 (result,) = compute(self, traverse=False, **kwargs)
144 return result
145
/home/avlach/virtualenvs/enorasys_sa_v2/local/lib/python2.7/site-packages/dask/base.pyc in compute(*args, **kwargs)
390 postcomputes = [a.__dask_postcompute__() if is_dask_collection(a)
391 else (None, a) for a in args]
--> 392 results = get(dsk, keys, **kwargs)
393 results_iter = iter(results)
394 return tuple(a if f is None else f(next(results_iter), *a)
/home/avlach/virtualenvs/enorasys_sa_v2/local/lib/python2.7/site-packages/distributed/client.pyc in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, **kwargs)
2039 secede()
2040 try:
-> 2041 results = self.gather(packed, asynchronous=asynchronous)
2042 finally:
2043 for f in futures.values():
/home/avlach/virtualenvs/enorasys_sa_v2/local/lib/python2.7/site-packages/distributed/client.pyc in gather(self, futures, errors, maxsize, direct, asynchronous)
1476 return self.sync(self._gather, futures, errors=errors,
1477 direct=direct, local_worker=local_worker,
-> 1478 asynchronous=asynchronous)
1479
1480 #gen.coroutine
/home/avlach/virtualenvs/enorasys_sa_v2/local/lib/python2.7/site-packages/distributed/client.pyc in sync(self, func, *args, **kwargs)
601 return future
602 else:
--> 603 return sync(self.loop, func, *args, **kwargs)
604
605 def __repr__(self):
/home/avlach/virtualenvs/enorasys_sa_v2/local/lib/python2.7/site-packages/distributed/utils.pyc in sync(loop, func, *args, **kwargs)
251 e.wait(10)
252 if error[0]:
--> 253 six.reraise(*error[0])
254 else:
255 return result[0]
/home/avlach/virtualenvs/enorasys_sa_v2/local/lib/python2.7/site-packages/distributed/utils.pyc in f()
235 yield gen.moment
236 thread_state.asynchronous = True
--> 237 result[0] = yield make_coro()
238 except Exception as exc:
239 logger.exception(exc)
/home/avlach/virtualenvs/enorasys_sa_v2/local/lib/python2.7/site-packages/tornado/gen.pyc in run(self)
1053
1054 try:
-> 1055 value = future.result()
1056 except Exception:
1057 self.had_exception = True
/home/avlach/virtualenvs/enorasys_sa_v2/local/lib/python2.7/site-packages/tornado/concurrent.pyc in result(self, timeout)
236 if self._exc_info is not None:
237 try:
--> 238 raise_exc_info(self._exc_info)
239 finally:
240 self = None
/home/avlach/virtualenvs/enorasys_sa_v2/local/lib/python2.7/site-packages/tornado/gen.pyc in run(self)
1061 if exc_info is not None:
1062 try:
-> 1063 yielded = self.gen.throw(*exc_info)
1064 finally:
1065 # Break up a reference to itself
/home/avlach/virtualenvs/enorasys_sa_v2/local/lib/python2.7/site-packages/distributed/client.pyc in _gather(self, futures, errors, direct, local_worker)
1354 six.reraise(type(exception),
1355 exception,
-> 1356 traceback)
1357 if errors == 'skip':
1358 bad_keys.add(key)
/home/avlach/virtualenvs/enorasys_sa_v2/local/lib/python2.7/site-packages/dask/dataframe/core.pyc in apply_and_enforce()
3354 return meta
3355 c = meta.columns if isinstance(df, pd.DataFrame) else meta.name
-> 3356 return _rename(c, df)
3357 return df
3358
/home/avlach/virtualenvs/enorasys_sa_v2/local/lib/python2.7/site-packages/dask/dataframe/core.pyc in _rename()
3391 # deep=False doesn't doesn't copy any data/indices, so this is cheap
3392 df = df.copy(deep=False)
-> 3393 df.columns = columns
3394 return df
3395 elif isinstance(df, (pd.Series, pd.Index)):
/home/avlach/virtualenvs/enorasys_sa_v2/local/lib/python2.7/site-packages/pandas/core/generic.pyc in __setattr__()
3625 try:
3626 object.__getattribute__(self, name)
-> 3627 return object.__setattr__(self, name, value)
3628 except AttributeError:
3629 pass
pandas/_libs/properties.pyx in pandas._libs.properties.AxisProperty.__set__()
/home/avlach/virtualenvs/enorasys_sa_v2/local/lib/python2.7/site-packages/pandas/core/generic.pyc in _set_axis()
557
558 def _set_axis(self, axis, labels):
--> 559 self._data.set_axis(axis, labels)
560 self._clear_item_cache()
561
/home/avlach/virtualenvs/enorasys_sa_v2/local/lib/python2.7/site-packages/pandas/core/internals.pyc in set_axis()
3072 raise ValueError('Length mismatch: Expected axis has %d elements, '
3073 'new values have %d elements' %
-> 3074 (old_len, new_len))
3075
3076 self.axes[axis] = new_labels
ValueError: Length mismatch: Expected axis has 5 elements, new values have 2 elements
When I do my_df_grouped.count().reset_index().compute() it works as it should and when I do my_df_grouped.sum().reset_index().compute() I get
/home/avlach/virtualenvs/enorasys_sa_v2/local/lib/python2.7/site-packages/pandas/core/groupby.pyc in _get_grouper()
2830 raise ValueError('No group keys passed!')
2831 else:
-> 2832 raise ValueError('multiple levels only valid with '
2833 'MultiIndex')
2834
ValueError: multiple levels only valid with MultiIndex
Reproducing locally with dummy data doesn't give me these errors. What could be going wrong?
EDIT:
It seems it is loosing the multi index. If I do this:
total_hits = my_df_grouped.total_hits.sum()
total_hits._meta.index = pd.MultiIndex(levels=[[],[],[],[],], labels=[[],[],[],[]], names=['username', 'http_method', 'weekday', hour'])

ValueError when using .diff() with dask dataframe

I have a large time series data set which I want to process with Dask.
apart from a few other columns, there is a column called 'id' which identifies individuals and a column transc_date which identifies the date and a column transc_time identifying the time when an individual made a transaction.
The data is sorted using:
df = df.map_partitions(lambda x: x.sort_values(['id', 'transc_date', 'transc_time'], ascending=[True, True, True]))
transc_time is of type int and transc_date is of type datetime64.
I want to create a new column which gives me for each individual the number of days since the last transaction. For this I created the following function:
def get_diff_since_last_trans(df, plot=True):
df['diff_last'] = df.map_overlap(lambda x: x.groupby('id')['transc_date'].diff(), before=10, after=10)
diffs = df[['id', 'diff_last']].groupby(['id']).agg('max')['diff_last'].dt.days.compute()
if plot:
sns.distplot(diffs.values, kde = False, rug = False)
return diffs
When I try this function on a small subset of the data (200k rows) it works as intended. But when I use it on the full data set I get a ValueErro below.
I dropped all ids which have fewer than 10 occurrences first. transc_date does not contain nans, it only contains datetime64 entries.
Any idea what's going wrong?
ValueError Traceback (most recent call last)
<ipython-input-12-551d7256f328> in <module>()
1 a = get_diff_first_last_trans(df, plot=False)
----> 2 b = get_diff_since_last_trans(df, plot=False)
3 plot_trans_diff(a,b)
<ipython-input-10-8f83d4571659> in get_diff_since_last_trans(df, plot)
12 def get_diff_since_last_trans(df, plot=True):
13 df['diff_last'] = df.map_overlap(lambda x: x.groupby('id')['transc_date'].diff(), before=10, after=10)
---> 14 diffs = df[['id', 'diff_last']].groupby(['id']).agg('max')['diff_last'].dt.days.compute()
15 if plot:
16 sns.distplot(diffs.values, kde = False, rug = False)
~/venv/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
133 dask.base.compute
134 """
--> 135(result,)= compute(self, traverse=False,**kwargs) 136return result
137
~/venv/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
331 postcomputes = [a.__dask_postcompute__() if is_dask_collection(a)
332 else (None, a) for a in args]
--> 333 results = get(dsk, keys, **kwargs)
334 results_iter = iter(results)
335 return tuple(a if f is None else f(next(results_iter), *a)
~/venv/lib/python3.6/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, **kwargs)
1997 secede()
1998 try:
-> 1999 results = self.gather(packed, asynchronous=asynchronous)
2000 finally:
2001 for f in futures.values():
~/venv/lib/python3.6/site-packages/distributed/client.py in gather(self, futures, errors, maxsize, direct, asynchronous)
1435 return self.sync(self._gather, futures, errors=errors,
1436 direct=direct, local_worker=local_worker,
-> 1437 asynchronous=asynchronous)
1438
1439 #gen.coroutine
~/venv/lib/python3.6/site-packages/distributed/client.py in sync(self, func, *args, **kwargs)
590 return future
591 else:
--> 592return sync(self.loop, func,*args,**kwargs) 593 594def __repr__(self):
~/venv/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
252 e.wait(1000000)
253 if error[0]:
--> 254 six.reraise(*error[0])
255 else:
256 return result[0]
~/venv/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
691 if value.__traceback__ is not tb:
692 raise value.with_traceback(tb)
--> 693raise value
694finally: 695 value =None
~/venv/lib/python3.6/site-packages/distributed/utils.py in f()
236 yield gen.moment
237 thread_state.asynchronous = True
--> 238 result[0] = yield make_coro()
239 except Exception as exc:
240 logger.exception(exc)
~/venv/lib/python3.6/site-packages/tornado/gen.py in run(self)
1053
1054 try:
-> 1055 value = future.result()
1056 except Exception:
1057 self.had_exception = True
~/venv/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
236 if self._exc_info is not None:
237 try:
--> 238 raise_exc_info(self._exc_info)
239 finally:
240 self = None
~/venv/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)
~/venv/lib/python3.6/site-packages/tornado/gen.py in run(self)
1061 if exc_info is not None:
1062 try:
-> 1063 yielded = self.gen.throw(*exc_info)
1064 finally:
1065 # Break up a reference to itself
~/venv/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
1313 six.reraise(type(exception),
1314 exception,
-> 1315 traceback)
1316 if errors == 'skip':
1317 bad_keys.add(key)
~/venv/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
690 value = tp()
691 if value.__traceback__ is not tb:
--> 692raise value.with_traceback(tb) 693raise value
694finally:
~/venv/lib/python3.6/site-packages/dask/dataframe/rolling.py in overlap_chunk()
30 parts = [p for p in (prev_part, current_part, next_part) if p is not None]
31 combined = pd.concat(parts)
---> 32 out = func(combined, *args, **kwargs)
33 if prev_part is None:
34 before = None
<ipython-input-10-8f83d4571659> in <lambda>()
11
12 def get_diff_since_last_trans(df, plot=True):
---> 13 df['diff_last'] = df.map_overlap(lambda x: x.groupby('id')['transc_date'].diff(), before=10, after=10)
14 diffs = df[['id', 'diff_last']].groupby(['id']).agg('max')['diff_last'].dt.days.compute()
15 if plot:
~/venv/lib/python3.6/site-packages/pandas/core/groupby.py in wrapper()
737 *args, **kwargs)
738 except (AttributeError):
--> 739raise ValueError
740 741return wrapper
ValueError:

KeyError when plotting a sliced pandas dataframe with datetimes

I get a KeyError when I try to plot a slice of a pandas DataFrame column with datetimes in it. Does anybody know what could cause this?
I managed to reproduce the error in a little self contained example (which you can also view here: http://nbviewer.ipython.org/3714142/):
import numpy as np
from pandas import DataFrame
import datetime
from pylab import *
test = DataFrame({'x' : [datetime.datetime(2012,9,10) + datetime.timedelta(n) for n in range(10)],
'y' : range(10)})
Now if I plot:
plot(test['x'][0:5])
there is not problem, but when I plot:
plot(test['x'][5:10])
I get the KeyError below (and the error message is not very helpfull to me). This only happens with datetime columns, not with other columns (as far as I experienced). E.g. plot(test['y'][5:10]) is not a problem.
Ther error message:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-7-aa076e3fc4e0> in <module>()
----> 1 plot(test['x'][5:10])
C:\Python27\lib\site-packages\matplotlib\pyplot.pyc in plot(*args, **kwargs)
2456 ax.hold(hold)
2457 try:
-> 2458 ret = ax.plot(*args, **kwargs)
2459 draw_if_interactive()
2460 finally:
C:\Python27\lib\site-packages\matplotlib\axes.pyc in plot(self, *args, **kwargs)
3846 lines = []
3847
-> 3848 for line in self._get_lines(*args, **kwargs):
3849 self.add_line(line)
3850 lines.append(line)
C:\Python27\lib\site-packages\matplotlib\axes.pyc in _grab_next_args(self, *args, **kwargs)
321 return
322 if len(remaining) <= 3:
--> 323 for seg in self._plot_args(remaining, kwargs):
324 yield seg
325 return
C:\Python27\lib\site-packages\matplotlib\axes.pyc in _plot_args(self, tup, kwargs)
298 x = np.arange(y.shape[0], dtype=float)
299
--> 300 x, y = self._xy_from_xy(x, y)
301
302 if self.command == 'plot':
C:\Python27\lib\site-packages\matplotlib\axes.pyc in _xy_from_xy(self, x, y)
215 if self.axes.xaxis is not None and self.axes.yaxis is not None:
216 bx = self.axes.xaxis.update_units(x)
--> 217 by = self.axes.yaxis.update_units(y)
218
219 if self.command!='plot':
C:\Python27\lib\site-packages\matplotlib\axis.pyc in update_units(self, data)
1277 neednew = self.converter!=converter
1278 self.converter = converter
-> 1279 default = self.converter.default_units(data, self)
1280 #print 'update units: default=%s, units=%s'%(default, self.units)
1281 if default is not None and self.units is None:
C:\Python27\lib\site-packages\matplotlib\dates.pyc in default_units(x, axis)
1153 'Return the tzinfo instance of *x* or of its first element, or None'
1154 try:
-> 1155 x = x[0]
1156 except (TypeError, IndexError):
1157 pass
C:\Python27\lib\site-packages\pandas\core\series.pyc in __getitem__(self, key)
374 def __getitem__(self, key):
375 try:
--> 376 return self.index.get_value(self, key)
377 except InvalidIndexError:
378 pass
C:\Python27\lib\site-packages\pandas\core\index.pyc in get_value(self, series, key)
529 """
530 try:
--> 531 return self._engine.get_value(series, key)
532 except KeyError, e1:
533 if len(self) > 0 and self.inferred_type == 'integer':
C:\Python27\lib\site-packages\pandas\_engines.pyd in pandas._engines.IndexEngine.get_value (pandas\src\engines.c:1479)()
C:\Python27\lib\site-packages\pandas\_engines.pyd in pandas._engines.IndexEngine.get_value (pandas\src\engines.c:1374)()
C:\Python27\lib\site-packages\pandas\_engines.pyd in pandas._engines.DictIndexEngine.get_loc (pandas\src\engines.c:2498)()
C:\Python27\lib\site-packages\pandas\_engines.pyd in pandas._engines.DictIndexEngine.get_loc (pandas\src\engines.c:2460)()
KeyError: 0

HYRY explained why you get the KeyError.
To plot with slices using matplotlib you can do:
In [157]: plot(test['x'][5:10].values)
Out[157]: [<matplotlib.lines.Line2D at 0xc38348c>]
In [158]: plot(test['x'][5:10].reset_index(drop=True))
Out[158]: [<matplotlib.lines.Line2D at 0xc37e3cc>]
x, y plotting in one go with 0.7.3
In [161]: test[5:10].set_index('x')['y'].plot()
Out[161]: <matplotlib.axes.AxesSubplot at 0xc48b1cc>

Instead of calling plot(test["x"][5:10]), you can call the plot method of Series object:
test["x"][5:10].plot()
The reason: test["x"][5:10] is a Series object with integer index from 5 to 10. plot() try to get index 0 of it, that will cause error.

I encountered this error with pd.groupby in Pandas 0.14.0 and solved it with df = df[df['col']!= 0].reset_index()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why I am getting Error while using Lambda within Apply - python

Related

df.apply(hurst_function) gave TypeError: must be real number, not tuple in, Python

how can i convert a string column to a date type using apply() function

Dask groupby with multiple columns issue

ValueError when using .diff() with dask dataframe

KeyError when plotting a sliced pandas dataframe with datetimes

Categories

Resources