python pandas rolling function with two arguments in a grouped DataFrame

python pandas rolling function with two arguments in a grouped DataFrame - python

This is a somewhat extension to my previous problem
python pandas rolling function with two arguments .
How do I perform the same by group? Let's say that the 'C' column below is used for grouping.
I am struggling to:
Group by column 'C'
Within each group, sort by 'A'
Withing each group, apply a rolling function taking two arguments, like kendalltau, to arguments 'A' and 'B'.
The expected result would be a DataFrame like the one below:
I have been trying the 'pass an index' workaround as described in the link above, but the complexity of this case is beyond my skills :-( . This is a toy example, not that far from what I am working with, so for simplicity i used randomly generated data.
rand = np.random.RandomState(1)
dff = pd.DataFrame({'A' : np.arange(20),
'B' : rand.randint(100, 120, 20),
'C' : rand.randint(0, 2, 20)})
def my_tau_indx(indx):
x = dff.iloc[indx, 0]
y = dff.iloc[indx, 1]
tau = sp.stats.mstats.kendalltau(x, y)[0]
return tau
dff['tau'] = dff.sort_values(['C', 'A']).groupby('C').rolling(window = 5).apply(my_tau_indx, args = ([dff.index.values]))
Every fix I make creates yet another bug...
The Above issue has been solved by Nickil Maveli and it works with numpy 1.11.0, pandas 0.18.1, scipy 0.17.1, andwith conda 4.1.4. It generates some warnings, but works.
On my another machine with latest & greatest numpy 1.12.0, pandas 0.19.2, scipy 0.18.1, conda version 3.10.0 and BLAS/LAPACK - it does not work and I get the traceback below. This seems versions related since I upgraded the 1st machine it also stopped working... In the name of science... ;-)
As Nickil suggested, this was due to incompatibility between numpy 1.11 and 1.12. Downgrading numpy helped. Since I had had BLAS/LAPACK on a Windows, I installed numpy 1.11.3+mkl from http://www.lfd.uci.edu/~gohlke/pythonlibs/ .
Traceback (most recent call last):
File "<ipython-input-4-bbca2c0e986b>", line 16, in <module>
t = grp.apply(func)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\groupby.py", line 651, in apply
return self._python_apply_general(f)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\groupby.py", line 655, in _python_apply_general
self.axis)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\groupby.py", line 1527, in apply
res = f(group)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\groupby.py", line 647, in f
return func(g, *args, **kwargs)
File "<ipython-input-4-bbca2c0e986b>", line 15, in <lambda>
func = lambda x: pd.Series(pd.rolling_apply(np.arange(len(x)), 5, my_tau_indx), x.index)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\stats\moments.py", line 584, in rolling_apply
kwargs=kwargs)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\stats\moments.py", line 240, in ensure_compat
result = getattr(r, name)(*args, **kwds)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\window.py", line 863, in apply
return super(Rolling, self).apply(func, args=args, kwargs=kwargs)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\window.py", line 621, in apply
center=False)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\window.py", line 560, in _apply
result = calc(values)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\window.py", line 555, in calc
return func(x, window, min_periods=self.min_periods)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\window.py", line 618, in f
kwargs)
File "pandas\algos.pyx", line 1831, in pandas.algos.roll_generic (pandas\algos.c:51768)
File "<ipython-input-4-bbca2c0e986b>", line 8, in my_tau_indx
x = dff.iloc[indx, 0]
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\indexing.py", line 1294, in __getitem__
return self._getitem_tuple(key)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\indexing.py", line 1560, in _getitem_tuple
retval = getattr(retval, self.name)._getitem_axis(key, axis=axis)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\indexing.py", line 1614, in _getitem_axis
return self._get_loc(key, axis=axis)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\indexing.py", line 96, in _get_loc
return self.obj._ixs(key, axis=axis)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\core\frame.py", line 1908, in _ixs
label = self.index[i]
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\indexes\range.py", line 510, in __getitem__
return super_getitem(key)
File "C:\Apps\Anaconda\v2_1_0_x64\envs\python35\lib\site-packages\pandas\indexes\base.py", line 1275, in __getitem__
result = getitem(key)
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
The final check:

One way to achieve would be to iterate through every group and use pd.rolling_apply on every such groups.
import scipy.stats as ss
def my_tau_indx(indx):
x = dff.iloc[indx, 0]
y = dff.iloc[indx, 1]
tau = ss.mstats.kendalltau(x, y)[0]
return tau
grp = dff.sort_values(['A', 'C']).groupby('C', group_keys=False)
func = lambda x: pd.Series(pd.rolling_apply(np.arange(len(x)), 5, my_tau_indx), x.index)
t = grp.apply(func)
dff.reindex(t.index).assign(tau=t)
EDIT:
def my_tau_indx(indx):
x = dff.ix[indx, 0]
y = dff.ix[indx, 1]
tau = ss.mstats.kendalltau(x, y)[0]
return tau
grp = dff.sort_values(['A', 'C']).groupby('C', group_keys=False)
t = grp.rolling(5).apply(my_tau_indx).get('A')
grp.head(dff.shape[0]).reindex(t.index).assign(tau=t)

Related

Using `xarray.apply_ufunc` with `np.linalg.pinv` returns an error with `dask.array`

I get an error when running the following MWE:
import xarray as xr
import numpy as np
from numpy.linalg import pinv
import dask
data = np.random.randn(4, 4, 3, 2)
da = xr.DataArray(data=data, dims=("x", "y", "i", "j"),)
da = da.chunk(x=1, y=1)
da_inv = xr.apply_ufunc(pinv, da,
input_core_dims=[["i", "j"]],
output_core_dims=[["i", "j"]],
exclude_dims=set(("i", "j")),
dask = "parallelized",
)
This throws me this error:
Traceback (most recent call last):
File "/glade/scratch/tomasc/tracer_inversion2/mwe.py", line 14, in <module>
da_inv = xr.apply_ufunc(pinv, da,
File "/glade/u/home/tomasc/miniconda3/envs/py310/lib/python3.10/site-packages/xarray/core/computation.py", line 1204, in apply_ufunc
return apply_dataarray_vfunc(
File "/glade/u/home/tomasc/miniconda3/envs/py310/lib/python3.10/site-packages/xarray/core/computation.py", line 315, in apply_dataarray_vfunc
result_var = func(*data_vars)
File "/glade/u/home/tomasc/miniconda3/envs/py310/lib/python3.10/site-packages/xarray/core/computation.py", line 771, in apply_variable_ufunc
result_data = func(*input_data)
File "/glade/u/home/tomasc/miniconda3/envs/py310/lib/python3.10/site-packages/xarray/core/computation.py", line 747, in func
res = da.apply_gufunc(
File "/glade/u/home/tomasc/miniconda3/envs/py310/lib/python3.10/site-packages/dask/array/gufunc.py", line 489, in apply_gufunc
core_output_shape = tuple(core_shapes[d] for d in ocd)
File "/glade/u/home/tomasc/miniconda3/envs/py310/lib/python3.10/site-packages/dask/array/gufunc.py", line 489, in <genexpr>
core_output_shape = tuple(core_shapes[d] for d in ocd)
KeyError: 'dim0'
Even though when using dask.array.map_blocks directly, things seem to work right out of the box:
data_inv = dask.array.map_blocks(pinv, da.data).compute() # works!
What am I missing here?

(Same question answered on the xarray repository here.)
You were almost there, you just needed to add the sizes of new the output dimensions by including the kwarg
dask_gufunc_kwargs={'output_sizes': {'i': 2, 'j': 3}}
It does sort of say this in the docstring for apply_ufunc but it could definitely be clearer!
That's a very unhelpful error, but it's ultimately being thrown because the keys 'i' and 'j' don't exist in the dict of expected sizes of the output (because you didn't provide them).
The actual error message has been improved in xarray version v2023.2.0.

Dask array mean throws "setting an array element with a sequence" exception where pandas array mean works

I have a pandas data frame that consist of a single column of numpy arrays. I can use the numpy.mean function to calculate the mean of the arrays.
import numpy
import pandas
f = pandas.DataFrame({"a":[numpy.array([1.0, 2.0]), numpy.array([3.0, 4.0])]})
numpy.mean(f["a"]) # returns array([2., 3.])
I want to do the same thing in Dask.
import dask.dataframe
import dask.array
g = dask.dataframe.from_pandas(f, npartitions=1)
dask.array.mean(g["a"], dtype="float64")
(You have to specify the dtype, otherwise you get a TypeError: unsupported operand type(s) for /: 'NoneType' and 'int' exception.)
The call to dask.array.mean returns the following, which looks correct.
dask.array<mean_agg-aggregate, shape=(), dtype=float64, chunksize=(), chunktype=numpy.ndarray>
However, when I run dask.array.mean(g["a"], dtype="float64").compute() to get the final value I get a ValueError: setting an array element with a sequence. exception. The full stack is as follows.
Traceback (most recent call last):
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "<input>", line 1, in <module>
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/base.py", line 165, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/base.py", line 436, in compute
results = schedule(dsk, keys, **kwargs)
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/threaded.py", line 81, in get
**kwargs
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/local.py", line 486, in get_async
raise_exception(exc, tb)
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/local.py", line 316, in reraise
raise exc
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/local.py", line 222, in execute_task
result = _execute_task(task, data)
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/core.py", line 118, in _execute_task
args2 = [_execute_task(a, cache) for a in args]
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/core.py", line 118, in <listcomp>
args2 = [_execute_task(a, cache) for a in args]
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/core.py", line 119, in _execute_task
return func(*args2)
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/optimization.py", line 982, in __call__
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/core.py", line 149, in get
result = _execute_task(task, cache)
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/core.py", line 119, in _execute_task
return func(*args2)
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/utils.py", line 29, in apply
return func(*args, **kwargs)
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/array/reductions.py", line 539, in mean_chunk
total = sum(x, dtype=dtype, **kwargs)
File "<__array_function__ internals>", line 6, in sum
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 2229, in sum
initial=initial, where=where)
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 90, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: setting an array element with a sequence.
Is it possible to perform the equivalent Dask operation?

It would be good if Dask Dataframe handled this case, but it doesn't today. It's not actually that surprising given the situation.
Your dataframe is a bit odd, in that elements of that dataframe are themselves Numpy arrays.
>>> f
a
0 [1.0, 2.0]
1 [3.0, 4.0]
As a result, Pandas thinks that this is an object dtype dataframe
>>> f.dtypes
a object
dtype: object
Because Dask Dataframe is lazy it doesn't actually keep track of all of the data at any given point, it only knows the dtypes, which in this case are pretty non-informative. Dask Dataframe doesn't really know what do to with a mean computation on these complex elements. It doesn't know if your elements are numpy arrays or strings, or custom Python objects, etc..
So it errs and you need to provide a data type explicitly.
The full solution to this problem is probably for Pandas to develop a much more complex dtype heirarchy, but that's probably unlikely near term.
Ideally Dask Dataframe would give a better error message here encouraging you to specify a dtype manually. If you wanted to raise an issue, that would be welcome.

Python Scipy Optimize not recongnising Lambdas while using numpy array

While trying to create an optimisation algorithm for work, i found a particular problem:
Here is some basic information about the code :
LZ is a nestedlist.
M is a numpy array converted from the nestedlist.
here is the code :
for i in range(len(LZ)):
for j in range(len(LZ[i])):
constraints1 = lambda MA, i=i,j=j: MAXQ - abs(M[i][j]-MA[i][j])
print(M[i][j])
if j <len(LZ[i])-1:
constraints2 = lambda MA, i=i,j=j: PENTEMAX +((MA[i][j]-MA[i][j+1])/LL[i][j])
constraints3 = lambda MA, i=i,j=j: PENTEMAX - ((MA[i][j]-MA[i][j+1])/LL[i][j])
cons.append({'type' : 'ineq','fun' : constraints1})
cons.append({'type' : 'ineq','fun' : constraints2})
cons.append({'type' : 'ineq','fun' : constraints3})
x0 = M
sol = minimize(objective,x0,method='SLSQP',constraints=cons)
I run the code, and here is what i get :
it prints the M[i][j] just fine, the printing is long so i didnt copy it here :
Traceback (most recent call last):
File "D:/Opti Assainissement/VOIRIE5.py", line 118, in <module>
sol = minimize(objective,x0,method='SLSQP',constraints=cons)
File "C:\Users\Asus\AppData\Local\Programs\Python\Python37\lib\site-packages\scipy\optimize\_minimize.py", line 611, in minimize
constraints, callback=callback, **options)
File "C:\Users\Asus\AppData\Local\Programs\Python\Python37\lib\site-packages\scipy\optimize\slsqp.py", line 315, in _minimize_slsqp
for c in cons['ineq']]))
File "C:\Users\Asus\AppData\Local\Programs\Python\Python37\lib\site-packages\scipy\optimize\slsqp.py", line 315, in <listcomp>
for c in cons['ineq']]))
File "D:/Opti Assainissement/VOIRIE5.py", line 101, in <lambda>
constraints1 = lambda MA, i=i,j=j: cdt(MA,i,j)
File "D:/Opti Assainissement/VOIRIE5.py", line 98, in cdt
return MAXQ - abs(M[i][j]-MA[i][j])
IndexError: invalid index to scalar variable.
My first guess was that SciPy doesnt recognise MA as an array, but i can't know if it's related to SciPy or the lambda construction or to my lack of knowledge in the matter. I'll be glad to get somehelp from the community !

Manual integration in Sympy doesn't work correctly with noncommutative symbols

I have the following
x=Symbol('x',commutative=False)
y=Symbol('y',commutative=False)
expr = 2*x + 87*x*y + 7*y
Now, this works
integrate(expr,y,manual=True)
because it gives
2*x*y + 87*x*y**2/2 + 7*y**2/2
but the same exact thing with x fails:
integrate(expr,x,manual=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/sympy/integrals/integrals.py", line 1295, in integrate
risch=risch, manual=manual)
File "/usr/local/lib/python2.7/dist-packages/sympy/integrals/integrals.py", line 486, in doit
conds=conds)
File "/usr/local/lib/python2.7/dist-packages/sympy/integrals/integrals.py", line 774, in _eval_integral
poly = f.as_poly(x)
File "/usr/local/lib/python2.7/dist-packages/sympy/core/basic.py", line 706, in as_poly
poly = Poly(self, *gens, **args)
File "/usr/local/lib/python2.7/dist-packages/sympy/polys/polytools.py", line 113, in __new__
opt = options.build_options(gens, args)
File "/usr/local/lib/python2.7/dist-packages/sympy/polys/polyoptions.py", line 731, in build_options
return Options(gens, args)
File "/usr/local/lib/python2.7/dist-packages/sympy/polys/polyoptions.py", line 154, in __init__
preprocess_options(args)
File "/usr/local/lib/python2.7/dist-packages/sympy/polys/polyoptions.py", line 152, in preprocess_options
self[option] = cls.preprocess(value)
File "/usr/local/lib/python2.7/dist-packages/sympy/polys/polyoptions.py", line 293, in preprocess
raise GeneratorsError("non-commutative generators: %s" % str(gens))
sympy.polys.polyerrors.GeneratorsError: non-commutative generators: (x,)
Why Sympy is so weird? How can I fix this?

You seem satisfied with
integrate(2*x + 87*x*y + 7*y, y, manual=True)
returning
2*x*y + 87*x*y**2/2 + 7*y**2/2
But the first term of this answer could also be 2*y*x. Or x*y + y*x. And these are all different answers. So, is the notion of an integral with noncommutative symbols well-defined to begin with? Maybe it's not that SymPy is weird, but the question you are asking it is.
The concrete reason for this behavior is that manual integration is based on matching certain patterns. Such as "constant times something" pattern:
coeff, f = integrand.as_independent(symbol)
The method as_independent splits the product as independent * possibly_dependent, in this order. So,
(x*y).as_independent(y) # returns (x, y)
(x*y).as_independent(x) # returns (1, x*y)
As a result, constant factors are recognized only in front of the expression, when the product is noncommutative.
I don't think this can be fixed without rewriting one of the core methods as_independent to support noncommutative products (possibly returning independent * dependent * independent2) which looks like a lot of work to me. Before doing that work, I'd want to know whether the objective (antiderivative with noncommuting variables) is well defined.

Python sklearn kaggle/titanic tutorial fails on the last feature scale

I was in the process of working through this tutorial: http://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html
And it went with no problems, until I got to the last section of the middle section:
As you can see, the features range in different intervals. Let's normalize all of them in the unit interval. All of them except the PassengerId that we'll need for the submission
In [48]:
>>> def scale_all_features():
>>> global combined
>>> features = list(combined.columns)
>>> features.remove('PassengerId')
>>> combined[features] = combined[features].apply(lambda x: x/x.max(), axis=0)
>>> print 'Features scaled successfully !'
In [49]:
>>> scale_all_features()
Features scaled successfully !
and despite typing it word for word in my python script:
#Cell 48
GreatDivide.split()
def scale_all_features():
global combined
features = list(combined.columns)
features.remove('PassengerId')
combined[features] = combined[features].apply(lambda x: x/x.max(), axis=0)
print 'Features scaled successfully !'
#Cell 49
GreatDivide.split()
scale_all_features()
It keeps giving me an error:
--------------------------------------------------48--------------------------------------------------
--------------------------------------------------49--------------------------------------------------
Traceback (most recent call last):
File "KaggleTitanic[2-FE]--[01].py", line 350, in <module>
scale_all_features()
File "KaggleTitanic[2-FE]--[01].py", line 332, in scale_all_features
combined[features] = combined[features].apply(lambda x: x/x.max(), axis=0)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4061, in apply
return self._apply_standard(f, axis, reduce=reduce)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4157, in _apply_standard
results[i] = func(v)
File "KaggleTitanic[2-FE]--[01].py", line 332, in <lambda>
combined[features] = combined[features].apply(lambda x: x/x.max(), axis=0)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 651, in wrapper
return left._constructor(wrap_results(na_op(lvalues, rvalues)),
File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 592, in na_op
result[mask] = op(x[mask], y)
TypeError: ("unsupported operand type(s) for /: 'str' and 'str'", u'occurred at index Ticket')
What's the problem here? All of the previous 49 sections ran with no problem, so if I was getting an error it would have shown by now, right?

You can help insure that the math transformation only occurs on numeric columns with the following.
numeric_cols = combined.columns[combined.dtypes != 'object']
combined.loc[:, numeric_cols] = combined[numeric_cols] / combined[numeric_cols].max()
There is no need for that apply function.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python pandas rolling function with two arguments in a grouped DataFrame - python

Related

Using `xarray.apply_ufunc` with `np.linalg.pinv` returns an error with `dask.array`

Dask array mean throws "setting an array element with a sequence" exception where pandas array mean works

Python Scipy Optimize not recongnising Lambdas while using numpy array

Manual integration in Sympy doesn't work correctly with noncommutative symbols

Python sklearn kaggle/titanic tutorial fails on the last feature scale

Categories

Resources