Python Pandas: Passing arguments to a function in agg() - python

I am trying to reduce data in a pandas dataframe by using different kind of functions and argument values. However, I did not manage to change the default arguments in the aggregation functions. Here is an example:
>>> df = pd.DataFrame({'x': [1,np.nan,2,1],
... 'y': ['a','a','b','b']})
>>> df
x y
0 1.0 a
1 NaN a
2 2.0 b
3 1.0 b
Here is an aggregation function, for which I would like to test different values of b:
>>> def translate_mean(x, b=10):
... y = [elem + b for elem in x]
... return np.mean(y)
In the following code, I can use this function with the default b value, but I would like to pass other values:
>>> df.groupby('y').agg(translate_mean)
x
y
a NaN
b 11.5
Any ideas?

Just pass as arguments to agg (this works with apply, too).
df.groupby('y').agg(translate_mean, b=4)
Out:
x
y
a NaN
b 5.5

Maybe you can try using apply in this case:
df.groupby('y').apply(lambda x: translate_mean(x['x'], 20))
Now the result is:
y
a NaN
b 21.5

Just in case you have multiple columns, and you want to apply different functions and different parameters for each column, you can use lambda function with agg function.
For example:
>>> df = pd.DataFrame({'x': [1,np.nan,2,1],
... 'y': ['a','a','b','b']
'z': ['0.1','0.2','0.3','0.4']})
>>> df
x y z
0 1.0 a 0.1
1 NaN a 0.2
2 2.0 b 0.3
3 1.0 0.4
>>> def translate_mean(x, b=10):
... y = [elem + b for elem in x]
... return np.mean(y)
To groupby column 'y', and apply function translate_mean with b=10 for col 'x'; b=25 for col 'z', you can try this:
df_res = df.groupby(by='a').agg({
'x': lambda x: translate_mean(x, 10),
'z': lambda x: translate_mean(x, 25)})
Hopefully, it helps.

Related

Applying/Composing a function N times to a pandas column, N being different for each row

Suppose we have this simple pandas.DataFrame:
import pandas as pd
df = pd.DataFrame(
columns=['quantity', 'value'],
data=[[1, 12.5], [3, 18.0]]
)
>>> print(df)
quantity value
0 1 12.5
1 3 18.0
I would like to create a new column, say modified_value, that applies a function N times to the value column, N being the quantity column.
Suppose that function is new_value = round(value/2, 1), the expected result would be:
quantity value modified_value
0 1 12.5 6.2 # applied 1 time
1 3 9.0 1.1 # applied 3 times, 9.0 -> 4.5 -> 2.2 -> 1.1
What would be an elegant/vectorized way to do so?
You can write a custom repeat function, then use apply:
def repeat(func, x, n):
ret = x
for i in range(int(n)):
ret = func(ret)
return ret
def my_func(val): return round(val/2, 1)
df['new_col'] = df.apply(lambda x: repeat(my_func, x['value'], x['quantity']),
axis=1)
# or without apply
# df['new_col'] = [repeat(my_func, v, n) for v,n in zip(df['value'], df['quantity'])]
Use reduce:
from functools import reduce
def repeated(f, n):
def rfun(p):
return reduce(lambda x, _: f(x), range(n), p)
return rfun
def myfunc(value): return round(value/2, 1)
df['modified_valued'] = df.apply(lambda x: repeated(myfunc,
int(x['quantity']))(x['value']),
axis=1)
We can also use list comprehension instead apply
df['modified_valued'] = [repeated(myfunc, int(quantity))(value)
for quantity, value in zip (df['quantity'], df['value'])]
Output
quantity value modified_valued
0 1 12.5 6.2
1 3 18.0 2.2

Why this case pandas dataframe assign raise TypeError

Environment:
Python 3.6.4
pandas 0.23.4
My code is below.
from math import sqrt
import pandas as pd
df = pd.DataFrame({'x':[1,2,3], 'y':[4,5,6]})
df = df.assign(d = lambda z: sqrt(z.x**2 + z.y**2))
The bottom line raise TypeError like below.
...
TypeError: cannot convert the series to <class 'float'>
Without sqrt, it works.
df = df.assign(d2 = lambda z: z.x**2 + z.y**2)
df
Out[6]:
x y d2
0 1 4 17
1 2 5 29
2 3 6 45
And apply also works.
df['d3'] = df.apply(lambda z: sqrt(z.x**2 + z.y**2), axis=1)
df
Out[8]:
x y d2 d3
0 1 4 17 4.123106
1 2 5 29 5.385165
2 3 6 45 6.708204
What's the matter with the first?
Use numpy.sqrt - it works also with 1d arrays, while sqrt from math works only with scalars:
df = df.assign(d = lambda z: np.sqrt(z.x**2 + z.y**2))
Another solution is use **(1/2):
df = df.assign(d = lambda z: (z.x**2 + z.y**2)**(1/2))
print (df)
x y d
0 1 4 4.123106
1 2 5 5.385165
2 3 6 6.708204
Your solution working, because axis=1 in apply working by scalars, but like #jpp mentioned, apply should not be preferred as it involves a Python-level row-wise loop.
df.apply(lambda z: print(z.x), axis=1)
1
2
3
pandas series object is like a numpy array, you cannot operate a math module
that is searching for a single object, and not a series.
the default math operations are valid , but not functions that do not work on arrays/ series.
what you can do is:
df = df.assign(d = lambda z: (z.x**0.5 + z.y**0.5))
or
df['d'] = df.z.x**0.5 + df.y.x**0.5
which is defined in pandas standard operations.

Pandas groupby mean() not ignoring NaNs

If I calculate the mean of a groupby object and within one of the groups there is a NaN(s) the NaNs are ignored. Even when applying np.mean it is still returning just the mean of all valid numbers. I would expect a behaviour of returning NaN as soon as one NaN is within the group. Here a simplified example of the behaviour
import pandas as pd
import numpy as np
c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
c.groupby('b').mean()
a
b
1 1.5
2 3.0
c.groupby('b').agg(np.mean)
a
b
1 1.5
2 3.0
I want to receive following result:
a
b
1 1.5
2 NaN
I am aware that I can replace NaNs beforehand and that i probably can write my own aggregation function to return NaN as soon as NaN is within the group. This function wouldn't be optimized though.
Do you know of an argument to achieve the desired behaviour with the optimized functions?
Btw, I think the desired behaviour was implemented in a previous version of pandas.
By default, pandas skips the Nan values. You can make it include Nan by specifying skipna=False:
In [215]: c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Out[215]:
a
b
1 1.5
2 NaN
There is mean(skipna=False), but it's not working
GroupBy aggregation methods (min, max, mean, median, etc.) have the skipna parameter, which is meant for this exact task, but it seems that currently (may-2020) there is a bug (issue opened on mar-2020), which prevents it from working correctly.
Quick workaround
Complete working example based on this comments: #Serge Ballesta, #RoelAdriaans
>>> import pandas as pd
>>> import numpy as np
>>> c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
>>> c.fillna(np.inf).groupby('b').mean().replace(np.inf, np.nan)
a
b
1 1.5
2 NaN
For additional information and updates follow the link above.
Use the skipna option -
c.groupby('b').apply(lambda g: g.mean(skipna=False))
Another approach would be to use a value that is not ignored by default, for example np.inf:
>>> c = pd.DataFrame({'a':[1,np.inf,2,3],'b':[1,2,1,2]})
>>> c.groupby('b').mean()
a
b
1 1.500000
2 inf
There are three different methods for it:
slowest:
c.groupby('b').apply(lambda g: g.mean(skipna=False))
faster than apply but slower than default sum:
c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Fastest but need more codes:
method3 = c.groupby('b').sum()
nan_index = c[c['b'].isna()].index.to_list()
method3.loc[method3.index.isin(nan_index)] = np.nan
I landed here in search of a fast (vectorized) way of doing this, but did not find it. Also, in the case of complex numbers, groupby behaves a bit strangely: it doesn't like mean(), and with sum() it will convert groups where all values are NaN into 0+0j.
So, here is what I came up with:
Setup:
df = pd.DataFrame({
'a': [1, 2, 1, 2],
'b': [1, np.nan, 2, 3],
'c': [1, np.nan, 2, np.nan],
'd': np.array([np.nan, np.nan, 2, np.nan]) * 1j,
})
gb = df.groupby('a')
Default behavior:
gb.sum()
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 0.0 0.000000+0.000000j
A single NaN kills the group:
cnt = gb.count()
siz = gb.size()
mask = siz.values[:, None] == cnt.values
gb.sum().where(mask)
Out[]:
b c d
a
1 3.0 3.0 NaN
2 NaN NaN NaN
Only NaN if all values in group are NaN:
cnt = gb.count()
gb.sum() * (cnt / cnt)
out
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 NaN NaN
Corollary: mean of complex:
cnt = gb.count()
gb.sum() / cnt
Out[]:
b c d
a
1 1.5 1.5 0.000000+2.000000j
2 3.0 NaN NaN

Use .apply to recode nan rows into a different value

I am trying to create a new groupid based on the original groupid which has the value of 0, 1. I used the following code but it failed to code the nan rows into 2.
final['groupid2'] = final['groupid'].apply(lambda x: 2 if x == np.nan else x)
I tried the following code also, but it gave an attribute error
final['groupid2'] = final['groupid'].apply(lambda x: 2 if x.isnull() else x)
Could someone please explain why this is the case? Thanks
Use pd.isnull for check scalars if need use apply:
final = pd.DataFrame({'groupid': [1, 0, np.nan],\
'B': [400, 500, 600]})
final['groupid2'] = final['groupid'].apply(lambda x: 2 if pd.isnull(x) else x)
print (final)
groupid B groupid2
0 1.0 400 1.0
1 0.0 500 0.0
2 NaN 600 2.0
Details:
Value x in lambda function is scalar, because Series.apply loop each value of column. So function pd.Series.isnull() failed.
For better testing is possible rewrite lambda funcion to:
def f(x):
print (x)
print (pd.isnull(x))
return 2 if pd.isnull(x) else x
1.0
False
0.0
False
nan
True
final['groupid2'] = final['groupid'].apply(f)
But better is Series.fillna:
final['groupid2'] = final['groupid'].fillna(2)

Nested dictionary of namedtuples to pandas dataframe

I have namedtuples defined as follows:
In[37]: from collections import namedtuple
Point = namedtuple('Point', 'x y')
The nested dictionary has the following format:
In[38]: d
Out[38]:
{1: {None: {1: Point(x=1.0, y=5.0), 2: Point(x=4.0, y=8.0)}},
2: {None: {1: Point(x=45324.0, y=24338.0), 2: Point(x=45.0, y=38.0)}}}
I am trying to create a pandas dataframe from the dictionary d without having to do for loops.
I have succeeded in creating the dataframe from a subset of the dictionary by doing this:
In[40]: df=pd.DataFrame(d[1][None].values())
In[41]: df
Out[41]:
x y
0 1 5
1 4 8
But i want to be able to create the dataframe from the entire dictionary.
I want the dataframe to output the following (i am using multi index notation):
In[42]: df
Out[42]:
Subcase Step ID x y
1 None 1 1.0 5.0
2 4.0 8.0
2 None 1 45324.0 24338.0
2 45.0 38.0
The from_dict method of DataFrame, only supports up to two levels of nesting, so i was not able to use it. I am also considering modifying the structure of the d dictionary to achieve my goal. Furthermore, maybe it does not have to be a dictionary.
Thank you.
There are already several answers to similar questions on SO (here, here, or here). These solutions can be adapted to this problem as well. However, none of them is really general to be run on an arbitrary dict. So I decided to write something more universal.
This is a function that can be run on any dict. The dict has to have the same number of levels (depth) at any of its elements, otherwise it will most probably raise.
def frame_from_dict(dic, depth=None, **kwargs):
def get_dict_depth(dic):
if not isinstance(dic, dict):
return 0
for v in dic.values():
return get_dict_depth(v) + 1
if depth is None:
depth = get_dict_depth(dic)
if depth == 0:
return pd.Series(dic)
elif depth > 0:
keys = []
vals = []
for k, v in dic.items():
keys.append(k)
vals.append(frame_from_dict(v, depth - 1))
try:
keys = sorted(keys)
except TypeError:
# unorderable types
pass
return pd.concat(vals, axis=1, keys=keys, **kwargs)
raise ValueError("depth should be a nonnegative integer or None")
I sacrificed a namedtuple case from this question for the generality. But it can be tweaked if needed.
In this particular case, it can be applied as follows:
df = frame_from_dict(d, names=['Subcase', 'Step', 'ID']).T
df.columns = ['x', 'y']
df
Out[115]:
x y
Subcase Step ID
1 NaN 1 1.0 5.0
2 4.0 8.0
2 NaN 1 45324.0 24338.0
2 45.0 38.0
I decided to flatten the keys into a tuple (tested using pandas 0.18.1):
In [5]: from collections import namedtuple
In [6]: Point = namedtuple('Point', 'x y')
In [11]: from collections import OrderedDict
In [14]: d=OrderedDict()
In [15]: d[(1,None,1)]=Point(x=1.0, y=5.0)
In [16]: d[(1,None,2)]=Point(x=4.0, y=8.0)
In [17]: d[(2,None,1)]=Point(x=45324.0, y=24338.0)
In [18]: d[(2,None,2)]=Point(x=45.0, y=38.0)
Finally,
In [7]: import pandas as pd
In [8]: df=pd.DataFrame(d.values(), index=pd.MultiIndex.from_tuples(d.keys(), names=['Subcase','Step','ID']))
In [9]:df
Out[9]:
x y
Subcase Step ID
1 NaN 1 1.0 5.0
2 4.0 8.0
2 NaN 1 45324.0 24338.0
2 45.0 38.0

Categories

Resources