python pandas custom agg function - python

Dataframe:
one two
a 1 x
b 1 y
c 2 y
d 2 z
e 3 z
grp = DataFrame.groupby('one')
grp.agg(lambda x: ???) #or equivalent function
Desired output from grp.agg:
one two
1 x|y
2 y|z
3 z
My agg function before integrating dataframes was "|".join(sorted(set(x))). Ideally I want to have any number of columns in the group and agg returns the "|".join(sorted(set()) for each column item like two above. I also tried np.char.join().
Love Pandas and it has taken me from a 800 line complicated program to a 400 line walk in the park that zooms. Thank you :)

You were so close:
In [1]: df.groupby('one').agg(lambda x: "|".join(x.tolist()))
Out[1]:
two
one
1 x|y
2 y|z
3 z
Expanded answer to handle sorting and take only the set:
In [1]: df = DataFrame({'one':[1,1,2,2,3], 'two':list('xyyzz'), 'three':list('eecba')}, index=list('abcde'), columns=['one','two','three'])
In [2]: df
Out[2]:
one two three
a 1 x e
b 1 y e
c 2 y c
d 2 z b
e 3 z a
In [3]: df.groupby('one').agg(lambda x: "|".join(x.order().unique().tolist()))
Out[3]:
two three
one
1 x|y e
2 y|z b|c
3 z a

Just an elaboration on the accepted answer:
df.groupby('one').agg(lambda x: "|".join(x.tolist()))
Note that the type of df.groupby('one') is SeriesGroupBy. And the function agg defined on this type. If you check the documentation of this function, it says its input is a function that works on Series. This means that x type in the above lambda is Series.
Another note is that defining the agg function as lambda is not necessary. If the aggregation function is complex, it can be defined separately as a regular function like below. The only constraint is that the x type should be of Series (or compatible with it):
def myfun1(x):
return "|".join(x.tolist())
and then:
df.groupby('one').agg(myfun1)

There is a better way to concatenate strings, in pandas documentation.So I prefer this way:
In [1]: df.groupby('one').agg(lambda x: x.str.cat(sep='|'))
Out[1]:
two
one
1 x|y
2 y|z
3 z

Related

Pandas dataframe :convert the numeric value to 2 to power of numeric value

How do i get this 2^ value in another col of a df
i need to calculate 2^ value
is there a easy way to do this
Value
2^Value
0
1
1
2
You can use numpy.power :
import numpy as np
df["2^Value"] = np.power(2, df["Value"])
Or simply, 2 ** df["Value"] as suggested by #B Remmelzwaal.
Output :
print(df)
Value 2^Value
0 0 1
1 1 2
2 3 8
3 4 16
Here is some stats/timing :
Using rpow:
df['2^Value'] = df['Value'].rpow(2)
Output:
Value 2^Value
0 0 1
1 1 2
2 2 4
3 3 8
4 4 16
You can use .apply with a lambda function
df["new_column"] = df["Value"].apply(lambda x: x**2)
In python the power operator is **
You can apply a function to each row in a dataframe by using the df.apply method. See this documentation to learn how the method is used. Here is some untested code to get you started.
# a simple function that takes a number and returns
# 2^n of that number
def calculate_2_n(n):
return 2**n
# use the df.apply method to apply that function to each of the
# cells in the 'Value' column of the DataFrame
df['2_n_value'] = df.apply(lambda row : calculate_2_n(row['Value']), axis = 1)
This code is a modified version of the code from this G4G example

KeyError: in Pandas

I am trying to run groupby with multiple columns and aggregate column and aggregate Operator.
I will get all of above as parameter to method. I have to do groupby:
result = df.groupby([groupByColumns])[aggColumn].agg(aggOperation)
Here
groupByColumns: clientId,state,branchId
aggColumn: amount
aggOperator: sum
But I am getting this error
KeyError: ''
I am not good in Panda. How can I correct my statement above?
If groupByColumns is already a list, remove [] in groupby:
groupByColumns = ['clientId', 'state', 'branchId']
aggColumn = 'amount'
aggOperation = sum
out = df.groupby(groupByColumns)[aggColumn].agg(aggOperation)
# OR
out = df.groupby(['clientId', 'state', 'branchId'])['amount'].sum()
print(out)
# Output
clientId state branchId
A M X 3
N Y 3
B M X 9
N Y 6
Name: amount, dtype: int64
Setup:
df = pd.DataFrame({'clientId': list('AAABBB'),
'state': list('MMNMMN'),
'branchId': list('XXYXXY'),
'amount': range(1, 7)})
print(df)
# Output
clientId state branchId amount
0 A M X 1
1 A M X 2
2 A N Y 3
3 B M X 4
4 B M X 5
5 B N Y 6
The groupby column requires a 1D list in input parameter. In your case, your groupByColumns is ['clientId', 'state', 'branchId'], and while using the groupby function, you are using the list operator, essentially making it a 2d list of length 1. This is what is happening in your case
df.groupby([['clientId', 'state', 'branchId']])['amount'].sum()
Solution
As answered by #Corralien, use the same command but without the list operator, this way in the groupby command you'll be passing a 1d list and it should work !

Stripping string values at different positions

Suppose I have the following dataframe:
df = pd.DataFrame({'X':['AB_123_CD','EF_123CD','XY_Z'],'Y':[1,2,3]})
X Y
0 AB_123_CD 1
1 EF_123CD 2
2 XY_Z 3
I want to use strip method to get rid of the first prefix such that I get
X Y
0 123_CD 1
1 123CD 2
2 Z 3
I tried doing: df.X.str.split('_').str[-1].str.strip() but since the positions of _'s are different it returns different result to the one desired above. I wonder how can I address this issue?
You're close, you can split once (n=1) from the left and keep the second one (str[1]):
df.X = df.X.str.split("_", n=1).str[1]
to get
>>> df
X Y
0 123_CD 1
1 123CD 2
2 Z 3
Try this instead:
df["X"] = df["X"].apply(lambda x: x[x.find("_")+1:])
>>> df
X Y
0 123_CD 1
1 123CD 2
2 Z 3
This keeps the entire string after the first occurence of _
The following code could do the job:
df['X'] = df.X.apply(lambda x: '_'.join(x.split('_')[1:]))
Your solution is very close. With some minor changes, it should work:
df.X.str.split('_').str[1:].str.join('_')
0 123_CD
1 123CD
2 Z
Name: X, dtype: object
You can define maxsplit in the str.split() function. It sounds like you just want to split with maxsplit 1 and take the last element:
df['X'] = df['X'].apply(lambda x: x.split('_',1)[-1])

Pandas dataframe: creating a new column that is a custom function using 2 other columns

Consider the following data set stored in a pandas DataFrame dfX:
A B
1 2
4 6
7 9
I have a function that is:
def someThingSpecial(x,y)
# z = do something special with x,y
return z
I now want to create a new column in df that bears the computed z value
Looking at other SO examples, I've tried several variants including:
dfX['C'] = dfX.apply(lambda x: someThingSpecial(x=x['A'], y=x['B']), axis=1)
Which returns errors. What is the right way to do this?
This seems to work for me on v0.21. Take a look -
df
A B
0 1 2
1 4 6
2 7 9
def someThingSpecial(x,y):
return x + y
df.apply(lambda x: someThingSpecial(x.A, x.B), 1)
0 3
1 10
2 16
dtype: int64
You might want to try upgrading your pandas version to the latest stable release (0.21 as of now).
Here's another option. You can vectorise your function.
v = np.vectorize(someThingSpecial)
v now accepts arrays, but operates on each pair of elements individually. Note that this just hides the loop, as apply does, but is much cleaner. Now, you can compute C as so -
df['C'] = v(df.A, df.B)
if your function only needs one column's value, then do this instead of coldspeed's answer:
dfX['A'].apply(your_func)
to store it:
dfX['C'] = dfX['A'].apply(your_func)

Better way to apply function to every combination of two columns in Pandas.DataFrame

I want to implement a something just like DataFrame.corr() which can apply a function to pairwise columns.
Eg.
I have a function:
def func(x, y):
pass
I want to apply func to every combination of two columns in a_pd(type of Pandas.DataFrame). I have figured out a way by create a new function wap_func to wrap func:
def wap_func(x):
for i in range(len(x)):
for j in range(i+1, len(x)):
func(x[i], x[j])
res = a_pd.apply(wap_func, axis=1)
Although the question seems to be solved, but it isn't convenient. If it could be done like a_pd.corr(), it could be better.
Have you considered using the itertools.combinations module?
import pandas as pd
from itertools import combinations
df = pd.DataFrame([[1,2,3], [2,3,4], [3,5,7]], columns = ['A', 'B', 'C'])
print(df)
A B C
0 1 2 3
1 2 3 4
2 3 5 7
Define your function slightly differently so that you can use apply more seamlessly
def func(xy):
x, y = xy
return x+y
Use the itertools.combinations module to get all combinations of the columns that you wish, go through each of the combinations in turn, and apply the function earlier defined
for combi in combinations(df.columns, 2):
df['_'.join([i for i in combi])] = df[[i for i in combi]].apply(func, axis=1, result_type='expand').transpose().values
print(df)
A B C A_B A_C B_C
0 1 2 3 3 4 5
1 2 3 4 5 6 7
2 3 5 7 8 10 12

Categories

Resources