How to do aggregation in Pandas with different aggregation functions within groups? - python

I want to do group_by into aggregation, but for each group I want to use a function based on values from a special column which stores which function needed to be used. Easier to show on example:
id
group
val
func
0
0
0
"avg"
1
0
2
"avg"
2
0
2
"avg"
3
1
0
"med"
4
1
2
"med"
So in that example expected behaviour would be "avg" aggregation for group 0 and "median" for group 1. How can I make agg to choose function based on "func" column values? I know that I can calculate each agg function for each group and then use func as mask for choosing right values, but that isn't that great since I'd do a lot of not needed calculations, there should be a better approach...
P.S. It's guaranteed that func is the same within each group so I don't have to worry about that.
I've written my own solution for my specific case and I'll add that in question, but answer below is fine too.
So, my approach was:
Use dict to transform from table-provided format to proper pandas names as suggested in answer:
func_dict = {"avg": "mean", "med": "median", "min": "min","max": "max", "rnk": "first"}
I wrote a custom function to pass to apply later:
def pick_price(subframe: pd.DataFrame) -> float:
func_name = subframe["agg"].iloc[0]
func_name = func_dict[func_name]
# this picks from first line in subframe a name and get real name from dict
# and next "if" block applies them among subframe
if func_name != "first":
ans = subframe["comp_price"].agg(func_name)
return 1.0 * ans
else:
idx = subframe["rank"].idxmin()
return 1.0 * subframe["comp_price"].loc[idx]
That function takes subframe with group with one same function to apply, and well, apply it.
3. Finally, use that function. First, group by groups where we need to apply different functions, and just apply with apply() method:
grouped = X.groupby("sku")
grouped.apply(pick_price)

I would use a dictionary of group: function:
f = {0: 'mean', 1: 'median'}
df['out'] = df.groupby('group')['val'].transform(lambda s: s.agg(f.get(s.name)))
Output:
id group val out
0 0 0 0 1.333333
1 1 0 2 1.333333
2 2 0 2 1.333333
3 3 1 0 1.000000
4 4 1 2 1.000000
variant using a column as source
NB. it's a bit hacky, I prefer the dictionary. It extract the function name from the first rows of the group. The names must be valid, like mean/meadian, not avg/med.
df['out'] = (df.groupby('group')['val']
.transform(lambda s: s.agg(df.loc[s.index[0], 'func']))
)
Output:
id group val func out
0 0 0 0 mean 1.333333
1 1 0 2 mean 1.333333
2 2 0 2 mean 1.333333
3 3 1 0 median 1.000000
4 4 1 2 median 1.000000

Related

Column are missing in the result of groupby

I have dataframe like this:
source_ip dest_ip dest_ip_usage dest_ip_count
0 4:107:27:41 1:23:54:114 2028544 2
1 4:107:27:41 2:112:41:134 3145639 1
2 4:107:27:41 2:112:41:178 4145639 1
3 1:192:221:145 32:107:27:134 6358000 1
4 1:192:344:161 3:243:82:204 6341359 1
I am using syntax: df1 = df.groupby(['source_ip','dest_ip'])['dest_ip_usage'].nlargest(2)
But I am not getting indexes and getting result:
0 2028544
1 3145639
2 4145639
3 6358000
4 6341359
This is not possible when using groupby on multiple columns.
If you want to find nlargest with groupby on multiple columns you must use apply method on that specific column on which you are trying to find nlargest.
df.groupby(['source_ip'])['dest_ip','dest_ip_usage'].apply(lambda x: x.nlargest(2, columns=['dest_ip_usage']))

pandas groupby where and else

I have a dataframe like this:
col1 col2
0 a 100
1 a 200
2 a 150
3 b 1000
4 c 400
5 c 200
what I want to do is group by col1 and count the number of occurrences and if count is equal or greater than 2, then calculate mean of col2 for those rows and if not apply another function. The output should be:
col1 mean
0 a 150
1 b whatever aggregator function returns
2 c 300
I followed #ansev solution in here pandas groupby count and then conditional mean however I don't want to replace them with NaN and actually want to replace it a value that returns from another function like this:
def aggregator(col1, col2):
return col1+col2
Please keep in mind that the actual aggregator function is more complicated and dependent to other tables and this is for just simplifying the question.
I'm not sure this is what you need, but you can resolve to apply:
def aggregator(x):
if len(x)==1:
return pd.Series( (x['col1'] + x['col2'].astype(str)).values)
else: return pd.Series(x['col2'].mean())
df.groupby('col1').apply(aggregator)
Output:
0
col1
a 150
b b1000
c 300

python: use agg with more than one customized function

I have a data frame like this.
mydf = pd.DataFrame({'a':[1,1,3,3],'b':[np.nan,2,3,6],'c':[1,3,3,9]})
a b c
0 1 NaN 1
1 1 2.0 3
2 3 3.0 3
3 3 6.0 9
I would like to have a resulting dataframe like this.
myResults = pd.concat([mydf.groupby('a').apply(lambda x: (x.b/x.c).max()), mydf.groupby('a').apply(lambda x: (x.b/x.c).min())], axis =1)
myResults.columns = ['max','min']
max min
a
1 0.666667 0.666667
3 1.000000 0.666667
Basically i would like to have max and min of ratio of column b and column c for each group (grouped by column a)
If it possible to achieve this by agg?
I tried mydf.groupby('a').agg([lambda x: (x.b/x.c).max(), lambda x: (x.b/x.c).min()]). It will not work, and seems column name b and c will not be recognized.
Another way i can think of is to add the ratio column first to mydf. i.e. mydf['ratio'] = mydf.b/mydf.c, and then use agg on the updated mydf like mydf.groupby('a')['ratio'],agg[max,min].
Is there a better way to achieve this through agg or other function? In summary, I would like to apply customized function to grouped DataFrame, and the customized function needs to read multiple columns from original DataFrame.
You can use a customized function to acheive this.
You can create any number of new columns using any input columns using the below function.
def f(x):
t = {}
t['max'] = (x['b']/x['c']).max()
t['min'] = (x['b']/x['c']).min()
return pd.Series(t)
mydf.groupby('a').apply(f)
Output:
max min
a
1 0.666667 0.666667
3 1.000000 0.666667

Apply different functions to different columns with a singe pandas groupby command

My data is stored in df. I have multiple users per group. I want to group df by group and apply different functions to different columns. The twist is that I would like to assign custom names to the new columns during this process.
np.random.seed(123)
df = pd.DataFrame({"user":range(4),"group":[1,1,2,2],"crop":["2018-01-01","2018-01-01","2018-03-01","2018-03-01"],
"score":np.random.randint(400,1000,4)})
df["crop"] = pd.to_datetime(df["crop"])
print(df)
user group crop score
0 0 1 2018-01-01 910
1 1 1 2018-01-01 765
2 2 2 2018-03-01 782
3 3 2 2018-03-01 722
I want to get the mean of score, and the min and max values of crop grouped by group and assign custom names to each new column. The desired output should look like this:
group mean_score min_crop max_crop
0 1 837.5 2018-01-01 2018-01-01
1 2 752.0 2018-03-01 2018-03-01
I don't know how to do this in a one-liner in Python. In R, I would use data.table and get the following:
df[, list(mean_score = mean(score),
max_crop = max(crop),
min_crop = min(crop)), by = group]
I know I could group the data and use .agg combined with a dictionary. Is there an alternative way where I can custom-name each column in this process?
Try creating a function with the required operations using groupby().apply():
def f(x):
d = {}
d['mean_score'] = x['score'].mean()
d['min_crop'] = x['crop'].min()
d['max_crop'] = x['crop'].max()
return pd.Series(d, index=['mean_score', 'min_crop', 'max_crop'])
data = df.groupby('group').apply(f)

How to keep only the top n% rows of each group of a pandas dataframe?

I have seen a variant of this question asked that keeps the top n rows of each group in a pandas dataframe and the solutions use n as an absolute number rather than a percentage here Pandas get topmost n records within each group. However, in my dataframe, each group has different numbers of rows in it and I want to keep the top n% rows of each group. How would I approach this problem?
You can construct a Boolean series of flags and filter before you groupby. First let's create an example dataframe and look at the number of row for each unique value in the first series:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 3)))
print(df[0].value_counts())
0 6
1 4
Name: 0, dtype: int64
Then define a fraction, e.g. 50% below, and construct a Boolean series for filtering:
n = 0.5
g = df.groupby(0)
flags = (g.cumcount() + 1) <= g[1].transform('size') * n
Then apply the condition, set the index as the first series and (if required) sort the index:
df = df.loc[flags].set_index(0).sort_index()
print(df)
1 2
0
0 1 1
0 1 1
0 1 0
1 1 1
1 1 0
As you can see, the resultant dataframe only has 3 0 indices and 2 1 indices, in each case half the number in the original dataframe.
Here is another option which builds on some of the answers in the post you mentioned
First of all here is a quick function to either round up or round down. If we want the top 30% of rows of a dataframe 8 rows long then we would try to take 2.4 rows. So we will need to either round up or down.
My preferred option is to round up. This is because, for eaxample, if we were to take 50% of the rows, but had one group which only had one row, we would still keep that one row. I kept this separate so that you can change the rounding as you wish
def round_func(x, up=True):
'''Function to round up or round down a float'''
if up:
return int(x+1)
else:
return int(x)
Next I make a dataframe to work with and set a parameter p to be the fraction of the rows from each group that we should keep. Everything follows and I have commented it so that hopefully you can follow.
import pandas as pd
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
p = 0.30 # top fraction to keep. Currently set to 80%
df_top = df.groupby('id').apply( # group by the ids
lambda x: x.reset_index()['value'].nlargest( # in each group take the top rows by column 'value'
round_func(x.count().max()*p))) # calculate how many to keep from each group
df_top = df_top.reset_index().drop('level_1', axis=1) # make the dataframe nice again
df looked like this
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
df_top looks like this
id value
0 1 3
1 2 4
2 2 3
3 3 1
4 4 1

Categories

Resources