Pandas: Group by and aggregation with function - python

Assuming that I have a dataframe with the following values:
name start end description
0 ag 20 30 None
1 bgb 21 111 'a'
2 cdd 31 101 None
3 bgb 17 19 'Bla'
4 ag 20 22 None
I want to groupby name and then get average of (end-start) values.
I can use mean (df.groupby(['name'], as_index=False).mean())
but how can I give the mean function the subtraction of two columns (last - first) ?

You can subtract column and then grouping by column df['name']:
df1 = df['end'].sub(df['start']).groupby(df['name']).mean().reset_index(name='diff')
print (df1)
name diff
0 ag 6
1 bgb 46
2 cdd 70
Another idea with new column diff:
df1 = (df.assign(diff = df['end'].sub(df['start']))
.groupby('name', as_index=False)['diff']
.mean())
print (df1)
name diff
0 ag 6
1 bgb 46
2 cdd 70

Related

Pandas: finding mean in the dataframe based on condition included in another

I have two dataframes, just like below.
Dataframe1:
country
type
start_week
end_week
1
a
12
13
2
b
13
14
Dataframe2:
country
type
week
value
1
a
12
1000
1
a
13
900
1
a
14
800
2
b
12
1000
2
b
13
900
2
b
14
800
I want to add to the first dataframe column with the mean value from the second dataframe for key (country+type) and between start_week and end_week.
I want desired output to look like the below:
country
type
start_week
end_week
avg
1
a
12
13
950
2
b
13
14
850
here is one way :
combined = df1.merge(df2 , on =['country','type'])
combined = combined.loc[(combined.start_week <= combined.week) & (combined.week <= combined.end_week)]
output = combined.groupby(['country','type','start_week','end_week'])['value'].mean().reset_index()
output:
>>
country type start_week end_week value
0 1 a 12 13 950.0
1 2 b 13 14 850.0
You can use pd.melt and comparison of numpy arrays.
# melt df1
melted_df1 = df1.melt(id_vars=['country','type'],value_name='week')[['country','type','week']]
# for loop to compare two dataframe arrays
result = []
for i in df2.values:
for j in melted_df1.values:
if (j == i[:3]).all():
result.append(i)
break
# Computing mean of the result dataframe
result_df = pd.DataFrame(result,columns=df2.columns).groupby('type').mean().reset_index()['value']
# Assigning result_df to df1
df1['avg'] = result_df
country type start_week end_week avg
0 1 a 12 13 950.0
1 2 b 13 14 850.0

sum duplicate row with condition using pandas

I have a dataframe who looks like this:
Name rent sale
0 A 180 2
1 B 1 4
2 M 12 1
3 O 10 1
4 A 180 5
5 M 2 19
that i want to make condition that if i have a duplicate row and a duplicate value in column field => Example :
duplicate row A have duplicate value 180 in rent column
I keep only one (without making the sum)
Or make the sum => Example duplicate row A with different values 2 & 5 in Sale column and duplicate row M with different values in rent & sales columns
Expected output:
Name rent sale
0 A 180 7
1 B 1 4
2 M 14 20
3 O 10 1
I tried this code but it's not workin as i want
import pandas as pd
df=pd.DataFrame({'Name':['A','B','M','O','A','M'],
'rent':[180,1,12,10,180,2],
'sale':[2,4,1,1,5,19]})
df2 = df.drop_duplicates().groupby('Name',sort=False,as_index=False).agg(Name=('Name','first'),
rent=('rent', 'sum'),
sale=('sale','sum'))
print(df2)
I got this output
Name rent sale
0 A 360 7
1 B 1 4
2 M 14 20
3 O 10 1
Can try summing only the unique values per group:
def sum_unique(s):
return s.unique().sum()
df2 = df.groupby('Name', sort=False, as_index=False).agg(
Name=('Name', 'first'),
rent=('rent', sum_unique),
sale=('sale', sum_unique)
)
df2:
Name rent sale
0 A 180 7
1 B 1 4
2 M 14 20
3 O 10 1
You can first groupby by Name and rent, and then just by Name:
df2 = df.groupby(['Name', 'rent'], as_index=False).sum().groupby('Name', as_index=False).sum()

dynamic concatenation of columns for finding max

Here's my data -
ID,Pay1,Pay2,Pay3,Low,High,expected_output
1,12,21,23,1,2,21
2,21,34,54,1,3,54
3,74,56,76,1,1,74
The goal is to calculate the max Pay of each row as per the Pay column index specified in Low and High columns.
For example, for row 1, calculate the max of Pay1 and Pay2 columns as Low and High are 1 and 2.
I have tried building a dynamic string and then using eval function which is not performing well.
Idea is filter only Pay columns and then using numpy broadcasting select columns by Low and High columns, pass to DataFrame.where and last get max:
df1 = df.filter(like='Pay')
m1 = np.arange(len(df1.columns)) >= df['Low'].to_numpy()[:, None] - 1
m2 = np.arange(len(df1.columns)) <= df['High'].to_numpy()[:, None] - 1
df['expected_output'] = df1.where(m1 & m2, 0).max(axis=1)
print (df)
ID Pay1 Pay2 Pay3 Low High expected_output
0 1 12 21 23 1 2 21
1 2 21 34 54 1 3 54
2 3 74 56 76 1 1 74
An alternative; I expect #jezrael's solution to be faster as it is within numpy and pd.wide_to_long is not particularly fast:
grouping = (
pd.wide_to_long(df.filter(regex="^Pay|Low|High"),
i=["Low", "High"],
stubnames="Pay",
j="num")
.query("Low==num or High==num")
.groupby(level=["Low", "High"])
.Pay.max()
)
grouping
Low High
1 1 74
2 21
3 54
Name: Pay, dtype: int64
df.join(grouping.rename("expected_output"), on=["Low", "High"])
ID Pay1 Pay2 Pay3 Low High expected_output
0 1 12 21 23 1 2 21
1 2 21 34 54 1 3 54
2 3 74 56 76 1 1 74

use groupby and custom agg in a dataframe pandas

I have this dataframe :
id start end
1 1 2
1 13 27
1 30 35
1 36 40
2 2 5
2 8 10
2 25 30
I want to groupby over id and aggregate rows where difference of end of n-1 row and start of n row is less than 10 for example. I already find a way using a loop but it's far too long with over a million rows.
So the expected outcome would be :
id start end
1 1 2
1 13 40
2 2 10
2 25 30
First I can get the required difference by using df['diff']=df['start'].shift(-1)-df['end']. How can I gather ids based on the condition for each different id ?
Thanks !
I believe you can create groups by suntract shifted end by DataFrameGroupBy.shift with greater like 10 and cumulative sum and pass to GroupBy.agg:
g = df['start'].sub(df.groupby('id')['end'].shift()).gt(10).cumsum()
df = (df.groupby(['id',g])
.agg({'start':'first', 'end': 'last'})
.reset_index(level=1, drop=True)
.reset_index())
print (df)
id start end
0 1 1 2
1 1 13 40
2 2 2 10
3 2 25 30

Ranking with no duplicates

I am trying to rank a large dataset using python. I do not want duplicates and rather than using the 'first' method, I would instead like it to look at another column and rank it based on that value.
It should only look at the second column if the rank in the first column has duplicates.
Name CountA CountB
Alpha 15 3
Beta 20 52
Delta 20 31
Gamma 45 43
I would like the ranking to end up
Name CountA CountB Rank
Alpha 15 3 4
Beta 20 52 2
Delta 20 31 3
Gamma 45 43 1
Currently, I am using df.rank(ascending=False, method='first')
Maybe use sort and pull out the index:
import pandas as pd
df = pd.DataFrame({'Name':['A','B','C','D'],'CountA':[15,20,20,45],'CountB':[3,52,31,43]})
df['rank'] = df.sort_values(['CountA','CountB'],ascending=False).index + 1
Name CountA CountB rank
0 A 15 3 4
1 B 20 52 2
2 C 20 31 3
3 D 45 43 1
You can take the counts of the values in CountA and then filter the DataFrame rows based on the count of CountA being greater than 1. Where the count is greater than 1, take CountB, otherwise CountA.
df = pd.DataFrame([[15,3],[20,52],[20,31],[45,43]],columns=['CountA','CountB'])
colAcount = df['CountA'].value_counts()
#then take the indices where colACount > 1 and use them in a where
df['final'] = df['CountA'].where(~df['CountA'].isin(colAcount[colAcount>1].index),df['CountB'])
df = df.sort_values(by='final', ascending=False).reset_index(drop=True)
# the rank is the index
CountA CountB final
0 20 52 52
1 45 43 45
2 20 31 31
3 15 3 15
See this for more details.

Categories

Resources