Select top n items in a pandas groupby and calculate the mean

Select top n items in a pandas groupby and calculate the mean - python

I have the following dataframe:
df = pd.DataFrame({'Value': [0, 1, 2,3, 4,5,6,7,8,9],'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','Jim','John','Jim','John']})
df
Value Name
0 0 John
1 1 Jim
2 2 John
3 3 Jim
4 4 John
5 5 Jim
6 6 Jim
7 7 John
8 8 Jim
9 9 John
I would like to select the top n items by Name and find the mean from the Value column.
I have tried this:
df['Top2Mean'] = df.groupby(['Name'])['Value'].nlargest(2).transform('mean')
But the following error:
ValueError: transforms cannot produce aggregated results
My expected result is a new column called Top2Mean with a 8 next to John and 7 next to Jim.
Thanks in advance!

Let us calculate mean on level=0, then map the calculated mean value to the Name column to broadcast the aggregated results.
top2 = df.groupby('Name')['Value'].nlargest(2).mean(level=0)
df['Top2Mean'] = df['Name'].map(top2)
If we need to group on multiple columns for example Name and City then we have to take mean on level=[Name, City] and map the calculated mean values using MultiIndex.map
c = ['Name', 'City']
top2 = df.groupby(c)['Value'].nlargest(2).mean(level=c)
df['Top2Mean'] = df.set_index(c).index.map(top2)
Alternative approach with groupby and transform using a custom lambda function
df['Top2Mean'] = df.groupby('Name')['Value']\
.transform(lambda v: v.nlargest(2).mean())
Value Name Top2Mean
0 0 John 8
1 1 Jim 7
2 2 John 8
3 3 Jim 7
4 4 John 8
5 5 Jim 7
6 6 Jim 7
7 7 John 8
8 8 Jim 7
9 9 John 8

Related

How do I create a new column of max values of a column(corresponding to specific name) using pandas?

I'm wondering if it is possible to use Pandas to create a new column for the max values of a column (corresponding to different names, so that each name will have a max value).
For an example:
name value max
Alice 1 9
Linda 1 1
Ben 3 5
Alice 4 9
Alice 9 9
Ben 5 5
Linda 1 1
So for Alice, we are picking the max of 1, 4, and 9, which is 9. For Linda max(1,1) = 1, and for Ben max(3,5) = 5.
I was thinking of using .loc to select the name == "Alice", then get the max value of these rows, then create the new column. But since I'm dealing with a large dataset, this does not seem like a good option. Is there a smarter way to do this so that I don't need to know what specific names?

groupby and taking a max gives the max by name, which is then merged with the original df
df.merge(df.groupby(['name'])['value'].max().reset_index(),
on='name').rename(
columns={'value_x' : 'value',
'value_y' : 'max'})
name value max
0 Alice 1 9
1 Alice 4 9
2 Alice 9 9
3 Linda 1 1
4 Linda 1 1
5 Ben 3 5
6 Ben 5 5

You could use transform or map
df['max'] = df.groupby('name')['value'].transform('max')
or
df['max'] = df['name'].map(df.groupby('name')['value'].max())

How to find common values in groupby groups?

I have a df of this format, my goal is to find users who participate in more than one tournament and ultimately set their 'val' value to the one they first appear with. Initially, I was thinking I need to groupby 'tour' but then it needs some intersection but I'm not sure how to proceed. Alternatively, I can do pd.crosstab(df.user, df.tour) but I'm not sure how to proceed either.
df = pd.DataFrame(data = [['jim','1','1', 10],['john','1','1', 12], ['jack','2', '1', 14],['jim','2','1', 10],
['mel','3','2', 20],['jim','3','2', 10],['mat','4','2', 14],['nick','4','2', 20],
['tim','5','3', 16],['john','5','3', 10],['lin','6','3', 16],['mick','6','3', 20]],
columns = ['user', 'game', 'tour', 'val'])

Since df is already sorted by tour, we could use groupby + first:
df['val'] = df.groupby('user')['val'].transform('first')
Output:
user game tour val
0 jim 1 1 10
1 john 1 1 12
2 jack 2 1 14
3 jim 2 1 10
4 mel 3 2 20
5 jim 3 2 10
6 mat 4 2 14
7 nick 4 2 20
8 tim 5 3 16
9 john 5 3 12
10 lin 6 3 16
11 mick 6 3 20

You can groupby on 'user' and filter out groups with only 1 element, and then select the first one, like so:
df.groupby(['user']).filter(lambda g:len(g)>1).groupby('user').head(1)
output
user game tour val
0 jim 1 1 10
1 john 1 1 12

Use values from list in order to create few new Dataframe based on existing one

My current DF looks like below
x y z x c name status
1 2 3 2 5 Jon Work
1 2 5 4 5 Adam Work
9 7 3 9 5 Adam Holiday
3 2 3 4 5 Anna Work
1 4 6 8 5 Anna Work
4 1 6 8 5 Kate Off
2 1 6 1 5 Jon Off
My lists with specific values looks like below:
name = [Jon, Adam]
status = [Off, Work]
I need using those lists create new dataframes for all unique elements in "status" list. So it should looks like below:
df_off:
x y z x c name status
2 1 6 1 5 Jon Off
there is only one values, because name Kate in not in the list name
df_Work:
x y z x c name status
1 2 3 2 5 Jon Work
1 2 5 4 5 Adam Work
In second DF there is no "Anna" because she is not in list "name".
I hope it is clear. Do you have any idea how can I solve this issue?
Regard
Tomasz

First part, filter your data using:
name = ['Jon', 'Adam']
status = ['Off', 'Work']
df[df['name'].isin(name)&df['status'].isin(status)]
Then use groupby and transform the output to dictionary:
conditions = df['name'].isin(name)&df['status'].isin(status)
dfs = {'df_%s' % k:v for k,v in df[conditions].groupby('status')}
Then access your dataframes using:
>>> dfs['df_Work']
x y z x.1 c name status
0 1 2 3 2 5 Jon Work
1 1 2 5 4 5 Adam Work
You can even use multiple groups:
dfs = {'df_%s_%s' % k:v for k,v in df.groupby(['name', 'status'])}
dfs['df_Adam_Work']
If you goal is to save the subframes:
for groupname, df in df[conditions].groupby('status'):
df.to_excel(f'df_{groupname}.xlsx')

How to know the occurrence of a text in row of data frame pandas python

C1
0 John
1 John
2 John
3 Michale
4 Michale
5 Newton
6 Newton
7 John
8 John
9 John
I want to know how many time John occurred row wise. Suppose John occurred from 0 to 2 In result i want from 0 to 2 John. from 3 to 4 Michel from 5 to 6 Newton
Result I want in this format:
Start End Name
0 2 John
3 4 Michale
5 6 newton
7 9 John

Use
In [163]: df.reset_index().groupby('C1')['index'].agg(['min', 'max']).rename(
columns={'min': 'start', 'max': 'end'})
Out[163]:
start end
C1
John 0 2
Michale 3 4
Newton 5 6

#Zero: Would adding the below to your code help ..?? :)
df_new = df.reset_index().groupby('C1')['index'].agg(['min', 'max']).rename(
columns={'min': 'start', 'max': 'end'})
df_new.reset_index().rename(columns={'C1':'Name'})
Edit: Maybe something like this..? I am still learning but there is no harm trying. :)
labels = (df.C1 != df.C1.shift()).cumsum()
df1 = pd.concat([df,labels],axis = 1,names = 'label')
df1.columns = ['C1','label']
df_new = df1.reset_index().groupby(['label','C1']).agg(['min', 'max']).rename(
columns={'min': 'start', 'max': 'end'}).reset_index().rename(columns={'C1':'Name'})
df_new

Pandas, how to make matrix

I have a question about pandas and if someone could help me, I would be grateful for that very much.
I have a dataframe
df1 = pd.DataFrame( {'Name': ['A', 'B','A','A']})
df1
I want to do groupby for this.
x=df1.groupby("Name").size()
x
I also have another dataframe
df2 = pd.DataFrame( {'Name2': ['Jon',Maria','Maria','Mike','Mike','Mike']})
df2
For this one, I do groupby as well.
y= df2.groupby("Name2").size()
And then I want to make matrix whose column is x and row is y, and want to multiply the values.
I want the matrix like this.
Jon Maria Mike
A 3 6 9
B 1 2 3
If you could tell me how to do that, I would greatly appreciate it.

You could perform a dot product:
x.to_frame().dot(y.to_frame().T)
Name2 Jon Maria Mike
Name
A 3 6 9
B 1 2 3
If you want to remove the axis labels, use rename_axis:
x.to_frame().dot(y.to_frame().T)\
.rename_axis(None).rename_axis(None, 1)
Jon Maria Mike
A 3 6 9
B 1 2 3
Alternatively, assign in-place:
v = x.to_frame().dot(y.to_frame().T)
v.index.name = v.columns.name = None
v
Jon Maria Mike
A 3 6 9
B 1 2 3

In [35]: (pd.DataFrame(y[:,None].dot(x[:,None].T).T, columns=y.index, index=x.index)
.rename_axis(None)
.rename_axis(None,1))
Out[35]:
Jon Maria Mike
A 3 6 9
B 1 2 3

Or we can using np.multiply.outer
pd.DataFrame(np.multiply.outer(x.values,y.values),columns=y.index,index=x.index)
Out[344]:
Name2 Jon Maria Mike
Name
A 3 6 9
B 1 2 3

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Select top n items in a pandas groupby and calculate the mean - python

Related

How do I create a new column of max values of a column(corresponding to specific name) using pandas?

How to find common values in groupby groups?

Use values from list in order to create few new Dataframe based on existing one

How to know the occurrence of a text in row of data frame pandas python

Pandas, how to make matrix

Categories

Resources