Select top n items in a pandas groupby and calculate the mean - python

I have the following dataframe:
df = pd.DataFrame({'Value': [0, 1, 2,3, 4,5,6,7,8,9],'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','Jim','John','Jim','John']})
df
Value Name
0 0 John
1 1 Jim
2 2 John
3 3 Jim
4 4 John
5 5 Jim
6 6 Jim
7 7 John
8 8 Jim
9 9 John
I would like to select the top n items by Name and find the mean from the Value column.
I have tried this:
df['Top2Mean'] = df.groupby(['Name'])['Value'].nlargest(2).transform('mean')
But the following error:
ValueError: transforms cannot produce aggregated results
My expected result is a new column called Top2Mean with a 8 next to John and 7 next to Jim.
Thanks in advance!

Let us calculate mean on level=0, then map the calculated mean value to the Name column to broadcast the aggregated results.
top2 = df.groupby('Name')['Value'].nlargest(2).mean(level=0)
df['Top2Mean'] = df['Name'].map(top2)
If we need to group on multiple columns for example Name and City then we have to take mean on level=[Name, City] and map the calculated mean values using MultiIndex.map
c = ['Name', 'City']
top2 = df.groupby(c)['Value'].nlargest(2).mean(level=c)
df['Top2Mean'] = df.set_index(c).index.map(top2)
Alternative approach with groupby and transform using a custom lambda function
df['Top2Mean'] = df.groupby('Name')['Value']\
.transform(lambda v: v.nlargest(2).mean())
Value Name Top2Mean
0 0 John 8
1 1 Jim 7
2 2 John 8
3 3 Jim 7
4 4 John 8
5 5 Jim 7
6 6 Jim 7
7 7 John 8
8 8 Jim 7
9 9 John 8

Related

How do I create a new column of max values of a column(corresponding to specific name) using pandas?

I'm wondering if it is possible to use Pandas to create a new column for the max values of a column (corresponding to different names, so that each name will have a max value).
For an example:
name value max
Alice 1 9
Linda 1 1
Ben 3 5
Alice 4 9
Alice 9 9
Ben 5 5
Linda 1 1
So for Alice, we are picking the max of 1, 4, and 9, which is 9. For Linda max(1,1) = 1, and for Ben max(3,5) = 5.
I was thinking of using .loc to select the name == "Alice", then get the max value of these rows, then create the new column. But since I'm dealing with a large dataset, this does not seem like a good option. Is there a smarter way to do this so that I don't need to know what specific names?
groupby and taking a max gives the max by name, which is then merged with the original df
df.merge(df.groupby(['name'])['value'].max().reset_index(),
on='name').rename(
columns={'value_x' : 'value',
'value_y' : 'max'})
name value max
0 Alice 1 9
1 Alice 4 9
2 Alice 9 9
3 Linda 1 1
4 Linda 1 1
5 Ben 3 5
6 Ben 5 5
You could use transform or map
df['max'] = df.groupby('name')['value'].transform('max')
or
df['max'] = df['name'].map(df.groupby('name')['value'].max())

How to find common values in groupby groups?

I have a df of this format, my goal is to find users who participate in more than one tournament and ultimately set their 'val' value to the one they first appear with. Initially, I was thinking I need to groupby 'tour' but then it needs some intersection but I'm not sure how to proceed. Alternatively, I can do pd.crosstab(df.user, df.tour) but I'm not sure how to proceed either.
df = pd.DataFrame(data = [['jim','1','1', 10],['john','1','1', 12], ['jack','2', '1', 14],['jim','2','1', 10],
['mel','3','2', 20],['jim','3','2', 10],['mat','4','2', 14],['nick','4','2', 20],
['tim','5','3', 16],['john','5','3', 10],['lin','6','3', 16],['mick','6','3', 20]],
columns = ['user', 'game', 'tour', 'val'])
Since df is already sorted by tour, we could use groupby + first:
df['val'] = df.groupby('user')['val'].transform('first')
Output:
user game tour val
0 jim 1 1 10
1 john 1 1 12
2 jack 2 1 14
3 jim 2 1 10
4 mel 3 2 20
5 jim 3 2 10
6 mat 4 2 14
7 nick 4 2 20
8 tim 5 3 16
9 john 5 3 12
10 lin 6 3 16
11 mick 6 3 20
You can groupby on 'user' and filter out groups with only 1 element, and then select the first one, like so:
df.groupby(['user']).filter(lambda g:len(g)>1).groupby('user').head(1)
output
user game tour val
0 jim 1 1 10
1 john 1 1 12

Use values from list in order to create few new Dataframe based on existing one

My current DF looks like below
x y z x c name status
1 2 3 2 5 Jon Work
1 2 5 4 5 Adam Work
9 7 3 9 5 Adam Holiday
3 2 3 4 5 Anna Work
1 4 6 8 5 Anna Work
4 1 6 8 5 Kate Off
2 1 6 1 5 Jon Off
My lists with specific values looks like below:
name = [Jon, Adam]
status = [Off, Work]
I need using those lists create new dataframes for all unique elements in "status" list. So it should looks like below:
df_off:
x y z x c name status
2 1 6 1 5 Jon Off
there is only one values, because name Kate in not in the list name
df_Work:
x y z x c name status
1 2 3 2 5 Jon Work
1 2 5 4 5 Adam Work
In second DF there is no "Anna" because she is not in list "name".
I hope it is clear. Do you have any idea how can I solve this issue?
Regard
Tomasz
First part, filter your data using:
name = ['Jon', 'Adam']
status = ['Off', 'Work']
df[df['name'].isin(name)&df['status'].isin(status)]
Then use groupby and transform the output to dictionary:
conditions = df['name'].isin(name)&df['status'].isin(status)
dfs = {'df_%s' % k:v for k,v in df[conditions].groupby('status')}
Then access your dataframes using:
>>> dfs['df_Work']
x y z x.1 c name status
0 1 2 3 2 5 Jon Work
1 1 2 5 4 5 Adam Work
You can even use multiple groups:
dfs = {'df_%s_%s' % k:v for k,v in df.groupby(['name', 'status'])}
dfs['df_Adam_Work']
If you goal is to save the subframes:
for groupname, df in df[conditions].groupby('status'):
df.to_excel(f'df_{groupname}.xlsx')

How to know the occurrence of a text in row of data frame pandas python

C1
0 John
1 John
2 John
3 Michale
4 Michale
5 Newton
6 Newton
7 John
8 John
9 John
I want to know how many time John occurred row wise. Suppose John occurred from 0 to 2 In result i want from 0 to 2 John. from 3 to 4 Michel from 5 to 6 Newton
Result I want in this format:
Start End Name
0 2 John
3 4 Michale
5 6 newton
7 9 John
Use
In [163]: df.reset_index().groupby('C1')['index'].agg(['min', 'max']).rename(
columns={'min': 'start', 'max': 'end'})
Out[163]:
start end
C1
John 0 2
Michale 3 4
Newton 5 6
#Zero: Would adding the below to your code help ..?? :)
df_new = df.reset_index().groupby('C1')['index'].agg(['min', 'max']).rename(
columns={'min': 'start', 'max': 'end'})
df_new.reset_index().rename(columns={'C1':'Name'})
Edit: Maybe something like this..? I am still learning but there is no harm trying. :)
labels = (df.C1 != df.C1.shift()).cumsum()
df1 = pd.concat([df,labels],axis = 1,names = 'label')
df1.columns = ['C1','label']
df_new = df1.reset_index().groupby(['label','C1']).agg(['min', 'max']).rename(
columns={'min': 'start', 'max': 'end'}).reset_index().rename(columns={'C1':'Name'})
df_new

Pandas, how to make matrix

I have a question about pandas and if someone could help me, I would be grateful for that very much.
I have a dataframe
df1 = pd.DataFrame( {'Name': ['A', 'B','A','A']})
df1
I want to do groupby for this.
x=df1.groupby("Name").size()
x
I also have another dataframe
df2 = pd.DataFrame( {'Name2': ['Jon',Maria','Maria','Mike','Mike','Mike']})
df2
For this one, I do groupby as well.
y= df2.groupby("Name2").size()
And then I want to make matrix whose column is x and row is y, and want to multiply the values.
I want the matrix like this.
Jon Maria Mike
A 3 6 9
B 1 2 3
If you could tell me how to do that, I would greatly appreciate it.
You could perform a dot product:
x.to_frame().dot(y.to_frame().T)
Name2 Jon Maria Mike
Name
A 3 6 9
B 1 2 3
If you want to remove the axis labels, use rename_axis:
x.to_frame().dot(y.to_frame().T)\
.rename_axis(None).rename_axis(None, 1)
Jon Maria Mike
A 3 6 9
B 1 2 3
Alternatively, assign in-place:
v = x.to_frame().dot(y.to_frame().T)
v.index.name = v.columns.name = None
v
Jon Maria Mike
A 3 6 9
B 1 2 3
In [35]: (pd.DataFrame(y[:,None].dot(x[:,None].T).T, columns=y.index, index=x.index)
.rename_axis(None)
.rename_axis(None,1))
Out[35]:
Jon Maria Mike
A 3 6 9
B 1 2 3
Or we can using np.multiply.outer
pd.DataFrame(np.multiply.outer(x.values,y.values),columns=y.index,index=x.index)
Out[344]:
Name2 Jon Maria Mike
Name
A 3 6 9
B 1 2 3

Categories

Resources