Count occurrence of a max value within aggregation

Count occurrence of a max value within aggregation - python

I have a table like this:
Column1
Column2
John
2
John
8
John
8
John
8
Robert
5
Robert
5
Robert
1
Carl
8
Carl
7
Now what I want is to aggregate this DataFrame by Column1 and get the max value as well as to count how many times does the given max value occurs for every group.
So the output should look like this:
Column1
Max
Count_of_Max
John
8
3
Robert
5
2
Carl
8
1
I've been trying to do something like this:
def Max_Count(x):
a = df.loc[x.index]
return a.loc[a['Column2'] == a['Column2'].max(), 'Column2'].count()
df.groupby(["Column1"]).agg({'Column2': ["max", Max_Count]}).reset_index()
But it's not really working :(
What would be the way get the desired result?

df.groupby('Column1').agg({
'Column2': [max, lambda x: (x==max(x)).sum()]
}).rename(columns={'max': 'Max', '<lambda_0>': 'Count_of_Max'})

Related

How do I create a new column of max values of a column(corresponding to specific name) using pandas?

I'm wondering if it is possible to use Pandas to create a new column for the max values of a column (corresponding to different names, so that each name will have a max value).
For an example:
name value max
Alice 1 9
Linda 1 1
Ben 3 5
Alice 4 9
Alice 9 9
Ben 5 5
Linda 1 1
So for Alice, we are picking the max of 1, 4, and 9, which is 9. For Linda max(1,1) = 1, and for Ben max(3,5) = 5.
I was thinking of using .loc to select the name == "Alice", then get the max value of these rows, then create the new column. But since I'm dealing with a large dataset, this does not seem like a good option. Is there a smarter way to do this so that I don't need to know what specific names?

groupby and taking a max gives the max by name, which is then merged with the original df
df.merge(df.groupby(['name'])['value'].max().reset_index(),
on='name').rename(
columns={'value_x' : 'value',
'value_y' : 'max'})
name value max
0 Alice 1 9
1 Alice 4 9
2 Alice 9 9
3 Linda 1 1
4 Linda 1 1
5 Ben 3 5
6 Ben 5 5

You could use transform or map
df['max'] = df.groupby('name')['value'].transform('max')
or
df['max'] = df['name'].map(df.groupby('name')['value'].max())

Pandas: filter the row according to the value of another column in different group (two columns in aggregate)

I have a dataset like below in pandas dataframe:
Name Shift Data Type
Peter 0 12 A
Peter 0 13 A
Peter 0 14 B
Sam 1 12 A
Sam 1 15 A
Sam 1 16 B
Sam 1 17 B
Mary 2 20 A
Mary 2 21 A
Mary 2 12 A
May anyone suggest how to show end result like the below? (logic is: if shift is 0, pick the 1st item under groupby "Name" and "type" columns; if shift is 1, pick the 2nd value under the groupby "Name" and "type" columns, etc... I have thought of nth(x) but I don't know how to put a variable on x in this case. Other workaround is fine that can generated the same result. Thank you.
Name Shift Data Type
Peter 0 12 A
Peter 0 14 B
Sam 1 15 A
Sam 1 17 B
Mary 2 12 A

You can use groupby.cumcount()
Assuming your data is in a DataFrame called df, I think this should work for you:
df = df[df.groupby(['Name','Type']).cumcount()==df['Shift']]
It compares the cumulative count of rows with the same Name and Type to the values in the Shift column to determine which rows should be kept

Select top n items in a pandas groupby and calculate the mean

I have the following dataframe:
df = pd.DataFrame({'Value': [0, 1, 2,3, 4,5,6,7,8,9],'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','Jim','John','Jim','John']})
df
Value Name
0 0 John
1 1 Jim
2 2 John
3 3 Jim
4 4 John
5 5 Jim
6 6 Jim
7 7 John
8 8 Jim
9 9 John
I would like to select the top n items by Name and find the mean from the Value column.
I have tried this:
df['Top2Mean'] = df.groupby(['Name'])['Value'].nlargest(2).transform('mean')
But the following error:
ValueError: transforms cannot produce aggregated results
My expected result is a new column called Top2Mean with a 8 next to John and 7 next to Jim.
Thanks in advance!

Let us calculate mean on level=0, then map the calculated mean value to the Name column to broadcast the aggregated results.
top2 = df.groupby('Name')['Value'].nlargest(2).mean(level=0)
df['Top2Mean'] = df['Name'].map(top2)
If we need to group on multiple columns for example Name and City then we have to take mean on level=[Name, City] and map the calculated mean values using MultiIndex.map
c = ['Name', 'City']
top2 = df.groupby(c)['Value'].nlargest(2).mean(level=c)
df['Top2Mean'] = df.set_index(c).index.map(top2)
Alternative approach with groupby and transform using a custom lambda function
df['Top2Mean'] = df.groupby('Name')['Value']\
.transform(lambda v: v.nlargest(2).mean())
Value Name Top2Mean
0 0 John 8
1 1 Jim 7
2 2 John 8
3 3 Jim 7
4 4 John 8
5 5 Jim 7
6 6 Jim 7
7 7 John 8
8 8 Jim 7
9 9 John 8

How to create a new column based on information in another column?

I try to create a new column in panda dataframe. I have names in one column, I want to attain numbers to them in a new column. If name is repeated sequentially, they get the same number, if they are repeated after different names then they should get another number
For example, my df is like
Name/
Stephen
Stephen
Mike
Carla
Carla
Stephen
my new column should be
Numbers/
0
0
1
2
2
3
Sorry, I couldn't paste my dataframe here.

Try:
df['Numbers'] = (df['Name'] != df['Name'].shift()).cumsum() - 1
Output:
Name Numbers
0 Stephen 0
1 Stephen 0
2 Mike 1
3 Carla 2
4 Carla 2
5 Stephen 3

Conditional drop of identical pairs of columns in pandas

I have a somewhat big pandas dataframe (100,000x9). The first two columns are a combination of names associated with a value (in both sides). I want to delete the lower value associated with a given combination.
I haven't tried anything yet, because I'm not sure how to tackle this problem. My first impression is that I need to use the apply function over the data frame, but I need to select each combination of 'first' and 'second', compare them and then delete that row.
df = pd.DataFrame(np.array([['John','Mary',5],['John','Mark',1], ['Mary','John',2], ['Mary','Mark',1], ['Mark','John',3], ['Mark','Mary',5]]), columns=['first','second','third'])
df
first second third
0 John Mary 5
1 John Mark 1
2 Mary John 2
3 Mary Mark 1
4 Mark John 3
5 Mark Mary 5
My objective is to get this data frame
df_clean = pd.DataFrame(np.array([['John','Mary',5], ['Mark','John',3], ['Mark','Mary',5]]), columns=['first','second','third'])
df_clean
first second third
0 John Mary 5
1 Mark John 3
2 Mark Mary 5
Any ideas?

First we use np.sort to sort horizontally, then we use groupby with max function to get the highest value per unique value of first, second:
df[['first', 'second']] = np.sort(df[['first', 'second']], axis=1)
print(df.groupby(['first', 'second']).third.max().reset_index())
first second third
0 John Mark 3
1 John Mary 5
2 Mark Mary 5

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Count occurrence of a max value within aggregation - python

df.groupby('Column1').agg({ 'Column2': [max, lambda x: (x==max(x)).sum()] }).rename(columns={'max': 'Max', '<lambda_0>': 'Count_of_Max'})

Related

How do I create a new column of max values of a column(corresponding to specific name) using pandas?

Pandas: filter the row according to the value of another column in different group (two columns in aggregate)

Select top n items in a pandas groupby and calculate the mean

How to create a new column based on information in another column?

Conditional drop of identical pairs of columns in pandas

Categories

Resources