getting the maximum value of each group after groupby pandas [duplicate] - python

How do I find all rows in a pandas DataFrame which have the max value for count column, after grouping by ['Sp','Mt'] columns?
Example 1: the following DataFrame:
Sp Mt Value count
0 MM1 S1 a **3**
1 MM1 S1 n 2
2 MM1 S3 cb **5**
3 MM2 S3 mk **8**
4 MM2 S4 bg **10**
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi **7**
Expected output is to get the result rows whose count is max in each group, like this:
Sp Mt Value count
0 MM1 S1 a **3**
2 MM1 S3 cb **5**
3 MM2 S3 mk **8**
4 MM2 S4 bg **10**
8 MM4 S2 uyi **7**
Example 2:
Sp Mt Value count
4 MM2 S4 bg 10
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 8
8 MM4 S2 uyi 8
Expected output:
Sp Mt Value count
4 MM2 S4 bg 10
7 MM4 S2 cb 8
8 MM4 S2 uyi 8

Firstly, we can get the max count for each group like this:
In [1]: df
Out[1]:
Sp Mt Value count
0 MM1 S1 a 3
1 MM1 S1 n 2
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi 7
In [2]: df.groupby(['Sp', 'Mt'])['count'].max()
Out[2]:
Sp Mt
MM1 S1 3
S3 5
MM2 S3 8
S4 10
MM4 S2 7
Name: count, dtype: int64
To get the indices of the original DF you can do:
In [3]: idx = df.groupby(['Sp', 'Mt'])['count'].transform(max) == df['count']
In [4]: df[idx]
Out[4]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
8 MM4 S2 uyi 7
Note that if you have multiple max values per group, all will be returned.
Update
On a Hail Mary chance that this is what the OP is requesting:
In [5]: df['count_max'] = df.groupby(['Sp', 'Mt'])['count'].transform(max)
In [6]: df
Out[6]:
Sp Mt Value count count_max
0 MM1 S1 a 3 3
1 MM1 S1 n 2 3
2 MM1 S3 cb 5 5
3 MM2 S3 mk 8 8
4 MM2 S4 bg 10 10
5 MM2 S4 dgd 1 10
6 MM4 S2 rd 2 7
7 MM4 S2 cb 2 7
8 MM4 S2 uyi 7 7

You can sort the dataFrame by count and then remove duplicates. I think it's easier:
df.sort_values('count', ascending=False).drop_duplicates(['Sp','Mt'])

Easy solution would be to apply the idxmax() function to get indices of rows with max values.
This would filter out all the rows with max value in the group.
In [367]: df
Out[367]:
sp mt val count
0 MM1 S1 a 3
1 MM1 S1 n 2
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
5 MM2 S4 dgb 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi 7
# Apply idxmax() and use .loc() on dataframe to filter the rows with max values:
In [368]: df.loc[df.groupby(["sp", "mt"])["count"].idxmax()]
Out[368]:
sp mt val count
0 MM1 S1 a 3
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
8 MM4 S2 uyi 7
# Just to show what values are returned by .idxmax() above:
In [369]: df.groupby(["sp", "mt"])["count"].idxmax().values
Out[369]: array([0, 2, 3, 4, 8])

You may not need to do groupby(), but use both sort_values + drop_duplicates
df.sort_values('count').drop_duplicates(['Sp', 'Mt'], keep='last')
Out[190]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
8 MM4 S2 uyi 7
3 MM2 S3 mk 8
4 MM2 S4 bg 10
Also almost same logic by using tail
df.sort_values('count').groupby(['Sp', 'Mt']).tail(1)
Out[52]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
8 MM4 S2 uyi 7
3 MM2 S3 mk 8
4 MM2 S4 bg 10

Having tried the solution suggested by Zelazny on a relatively large DataFrame (~400k rows) I found it to be very slow. Here is an alternative that I found to run orders of magnitude faster on my data set.
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df_grouped = df.groupby(['sp', 'mt']).agg({'count':'max'})
df_grouped = df_grouped.reset_index()
df_grouped = df_grouped.rename(columns={'count':'count_max'})
df = pd.merge(df, df_grouped, how='left', on=['sp', 'mt'])
df = df[df['count'] == df['count_max']]

Use groupby and idxmax methods:
transfer col date to datetime:
df['date'] = pd.to_datetime(df['date'])
get the index of max of column date, after groupyby ad_id:
idx = df.groupby(by='ad_id')['date'].idxmax()
get the wanted data:
df_max = df.loc[idx,]
ad_id price date
7 22 2 2018-06-11
6 23 2 2018-06-22
2 24 2 2018-06-30
3 28 5 2018-06-22

For me, the easiest solution would be keep value when count is equal to the maximum. Therefore, the following one line command is enough :
df[df['count'] == df.groupby(['Mt'])['count'].transform(max)]

Summarizing, there are many ways, but which one is faster?
import pandas as pd
import numpy as np
import time
df = pd.DataFrame(np.random.randint(1,10,size=(1000000, 2)), columns=list('AB'))
start_time = time.time()
df1idx = df.groupby(['A'])['B'].transform(max) == df['B']
df1 = df[df1idx]
print("---1 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df2 = df.sort_values('B').groupby(['A']).tail(1)
print("---2 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df3 = df.sort_values('B').drop_duplicates(['A'],keep='last')
print("---3 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df3b = df.sort_values('B', ascending=False).drop_duplicates(['A'])
print("---3b) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df4 = df[df['B'] == df.groupby(['A'])['B'].transform(max)]
print("---4 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
d = df.groupby('A')['B'].nlargest(1)
df5 = df.iloc[[i[1] for i in d.index], :]
print("---5 ) %s seconds ---" % (time.time() - start_time))
And the winner is...
--1 ) 0.03337574005126953 seconds ---
--2 ) 0.1346898078918457 seconds ---
--3 ) 0.10243558883666992 seconds ---
--3b) 0.1004343032836914 seconds ---
--4 ) 0.028397560119628906 seconds ---
--5 ) 0.07552886009216309 seconds ---

Try using nlargest on the groupby object. The advantage is that it returns the rows where "the nlargest item(s)" were fetched from, and we can get their index.
In this case, we want n=1 for the max and keep='all' to include duplicate maxes.
Note: we slice the last (-1) element of our index since our index in this case consist of tuples (e.g. ('MM1', 'S1', 0)).
df = pd.DataFrame({
'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'Mt': ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'Val': ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count': [3, 2, 5, 8, 10, 1, 2, 2, 7]
})
d = df.groupby(['Sp', 'Mt'])['count'].nlargest(1, keep='all')
df.loc[[i[-1] for i in d.index]]
Sp Mt Val count
0 MM1 S1 a 3
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
8 MM4 S2 uyi 7

Realizing that "applying" "nlargest" to groupby object works just as fine:
Additional advantage - also can fetch top n values if required:
In [85]: import pandas as pd
In [86]: df = pd.DataFrame({
...: 'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
...: 'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
...: 'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
...: 'count' : [3,2,5,8,10,1,2,2,7]
...: })
## Apply nlargest(1) to find the max val df, and nlargest(n) gives top n values for df:
In [87]: df.groupby(["sp", "mt"]).apply(lambda x: x.nlargest(1, "count")).reset_index(drop=True)
Out[87]:
count mt sp val
0 3 S1 MM1 a
1 5 S3 MM1 cb
2 8 S3 MM2 mk
3 10 S4 MM2 bg
4 7 S2 MM4 uyi

If you sort your DataFrame that ordering will be preserved in the groupby. You can then just grab the first or last element and reset the index.
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df.sort_values("count", ascending=False).groupby(["sp", "mt"]).first().reset_index()

Many of these are great answers, but to help show scalability, on 2.8 million rows with varying amount of duplicates shows some startling differences. The fastest for my data was the sort by then drop duplicate (drop all but last marginally faster than sort descending and drop all but first)
Sort Ascending, Drop duplicate keep last (2.22 s)
Sort Descending, Drop Duplicate keep First (2.32 s)
Transform Max within the loc function (3.73 s)
Transform Max storing IDX then using loc select as second step (3.84 s)
Groupby using Tail (8.98 s)
IDMax with groupby and then using loc select as second step (95.39 s)
IDMax with groupby within the loc select (95.74 s)
NLargest(1) then using iloc select as a second step (> 35000 s ) - did not finish after running overnight
NLargest(1) within iloc select (> 35000 s ) - did not finish after running overnight
As you can see Sort is 1/3 faster than transform and 75% faster than groupby. Everything else is up to 40x slower. In small datasets, this may not matter by much, but as you can see, this can significantly impact large datasets.

df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df.groupby(['sp', 'mt']).apply(lambda grp: grp.nlargest(1, 'count'))

I've been using this functional style for many group operations:
df = pd.DataFrame({
'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
'Mt': ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'Val': ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'Count': [3, 2, 5, 8, 10, 1, 2, 2, 7]
})
(df.groupby(['Sp', 'Mt'])
.apply(lambda group: group[group['Count'] == group['Count'].max()])
.reset_index(drop=True))
Sp Mt Val Count
0 MM1 S1 a 3
1 MM1 S3 cb 5
2 MM2 S3 mk 8
3 MM2 S4 bg 10
4 MM4 S2 uyi 7
.reset_index(drop=True) gets you back to the original index by dropping the group-index.

df.loc[df.groupby('mt')['count'].idxmax()]
if the df index isn't unique you may need this step df.reset_index(inplace=True) first.

Related

Drop duplicated and take the highest value based on condition [duplicate]

How do I find all rows in a pandas DataFrame which have the max value for count column, after grouping by ['Sp','Mt'] columns?
Example 1: the following DataFrame:
Sp Mt Value count
0 MM1 S1 a **3**
1 MM1 S1 n 2
2 MM1 S3 cb **5**
3 MM2 S3 mk **8**
4 MM2 S4 bg **10**
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi **7**
Expected output is to get the result rows whose count is max in each group, like this:
Sp Mt Value count
0 MM1 S1 a **3**
2 MM1 S3 cb **5**
3 MM2 S3 mk **8**
4 MM2 S4 bg **10**
8 MM4 S2 uyi **7**
Example 2:
Sp Mt Value count
4 MM2 S4 bg 10
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 8
8 MM4 S2 uyi 8
Expected output:
Sp Mt Value count
4 MM2 S4 bg 10
7 MM4 S2 cb 8
8 MM4 S2 uyi 8
Firstly, we can get the max count for each group like this:
In [1]: df
Out[1]:
Sp Mt Value count
0 MM1 S1 a 3
1 MM1 S1 n 2
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi 7
In [2]: df.groupby(['Sp', 'Mt'])['count'].max()
Out[2]:
Sp Mt
MM1 S1 3
S3 5
MM2 S3 8
S4 10
MM4 S2 7
Name: count, dtype: int64
To get the indices of the original DF you can do:
In [3]: idx = df.groupby(['Sp', 'Mt'])['count'].transform(max) == df['count']
In [4]: df[idx]
Out[4]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
8 MM4 S2 uyi 7
Note that if you have multiple max values per group, all will be returned.
Update
On a Hail Mary chance that this is what the OP is requesting:
In [5]: df['count_max'] = df.groupby(['Sp', 'Mt'])['count'].transform(max)
In [6]: df
Out[6]:
Sp Mt Value count count_max
0 MM1 S1 a 3 3
1 MM1 S1 n 2 3
2 MM1 S3 cb 5 5
3 MM2 S3 mk 8 8
4 MM2 S4 bg 10 10
5 MM2 S4 dgd 1 10
6 MM4 S2 rd 2 7
7 MM4 S2 cb 2 7
8 MM4 S2 uyi 7 7
You can sort the dataFrame by count and then remove duplicates. I think it's easier:
df.sort_values('count', ascending=False).drop_duplicates(['Sp','Mt'])
Easy solution would be to apply the idxmax() function to get indices of rows with max values.
This would filter out all the rows with max value in the group.
In [367]: df
Out[367]:
sp mt val count
0 MM1 S1 a 3
1 MM1 S1 n 2
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
5 MM2 S4 dgb 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi 7
# Apply idxmax() and use .loc() on dataframe to filter the rows with max values:
In [368]: df.loc[df.groupby(["sp", "mt"])["count"].idxmax()]
Out[368]:
sp mt val count
0 MM1 S1 a 3
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
8 MM4 S2 uyi 7
# Just to show what values are returned by .idxmax() above:
In [369]: df.groupby(["sp", "mt"])["count"].idxmax().values
Out[369]: array([0, 2, 3, 4, 8])
You may not need to do groupby(), but use both sort_values + drop_duplicates
df.sort_values('count').drop_duplicates(['Sp', 'Mt'], keep='last')
Out[190]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
8 MM4 S2 uyi 7
3 MM2 S3 mk 8
4 MM2 S4 bg 10
Also almost same logic by using tail
df.sort_values('count').groupby(['Sp', 'Mt']).tail(1)
Out[52]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
8 MM4 S2 uyi 7
3 MM2 S3 mk 8
4 MM2 S4 bg 10
Having tried the solution suggested by Zelazny on a relatively large DataFrame (~400k rows) I found it to be very slow. Here is an alternative that I found to run orders of magnitude faster on my data set.
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df_grouped = df.groupby(['sp', 'mt']).agg({'count':'max'})
df_grouped = df_grouped.reset_index()
df_grouped = df_grouped.rename(columns={'count':'count_max'})
df = pd.merge(df, df_grouped, how='left', on=['sp', 'mt'])
df = df[df['count'] == df['count_max']]
Use groupby and idxmax methods:
transfer col date to datetime:
df['date'] = pd.to_datetime(df['date'])
get the index of max of column date, after groupyby ad_id:
idx = df.groupby(by='ad_id')['date'].idxmax()
get the wanted data:
df_max = df.loc[idx,]
ad_id price date
7 22 2 2018-06-11
6 23 2 2018-06-22
2 24 2 2018-06-30
3 28 5 2018-06-22
For me, the easiest solution would be keep value when count is equal to the maximum. Therefore, the following one line command is enough :
df[df['count'] == df.groupby(['Mt'])['count'].transform(max)]
Summarizing, there are many ways, but which one is faster?
import pandas as pd
import numpy as np
import time
df = pd.DataFrame(np.random.randint(1,10,size=(1000000, 2)), columns=list('AB'))
start_time = time.time()
df1idx = df.groupby(['A'])['B'].transform(max) == df['B']
df1 = df[df1idx]
print("---1 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df2 = df.sort_values('B').groupby(['A']).tail(1)
print("---2 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df3 = df.sort_values('B').drop_duplicates(['A'],keep='last')
print("---3 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df3b = df.sort_values('B', ascending=False).drop_duplicates(['A'])
print("---3b) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df4 = df[df['B'] == df.groupby(['A'])['B'].transform(max)]
print("---4 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
d = df.groupby('A')['B'].nlargest(1)
df5 = df.iloc[[i[1] for i in d.index], :]
print("---5 ) %s seconds ---" % (time.time() - start_time))
And the winner is...
--1 ) 0.03337574005126953 seconds ---
--2 ) 0.1346898078918457 seconds ---
--3 ) 0.10243558883666992 seconds ---
--3b) 0.1004343032836914 seconds ---
--4 ) 0.028397560119628906 seconds ---
--5 ) 0.07552886009216309 seconds ---
Try using nlargest on the groupby object. The advantage is that it returns the rows where "the nlargest item(s)" were fetched from, and we can get their index.
In this case, we want n=1 for the max and keep='all' to include duplicate maxes.
Note: we slice the last (-1) element of our index since our index in this case consist of tuples (e.g. ('MM1', 'S1', 0)).
df = pd.DataFrame({
'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'Mt': ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'Val': ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count': [3, 2, 5, 8, 10, 1, 2, 2, 7]
})
d = df.groupby(['Sp', 'Mt'])['count'].nlargest(1, keep='all')
df.loc[[i[-1] for i in d.index]]
Sp Mt Val count
0 MM1 S1 a 3
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
8 MM4 S2 uyi 7
Realizing that "applying" "nlargest" to groupby object works just as fine:
Additional advantage - also can fetch top n values if required:
In [85]: import pandas as pd
In [86]: df = pd.DataFrame({
...: 'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
...: 'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
...: 'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
...: 'count' : [3,2,5,8,10,1,2,2,7]
...: })
## Apply nlargest(1) to find the max val df, and nlargest(n) gives top n values for df:
In [87]: df.groupby(["sp", "mt"]).apply(lambda x: x.nlargest(1, "count")).reset_index(drop=True)
Out[87]:
count mt sp val
0 3 S1 MM1 a
1 5 S3 MM1 cb
2 8 S3 MM2 mk
3 10 S4 MM2 bg
4 7 S2 MM4 uyi
If you sort your DataFrame that ordering will be preserved in the groupby. You can then just grab the first or last element and reset the index.
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df.sort_values("count", ascending=False).groupby(["sp", "mt"]).first().reset_index()
Many of these are great answers, but to help show scalability, on 2.8 million rows with varying amount of duplicates shows some startling differences. The fastest for my data was the sort by then drop duplicate (drop all but last marginally faster than sort descending and drop all but first)
Sort Ascending, Drop duplicate keep last (2.22 s)
Sort Descending, Drop Duplicate keep First (2.32 s)
Transform Max within the loc function (3.73 s)
Transform Max storing IDX then using loc select as second step (3.84 s)
Groupby using Tail (8.98 s)
IDMax with groupby and then using loc select as second step (95.39 s)
IDMax with groupby within the loc select (95.74 s)
NLargest(1) then using iloc select as a second step (> 35000 s ) - did not finish after running overnight
NLargest(1) within iloc select (> 35000 s ) - did not finish after running overnight
As you can see Sort is 1/3 faster than transform and 75% faster than groupby. Everything else is up to 40x slower. In small datasets, this may not matter by much, but as you can see, this can significantly impact large datasets.
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df.groupby(['sp', 'mt']).apply(lambda grp: grp.nlargest(1, 'count'))
I've been using this functional style for many group operations:
df = pd.DataFrame({
'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
'Mt': ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'Val': ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'Count': [3, 2, 5, 8, 10, 1, 2, 2, 7]
})
(df.groupby(['Sp', 'Mt'])
.apply(lambda group: group[group['Count'] == group['Count'].max()])
.reset_index(drop=True))
Sp Mt Val Count
0 MM1 S1 a 3
1 MM1 S3 cb 5
2 MM2 S3 mk 8
3 MM2 S4 bg 10
4 MM4 S2 uyi 7
.reset_index(drop=True) gets you back to the original index by dropping the group-index.
df.loc[df.groupby('mt')['count'].idxmax()]
if the df index isn't unique you may need this step df.reset_index(inplace=True) first.

How can i group colums and get the latest time? [duplicate]

How do I find all rows in a pandas DataFrame which have the max value for count column, after grouping by ['Sp','Mt'] columns?
Example 1: the following DataFrame:
Sp Mt Value count
0 MM1 S1 a **3**
1 MM1 S1 n 2
2 MM1 S3 cb **5**
3 MM2 S3 mk **8**
4 MM2 S4 bg **10**
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi **7**
Expected output is to get the result rows whose count is max in each group, like this:
Sp Mt Value count
0 MM1 S1 a **3**
2 MM1 S3 cb **5**
3 MM2 S3 mk **8**
4 MM2 S4 bg **10**
8 MM4 S2 uyi **7**
Example 2:
Sp Mt Value count
4 MM2 S4 bg 10
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 8
8 MM4 S2 uyi 8
Expected output:
Sp Mt Value count
4 MM2 S4 bg 10
7 MM4 S2 cb 8
8 MM4 S2 uyi 8
Firstly, we can get the max count for each group like this:
In [1]: df
Out[1]:
Sp Mt Value count
0 MM1 S1 a 3
1 MM1 S1 n 2
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi 7
In [2]: df.groupby(['Sp', 'Mt'])['count'].max()
Out[2]:
Sp Mt
MM1 S1 3
S3 5
MM2 S3 8
S4 10
MM4 S2 7
Name: count, dtype: int64
To get the indices of the original DF you can do:
In [3]: idx = df.groupby(['Sp', 'Mt'])['count'].transform(max) == df['count']
In [4]: df[idx]
Out[4]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
8 MM4 S2 uyi 7
Note that if you have multiple max values per group, all will be returned.
Update
On a Hail Mary chance that this is what the OP is requesting:
In [5]: df['count_max'] = df.groupby(['Sp', 'Mt'])['count'].transform(max)
In [6]: df
Out[6]:
Sp Mt Value count count_max
0 MM1 S1 a 3 3
1 MM1 S1 n 2 3
2 MM1 S3 cb 5 5
3 MM2 S3 mk 8 8
4 MM2 S4 bg 10 10
5 MM2 S4 dgd 1 10
6 MM4 S2 rd 2 7
7 MM4 S2 cb 2 7
8 MM4 S2 uyi 7 7
You can sort the dataFrame by count and then remove duplicates. I think it's easier:
df.sort_values('count', ascending=False).drop_duplicates(['Sp','Mt'])
Easy solution would be to apply the idxmax() function to get indices of rows with max values.
This would filter out all the rows with max value in the group.
In [367]: df
Out[367]:
sp mt val count
0 MM1 S1 a 3
1 MM1 S1 n 2
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
5 MM2 S4 dgb 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi 7
# Apply idxmax() and use .loc() on dataframe to filter the rows with max values:
In [368]: df.loc[df.groupby(["sp", "mt"])["count"].idxmax()]
Out[368]:
sp mt val count
0 MM1 S1 a 3
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
8 MM4 S2 uyi 7
# Just to show what values are returned by .idxmax() above:
In [369]: df.groupby(["sp", "mt"])["count"].idxmax().values
Out[369]: array([0, 2, 3, 4, 8])
You may not need to do groupby(), but use both sort_values + drop_duplicates
df.sort_values('count').drop_duplicates(['Sp', 'Mt'], keep='last')
Out[190]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
8 MM4 S2 uyi 7
3 MM2 S3 mk 8
4 MM2 S4 bg 10
Also almost same logic by using tail
df.sort_values('count').groupby(['Sp', 'Mt']).tail(1)
Out[52]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
8 MM4 S2 uyi 7
3 MM2 S3 mk 8
4 MM2 S4 bg 10
Having tried the solution suggested by Zelazny on a relatively large DataFrame (~400k rows) I found it to be very slow. Here is an alternative that I found to run orders of magnitude faster on my data set.
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df_grouped = df.groupby(['sp', 'mt']).agg({'count':'max'})
df_grouped = df_grouped.reset_index()
df_grouped = df_grouped.rename(columns={'count':'count_max'})
df = pd.merge(df, df_grouped, how='left', on=['sp', 'mt'])
df = df[df['count'] == df['count_max']]
Use groupby and idxmax methods:
transfer col date to datetime:
df['date'] = pd.to_datetime(df['date'])
get the index of max of column date, after groupyby ad_id:
idx = df.groupby(by='ad_id')['date'].idxmax()
get the wanted data:
df_max = df.loc[idx,]
ad_id price date
7 22 2 2018-06-11
6 23 2 2018-06-22
2 24 2 2018-06-30
3 28 5 2018-06-22
For me, the easiest solution would be keep value when count is equal to the maximum. Therefore, the following one line command is enough :
df[df['count'] == df.groupby(['Mt'])['count'].transform(max)]
Summarizing, there are many ways, but which one is faster?
import pandas as pd
import numpy as np
import time
df = pd.DataFrame(np.random.randint(1,10,size=(1000000, 2)), columns=list('AB'))
start_time = time.time()
df1idx = df.groupby(['A'])['B'].transform(max) == df['B']
df1 = df[df1idx]
print("---1 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df2 = df.sort_values('B').groupby(['A']).tail(1)
print("---2 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df3 = df.sort_values('B').drop_duplicates(['A'],keep='last')
print("---3 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df3b = df.sort_values('B', ascending=False).drop_duplicates(['A'])
print("---3b) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df4 = df[df['B'] == df.groupby(['A'])['B'].transform(max)]
print("---4 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
d = df.groupby('A')['B'].nlargest(1)
df5 = df.iloc[[i[1] for i in d.index], :]
print("---5 ) %s seconds ---" % (time.time() - start_time))
And the winner is...
--1 ) 0.03337574005126953 seconds ---
--2 ) 0.1346898078918457 seconds ---
--3 ) 0.10243558883666992 seconds ---
--3b) 0.1004343032836914 seconds ---
--4 ) 0.028397560119628906 seconds ---
--5 ) 0.07552886009216309 seconds ---
Try using nlargest on the groupby object. The advantage is that it returns the rows where "the nlargest item(s)" were fetched from, and we can get their index.
In this case, we want n=1 for the max and keep='all' to include duplicate maxes.
Note: we slice the last (-1) element of our index since our index in this case consist of tuples (e.g. ('MM1', 'S1', 0)).
df = pd.DataFrame({
'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'Mt': ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'Val': ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count': [3, 2, 5, 8, 10, 1, 2, 2, 7]
})
d = df.groupby(['Sp', 'Mt'])['count'].nlargest(1, keep='all')
df.loc[[i[-1] for i in d.index]]
Sp Mt Val count
0 MM1 S1 a 3
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
8 MM4 S2 uyi 7
Realizing that "applying" "nlargest" to groupby object works just as fine:
Additional advantage - also can fetch top n values if required:
In [85]: import pandas as pd
In [86]: df = pd.DataFrame({
...: 'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
...: 'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
...: 'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
...: 'count' : [3,2,5,8,10,1,2,2,7]
...: })
## Apply nlargest(1) to find the max val df, and nlargest(n) gives top n values for df:
In [87]: df.groupby(["sp", "mt"]).apply(lambda x: x.nlargest(1, "count")).reset_index(drop=True)
Out[87]:
count mt sp val
0 3 S1 MM1 a
1 5 S3 MM1 cb
2 8 S3 MM2 mk
3 10 S4 MM2 bg
4 7 S2 MM4 uyi
If you sort your DataFrame that ordering will be preserved in the groupby. You can then just grab the first or last element and reset the index.
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df.sort_values("count", ascending=False).groupby(["sp", "mt"]).first().reset_index()
Many of these are great answers, but to help show scalability, on 2.8 million rows with varying amount of duplicates shows some startling differences. The fastest for my data was the sort by then drop duplicate (drop all but last marginally faster than sort descending and drop all but first)
Sort Ascending, Drop duplicate keep last (2.22 s)
Sort Descending, Drop Duplicate keep First (2.32 s)
Transform Max within the loc function (3.73 s)
Transform Max storing IDX then using loc select as second step (3.84 s)
Groupby using Tail (8.98 s)
IDMax with groupby and then using loc select as second step (95.39 s)
IDMax with groupby within the loc select (95.74 s)
NLargest(1) then using iloc select as a second step (> 35000 s ) - did not finish after running overnight
NLargest(1) within iloc select (> 35000 s ) - did not finish after running overnight
As you can see Sort is 1/3 faster than transform and 75% faster than groupby. Everything else is up to 40x slower. In small datasets, this may not matter by much, but as you can see, this can significantly impact large datasets.
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df.groupby(['sp', 'mt']).apply(lambda grp: grp.nlargest(1, 'count'))
I've been using this functional style for many group operations:
df = pd.DataFrame({
'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
'Mt': ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'Val': ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'Count': [3, 2, 5, 8, 10, 1, 2, 2, 7]
})
(df.groupby(['Sp', 'Mt'])
.apply(lambda group: group[group['Count'] == group['Count'].max()])
.reset_index(drop=True))
Sp Mt Val Count
0 MM1 S1 a 3
1 MM1 S3 cb 5
2 MM2 S3 mk 8
3 MM2 S4 bg 10
4 MM4 S2 uyi 7
.reset_index(drop=True) gets you back to the original index by dropping the group-index.
df.loc[df.groupby('mt')['count'].idxmax()]
if the df index isn't unique you may need this step df.reset_index(inplace=True) first.

How to find maximum within dataset from two grouped columns [duplicate]

How do I find all rows in a pandas DataFrame which have the max value for count column, after grouping by ['Sp','Mt'] columns?
Example 1: the following DataFrame:
Sp Mt Value count
0 MM1 S1 a **3**
1 MM1 S1 n 2
2 MM1 S3 cb **5**
3 MM2 S3 mk **8**
4 MM2 S4 bg **10**
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi **7**
Expected output is to get the result rows whose count is max in each group, like this:
Sp Mt Value count
0 MM1 S1 a **3**
2 MM1 S3 cb **5**
3 MM2 S3 mk **8**
4 MM2 S4 bg **10**
8 MM4 S2 uyi **7**
Example 2:
Sp Mt Value count
4 MM2 S4 bg 10
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 8
8 MM4 S2 uyi 8
Expected output:
Sp Mt Value count
4 MM2 S4 bg 10
7 MM4 S2 cb 8
8 MM4 S2 uyi 8
Firstly, we can get the max count for each group like this:
In [1]: df
Out[1]:
Sp Mt Value count
0 MM1 S1 a 3
1 MM1 S1 n 2
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi 7
In [2]: df.groupby(['Sp', 'Mt'])['count'].max()
Out[2]:
Sp Mt
MM1 S1 3
S3 5
MM2 S3 8
S4 10
MM4 S2 7
Name: count, dtype: int64
To get the indices of the original DF you can do:
In [3]: idx = df.groupby(['Sp', 'Mt'])['count'].transform(max) == df['count']
In [4]: df[idx]
Out[4]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
8 MM4 S2 uyi 7
Note that if you have multiple max values per group, all will be returned.
Update
On a Hail Mary chance that this is what the OP is requesting:
In [5]: df['count_max'] = df.groupby(['Sp', 'Mt'])['count'].transform(max)
In [6]: df
Out[6]:
Sp Mt Value count count_max
0 MM1 S1 a 3 3
1 MM1 S1 n 2 3
2 MM1 S3 cb 5 5
3 MM2 S3 mk 8 8
4 MM2 S4 bg 10 10
5 MM2 S4 dgd 1 10
6 MM4 S2 rd 2 7
7 MM4 S2 cb 2 7
8 MM4 S2 uyi 7 7
You can sort the dataFrame by count and then remove duplicates. I think it's easier:
df.sort_values('count', ascending=False).drop_duplicates(['Sp','Mt'])
Easy solution would be to apply the idxmax() function to get indices of rows with max values.
This would filter out all the rows with max value in the group.
In [367]: df
Out[367]:
sp mt val count
0 MM1 S1 a 3
1 MM1 S1 n 2
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
5 MM2 S4 dgb 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi 7
# Apply idxmax() and use .loc() on dataframe to filter the rows with max values:
In [368]: df.loc[df.groupby(["sp", "mt"])["count"].idxmax()]
Out[368]:
sp mt val count
0 MM1 S1 a 3
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
8 MM4 S2 uyi 7
# Just to show what values are returned by .idxmax() above:
In [369]: df.groupby(["sp", "mt"])["count"].idxmax().values
Out[369]: array([0, 2, 3, 4, 8])
You may not need to do groupby(), but use both sort_values + drop_duplicates
df.sort_values('count').drop_duplicates(['Sp', 'Mt'], keep='last')
Out[190]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
8 MM4 S2 uyi 7
3 MM2 S3 mk 8
4 MM2 S4 bg 10
Also almost same logic by using tail
df.sort_values('count').groupby(['Sp', 'Mt']).tail(1)
Out[52]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
8 MM4 S2 uyi 7
3 MM2 S3 mk 8
4 MM2 S4 bg 10
Having tried the solution suggested by Zelazny on a relatively large DataFrame (~400k rows) I found it to be very slow. Here is an alternative that I found to run orders of magnitude faster on my data set.
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df_grouped = df.groupby(['sp', 'mt']).agg({'count':'max'})
df_grouped = df_grouped.reset_index()
df_grouped = df_grouped.rename(columns={'count':'count_max'})
df = pd.merge(df, df_grouped, how='left', on=['sp', 'mt'])
df = df[df['count'] == df['count_max']]
Use groupby and idxmax methods:
transfer col date to datetime:
df['date'] = pd.to_datetime(df['date'])
get the index of max of column date, after groupyby ad_id:
idx = df.groupby(by='ad_id')['date'].idxmax()
get the wanted data:
df_max = df.loc[idx,]
ad_id price date
7 22 2 2018-06-11
6 23 2 2018-06-22
2 24 2 2018-06-30
3 28 5 2018-06-22
For me, the easiest solution would be keep value when count is equal to the maximum. Therefore, the following one line command is enough :
df[df['count'] == df.groupby(['Mt'])['count'].transform(max)]
Summarizing, there are many ways, but which one is faster?
import pandas as pd
import numpy as np
import time
df = pd.DataFrame(np.random.randint(1,10,size=(1000000, 2)), columns=list('AB'))
start_time = time.time()
df1idx = df.groupby(['A'])['B'].transform(max) == df['B']
df1 = df[df1idx]
print("---1 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df2 = df.sort_values('B').groupby(['A']).tail(1)
print("---2 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df3 = df.sort_values('B').drop_duplicates(['A'],keep='last')
print("---3 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df3b = df.sort_values('B', ascending=False).drop_duplicates(['A'])
print("---3b) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df4 = df[df['B'] == df.groupby(['A'])['B'].transform(max)]
print("---4 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
d = df.groupby('A')['B'].nlargest(1)
df5 = df.iloc[[i[1] for i in d.index], :]
print("---5 ) %s seconds ---" % (time.time() - start_time))
And the winner is...
--1 ) 0.03337574005126953 seconds ---
--2 ) 0.1346898078918457 seconds ---
--3 ) 0.10243558883666992 seconds ---
--3b) 0.1004343032836914 seconds ---
--4 ) 0.028397560119628906 seconds ---
--5 ) 0.07552886009216309 seconds ---
Try using nlargest on the groupby object. The advantage is that it returns the rows where "the nlargest item(s)" were fetched from, and we can get their index.
In this case, we want n=1 for the max and keep='all' to include duplicate maxes.
Note: we slice the last (-1) element of our index since our index in this case consist of tuples (e.g. ('MM1', 'S1', 0)).
df = pd.DataFrame({
'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'Mt': ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'Val': ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count': [3, 2, 5, 8, 10, 1, 2, 2, 7]
})
d = df.groupby(['Sp', 'Mt'])['count'].nlargest(1, keep='all')
df.loc[[i[-1] for i in d.index]]
Sp Mt Val count
0 MM1 S1 a 3
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
8 MM4 S2 uyi 7
Realizing that "applying" "nlargest" to groupby object works just as fine:
Additional advantage - also can fetch top n values if required:
In [85]: import pandas as pd
In [86]: df = pd.DataFrame({
...: 'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
...: 'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
...: 'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
...: 'count' : [3,2,5,8,10,1,2,2,7]
...: })
## Apply nlargest(1) to find the max val df, and nlargest(n) gives top n values for df:
In [87]: df.groupby(["sp", "mt"]).apply(lambda x: x.nlargest(1, "count")).reset_index(drop=True)
Out[87]:
count mt sp val
0 3 S1 MM1 a
1 5 S3 MM1 cb
2 8 S3 MM2 mk
3 10 S4 MM2 bg
4 7 S2 MM4 uyi
If you sort your DataFrame that ordering will be preserved in the groupby. You can then just grab the first or last element and reset the index.
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df.sort_values("count", ascending=False).groupby(["sp", "mt"]).first().reset_index()
Many of these are great answers, but to help show scalability, on 2.8 million rows with varying amount of duplicates shows some startling differences. The fastest for my data was the sort by then drop duplicate (drop all but last marginally faster than sort descending and drop all but first)
Sort Ascending, Drop duplicate keep last (2.22 s)
Sort Descending, Drop Duplicate keep First (2.32 s)
Transform Max within the loc function (3.73 s)
Transform Max storing IDX then using loc select as second step (3.84 s)
Groupby using Tail (8.98 s)
IDMax with groupby and then using loc select as second step (95.39 s)
IDMax with groupby within the loc select (95.74 s)
NLargest(1) then using iloc select as a second step (> 35000 s ) - did not finish after running overnight
NLargest(1) within iloc select (> 35000 s ) - did not finish after running overnight
As you can see Sort is 1/3 faster than transform and 75% faster than groupby. Everything else is up to 40x slower. In small datasets, this may not matter by much, but as you can see, this can significantly impact large datasets.
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df.groupby(['sp', 'mt']).apply(lambda grp: grp.nlargest(1, 'count'))
I've been using this functional style for many group operations:
df = pd.DataFrame({
'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
'Mt': ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'Val': ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'Count': [3, 2, 5, 8, 10, 1, 2, 2, 7]
})
(df.groupby(['Sp', 'Mt'])
.apply(lambda group: group[group['Count'] == group['Count'].max()])
.reset_index(drop=True))
Sp Mt Val Count
0 MM1 S1 a 3
1 MM1 S3 cb 5
2 MM2 S3 mk 8
3 MM2 S4 bg 10
4 MM4 S2 uyi 7
.reset_index(drop=True) gets you back to the original index by dropping the group-index.
df.loc[df.groupby('mt')['count'].idxmax()]
if the df index isn't unique you may need this step df.reset_index(inplace=True) first.

Filter groupby by max and recombine into new df [duplicate]

How do I find all rows in a pandas DataFrame which have the max value for count column, after grouping by ['Sp','Mt'] columns?
Example 1: the following DataFrame:
Sp Mt Value count
0 MM1 S1 a **3**
1 MM1 S1 n 2
2 MM1 S3 cb **5**
3 MM2 S3 mk **8**
4 MM2 S4 bg **10**
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi **7**
Expected output is to get the result rows whose count is max in each group, like this:
Sp Mt Value count
0 MM1 S1 a **3**
2 MM1 S3 cb **5**
3 MM2 S3 mk **8**
4 MM2 S4 bg **10**
8 MM4 S2 uyi **7**
Example 2:
Sp Mt Value count
4 MM2 S4 bg 10
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 8
8 MM4 S2 uyi 8
Expected output:
Sp Mt Value count
4 MM2 S4 bg 10
7 MM4 S2 cb 8
8 MM4 S2 uyi 8
Firstly, we can get the max count for each group like this:
In [1]: df
Out[1]:
Sp Mt Value count
0 MM1 S1 a 3
1 MM1 S1 n 2
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi 7
In [2]: df.groupby(['Sp', 'Mt'])['count'].max()
Out[2]:
Sp Mt
MM1 S1 3
S3 5
MM2 S3 8
S4 10
MM4 S2 7
Name: count, dtype: int64
To get the indices of the original DF you can do:
In [3]: idx = df.groupby(['Sp', 'Mt'])['count'].transform(max) == df['count']
In [4]: df[idx]
Out[4]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
8 MM4 S2 uyi 7
Note that if you have multiple max values per group, all will be returned.
Update
On a Hail Mary chance that this is what the OP is requesting:
In [5]: df['count_max'] = df.groupby(['Sp', 'Mt'])['count'].transform(max)
In [6]: df
Out[6]:
Sp Mt Value count count_max
0 MM1 S1 a 3 3
1 MM1 S1 n 2 3
2 MM1 S3 cb 5 5
3 MM2 S3 mk 8 8
4 MM2 S4 bg 10 10
5 MM2 S4 dgd 1 10
6 MM4 S2 rd 2 7
7 MM4 S2 cb 2 7
8 MM4 S2 uyi 7 7
You can sort the dataFrame by count and then remove duplicates. I think it's easier:
df.sort_values('count', ascending=False).drop_duplicates(['Sp','Mt'])
Easy solution would be to apply the idxmax() function to get indices of rows with max values.
This would filter out all the rows with max value in the group.
In [367]: df
Out[367]:
sp mt val count
0 MM1 S1 a 3
1 MM1 S1 n 2
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
5 MM2 S4 dgb 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi 7
# Apply idxmax() and use .loc() on dataframe to filter the rows with max values:
In [368]: df.loc[df.groupby(["sp", "mt"])["count"].idxmax()]
Out[368]:
sp mt val count
0 MM1 S1 a 3
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
8 MM4 S2 uyi 7
# Just to show what values are returned by .idxmax() above:
In [369]: df.groupby(["sp", "mt"])["count"].idxmax().values
Out[369]: array([0, 2, 3, 4, 8])
You may not need to do groupby(), but use both sort_values + drop_duplicates
df.sort_values('count').drop_duplicates(['Sp', 'Mt'], keep='last')
Out[190]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
8 MM4 S2 uyi 7
3 MM2 S3 mk 8
4 MM2 S4 bg 10
Also almost same logic by using tail
df.sort_values('count').groupby(['Sp', 'Mt']).tail(1)
Out[52]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
8 MM4 S2 uyi 7
3 MM2 S3 mk 8
4 MM2 S4 bg 10
Having tried the solution suggested by Zelazny on a relatively large DataFrame (~400k rows) I found it to be very slow. Here is an alternative that I found to run orders of magnitude faster on my data set.
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df_grouped = df.groupby(['sp', 'mt']).agg({'count':'max'})
df_grouped = df_grouped.reset_index()
df_grouped = df_grouped.rename(columns={'count':'count_max'})
df = pd.merge(df, df_grouped, how='left', on=['sp', 'mt'])
df = df[df['count'] == df['count_max']]
Use groupby and idxmax methods:
transfer col date to datetime:
df['date'] = pd.to_datetime(df['date'])
get the index of max of column date, after groupyby ad_id:
idx = df.groupby(by='ad_id')['date'].idxmax()
get the wanted data:
df_max = df.loc[idx,]
ad_id price date
7 22 2 2018-06-11
6 23 2 2018-06-22
2 24 2 2018-06-30
3 28 5 2018-06-22
For me, the easiest solution would be keep value when count is equal to the maximum. Therefore, the following one line command is enough :
df[df['count'] == df.groupby(['Mt'])['count'].transform(max)]
Summarizing, there are many ways, but which one is faster?
import pandas as pd
import numpy as np
import time
df = pd.DataFrame(np.random.randint(1,10,size=(1000000, 2)), columns=list('AB'))
start_time = time.time()
df1idx = df.groupby(['A'])['B'].transform(max) == df['B']
df1 = df[df1idx]
print("---1 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df2 = df.sort_values('B').groupby(['A']).tail(1)
print("---2 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df3 = df.sort_values('B').drop_duplicates(['A'],keep='last')
print("---3 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df3b = df.sort_values('B', ascending=False).drop_duplicates(['A'])
print("---3b) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df4 = df[df['B'] == df.groupby(['A'])['B'].transform(max)]
print("---4 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
d = df.groupby('A')['B'].nlargest(1)
df5 = df.iloc[[i[1] for i in d.index], :]
print("---5 ) %s seconds ---" % (time.time() - start_time))
And the winner is...
--1 ) 0.03337574005126953 seconds ---
--2 ) 0.1346898078918457 seconds ---
--3 ) 0.10243558883666992 seconds ---
--3b) 0.1004343032836914 seconds ---
--4 ) 0.028397560119628906 seconds ---
--5 ) 0.07552886009216309 seconds ---
Try using nlargest on the groupby object. The advantage is that it returns the rows where "the nlargest item(s)" were fetched from, and we can get their index.
In this case, we want n=1 for the max and keep='all' to include duplicate maxes.
Note: we slice the last (-1) element of our index since our index in this case consist of tuples (e.g. ('MM1', 'S1', 0)).
df = pd.DataFrame({
'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'Mt': ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'Val': ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count': [3, 2, 5, 8, 10, 1, 2, 2, 7]
})
d = df.groupby(['Sp', 'Mt'])['count'].nlargest(1, keep='all')
df.loc[[i[-1] for i in d.index]]
Sp Mt Val count
0 MM1 S1 a 3
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
8 MM4 S2 uyi 7
Realizing that "applying" "nlargest" to groupby object works just as fine:
Additional advantage - also can fetch top n values if required:
In [85]: import pandas as pd
In [86]: df = pd.DataFrame({
...: 'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
...: 'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
...: 'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
...: 'count' : [3,2,5,8,10,1,2,2,7]
...: })
## Apply nlargest(1) to find the max val df, and nlargest(n) gives top n values for df:
In [87]: df.groupby(["sp", "mt"]).apply(lambda x: x.nlargest(1, "count")).reset_index(drop=True)
Out[87]:
count mt sp val
0 3 S1 MM1 a
1 5 S3 MM1 cb
2 8 S3 MM2 mk
3 10 S4 MM2 bg
4 7 S2 MM4 uyi
If you sort your DataFrame that ordering will be preserved in the groupby. You can then just grab the first or last element and reset the index.
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df.sort_values("count", ascending=False).groupby(["sp", "mt"]).first().reset_index()
Many of these are great answers, but to help show scalability, on 2.8 million rows with varying amount of duplicates shows some startling differences. The fastest for my data was the sort by then drop duplicate (drop all but last marginally faster than sort descending and drop all but first)
Sort Ascending, Drop duplicate keep last (2.22 s)
Sort Descending, Drop Duplicate keep First (2.32 s)
Transform Max within the loc function (3.73 s)
Transform Max storing IDX then using loc select as second step (3.84 s)
Groupby using Tail (8.98 s)
IDMax with groupby and then using loc select as second step (95.39 s)
IDMax with groupby within the loc select (95.74 s)
NLargest(1) then using iloc select as a second step (> 35000 s ) - did not finish after running overnight
NLargest(1) within iloc select (> 35000 s ) - did not finish after running overnight
As you can see Sort is 1/3 faster than transform and 75% faster than groupby. Everything else is up to 40x slower. In small datasets, this may not matter by much, but as you can see, this can significantly impact large datasets.
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df.groupby(['sp', 'mt']).apply(lambda grp: grp.nlargest(1, 'count'))
I've been using this functional style for many group operations:
df = pd.DataFrame({
'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
'Mt': ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'Val': ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'Count': [3, 2, 5, 8, 10, 1, 2, 2, 7]
})
(df.groupby(['Sp', 'Mt'])
.apply(lambda group: group[group['Count'] == group['Count'].max()])
.reset_index(drop=True))
Sp Mt Val Count
0 MM1 S1 a 3
1 MM1 S3 cb 5
2 MM2 S3 mk 8
3 MM2 S4 bg 10
4 MM4 S2 uyi 7
.reset_index(drop=True) gets you back to the original index by dropping the group-index.
df.loc[df.groupby('mt')['count'].idxmax()]
if the df index isn't unique you may need this step df.reset_index(inplace=True) first.

how to display matching row in pandas groupby [duplicate]

How do I find all rows in a pandas DataFrame which have the max value for count column, after grouping by ['Sp','Mt'] columns?
Example 1: the following DataFrame:
Sp Mt Value count
0 MM1 S1 a **3**
1 MM1 S1 n 2
2 MM1 S3 cb **5**
3 MM2 S3 mk **8**
4 MM2 S4 bg **10**
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi **7**
Expected output is to get the result rows whose count is max in each group, like this:
Sp Mt Value count
0 MM1 S1 a **3**
2 MM1 S3 cb **5**
3 MM2 S3 mk **8**
4 MM2 S4 bg **10**
8 MM4 S2 uyi **7**
Example 2:
Sp Mt Value count
4 MM2 S4 bg 10
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 8
8 MM4 S2 uyi 8
Expected output:
Sp Mt Value count
4 MM2 S4 bg 10
7 MM4 S2 cb 8
8 MM4 S2 uyi 8
Firstly, we can get the max count for each group like this:
In [1]: df
Out[1]:
Sp Mt Value count
0 MM1 S1 a 3
1 MM1 S1 n 2
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
5 MM2 S4 dgd 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi 7
In [2]: df.groupby(['Sp', 'Mt'])['count'].max()
Out[2]:
Sp Mt
MM1 S1 3
S3 5
MM2 S3 8
S4 10
MM4 S2 7
Name: count, dtype: int64
To get the indices of the original DF you can do:
In [3]: idx = df.groupby(['Sp', 'Mt'])['count'].transform(max) == df['count']
In [4]: df[idx]
Out[4]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
8 MM4 S2 uyi 7
Note that if you have multiple max values per group, all will be returned.
Update
On a Hail Mary chance that this is what the OP is requesting:
In [5]: df['count_max'] = df.groupby(['Sp', 'Mt'])['count'].transform(max)
In [6]: df
Out[6]:
Sp Mt Value count count_max
0 MM1 S1 a 3 3
1 MM1 S1 n 2 3
2 MM1 S3 cb 5 5
3 MM2 S3 mk 8 8
4 MM2 S4 bg 10 10
5 MM2 S4 dgd 1 10
6 MM4 S2 rd 2 7
7 MM4 S2 cb 2 7
8 MM4 S2 uyi 7 7
You can sort the dataFrame by count and then remove duplicates. I think it's easier:
df.sort_values('count', ascending=False).drop_duplicates(['Sp','Mt'])
Easy solution would be to apply the idxmax() function to get indices of rows with max values.
This would filter out all the rows with max value in the group.
In [367]: df
Out[367]:
sp mt val count
0 MM1 S1 a 3
1 MM1 S1 n 2
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
5 MM2 S4 dgb 1
6 MM4 S2 rd 2
7 MM4 S2 cb 2
8 MM4 S2 uyi 7
# Apply idxmax() and use .loc() on dataframe to filter the rows with max values:
In [368]: df.loc[df.groupby(["sp", "mt"])["count"].idxmax()]
Out[368]:
sp mt val count
0 MM1 S1 a 3
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
8 MM4 S2 uyi 7
# Just to show what values are returned by .idxmax() above:
In [369]: df.groupby(["sp", "mt"])["count"].idxmax().values
Out[369]: array([0, 2, 3, 4, 8])
You may not need to do groupby(), but use both sort_values + drop_duplicates
df.sort_values('count').drop_duplicates(['Sp', 'Mt'], keep='last')
Out[190]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
8 MM4 S2 uyi 7
3 MM2 S3 mk 8
4 MM2 S4 bg 10
Also almost same logic by using tail
df.sort_values('count').groupby(['Sp', 'Mt']).tail(1)
Out[52]:
Sp Mt Value count
0 MM1 S1 a 3
2 MM1 S3 cb 5
8 MM4 S2 uyi 7
3 MM2 S3 mk 8
4 MM2 S4 bg 10
Having tried the solution suggested by Zelazny on a relatively large DataFrame (~400k rows) I found it to be very slow. Here is an alternative that I found to run orders of magnitude faster on my data set.
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df_grouped = df.groupby(['sp', 'mt']).agg({'count':'max'})
df_grouped = df_grouped.reset_index()
df_grouped = df_grouped.rename(columns={'count':'count_max'})
df = pd.merge(df, df_grouped, how='left', on=['sp', 'mt'])
df = df[df['count'] == df['count_max']]
Use groupby and idxmax methods:
transfer col date to datetime:
df['date'] = pd.to_datetime(df['date'])
get the index of max of column date, after groupyby ad_id:
idx = df.groupby(by='ad_id')['date'].idxmax()
get the wanted data:
df_max = df.loc[idx,]
ad_id price date
7 22 2 2018-06-11
6 23 2 2018-06-22
2 24 2 2018-06-30
3 28 5 2018-06-22
For me, the easiest solution would be keep value when count is equal to the maximum. Therefore, the following one line command is enough :
df[df['count'] == df.groupby(['Mt'])['count'].transform(max)]
Summarizing, there are many ways, but which one is faster?
import pandas as pd
import numpy as np
import time
df = pd.DataFrame(np.random.randint(1,10,size=(1000000, 2)), columns=list('AB'))
start_time = time.time()
df1idx = df.groupby(['A'])['B'].transform(max) == df['B']
df1 = df[df1idx]
print("---1 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df2 = df.sort_values('B').groupby(['A']).tail(1)
print("---2 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df3 = df.sort_values('B').drop_duplicates(['A'],keep='last')
print("---3 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df3b = df.sort_values('B', ascending=False).drop_duplicates(['A'])
print("---3b) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
df4 = df[df['B'] == df.groupby(['A'])['B'].transform(max)]
print("---4 ) %s seconds ---" % (time.time() - start_time))
start_time = time.time()
d = df.groupby('A')['B'].nlargest(1)
df5 = df.iloc[[i[1] for i in d.index], :]
print("---5 ) %s seconds ---" % (time.time() - start_time))
And the winner is...
--1 ) 0.03337574005126953 seconds ---
--2 ) 0.1346898078918457 seconds ---
--3 ) 0.10243558883666992 seconds ---
--3b) 0.1004343032836914 seconds ---
--4 ) 0.028397560119628906 seconds ---
--5 ) 0.07552886009216309 seconds ---
Try using nlargest on the groupby object. The advantage is that it returns the rows where "the nlargest item(s)" were fetched from, and we can get their index.
In this case, we want n=1 for the max and keep='all' to include duplicate maxes.
Note: we slice the last (-1) element of our index since our index in this case consist of tuples (e.g. ('MM1', 'S1', 0)).
df = pd.DataFrame({
'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'Mt': ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'Val': ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count': [3, 2, 5, 8, 10, 1, 2, 2, 7]
})
d = df.groupby(['Sp', 'Mt'])['count'].nlargest(1, keep='all')
df.loc[[i[-1] for i in d.index]]
Sp Mt Val count
0 MM1 S1 a 3
2 MM1 S3 cb 5
3 MM2 S3 mk 8
4 MM2 S4 bg 10
8 MM4 S2 uyi 7
Realizing that "applying" "nlargest" to groupby object works just as fine:
Additional advantage - also can fetch top n values if required:
In [85]: import pandas as pd
In [86]: df = pd.DataFrame({
...: 'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
...: 'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
...: 'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
...: 'count' : [3,2,5,8,10,1,2,2,7]
...: })
## Apply nlargest(1) to find the max val df, and nlargest(n) gives top n values for df:
In [87]: df.groupby(["sp", "mt"]).apply(lambda x: x.nlargest(1, "count")).reset_index(drop=True)
Out[87]:
count mt sp val
0 3 S1 MM1 a
1 5 S3 MM1 cb
2 8 S3 MM2 mk
3 10 S4 MM2 bg
4 7 S2 MM4 uyi
If you sort your DataFrame that ordering will be preserved in the groupby. You can then just grab the first or last element and reset the index.
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df.sort_values("count", ascending=False).groupby(["sp", "mt"]).first().reset_index()
Many of these are great answers, but to help show scalability, on 2.8 million rows with varying amount of duplicates shows some startling differences. The fastest for my data was the sort by then drop duplicate (drop all but last marginally faster than sort descending and drop all but first)
Sort Ascending, Drop duplicate keep last (2.22 s)
Sort Descending, Drop Duplicate keep First (2.32 s)
Transform Max within the loc function (3.73 s)
Transform Max storing IDX then using loc select as second step (3.84 s)
Groupby using Tail (8.98 s)
IDMax with groupby and then using loc select as second step (95.39 s)
IDMax with groupby within the loc select (95.74 s)
NLargest(1) then using iloc select as a second step (> 35000 s ) - did not finish after running overnight
NLargest(1) within iloc select (> 35000 s ) - did not finish after running overnight
As you can see Sort is 1/3 faster than transform and 75% faster than groupby. Everything else is up to 40x slower. In small datasets, this may not matter by much, but as you can see, this can significantly impact large datasets.
df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df.groupby(['sp', 'mt']).apply(lambda grp: grp.nlargest(1, 'count'))
I've been using this functional style for many group operations:
df = pd.DataFrame({
'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
'Mt': ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'Val': ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'Count': [3, 2, 5, 8, 10, 1, 2, 2, 7]
})
(df.groupby(['Sp', 'Mt'])
.apply(lambda group: group[group['Count'] == group['Count'].max()])
.reset_index(drop=True))
Sp Mt Val Count
0 MM1 S1 a 3
1 MM1 S3 cb 5
2 MM2 S3 mk 8
3 MM2 S4 bg 10
4 MM4 S2 uyi 7
.reset_index(drop=True) gets you back to the original index by dropping the group-index.
df.loc[df.groupby('mt')['count'].idxmax()]
if the df index isn't unique you may need this step df.reset_index(inplace=True) first.

Categories

Resources