I want to group my values together so that the max sum of 2 values comes to a certain value ( here 6 ).
For example, I want to put together (1+5), (3+3), (4+1) and the rest by themselves. For this, I need to be able to search for a certain condition combination, and also ignore it if there is no such number. In the "Grouped" column I keep track of if they have already been grouped, if so then leave them, an index can only be grouped once.
I have:
df_1= pd.DataFrame({'Rest_after_division': [1,3,3,4,5,5,1],
'Grouped_with_index': ["-","-","-","-","-","-","-"],
'Grouped': [0,0,0,0,0,0,0]})
Rest_after_division Grouped_with_index Grouped
0 1 - 0
1 2 - 0
2 3 - 0
3 4 - 0
4 5 - 0
5 5 - 0
6 5 - 0
I want:
Rest_after_division Grouped_with_index Grouped
0 1 4 1
1 2 3 1
2 3 - 0
3 4 1 1
4 5 0 1
5 5 - 0
6 5 - 0
I have example 2:
df_1= pd.DataFrame({'Rest_after_division': [1,1,1,4,5,5,5],
'Grouped_with_index': ["-","-","-","-","-","-","-"],
'Grouped': [0,0,0,0,0,0,0]})
Rest_after_division Grouped_with_index Grouped
0 1 - 0
1 1 - 0
2 1 - 0
3 4 - 0
4 5 - 0
5 5 - 0
6 5 - 0
I want example 2:
Rest_after_division Grouped_with_index Grouped
0 1 4 1
1 1 5 1
2 1 6 1
3 4 - 0
4 5 0 1
5 5 1 1
6 5 2 1
I have tried: ( i know I need to loop this eventually, but I can't get the index..)
df_1 = df_1.sort_values('Grouped')
index_group_buddy= df_1[df_1['Rest_after_division']==5].head(1).index[0]
print(index_group_buddy)
This almost works, but not when the condition does not exist, how do I skip this? And I also think it will be problematic when all are grouped...
I have also tried:
#index_group_buddy = df_1.loc[((df_1['Rest_after_division'] == 5) & (df_1['Grouped'] != 1)) ].idxmin(axis=1)
#index_group_buddy =df_1.query("Rest_after_division==5 and Grouped!=1")
index_group_buddy = df_1[(df_1['Rest_after_division']==5) & (df_1['Grouped']!=1)].index[0]
df_1.at[index_group_buddy, 'Grouped'] = 1
df_1.at[index_group_buddy, 'Grouped_with_index '] = index_group_buddy
print(index_group_buddy)
I want to find the first index that has the right conditions.
rework you df_1 to map unique "Rest_after_division" to their index
Map the complement to 6 on those keys
calculate which values should be grouped (not complement with self and first value of group)
insert the values with mask
keys = (df_1['Rest_after_division']
.drop_duplicates()
.reset_index()
.set_index('Rest_after_division')
['index']
)
compl_index = (6-df_1['Rest_after_division']).map(keys)
df_1['Grouped'] = (compl_index.ne(df_1.index)
& df_1.groupby('Rest_after_division').cumcount().eq(0)
).astype(int)
df_1['Grouped_with_index'] = compl_index.where(df_1['Grouped'].eq(1),
df_1['Grouped_with_index'])
output:
Rest_after_division Grouped_with_index Grouped
0 1 4 1
1 2 3 1
2 3 - 0
3 4 1 1
4 5 0 1
5 5 - 0
6 1 - 0
Related
I am trying to handle the following dataframe
df = pd.DataFrame({'ID':[1,1,2,2,3,3,3,4,4,4,4],
'sum':[1,2,1,2,1,2,3,1,2,3,4,]})
Now I want to find the difference from the last row by each ID.
Specifically, I tried this code.
df['diff'] = df.groupby('ID')['sum'].diff(-1)
df
However, this would require a difference from one line behind.
Is there any way to determine the difference between each of the last rows with groupbuy?
Thank you for your help.
You can use transform('last') to get the last value per group:
df['diff'] = df['sum'].sub(df.groupby('ID')['sum'].transform('last'))
or using groupby.apply:
df['diff'] = df.groupby('ID')['sum'].apply(lambda x: x-x.iloc[-1])
output:
ID sum diff
0 1 1 -1
1 1 2 0
2 2 1 -1
3 2 2 0
4 3 1 -2
5 3 2 -1
6 3 3 0
7 4 1 -3
8 4 2 -2
9 4 3 -1
10 4 4 0
I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?
My data looks like:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).
Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff, and then take the first element in each item group:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.
You can use DataFrame.sort_values with DataFrame.drop_duplicates:
df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item diff otherstuff
6 2 -6 2
7 3 0 0
1 1 1 2
If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:
print (df)
item diff otherstuff
0 1 2 1
1 1 1 2 <-multiple min
2 1 1 7 <-multiple min
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
print (df.groupby("item")["diff"].transform('min'))
0 1
1 1
2 1
3 -6
4 -6
5 -6
6 -6
7 0
8 0
Name: diff, dtype: int64
df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item diff otherstuff
1 1 1 2
2 1 1 7
6 2 -6 2
7 3 0 0
The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked
def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]
df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
As an aside, .filter() is also relevant to this question but didn't work for me.
I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.
df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)
For a little more explanation:
Sort items by the minimum value you want
Drop the duplicates of the column you want to sort with
Resort the data because the data is still sorted by the minimum values
If you know that all of your "items" have more than one record you can sort, then use duplicated:
df.sort_values(by='diff').duplicated(subset='item', keep='first')
I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?
My data looks like:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).
Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff, and then take the first element in each item group:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.
You can use DataFrame.sort_values with DataFrame.drop_duplicates:
df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item diff otherstuff
6 2 -6 2
7 3 0 0
1 1 1 2
If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:
print (df)
item diff otherstuff
0 1 2 1
1 1 1 2 <-multiple min
2 1 1 7 <-multiple min
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
print (df.groupby("item")["diff"].transform('min'))
0 1
1 1
2 1
3 -6
4 -6
5 -6
6 -6
7 0
8 0
Name: diff, dtype: int64
df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item diff otherstuff
1 1 1 2
2 1 1 7
6 2 -6 2
7 3 0 0
The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked
def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]
df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
As an aside, .filter() is also relevant to this question but didn't work for me.
I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.
df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)
For a little more explanation:
Sort items by the minimum value you want
Drop the duplicates of the column you want to sort with
Resort the data because the data is still sorted by the minimum values
If you know that all of your "items" have more than one record you can sort, then use duplicated:
df.sort_values(by='diff').duplicated(subset='item', keep='first')
Suppose I have this DF:
s1 = pd.Series([1,1,2,2,2,3,3,3,4])
s2 = pd.Series([10,20,10,5,10,7,7,3,10])
s3 = pd.Series([0,0,0,0,1,1,0,2,0])
df = pd.DataFrame([s1,s2,s3]).transpose()
df.columns = ['id','qual','nm']
df
id qual nm
0 1 10 0
1 1 20 0
2 2 10 0
3 2 5 0
4 2 10 1
5 3 7 1
6 3 7 0
7 3 3 2
8 4 10 0
I want to get a new DF in which there are no duplicate ids, so there should be 4 rows with ids 1,2,3,4. The row that should be kept should be chosen based on the following criteria: take the one with smallest nm, if equal, take the one with largest qual, if still equal, just choose one.
I figure that my code should look something like:
df.groupby('id').apply(lambda x: ???)
And it should return:
id qual nm
0 1 20 0
1 2 10 0
2 3 7 0
3 4 10 0
But not sure what my function should take and return.
Or possibly there is an easier way?
Thanks!
Use boolean indexing with GroupBy.transform for minumum rows per groups, then for maximum values and last if still dupes remove them by DataFrame.drop_duplicates:
#get minimal nm
df1 = df[df['nm'] == df.groupby('id')['nm'].transform('min')]
#get maximal qual
df1 = df1[df1['qual'] == df1.groupby('id')['qual'].transform('max')]
#if still dupes get first id
df1 = df1.drop_duplicates('id')
print (df1)
id qual nm
1 1 20 0
2 2 10 0
6 3 7 0
8 4 10 0
Use -
grouper = df.groupby(['id'])
df.loc[(grouper['nm'].transform(min) == df['nm'] ) & (grouper['qual'].transform(max) == df['qual']),:].drop_duplicates(subset=['id'])
Output
id qual nm
1 1 20 0
2 2 10 0
6 3 7 0
8 4 10 0
with below example:
df = pd.DataFrame({'signal':[1,0,0,1,0,0,0,0,1,0,0,1,0,0],'product':['A','A','A','A','A','A','A','B','B','B','B','B','B','B'],'price':[1,2,3,4,5,6,7,1,2,3,4,5,6,7],'price2':[1,2,1,2,1,2,1,2,1,2,1,2,1,2]})
I have a function "fill_price" to create a new column 'Price_B' based on 'signal' and 'price'. For every 'product' subgroup, Price_B equals to Price if 'signal' is 1. Price_B equals previous row's Price_B if signal is 0. If the subgroup starts with a 0 'signal', then 'price_B' will be kept at 0 until 'signal' turns 1.
Currently I have:
def fill_price(df, signal,price_A):
p = df[price_A].where(df[signal] == 1)
return p.ffill().fillna(0).astype(df[price_A].dtype)
this is then applied using:
df['Price_B'] = fill_price(df,'signal','price')
However, I want to use df.groupby('product').apply() to apply this fill_price function to two subsets of 'product' columns separately, and also apply it to both'price' and 'price2' columns. Could someone help with that?
I basically want to do:
df.groupby('product',groupby_keys=False).apply(fill_price, 'signal','price2')
IIUC, you can use this syntax:
df['Price_B'] = df.groupby('product').apply(lambda x: fill_price(x,'signal','price2')).reset_index(level=0, drop=True)
Output:
price price2 product signal Price_B
0 1 1 A 1 1
1 2 2 A 0 1
2 3 1 A 0 1
3 4 2 A 1 2
4 5 1 A 0 2
5 6 2 A 0 2
6 7 1 A 0 2
7 1 2 B 0 0
8 2 1 B 1 1
9 3 2 B 0 1
10 4 1 B 0 1
11 5 2 B 1 2
12 6 1 B 0 2
13 7 2 B 0 2
You can write this much simplier without the extra function.
df['Price_B'] = (df.groupby('product',as_index=False)
.apply(lambda x: x['price2'].where(x.signal==1).ffill().fillna(0))
.reset_index(level=0, drop=True))