I have the following data frame:
order_id amount records
1 2 1
2 5 10
3 20 5
4 1 3
I want to remove rows where the amount is greater than the records, the output should be:
order_id amount records
2 5 10
4 1 3
Here is what I've attempted:
df = df.drop(
df[df.amount > df.records].index, inplace=True)
this is removing all rows, any suggestions are welcome.
Simply filter by:
df = df[df['amount']<df['records']]
and you get the desired results:
order_id amount records
1 2 5 10
3 4 1 3
df.loc[~df.amount.gt(df.records)]
order_id amount records
1 2 5 10
3 4 1 3
Explanation: comparisions return a boolean:
~df.amount.gt(df.records)
0 False
1 True
2 False
3 True
dtype: bool
This returns values where amount is not greater than records.
You can use this boolean to index into the dataframe to get your desired values.
Alternatively, you could use the code below as well, without having to call on the negation (~) :
df.loc[df.amount.le(df.records)]
Related
I have a pandas DataFrame with 3 columns :
id product_id is_opt
1 1 False
1 2 False
1 3 True
1 4 True
2 5 False
2 6 False
2 7 False
3 8 False
3 9 False
3 10 True
I want to transform this DataFrame this way :
For a set of rows that shares the same id, if all rows are is_opt = False, then the set of rows stays unchanged. For example, the rows with id = 2 do not change.
For a set of rows that shares the same id, if at least one row is is_opt = True, then we apply this transformation:
All rows that are is_opt = True stay unchanged.
All rows that are is_opt = False take at the end of their product_id all the product_ids of the rows that are is_opt = True. If there are n rows is_opt = True, then 1 row with is_opt = False gives n rows. For exemple, the first row [1, 1, False] gives 2 rows [1, 1-3, False] and [1, 1-4, False].
The expected output for the example is:
id product_id
1 1-3
1 1-4
1 2-3
1 2-4
1 3
1 4
2 5
2 6
2 7
3 8-10
3 9-10
3 10
is_opt column has been droped in the expected result.
Can you help me with a way to get this result in an efficient set of operations ? It is straightforward with some for loops but I would like something efficient because the DataFrames in production are huge.
You can use a custom function and itertools.product:
from itertools import product
def combine(df):
if df['is_opt'].any():
a = df.loc[~df['is_opt'], 'product_id']
b = df.loc[df['is_opt'], 'product_id']
l = ['-'.join(map(str, p)) for p in product(a, b)]
return pd.Series(l+b.tolist())
return df['product_id']
out = df.groupby('id').apply(combine).droplevel(1).reset_index(name='product_id')
output:
id product_id
0 1 1-3
1 1 1-4
2 1 2-3
3 1 2-4
4 1 3
5 1 4
6 2 5
7 2 6
8 2 7
9 3 8-10
10 3 9-10
11 3 10
I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?
My data looks like:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).
Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff, and then take the first element in each item group:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.
You can use DataFrame.sort_values with DataFrame.drop_duplicates:
df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item diff otherstuff
6 2 -6 2
7 3 0 0
1 1 1 2
If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:
print (df)
item diff otherstuff
0 1 2 1
1 1 1 2 <-multiple min
2 1 1 7 <-multiple min
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
print (df.groupby("item")["diff"].transform('min'))
0 1
1 1
2 1
3 -6
4 -6
5 -6
6 -6
7 0
8 0
Name: diff, dtype: int64
df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item diff otherstuff
1 1 1 2
2 1 1 7
6 2 -6 2
7 3 0 0
The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked
def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]
df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
As an aside, .filter() is also relevant to this question but didn't work for me.
I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.
df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)
For a little more explanation:
Sort items by the minimum value you want
Drop the duplicates of the column you want to sort with
Resort the data because the data is still sorted by the minimum values
If you know that all of your "items" have more than one record you can sort, then use duplicated:
df.sort_values(by='diff').duplicated(subset='item', keep='first')
I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?
My data looks like:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).
Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff, and then take the first element in each item group:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.
You can use DataFrame.sort_values with DataFrame.drop_duplicates:
df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item diff otherstuff
6 2 -6 2
7 3 0 0
1 1 1 2
If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:
print (df)
item diff otherstuff
0 1 2 1
1 1 1 2 <-multiple min
2 1 1 7 <-multiple min
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
print (df.groupby("item")["diff"].transform('min'))
0 1
1 1
2 1
3 -6
4 -6
5 -6
6 -6
7 0
8 0
Name: diff, dtype: int64
df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item diff otherstuff
1 1 1 2
2 1 1 7
6 2 -6 2
7 3 0 0
The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked
def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]
df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
As an aside, .filter() is also relevant to this question but didn't work for me.
I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.
df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)
For a little more explanation:
Sort items by the minimum value you want
Drop the duplicates of the column you want to sort with
Resort the data because the data is still sorted by the minimum values
If you know that all of your "items" have more than one record you can sort, then use duplicated:
df.sort_values(by='diff').duplicated(subset='item', keep='first')
I have a dataframe with user_id and order_number columns. order_number tells the nth order by a user. I want to select users who have done certain number of orders.
Sample DataFrame:
user_id order_number
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 3 1
6 3 2
7 3 3
Output: [1,3]
Output should be user_id [1,3] because they have done 3 orders but 2 have done only 2 orders.
I am trying:
(df.groupby(['user_id'])['order_number'].max()==3)
This gives me boolean series but how to select index with only True values?
A general way of doing this is by using df.loc[] or df.query:
df.groupby(['user_id'],as_index=False)['order_number'].max().query("order_number==3")
#or
df.groupby(['user_id'],as_index=False)['order_number'].max().loc[
lambda x: x['order_number']==3]
For this example you dont have to get max of another column, you can just count them, as #Steven suggests :
df.groupby('user_id',as_index=False).count().query("order_number==3")
Or as #Wen suggests:
df['user_id'].value_counts().loc[lambda x: x==3]
user_id order_number
0 1 3
2 3 3
I have a pandas dataframe like this:
a b c
0 1 1 1
1 1 1 0
2 2 4 1
3 3 5 0
4 3 5 0
where the first 2 columns ('a' and 'b') are IDs while the last one ('c') is a validation (0 = neg, 1 = pos). I do know how to remove duplicates based on the values of the first 2 columns, however in this case I would also like to get rid of inconsistent data i.e. duplicated data validated both as positive and negative. So for example the first 2 rows are duplicated but inconsistent hence I should remove the entire record, while the last 2 rows are both duplicated and consistent so I'd keep one of the records. The expected result sholud be:
a b c
0 2 4 1
1 3 5 0
The real dataframe can have more than two duplicates per group and
as you can see also the index has been changed. Thanks.
First filter rows by GroupBy.transform with SeriesGroupBy.nunique for get only unique values groups with boolean indexing and then DataFrame.drop_duplicates:
df = (df[df.groupby(['a','b'])['c'].transform('nunique').eq(1)]
.drop_duplicates(['a','b'])
.reset_index(drop=True))
print (df)
a b c
0 2 4 1
1 3 5 0
Detail:
print (df.groupby(['a','b'])['c'].transform('nunique'))
0 2
1 2
2 1
3 1
4 1
Name: c, dtype: int64