I have a dataframe with user_id and order_number columns. order_number tells the nth order by a user. I want to select users who have done certain number of orders.
Sample DataFrame:
user_id order_number
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 3 1
6 3 2
7 3 3
Output: [1,3]
Output should be user_id [1,3] because they have done 3 orders but 2 have done only 2 orders.
I am trying:
(df.groupby(['user_id'])['order_number'].max()==3)
This gives me boolean series but how to select index with only True values?
A general way of doing this is by using df.loc[] or df.query:
df.groupby(['user_id'],as_index=False)['order_number'].max().query("order_number==3")
#or
df.groupby(['user_id'],as_index=False)['order_number'].max().loc[
lambda x: x['order_number']==3]
For this example you dont have to get max of another column, you can just count them, as #Steven suggests :
df.groupby('user_id',as_index=False).count().query("order_number==3")
Or as #Wen suggests:
df['user_id'].value_counts().loc[lambda x: x==3]
user_id order_number
0 1 3
2 3 3
Related
I have the following data frame:
order_id amount records
1 2 1
2 5 10
3 20 5
4 1 3
I want to remove rows where the amount is greater than the records, the output should be:
order_id amount records
2 5 10
4 1 3
Here is what I've attempted:
df = df.drop(
df[df.amount > df.records].index, inplace=True)
this is removing all rows, any suggestions are welcome.
Simply filter by:
df = df[df['amount']<df['records']]
and you get the desired results:
order_id amount records
1 2 5 10
3 4 1 3
df.loc[~df.amount.gt(df.records)]
order_id amount records
1 2 5 10
3 4 1 3
Explanation: comparisions return a boolean:
~df.amount.gt(df.records)
0 False
1 True
2 False
3 True
dtype: bool
This returns values where amount is not greater than records.
You can use this boolean to index into the dataframe to get your desired values.
Alternatively, you could use the code below as well, without having to call on the negation (~) :
df.loc[df.amount.le(df.records)]
My dataframe looks like this:
Country Code Duration
A 1 0
A 1 1
A 1 2
A 1 3
A 2 0
A 2 1
A 1 0
A 1 1
A 1 2
I need to get max values from a "Duration" column - not just a maximum value, but a list of maximum values for each sequence of numbers in this column. The output might look like this:
Country Code Duration
A 1 3
A 2 1
A 1 2
I could have grouped by "Code", but its values are often repeating, so that's probably not an option. Any help or tips would be much appreciated.
Using idxmax after create another group key by diff and cumsum
df.loc[df.groupby([df.Country,df.Code.diff().ne(0).cumsum()]).Duration.idxmax()]
Country Code Duration
3 A 1 3
5 A 2 1
8 A 1 2
First we create a mask to mark the sequences. Then we groupby to create the wanted output:
m = (~df['Code'].eq(df['Code'].shift())).cumsum()
df.groupby(m).agg({'Country':'first',
'Code':'first',
'Duration':'max'}).reset_index(drop=True)
Country Code Duration
0 A 1 3
1 A 2 1
2 A 1 2
The problem is slightly unclear. However, assuming that order is important, we can move toward a solution.
import pandas as pd
d = pd.read_csv('data.csv')
s = d.Code
d['series'] = s.ne(s.shift()).cumsum()
print(pd.DataFrame(d.groupby(['Country','Code','series'])['Duration'].max().reset_index()))
Returns:
Country Code series Duration
0 A 1 1 3
1 A 1 3 2
2 A 2 2 1
You can then drop the series.
You might wanna check this link , it might be the answer you're looking for :
pandas groupby where you get the max of one column and the min of another column . It goes as :
result = df.groupby(['Code', 'Country']).agg({'Duration':'max'})[['Duration']].reset_index()
I have 4 columns in my dataframe user abcisse ordonnee,time
I want to find for each user the duplicate row with the last row of the user, duplicate row meaning two row with same abcisse and ordonnee.
I was thinking to use the df.duplicated function but i don't know how to combine it with groupby ?
entry = pd.DataFrame([[1,0,0,1],[1,3,-2,2],[1,2,1,3],[1,3,1,4],[1,3,-2,5],[2,1,3,1],[2,1,3,2]],columns=['user','abcisse','ordonnee','temps'])
output = pd.DataFrame([[1,0,0,1],[1,2,1,3],[1,3,1,4],[1,3,-2,5],[2,1,3,2]],columns=['user','abcisse','ordonnee','temps'])
Use drop_duplicates:
print (entry.drop_duplicates(['user', 'abcisse', 'ordonnee'], keep='last'))
user abcisse ordonnee temps
0 1 0 0 1
2 1 2 1 3
3 1 3 1 4
4 1 3 -2 5
6 2 1 3 2
I have the following data:
userid itemid
1 1
1 1
1 3
1 4
2 1
2 2
2 3
I want to drop userIDs who has viewed the same itemID more than or equal to twice.
For example, userid=1 has viewed itemid=1 twice, and thus I want to drop the entire record of userid=1. However, since userid=2 hasn't viewed the same item twice, I will leave userid=2 as it is.
So I want my data to be like the following:
userid itemid
2 1
2 2
2 3
Can someone help me?
import pandas as pd
df = pd.DataFrame({'userid':[1,1,1,1, 2,2,2],
'itemid':[1,1,3,4, 1,2,3] })
You can use duplicated to determine the row level duplicates, then perform a groupby on 'userid' to determine 'userid' level duplicates, then drop accordingly.
To drop without a threshold:
df = df[~df.duplicated(['userid', 'itemid']).groupby(df['userid']).transform('any')]
To drop with a threshold, use keep=False in duplicated, and sum over the Boolean column and compare against your threshold. For example, with a threshold of 3:
df = df[~df.duplicated(['userid', 'itemid'], keep=False).groupby(df['userid']).transform('sum').ge(3)]
The resulting output for no threshold:
userid itemid
4 2 1
5 2 2
6 2 3
filter
Was made for this. You can pass a function that returns a boolean that determines if the group passed the filter or not.
filter and value_counts
Most generalizable and intuitive
df.groupby('userid').filter(lambda x: x.itemid.value_counts().max() < 2)
filter and is_unique
special case when looking for n < 2
df.groupby('userid').filter(lambda x: x.itemid.is_unique)
userid itemid
4 2 1
5 2 2
6 2 3
Group the dataframe by users and items:
views = df.groupby(['userid','itemid'])['itemid'].count()
#userid itemid
#1 1 2 <=== The offending row
# 3 1
# 4 1
#2 1 1
# 2 1
# 3 1
#Name: dummy, dtype: int64
Find out who saw any item only once:
THRESHOLD = 2
viewed = ~(views.unstack() >= THRESHOLD).any(axis=1)
#userid
#1 False
#2 True
#dtype: bool
Combine the results and keep the 'good' rows:
combined = df.merge(pd.DataFrame(viewed).reset_index())
combined[combined[0]][['userid','itemid']]
# userid itemid
#4 2 1
#5 2 2
#6 2 3
# group userid and itemid and get a count
df2 = df.groupby(by=['userid','itemid']).apply(lambda x: len(x)).reset_index()
#Extract rows where the max userid-itemid count is less than 2.
df2 = df2[~df2.userid.isin(df2[df2.ix[:,-1]>1]['userid'])][df.columns]
print(df2)
itemid userid
3 1 2
4 2 2
5 3 2
If you want to drop at a certain threshold, just set
df2.ix[:,-1]>threshold]
I do not know whether there is a function available in Pandas to do this task. However, I tried to make a workaround to deal with your problem.
Here is the full code.
import pandas as pd
dictionary = {'userid':[1,1,1,1,2,2,2],
'itemid':[1,1,3,4,1,2,3]}
df = pd.DataFrame(dictionary, columns=['userid', 'itemid'])
selected_user = []
for user in df['userid'].drop_duplicates().tolist():
items = df.loc[df['userid']==user]['itemid'].tolist()
if len(items) != len(set(items)): continue
else: selected_user.append(user)
result = df.loc[(df['userid'].isin(selected_user))]
This code will result the following outcome.
userid itemid
4 2 1
5 2 2
6 2 3
Hope it helps.
I have data like this:
df = pd.DataFrame( {
'ID': [1,1,2,3,3,3,4],
'SOME_NUM': [8,10,2,4,0,5,1]
} );
df
ID SOME_NUM
0 1 8
1 1 10
2 2 2
3 3 4
4 3 0
5 3 5
6 4 1
And I want to group by the ID column while retaining the maximum value of SOME_NUM as a separate column. This would be easy in SQL:
SELECT ID,
MAX(SOME_NUM)
FROM DF
GROUP BY ID;
But I'm having trouble finding the equivalent Python code. Seems like this should be easy. Anyone have a solution?
Desired result:
new_df
ID SOME_NUM
0 1 10
1 2 2
2 3 5
6 4 1
Seeing as how you are using Pandas... use the groupby functionality baked in
df.groupby("ID").max()