I am trying to add a new column called ordered_1day_ago to my df.
DataFrame currently looks like this:
itemID
orderedDate
qty
1
12/2/21
3
2
12/3/21
2
1
12/3/21
2
1
12/4/21
3
I want it to look like this:
itemID
orderedDate
qty
ordered_1day_ago
1
12/2/21
3
0
2
12/3/21
2
0
1
12/3/21
2
3
1
12/4/21
3
2
itemID and ordered date must be used to insert the qty on the next orderedDate if it falls within one day, if it does not, then ordered_1day_ago is 0.
How can we use pandas for this?
This is the complete solution:
import pandas as pd
# a dict to create th dataframe
d = {
'itemID':[1,2,1,1],
'orderedDate':['12/2/21', '12/3/21', '12/3/21', '12/4/21'],
'qty':[3,2,2,3]
}
# the old dataframe
df = pd.DataFrame(d)
print(df)
# some function to do what you want to based on rows
def some_function(row):
# code goes here
z = row['itemID'] + row['qty']
return z
# add the new column given the function above
df['ordered_1day_ago'] = df.apply(some_function, axis=1)
# the new datafdrame with the extra column
print(df)
This is the original df:
itemID orderedDate qty
0 1 12/2/21 3
1 2 12/3/21 2
2 1 12/3/21 2
3 1 12/4/21 3
This is the new df with the added (example) column:
itemID orderedDate qty ordered_1day_ago
0 1 12/2/21 3 4
1 2 12/3/21 2 4
2 1 12/3/21 2 3
3 1 12/4/21 3 4
You can amend the function to contain whatever criteria you wish such that the new column ordered_1day_ago contains the results that you wish.
Related
I have a dataset as below:
import pandas as pd
dict = {"A":[1,1,1,1,5],"B":[1,1,2,4,1]}
dt = pd.DataFrame(data=dict)
so, it is as below:
A B
1 1
1 1
1 2
1 4
5 1
i need to apply a groupby based on A and B count how many records each group has?
i have applied the below solution:
dtSize = dt.groupby(by=["A","B"], as_index=False).size()
dtSize.to_csv("./datasets/Final DT/dtSize.csv", sep=',', encoding='utf-8', index=False)
I have 2 problems:
When i open the saved file, it only contains the last column which includes number element in each group, but it does not include the groups
when i print the final dtSize it is as below:
so, some similar records in A is missed.
My favorit output is as below in a .csv file
A B Number of elements in group
1 1 2
1 2 1
1 4 1
5 1 1
Actually, data from A isn't missing. GroupBy.size returns a Series, so A and B are used as a MultiIndex. Due to this, repeated values for A in the first three rows aren't printed.
You're close. You need to reset the index and, optionally, name the result:
dt.groupby(['A', 'B']).size().reset_index(name='Size')
The result is:
A B Size
0 1 1 2
1 1 2 1
2 1 4 1
3 5 1 1
My dataframe looks like this:
Country Code Duration
A 1 0
A 1 1
A 1 2
A 1 3
A 2 0
A 2 1
A 1 0
A 1 1
A 1 2
I need to get max values from a "Duration" column - not just a maximum value, but a list of maximum values for each sequence of numbers in this column. The output might look like this:
Country Code Duration
A 1 3
A 2 1
A 1 2
I could have grouped by "Code", but its values are often repeating, so that's probably not an option. Any help or tips would be much appreciated.
Using idxmax after create another group key by diff and cumsum
df.loc[df.groupby([df.Country,df.Code.diff().ne(0).cumsum()]).Duration.idxmax()]
Country Code Duration
3 A 1 3
5 A 2 1
8 A 1 2
First we create a mask to mark the sequences. Then we groupby to create the wanted output:
m = (~df['Code'].eq(df['Code'].shift())).cumsum()
df.groupby(m).agg({'Country':'first',
'Code':'first',
'Duration':'max'}).reset_index(drop=True)
Country Code Duration
0 A 1 3
1 A 2 1
2 A 1 2
The problem is slightly unclear. However, assuming that order is important, we can move toward a solution.
import pandas as pd
d = pd.read_csv('data.csv')
s = d.Code
d['series'] = s.ne(s.shift()).cumsum()
print(pd.DataFrame(d.groupby(['Country','Code','series'])['Duration'].max().reset_index()))
Returns:
Country Code series Duration
0 A 1 1 3
1 A 1 3 2
2 A 2 2 1
You can then drop the series.
You might wanna check this link , it might be the answer you're looking for :
pandas groupby where you get the max of one column and the min of another column . It goes as :
result = df.groupby(['Code', 'Country']).agg({'Duration':'max'})[['Duration']].reset_index()
I have a data frame df1 with data that looks like this:
Item Store Sales Dept
0 1 1 5 A
1 1 2 3 A
2 1 3 4 A
3 2 1 3 A
4 2 2 3 A
I then want to use group by to see the total sales by item:
df2 = df1.groupby(['Item']).agg({'Item':'first','Sales':'sum'})
Which gives me:
Item Sales
0 1 12
1 2 6
And then I add a column with the rank of the item in terms of number of sales:
df2['Item Rank'] = df2['Sales'].rank(ascending=False,method='min').astype(int)
So that I get:
Item Sales Item Rank
0 1 12 1
1 2 6 2
I now want to add the Dept column to df2, so that I have
Item Sales Item Rank Dept
0 1 12 1 A
1 2 6 2 A
But everything I have tried has failed.
I either get an empty column, when I try to add the column in from the beginning, or a df with the wrong size if I try to concatenate the new df with the column from the original df.
df.groupby(['Item']).agg({'Item':'first','Sales':'sum','Dept': 'first'}).\
assign(Itemrank=df.Sales.rank(ascending=False,method='min').astype(int) )
Out[64]:
Item Dept Sales Itemrank
Item
1 1 A 12 3
2 2 A 6 2
This is unusual but if you can add the Dept column when you're doing the groupby itself:
A simple option is just to hard code the value if you already know what it needs to be:
df2 = df1.groupby(['Item']).agg({'Item':'first',
'Sales':'sum',
'Dept': lambda x: 'A'})
Or you could take it from the dataframe itself:
df2 = df1.groupby(['Item']).agg({'Item':'first',
'Sales':'sum',
'Dept': lambda x: df1['Dept'].iloc[0]})
I have 4 columns in my dataframe user abcisse ordonnee,time
I want to find for each user the duplicate row with the last row of the user, duplicate row meaning two row with same abcisse and ordonnee.
I was thinking to use the df.duplicated function but i don't know how to combine it with groupby ?
entry = pd.DataFrame([[1,0,0,1],[1,3,-2,2],[1,2,1,3],[1,3,1,4],[1,3,-2,5],[2,1,3,1],[2,1,3,2]],columns=['user','abcisse','ordonnee','temps'])
output = pd.DataFrame([[1,0,0,1],[1,2,1,3],[1,3,1,4],[1,3,-2,5],[2,1,3,2]],columns=['user','abcisse','ordonnee','temps'])
Use drop_duplicates:
print (entry.drop_duplicates(['user', 'abcisse', 'ordonnee'], keep='last'))
user abcisse ordonnee temps
0 1 0 0 1
2 1 2 1 3
3 1 3 1 4
4 1 3 -2 5
6 2 1 3 2
I have the following data:
userid itemid
1 1
1 1
1 3
1 4
2 1
2 2
2 3
I want to drop userIDs who has viewed the same itemID more than or equal to twice.
For example, userid=1 has viewed itemid=1 twice, and thus I want to drop the entire record of userid=1. However, since userid=2 hasn't viewed the same item twice, I will leave userid=2 as it is.
So I want my data to be like the following:
userid itemid
2 1
2 2
2 3
Can someone help me?
import pandas as pd
df = pd.DataFrame({'userid':[1,1,1,1, 2,2,2],
'itemid':[1,1,3,4, 1,2,3] })
You can use duplicated to determine the row level duplicates, then perform a groupby on 'userid' to determine 'userid' level duplicates, then drop accordingly.
To drop without a threshold:
df = df[~df.duplicated(['userid', 'itemid']).groupby(df['userid']).transform('any')]
To drop with a threshold, use keep=False in duplicated, and sum over the Boolean column and compare against your threshold. For example, with a threshold of 3:
df = df[~df.duplicated(['userid', 'itemid'], keep=False).groupby(df['userid']).transform('sum').ge(3)]
The resulting output for no threshold:
userid itemid
4 2 1
5 2 2
6 2 3
filter
Was made for this. You can pass a function that returns a boolean that determines if the group passed the filter or not.
filter and value_counts
Most generalizable and intuitive
df.groupby('userid').filter(lambda x: x.itemid.value_counts().max() < 2)
filter and is_unique
special case when looking for n < 2
df.groupby('userid').filter(lambda x: x.itemid.is_unique)
userid itemid
4 2 1
5 2 2
6 2 3
Group the dataframe by users and items:
views = df.groupby(['userid','itemid'])['itemid'].count()
#userid itemid
#1 1 2 <=== The offending row
# 3 1
# 4 1
#2 1 1
# 2 1
# 3 1
#Name: dummy, dtype: int64
Find out who saw any item only once:
THRESHOLD = 2
viewed = ~(views.unstack() >= THRESHOLD).any(axis=1)
#userid
#1 False
#2 True
#dtype: bool
Combine the results and keep the 'good' rows:
combined = df.merge(pd.DataFrame(viewed).reset_index())
combined[combined[0]][['userid','itemid']]
# userid itemid
#4 2 1
#5 2 2
#6 2 3
# group userid and itemid and get a count
df2 = df.groupby(by=['userid','itemid']).apply(lambda x: len(x)).reset_index()
#Extract rows where the max userid-itemid count is less than 2.
df2 = df2[~df2.userid.isin(df2[df2.ix[:,-1]>1]['userid'])][df.columns]
print(df2)
itemid userid
3 1 2
4 2 2
5 3 2
If you want to drop at a certain threshold, just set
df2.ix[:,-1]>threshold]
I do not know whether there is a function available in Pandas to do this task. However, I tried to make a workaround to deal with your problem.
Here is the full code.
import pandas as pd
dictionary = {'userid':[1,1,1,1,2,2,2],
'itemid':[1,1,3,4,1,2,3]}
df = pd.DataFrame(dictionary, columns=['userid', 'itemid'])
selected_user = []
for user in df['userid'].drop_duplicates().tolist():
items = df.loc[df['userid']==user]['itemid'].tolist()
if len(items) != len(set(items)): continue
else: selected_user.append(user)
result = df.loc[(df['userid'].isin(selected_user))]
This code will result the following outcome.
userid itemid
4 2 1
5 2 2
6 2 3
Hope it helps.