Pandas - group by id and drop duplicate with threshold

Pandas - group by id and drop duplicate with threshold - python

I have the following data:
userid itemid
1 1
1 1
1 3
1 4
2 1
2 2
2 3
I want to drop userIDs who has viewed the same itemID more than or equal to twice.
For example, userid=1 has viewed itemid=1 twice, and thus I want to drop the entire record of userid=1. However, since userid=2 hasn't viewed the same item twice, I will leave userid=2 as it is.
So I want my data to be like the following:
userid itemid
2 1
2 2
2 3
Can someone help me?
import pandas as pd
df = pd.DataFrame({'userid':[1,1,1,1, 2,2,2],
'itemid':[1,1,3,4, 1,2,3] })

You can use duplicated to determine the row level duplicates, then perform a groupby on 'userid' to determine 'userid' level duplicates, then drop accordingly.
To drop without a threshold:
df = df[~df.duplicated(['userid', 'itemid']).groupby(df['userid']).transform('any')]
To drop with a threshold, use keep=False in duplicated, and sum over the Boolean column and compare against your threshold. For example, with a threshold of 3:
df = df[~df.duplicated(['userid', 'itemid'], keep=False).groupby(df['userid']).transform('sum').ge(3)]
The resulting output for no threshold:
userid itemid
4 2 1
5 2 2
6 2 3

filter
Was made for this. You can pass a function that returns a boolean that determines if the group passed the filter or not.
filter and value_counts
Most generalizable and intuitive
df.groupby('userid').filter(lambda x: x.itemid.value_counts().max() < 2)
filter and is_unique
special case when looking for n < 2
df.groupby('userid').filter(lambda x: x.itemid.is_unique)
userid itemid
4 2 1
5 2 2
6 2 3

Group the dataframe by users and items:
views = df.groupby(['userid','itemid'])['itemid'].count()
#userid itemid
#1 1 2 <=== The offending row
# 3 1
# 4 1
#2 1 1
# 2 1
# 3 1
#Name: dummy, dtype: int64
Find out who saw any item only once:
THRESHOLD = 2
viewed = ~(views.unstack() >= THRESHOLD).any(axis=1)
#userid
#1 False
#2 True
#dtype: bool
Combine the results and keep the 'good' rows:
combined = df.merge(pd.DataFrame(viewed).reset_index())
combined[combined[0]][['userid','itemid']]
# userid itemid
#4 2 1
#5 2 2
#6 2 3

# group userid and itemid and get a count
df2 = df.groupby(by=['userid','itemid']).apply(lambda x: len(x)).reset_index()
#Extract rows where the max userid-itemid count is less than 2.
df2 = df2[~df2.userid.isin(df2[df2.ix[:,-1]>1]['userid'])][df.columns]
print(df2)
itemid userid
3 1 2
4 2 2
5 3 2
If you want to drop at a certain threshold, just set
df2.ix[:,-1]>threshold]

I do not know whether there is a function available in Pandas to do this task. However, I tried to make a workaround to deal with your problem.
Here is the full code.
import pandas as pd
dictionary = {'userid':[1,1,1,1,2,2,2],
'itemid':[1,1,3,4,1,2,3]}
df = pd.DataFrame(dictionary, columns=['userid', 'itemid'])
selected_user = []
for user in df['userid'].drop_duplicates().tolist():
items = df.loc[df['userid']==user]['itemid'].tolist()
if len(items) != len(set(items)): continue
else: selected_user.append(user)
result = df.loc[(df['userid'].isin(selected_user))]
This code will result the following outcome.
userid itemid
4 2 1
5 2 2
6 2 3
Hope it helps.

Related

Pandas Dataframe - Add a new Column with value from another row

I am trying to add a new column called ordered_1day_ago to my df.
DataFrame currently looks like this:
itemID
orderedDate
qty
1
12/2/21
3
2
12/3/21
2
1
12/3/21
2
1
12/4/21
3
I want it to look like this:
itemID
orderedDate
qty
ordered_1day_ago
1
12/2/21
3
0
2
12/3/21
2
0
1
12/3/21
2
3
1
12/4/21
3
2
itemID and ordered date must be used to insert the qty on the next orderedDate if it falls within one day, if it does not, then ordered_1day_ago is 0.
How can we use pandas for this?

This is the complete solution:
import pandas as pd
# a dict to create th dataframe
d = {
'itemID':[1,2,1,1],
'orderedDate':['12/2/21', '12/3/21', '12/3/21', '12/4/21'],
'qty':[3,2,2,3]
}
# the old dataframe
df = pd.DataFrame(d)
print(df)
# some function to do what you want to based on rows
def some_function(row):
# code goes here
z = row['itemID'] + row['qty']
return z
# add the new column given the function above
df['ordered_1day_ago'] = df.apply(some_function, axis=1)
# the new datafdrame with the extra column
print(df)
This is the original df:
itemID orderedDate qty
0 1 12/2/21 3
1 2 12/3/21 2
2 1 12/3/21 2
3 1 12/4/21 3
This is the new df with the added (example) column:
itemID orderedDate qty ordered_1day_ago
0 1 12/2/21 3 4
1 2 12/3/21 2 4
2 1 12/3/21 2 3
3 1 12/4/21 3 4
You can amend the function to contain whatever criteria you wish such that the new column ordered_1day_ago contains the results that you wish.

Mapping columns from one dataframe to another based on few conditions to consider one mapping out of multiple mappings present

I have two dataframes A and B with a common column 'label'.I would like to create a new column 'Map' in dataframe A which consist of corresponding mapping from dataframe B.
Required Conditions :
With every mapping, increment a variable count by 1 (which would be compared to the 'Capacity' column in dataframe B)
The mapping of 'label' column should be done based on higher value of 'Num' column from dataframe B. Also if the count becomes greater than 'Capacity' for next assignment, assign second best 'Num' mapping and so on.
If there's no available mapping OR the 'Capacity' for available mapping is full then update the 'Map' as None
Dataframe A
Id label
1 1
2 1
3 1
4 2
5 2
6 3
7 3
Dataframe B
label Capacity Map Num
1 1 A 0.1
1 2 B 0.2
2 2 C 0.3
3 1 D 0.2
Expected Output Dataframe
Id label Map
1 1 B
2 1 B
3 1 A
4 2 C
5 2 C
6 3 D
7 3 None
Any pythonic way for this. I would appreciate some explanation on the code.

Assuming your initial dataframes are:
>>> dfa
Id label
0 1 1
1 2 1
2 3 1
3 4 2
4 5 2
5 6 3
6 7 3
>>> dfb
label Capacity Map Num
0 1 1 A 0.1
1 1 2 B 0.2
2 2 2 C 0.3
3 3 1 D 0.2
First, start with refactoring a bit the dataframes. We calculate the cumcount for dfa and cumsum for dfb. This gives us how many rows can be allocated in the order of the map with a cumulated limit.
dfa['count'] = dfa.groupby('label').cumcount()+1
dfb.sort_values(by='Num', ascending=False, inplace=True)
dfb['count'] = dfb.groupby('label')['Capacity'].cumsum()
Then we define a custom function to do the mapping. The try/except block handles the case where no rows are available to map and the function will return None
def custom_map(s):
try:
return (dfb[dfb['label'].eq(s['label']) & # same label
dfb['count'].ge(s['count']) # within capacity
].iloc[0]['Map']) # take first element
except IndexError:
pass
Finally, we map the values using:
dfa['Map'] = dfa.apply(custom_map, axis=1)
dfa.drop('count', axis=1)
output:
Id label Map
0 1 1 B
1 2 1 B
2 3 1 A
3 4 2 C
4 5 2 C
5 6 3 D
6 7 3 None

I have tried to duplicate the mentioned data frames. My approach is to first sort the "B" dataframe by "Num" first and then by "capacity". Then looping over the "A" dataframe, I was able to select the correct "map" label and decrement the available capacity as well.
import pandas as pd
dfA = pd.DataFrame()
dfA["Id"] = [1,2,3,4,5,6,7
]
dfA["label"] = [1,1,1,2,2,3,3]
dfB = pd.DataFrame()
dfB["label"] = [1,1,2,3]
dfB["cap"] = [1,2,2,1]
dfB["map"] = ["A","B","C","D"]
dfB["num"] = [0.1,0.2,0.3,0.2]
test = dfB.copy()
test = test.sort_values(by = ['num', "cap"], ascending = [False, False], na_position = 'first')
map_list = []
for index, row in dfA.iterrows():
currLabel = row["label"]
x = test.loc[test['label'] == currLabel]
if len(x):
foundMap = False
for i,r in x.iterrows():
if r["cap"] > 0:
test.at[i,"cap"] = r["cap"]-1
map_list.append(r["map"])
foundMap = True
break
if not foundMap:
map_list.append(None)
else:
map_list.append(None)
dfA["map"] = map_list
Instead of creating a copy of dfB, you could also create a new column in dfB, which will maintain the realtime capacity.

How to count number of records in a group and save them in a csv file?

I have a dataset as below:
import pandas as pd
dict = {"A":[1,1,1,1,5],"B":[1,1,2,4,1]}
dt = pd.DataFrame(data=dict)
so, it is as below:
A B
1 1
1 1
1 2
1 4
5 1
i need to apply a groupby based on A and B count how many records each group has?
i have applied the below solution:
dtSize = dt.groupby(by=["A","B"], as_index=False).size()
dtSize.to_csv("./datasets/Final DT/dtSize.csv", sep=',', encoding='utf-8', index=False)
I have 2 problems:
When i open the saved file, it only contains the last column which includes number element in each group, but it does not include the groups
when i print the final dtSize it is as below:
so, some similar records in A is missed.
My favorit output is as below in a .csv file
A B Number of elements in group
1 1 2
1 2 1
1 4 1
5 1 1

Actually, data from A isn't missing. GroupBy.size returns a Series, so A and B are used as a MultiIndex. Due to this, repeated values for A in the first three rows aren't printed.
You're close. You need to reset the index and, optionally, name the result:
dt.groupby(['A', 'B']).size().reset_index(name='Size')
The result is:
A B Size
0 1 1 2
1 1 2 1
2 1 4 1
3 5 1 1

Removing duplicates based on two columns while deleting inconsistent data

I have a pandas dataframe like this:
a b c
0 1 1 1
1 1 1 0
2 2 4 1
3 3 5 0
4 3 5 0
where the first 2 columns ('a' and 'b') are IDs while the last one ('c') is a validation (0 = neg, 1 = pos). I do know how to remove duplicates based on the values of the first 2 columns, however in this case I would also like to get rid of inconsistent data i.e. duplicated data validated both as positive and negative. So for example the first 2 rows are duplicated but inconsistent hence I should remove the entire record, while the last 2 rows are both duplicated and consistent so I'd keep one of the records. The expected result sholud be:
a b c
0 2 4 1
1 3 5 0
The real dataframe can have more than two duplicates per group and
as you can see also the index has been changed. Thanks.

First filter rows by GroupBy.transform with SeriesGroupBy.nunique for get only unique values groups with boolean indexing and then DataFrame.drop_duplicates:
df = (df[df.groupby(['a','b'])['c'].transform('nunique').eq(1)]
.drop_duplicates(['a','b'])
.reset_index(drop=True))
print (df)
a b c
0 2 4 1
1 3 5 0
Detail:
print (df.groupby(['a','b'])['c'].transform('nunique'))
0 2
1 2
2 1
3 1
4 1
Name: c, dtype: int64

Pandas: Using group by, combine multiple column values as one distinct group within the groupby

I have a data-frame which I'm using the pandas.groupby on a specific column and then running aggregate statistics on the produced groups (mean, median, count). I want to treat certain column values as members of the same group produced by the groupby rather than a distinct group per distinct value in the column which was used for the grouping. I was looking how I would accomplish such a thing.
For example:
>> my_df
ID SUB_NUM ELAPSED_TIME
1 1 1.7
2 2 1.4
3 2 2.1
4 4 3.0
5 6 1.8
6 6 1.2
So instead of the typical behavior:
>> my_df.groupby([SUB_NUM]).agg([count])
ID SUB_NUM Count
1 1 1
2 2 2
4 4 1
5 6 2
I want certain values (SUB_NUM in [1, 2]) to be computed as one group so instead something like below is produced:
>> # Some mystery pandas function calls
ID SUB_NUM Count
1 1, 2 3
4 4 1
5 6 2
Any help would be much appreciated, thanks!

For me works:
#for join values convert values to string
df['SUB_NUM'] = df['SUB_NUM'].astype(str)
#create mapping dict by dict comprehension
L = ['1','2']
d = {x: ','.join(L) for x in L}
print (d)
{'2': '1,2', '1': '1,2'}
#replace values by dict
a = df['SUB_NUM'].replace(d)
print (a)
0 1,2
1 1,2
2 1,2
3 4
4 6
5 6
Name: SUB_NUM, dtype: object
#groupby by mapping column and aggregating `first` and `size`
print (df.groupby(a)
.agg({'ID':'first', 'ELAPSED_TIME':'size'})
.rename(columns={'ELAPSED_TIME':'Count'})
.reset_index())
SUB_NUM ID Count
0 1,2 1 3
1 4 4 1
2 6 5 2
What is the difference between size and count in pandas?

You can create another column mapping the SUB_NUM values to actual groups and then group by it.
my_df['SUB_GROUP'] = my_df['SUB_NUM'].apply(lambda x: 1 if x < 3 else x)
my_df.groupby(['SUB_GROUP']).agg([count])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - group by id and drop duplicate with threshold - python

Related

Pandas Dataframe - Add a new Column with value from another row

Mapping columns from one dataframe to another based on few conditions to consider one mapping out of multiple mappings present

How to count number of records in a group and save them in a csv file?

Removing duplicates based on two columns while deleting inconsistent data

Pandas: Using group by, combine multiple column values as one distinct group within the groupby

Categories

Resources