Compare and remove duplicates from both dataframe

Compare and remove duplicates from both dataframe - python

I have 2 dataframes that needs to be compared and remove duplicates (if any)
Daily = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
Accumulated = DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
Out[4]:
col1 col2
0 1 2
1 2 3
2 3 4
col1 col2
0 4 6
1 2 3
2 5 5
3 6 6
What I am trying to achieve is to remove duplicates if there are any, from both DF and get the count of remaining entries from daily DF
Expected output:
col1 col2
0 1 2
2 3 4
col1 col2
0 4 6
2 5 5
3 6 6
Count = 2
How can i do it?
Both or either DFs can be empty, and daily can have more entries than Montlhy and vice versa

Why don't just concat both into one df and drop the duplicates completely?
s = (pd.concat([Daily.assign(source="Daily"),
Accumulated.assign(source="Accumlated")])
.drop_duplicates(["col1","col2"], keep=False))
print (s[s["source"].eq("Daily")])
col1 col2 source
0 1 2 Daily
2 3 4 Daily
print (s[s["source"].eq("Accumlated")])
col1 col2 source
0 4 6 Accumlated
2 5 5 Accumlated
3 6 6 Accumlated

You can try the below code
## For 1st Dataframe
for i in range(len(df1)):
for j in range(len(df2)):
if df1.iloc[i].to_list()==df2.iloc[j].to_list():
df1=df1.drop(index=i)
Similarly you can do for the second datframe

I would do it following way:
import pandas as pd
daily = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
accumulated = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
daily['isdaily'] = True
accumulated['isdaily'] = False
together = pd.concat([daily, accumulated])
without_dupes = together.drop_duplicates(['col1','col2'],keep=False)
daily_count = sum(without_dupes['isdaily'])
I added isdaily column to dataframes as Trues and Falses so they could be easily summed at end.

If I understood correctly, you need to have both tables separated.
You can concatenate them, keeping the table from where they come from and then recreate them:
Daily = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
Daily["Table"] = "Daily"
Accumulated = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
Accumulated["Table"] = "Accum"
df = pd.concat([Daily, Accumulated]).reset_index()
not_dup = df[["col1", "col2"]].drop_duplicates()
not_dup = df.loc[not_dup.index,:]
Daily = not_dup[not_dup["Table"] == "Daily"][["col1","col2"]]
Accumulated = not_dup[not_dup["Table"] == "Accum"][["col1","col2"]]
print(Daily)
print(Accumulated)

following those steps:
Concatenate the 2 data-frames
Drop all duplication
For each data-frame find the intersection with the concat data-frame
Find count with len
Daily = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
Accumulated = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
df = pd.concat([Daily, Accumulated]) # step 1
df = df.drop_duplicates(keep=False) # step 2
Daily = pd.merge(df, Daily, how='inner', on=['col1','col2']) #step 3
Accumulated = pd.merge(df, Accumulated, how='inner', on=['col1','col2']) #step 3
count = len(Daily) #step 4

Related

insert rows to pandas Dataframe based on condition?

I`m using pandas dataframe to read .csv format file. I would like to insert rows when specific column values changed from value to other. My data is shown as follow:
Id type
1 car
1 track
2 train
2 plane
3 car
I need to add row that contains Id is empty and type value is number 4 after any change in Id column value. My desired output should like this:
Id type
1 car
1 track
4
2 train
2 plane
4
3 car
How I do this??

You could use groupby to split by groups and append the rows in a list comprehension before merging again with contact:
df2 = pd.concat([d.append(pd.Series([None, 4], index=['Id', 'type']), ignore_index=True)
for _,d in df.groupby('Id')], ignore_index=True).iloc[:-1]
If the index is sorted, another option is to find the index of the last item per group and use it to generate the new rows:
# get index of last item per group (except last)
idx = df.index.to_series().groupby(df['Id']).last().values[:-1]
# craft a DataFrame with the new rows
d = pd.DataFrame([[None, 4]]*len(idx), columns=df.columns, index=idx)
# concatenate and reorder
pd.concat([df, d]).sort_index().reset_index(drop=True)
output:
Id type
0 1.0 car
1 1.0 track
2 NaN 4.0
3 2.0 train
4 2.0 plane
5 NaN 4.0
6 3.0 car

You can do this:
df = pd.read_csv('input.csv', sep=";")
Id type
0 1 car
1 1 track
2 2 train
3 2 plane
4 3 car
mask = df['Id'].ne(df['Id'].shift(-1))
df1 = pd.DataFrame('4',index=mask.index[mask] + .5, columns=df.columns)
df1['Id'] = df['Id'].replace({'4':' '})
df = pd.concat([df, df1]).sort_index().reset_index(drop=True).iloc[:-1]
which gives:
Id type
0 1.0 car
1 1.0 track
2 NaN 4
3 2.0 train
4 2.0 plane
5 NaN 4
6 3.0 car

You can do:
In [244]: grp = df.groupby('Id')
In [256]: res = pd.DataFrame()
In [257]: for x,y in grp:
...: if y['type'].count() > 1:
...: tmp = y.append(pd.DataFrame({'Id': [''], 'type':[4]}))
...: res = res.append(tmp)
...: else:
...: res = res.append(y)
...:
In [258]: res
Out[258]:
Id type
0 1 car
1 1 track
0 4
2 2 train
3 2 plane
0 4
4 3 car

Please find the solution below using index :
###### Create a shift variable to compare index
df['idshift'] = df['Id'].shift(1)
# When shift id does not match id, mean change index
change_index = df.index[df['idshift']!=df['Id']].tolist()
change_index
# Loop through all the change index and insert at index
for i in change_index[1:]:
line = pd.DataFrame({"Id": ' ' , "rate": 4}, index=[(i-1)+.5])
df = df.append(line, ignore_index=False)
# finallt sort the index
df = df.sort_index().reset_index(drop=True)
Input Dataframe :
df = pd.DataFrame({'Id': [1,1,2,2,3,3,3,4],'rate':[1,2,3,10,12,16,10,12]})
Ouput Results from the code :

Append data with one column to existing dataframe

I want append a list of data to a dataframe such that the list will appear in a column ie:
#Existing dataframe:
[A, 20150901, 20150902
1 4 5
4 2 7]
#list of data to append to column A:
data = [8,9,4]
#Required dataframe
[A, 20150901, 20150902
1 4 5
4 2 7
8, 0 0
9 0 0
4 0 0]
I am using the following:
df_new = df.copy(deep=True)
#I am copying and deleting data as column names are type Timestamp and easier to reuse them
df_new.drop(df_new.index, inplace=True)
for item in data_list:
df_new = df_new.append([{'A':item}], ignore_index=True)
df_new.fillna(0, inplace=True)
df = pd.concat([df, df_new], axis=0, ignore_index=True)
But doing this in a loop is inefficient plus I get this warning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
Any ideas on how to overcome this error and append 2 dataframes in one go?

I think need concat new DataFrame with column A, then reindex if want same order of columns and last replace missing values by fillna:
data = [8,9,4]
df_new = pd.DataFrame({'A':data})
df = (pd.concat([df, df_new], ignore_index=True)
.reindex(columns=df.columns)
.fillna(0, downcast='infer'))
print (df)
A 20150901 20150902
0 1 4 5
1 4 2 7
2 8 0 0
3 9 0 0
4 4 0 0

I think, you could do something like this.
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame({'A':[8,9,4]})
df.append(df2).fillna(0)
A B
0 1 2.0
1 3 4.0
0 8 0.0
1 9 0.0
2 4 0.0

maybe you can do it in this way:
new = pd.DataFrame(np.zeros((3, 3))) #Create a new zero dataframe:
new[0]=[8,9,4] #add values
existed_dataframe.append(new) #and merge both dataframes

Adding rows to a Dataframe to unify the length of groups

I would like to add element to specific groups in a Pandas DataFrame in a selective way. In particular, I would like to add zeros so that all groups have the same number of elements. The following is a simple example:
import pandas as pd
df = pd.DataFrame([[1,1], [2,2], [1,3], [2,4], [2,5]], columns=['key', 'value'])
df
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
I would like to have the same number of elements per group (where grouping is by the key column). The group 2 has the most elements: three elements. However, the group 1 has only two elements so a zeros should be added as follows:
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
5 1 0
Note that the index does not matter.

You can create new level of MultiIndex by cumcount and then add missing values by unstack/stack or reindex:
df = (df.set_index(['key', df.groupby('key').cumcount()])['value']
.unstack(fill_value=0)
.stack()
.reset_index(level=1, drop=True)
.reset_index(name='value'))
Alternative solution:
df = df.set_index(['key', df.groupby('key').cumcount()])
mux = pd.MultiIndex.from_product(df.index.levels, names = df.index.names)
df = df.reindex(mux, fill_value=0).reset_index(level=1, drop=True).reset_index()
print (df)
key value
0 1 1
1 1 3
2 1 0
3 2 2
4 2 4
5 2 5
If is important order of values:
df1 = df.set_index(['key', df.groupby('key').cumcount()])
mux = pd.MultiIndex.from_product(df1.index.levels, names = df1.index.names)
#get appended values
miss = mux.difference(df1.index).get_level_values(0)
#create helper df and add 0 to all columns of original df
df2 = pd.DataFrame({'key':miss}).reindex(columns=df.columns, fill_value=0)
#append to original df
df = pd.concat([df, df2], ignore_index=True)
print (df)
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
5 1 0

Compute Average/Mean across Dataframes in Python Pandas

I have a list of dataframes. Each dataframe was originally numerical data taken from which are all shaped identically with 21 rows and 5 columns. The first column is an index (index 0 to index 20). I want to compute the average (mean) values into a single dataframe. Then I want to export the dataframe to excel.
Here's a simplified version of my existing code:
#look to concatenate the dataframes together all at once
#dataFrameList is the given list of dataFrames
concatenatedDataframes = pd.concat(dataFrameList, axis = 1)
#grouping the dataframes by the index, which is the same across all of the dataframes
groupedByIndex = concatenatedDataframes.groupby(level = 0)
#take the mean
meanDataFrame = groupedByIndex.mean()
# Create a Pandas Excel writer using openpyxl as the engine.
writer = pd.ExcelWriter(filepath, engine='openpyxl')
meanDataFrame.to_excel(writer)
However, when I open the excel file, I see what looks like EVERY dataframe is copied into the sheet and the average/mean values are not shown. A simplified example is shown below (cutting most of the rows and dataframes)
Dataframe 1 Dataframe 2 Dataframe 3
Index Col2 Col3 Col4 Col5 Col2 Col3 Col4 Col5 Col2 Col3 Col4 Col5
0 Data Data Data Data Data Data Data Data Data Data Data Data
1 Data Data Data Data Data Data Data Data Data Data Data Data
2 Data Data Data Data Data Data Data Data Data Data Data Data
....
I'm looking for something more like:
Averaged DF
Index Col2 Col3 Col4
0 Mean Index0,Col2 across DFs Mean Index0,Col3 across DFs Mean Index0,Col4 across DFs
1 Mean Index1,Col2 across DFs Mean Index1,Col3 across DFs Mean Index1,Col4 across DFs
2 Mean Index2,Col2 across DFs Mean Index2,Col3 across DFs Mean Index3,Col4 across DFs
...
I have also already seen this answer:
Get the mean across multiple Pandas DataFrames
If possible, I'm looking for a clean solution, not one which would simply involve looping through each dataFrame value by value. Any suggestions?

Perhaps I misunderstood what you asked
The solution is simple. You just need to concat along the correct axis
dummy data
df1 = pd.DataFrame(index=range(rows), columns=range(columns), data=[[10 + i * j for j in range(columns)] for i in range(rows) ])
df2 = df1 = pd.DataFrame(index=range(rows), columns=range(columns), data=[[i + j for j in range(columns)] for i in range(rows) ])
ps. this should be your job as OP
pd.concat
df_concat0 = pd.concat((df1, df2), axis=1)
puts all the dataframes next to eachother.
0 1 0 1
0 10 10 0 1
1 10 11 1 2
2 10 12 2 3
If we want to do a groupby now, we first need to stack, groupby and stack again
df_concat0.stack().groupby(level=[0,1]).mean().unstack()
0 1
0 5.0 5.5
1 5.5 6.5
2 6.0 7.5
If we do
df_concat = pd.concat((df1, df2))
This puts all the dataframes on top of eachother
0 1
0 10 10
1 10 11
2 10 12
0 0 1
1 1 2
2 2 3
now we need to just groupby the index, like you did
df_concat.groupby(level=0).mean()
0 1
0 5.0 5.5
1 5.5 6.5
2 6.0 7.5
and then use ExcelWriter as context manager
with pd.ExcelWriter(filepath, engine='openpyxl') as writer:
result.to_excel(writer)
or just plain
result.to_excel(filepath, engine='openpyxl')
if you can overwrite what is is filepath

I suppose you need the mean of all rows against each column.
Concatenating a list of data frames with same index will add the columns from other data frames to the right of the first data frame. As below:
col1 col2 col3 col1 col2 col3
0 1 2 3 2 3 4
1 2 3 4 3 4 5
2 3 4 5 4 5 6
3 4 5 6 5 6 7
Try appending the data frames and then group by and take the mean to get the desired result.
##creating data frames
df1= pd.DataFrame({'col1':[1,2,3,4],
'col2':[2,3,4,5],
'col3':[3,4,5,6]})
df2= pd.DataFrame({'col1':[2,3,4,5],
'col2':[3,4,5,6],
'col3':[4,5,6,7]})
## list of data frames
dflist = [df1,df2]
## empty data frame to use for appending
df=pd.DataFrame()
#looping through each item in list and appending to empty data frame
for i in dflist:
df = df.append(i)
# group by and calculating mean on index
data_mean=df.groupby(level=0).mean()
Write to file as you are writing
Alternatively :
Instead of appending using a for loop you can also mention the axis along which you want to concatenate the data frames, in your case you want to concatenate along the index(axis = 0) to put the data data frames on top top each other. As below:
col1 col2 col3
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6
0 2 3 4
1 3 4 5
2 4 5 6
3 5 6 7
##creating data frames
df1= pd.DataFrame({'col1':[1,2,3,4],
'col2':[2,3,4,5],
'col3':[3,4,5,6]})
df2= pd.DataFrame({'col1':[2,3,4,5],
'col2':[3,4,5,6],
'col3':[4,5,6,7]})
## list of data frames
dflist = [df1,df2]
#concat the dflist along axis 0 to put the data frames on top of each other
df_concat=pd.concat(dflist,axis=0)
# group by and calculating mean on index
data_mean=df_concat.groupby(level=0).mean()
Write to file as you are writing

Pandas - group by id and drop duplicate with threshold

I have the following data:
userid itemid
1 1
1 1
1 3
1 4
2 1
2 2
2 3
I want to drop userIDs who has viewed the same itemID more than or equal to twice.
For example, userid=1 has viewed itemid=1 twice, and thus I want to drop the entire record of userid=1. However, since userid=2 hasn't viewed the same item twice, I will leave userid=2 as it is.
So I want my data to be like the following:
userid itemid
2 1
2 2
2 3
Can someone help me?
import pandas as pd
df = pd.DataFrame({'userid':[1,1,1,1, 2,2,2],
'itemid':[1,1,3,4, 1,2,3] })

You can use duplicated to determine the row level duplicates, then perform a groupby on 'userid' to determine 'userid' level duplicates, then drop accordingly.
To drop without a threshold:
df = df[~df.duplicated(['userid', 'itemid']).groupby(df['userid']).transform('any')]
To drop with a threshold, use keep=False in duplicated, and sum over the Boolean column and compare against your threshold. For example, with a threshold of 3:
df = df[~df.duplicated(['userid', 'itemid'], keep=False).groupby(df['userid']).transform('sum').ge(3)]
The resulting output for no threshold:
userid itemid
4 2 1
5 2 2
6 2 3

filter
Was made for this. You can pass a function that returns a boolean that determines if the group passed the filter or not.
filter and value_counts
Most generalizable and intuitive
df.groupby('userid').filter(lambda x: x.itemid.value_counts().max() < 2)
filter and is_unique
special case when looking for n < 2
df.groupby('userid').filter(lambda x: x.itemid.is_unique)
userid itemid
4 2 1
5 2 2
6 2 3

Group the dataframe by users and items:
views = df.groupby(['userid','itemid'])['itemid'].count()
#userid itemid
#1 1 2 <=== The offending row
# 3 1
# 4 1
#2 1 1
# 2 1
# 3 1
#Name: dummy, dtype: int64
Find out who saw any item only once:
THRESHOLD = 2
viewed = ~(views.unstack() >= THRESHOLD).any(axis=1)
#userid
#1 False
#2 True
#dtype: bool
Combine the results and keep the 'good' rows:
combined = df.merge(pd.DataFrame(viewed).reset_index())
combined[combined[0]][['userid','itemid']]
# userid itemid
#4 2 1
#5 2 2
#6 2 3

# group userid and itemid and get a count
df2 = df.groupby(by=['userid','itemid']).apply(lambda x: len(x)).reset_index()
#Extract rows where the max userid-itemid count is less than 2.
df2 = df2[~df2.userid.isin(df2[df2.ix[:,-1]>1]['userid'])][df.columns]
print(df2)
itemid userid
3 1 2
4 2 2
5 3 2
If you want to drop at a certain threshold, just set
df2.ix[:,-1]>threshold]

I do not know whether there is a function available in Pandas to do this task. However, I tried to make a workaround to deal with your problem.
Here is the full code.
import pandas as pd
dictionary = {'userid':[1,1,1,1,2,2,2],
'itemid':[1,1,3,4,1,2,3]}
df = pd.DataFrame(dictionary, columns=['userid', 'itemid'])
selected_user = []
for user in df['userid'].drop_duplicates().tolist():
items = df.loc[df['userid']==user]['itemid'].tolist()
if len(items) != len(set(items)): continue
else: selected_user.append(user)
result = df.loc[(df['userid'].isin(selected_user))]
This code will result the following outcome.
userid itemid
4 2 1
5 2 2
6 2 3
Hope it helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.