Find rows differences between two dataframes [duplicate] - python

This question already has answers here:
Anti-Join Pandas
(7 answers)
Closed 12 months ago.
I have two dataframes that have the same structure/indexes.
df1 = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'column_a': [5, 4, 3, 2, 1],
'column_b': [5, 4, 3, 2, 1],
'column_c': [5, 4, 3, 2, 1]
})
df1.set_index('id', drop=False, inplace=True)
and
df2 = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'column_a': [5, 4, 3, 2, 1],
'column_b': [5, 4, 3, 2, 1],
'column_c': [5, 4, 10, 2, 1]
})
df2.set_index('id', drop=False, inplace=True)
And I would like to get this result:
expected = pd.DataFrame({'id': [3], 'column_a': [3], 'column_b': [3], 'column_c': [10]})
I tried using for-loop, but I need to deal with a large data load, and it didn't become so performant...

Try with merge, filtering on the indicator:
>>> (df2.reset_index(drop=True).merge(df1.reset_index(drop=True),
indicator="Exist",
how="left")
.query("Exist=='left_only'")
.drop("Exist", axis=1)
)
id column_a column_b column_c
2 3 3 3 10

What you're asking for could be possibly be answered here.
Using the drop_duplicates example from that thread,
pd.concat([df1,df2]).drop_duplicates(keep=False)
you can end up with the following DataFrame.
id column_a column_b column_c
id
3 3 3 3 3
3 3 3 3 10
Albeit this approach will retrieve rows from both DataFrames.

Related

how to detect rows are subset of other rows and delete them in pandas series

I have a large pandas series that each row in it, is a list of numbers.
I want to detect rows that are subset of other rows and delete them from series.
my solution is using 2 for loops but it is very slow. Can anyone help me and introduce a faster way for this because my for loop is very slow.
for example, we must delete rows 2, 4 in the below sample because they are subsets of rows 1, 3 respectively.
import pandas as pd
cycles = pd.Series([[1, 2, 3, 4], [3, 4], [5, 6, 9, 7], [5, 9]])
First, you could sort the lists since they are numbers and convert them to string. Then for every string simply check if it is a substring of any of the other rows, if so it is a subset. Since everything is sorted we can be sure the order of the numbers will not affect this step.
Finally, filter out only the ones that are not identified as a subset.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'cycles': [[9, 5, 4, 3], [9, 5, 4], [2, 4, 3], [2, 3]],
'members': [4, 3, 3, 2]
})
print(df)
cycles members
0 [9, 5, 4, 3] 4
1 [9, 5, 4] 3
2 [2, 4, 3] 3
3 [2, 3] 2
df['cycles'] = df['cycles'].map(np.sort)
df['cycles_str'] = [','.join(map(str, c)) for c in df['cycles']]
# Here we check if matches are >1, because it will match with itself once!
df['is_subset'] = [df['cycles_str'].str.contains(c_str).sum() > 1 for c_str in df['cycles_str']]
df = df.loc[df['is_subset'] == False]
df = df.drop(['cycles_str', 'is_subset'], axis=1)
cycles members
0 [3, 4, 5, 9] 4
2 [2, 3, 4] 3
Edit - The above doesn't work for [1, 2, 4] & [1, 2, 3, 4]
Rewrote the code. This uses 2 loops and set to check for subsets using list comprehension:
# check if >1 True, as it will match with itself once!
df['is_subset'] = [[set(y).issubset(set(x)) for x in df['cycles']].count(True)>1 for y in df['cycles']]
df = df.loc[df['is_subset'] == False]
df = df.drop('is_subset', axis=1)
print(df)
cycles members
0 [9, 5, 4, 3] 4
2 [2, 4, 3] 3

How do I summarize a data frame as a list combined with an ID?

I have a buyer (buyerid) and this buyer can buy several different cars (carid).
I would like to list which cars he has bought.
Here I would like to summarize all cars for each buyer and save them as a list.
For example, buyer 1 bought the car with ID 1 and ID 2. This list should now contain [1,2].
How do I make such a list?
If I call method .values.tolist() then I get each line as a list, but I want the carid to be summarized by buyers.
import pandas as pd
d = {'Buyerid': [1,1,2,2,3,3,3,4,5,5,5],
'Carid': [1,2,3,4,4,1,2,4,1,3,5],
'Carid2': [1,2,3,4,4,1,2,4,1,3,5]}
df = pd.DataFrame(data=d)
print(df)
ls = df.values.tolist()
print(ls)
Buyerid Carid Carid2
0 1 1 1
1 1 2 2
2 2 3 3
3 2 4 4
4 3 4 4
5 3 1 1
6 3 2 2
7 4 4 4
8 5 1 1
9 5 3 3
10 5 5 5
[[1, 1, 1], [1, 2, 2], [2, 3, 3], [2, 4, 4], [3, 4, 4], [3, 1, 1], [3, 2, 2], [4, 4, 4], [5, 1, 1], [5, 3, 3], [5, 5, 5]]
# What I want as list
[[1,2],[3,4],[4,1,2],[4],[1,3,5]]
If need select columns for processing use GroupBy.apply with np.unique if order is not important:
L = (df.groupby(['Buyerid'])[['Carid','Carid2']]
.apply(lambda x: np.unique(x).tolist()).tolist())
Or if need processing all columns without Buyerid use:
L = (df.set_index('Buyerid')
.groupby('Buyerid')
.apply(lambda x: np.unique(x).tolist())
.tolist())
print (L)
[[1, 2], [3, 4], [1, 2, 4], [4], [1, 3, 5]]
If ordering is important use DataFrame.melt for unpict wit hremove duplicates by DataFrame.drop_duplicates:
L1 = (df.melt('Buyerid')
.drop_duplicates(['Buyerid','value'])
.groupby('Buyerid')['value']
.agg(list)
.tolist())
print (L1)
[[1, 2], [3, 4], [4, 1, 2], [4], [1, 3, 5]]

pandas groupby & lambda function to return nlargest(2)

Please see pandas df:
pd.DataFrame({'id': [1, 1, 2, 2, 2, 3],
'pay_date': ['Jul1', 'Jul2', 'Jul8', 'Aug5', 'Aug7', 'Aug22'],
'id_ind': [1, 2, 1, 2, 3, 1]})
I am trying to groupby 'id' and 'pay_date'. I only want to keep df['id_ind'].nlargest(2) in the dataframe after grouping by 'id' and 'pay_date'. Here is my code:
df = pd.DataFrame(df.groupby(['id', 'pay_date'])['id_ind'].apply(
lambda x: x.nlargest(2)).reset_index()
This does not work, as the new df returns all the records. If it worked, 'id'==2 would only appear twice in the df, as there are 3 records and I only want the 2 largest by 'id_ind'.
My desired output:
pd.DataFrame({'id': [1, 1, 2, 2, 3],
'pay_date': ['Jul1', 'Jul2', 'Aug5', 'Aug7', 'Aug22'],
'id_ind': [1, 2, 2, 3, 1]})
Sort on id_ind and doing groupby.tail
df_final = (df.sort_values('id_ind').groupby('id').tail(2)
.sort_index()
.reset_index(drop=True))
Out[29]:
id id_ind pay_date
0 1 1 Jul1
1 1 2 Jul2
2 2 2 Aug5
3 2 3 Aug7
4 3 1 Aug22

how to limit the duplicate to 5 in pandas data frames?

col1= ['A','B','A','C','A','B','A','C','A','C','A','A','A']
col2= [1,1,4,2,4,5,6,3,1,5,2,1,1]
df = pd.DataFrame({'col1':col1, 'col2':col2})
for A we have [1,4,4,6,1,2,1,1], 8 items but i want to limit the size to 5 while converting Data frame to dict/list
Output:
Dict = {'A':[1,4,4,6,1],'B':[1,5],'C':[2,3,5]}
Use pandas.DataFrame.groupby with apply:
df.groupby('col1')['col2'].apply(lambda x:list(x.head(5))).to_dict()
Output:
{'A': [1, 4, 4, 6, 1], 'B': [1, 5], 'C': [2, 3, 5]}
Use DataFrame.groupby with lambda function, convert to list and filter first 5 values by indexing, last convert to dictionary by Series.to_dict:
d = df.groupby('col1')['col2'].apply(lambda x: x.tolist()[:5]).to_dict()
print (d)
{'A': [1, 4, 4, 6, 1], 'B': [1, 5], 'C': [2, 3, 5]}

Maximum of an array constituting a pandas dataframe cell

I have a pandas dataframe in which a column is formed by arrays. So every cell is an array.
Say there is a column A in dataframe df, such that
A = [ [1, 2, 3],
[4, 5, 6],
[7, 8, 9],
... ]
I want to operate in each array and get, e.g. the maximum of each array, and store it in another column.
In the example, I would like to obtain another column
B = [ 3,
6,
9,
...]
I have tried these approaches so far, none of them giving what I want.
df['B'] = np.max(df['A']);#
df.applymap (lambda B: A.max())
df['B'] = df.applymap (lambda B: np.max(np.array(df['A'].tolist()),0))
How should I proceed? And is this the best way to have my dataframe organized?
You can just apply(max). It doesn't matter if the values are lists or np.array.
df = pd.DataFrame({'a': [[1, 2, 3], [4, 5, 6], [7, 8, 9]]})
df['b'] = df['a'].apply(max)
print(df)
Outputs
a b
0 [1, 2, 3] 3
1 [4, 5, 6] 6
2 [7, 8, 9] 9
Here is one way without apply:
df['B']=np.max(df['A'].values.tolist(),axis=1)
A B
0 [1, 2, 3] 3
1 [4, 5, 6] 6
2 [7, 8, 9] 9

Categories

Resources