Comparing dataframes in pandas - python

I have two separate pandas dataframes (df1 and df2) which have multiple columns with some common columns.
I would like to find every row in df2 that does not have a match in df1. Match between df1 and df2 is defined as having the same values in two different columns A and B in the same row.
df1
A B C text
45 2 1 score
33 5 2 miss
20 1 3 score
df2
A B D text
45 3 1 shot
33 5 2 shot
10 2 3 miss
20 1 4 miss
Result df (Only Rows 1 and 3 are returned as the values of A and B in df2 have a match in the same row in df1 for Rows 2 and 4)
A B D text
45 3 1 shot
10 2 3 miss
Is it possible to use the isin method in this scenario?

This works:
# set index (as selecting columns)
df1 = df1.set_index(['A','B'])
df2 = df2.set_index(['A','B'])
# now .isin will work
df2[~df2.index.isin(df1.index)].reset_index()
A B D text
0 45 3 1 shot
1 10 2 3 miss

Related

Pandas delete all duplicate rows in one column if values in another column is higher than a threshold

I have a dataframe where there are duplicate values in column A that have different values in column B.
I want to delete rows if one of column A duplicated values has values higher than 15 in column B.
Original Datafram
A Column
B Column
1
10
1
14
2
10
2
20
3
5
3
10
Desired dataframe
A Column
B Column
1
10
1
14
3
5
3
10
This works:
dfnew = df.groupby('A Column').filter(lambda x: x['B Column'].max()<=15 )
dfnew.reset_index(drop=True, inplace=True)
dfnew = dfnew[['A Column','B Column']]
print(dfnew)
output:
A Column B Column
0 1 10
1 1 14
2 3 5
3 3 10
Here is another way using groupby() and transform()
df.loc[~df['B Column'].gt(15).groupby(df['A Column']).transform('any')]

Dropping multiple columns in a pandas dataframe between two columns based on column names

A super simple question, for which I cannot find an answer.
I have a dataframe with 1000+ columns and cannot drop by column number, I do not know them. I want to drop all columns between two columns, based on their names.
foo = foo.drop(columns = ['columnWhatever233':'columnWhatever826'])
does not work. I tried several other options, but do not see a simple solution. Thanks!
You can use .loc with column range. For example if you have this dataframe:
A B C D E
0 1 3 3 6 0
1 2 2 4 9 1
2 3 1 5 8 4
Then to delete columns B to D:
df = df.drop(columns=df.loc[:, "B":"D"].columns)
print(df)
Prints:
A E
0 1 0
1 2 1
2 3 4

Python how to merge two dataframes with multiple columns while preserving row order in each column?

My data is contained within two dataframes. Within each dataframe, the entries are sorted in each column. I want to now merge the two dataframes while preserving row order. For example, suppose I have this:
The first dataframe "A1" looks like this:
index a b c
0 1 4 1
3 2 7 3
5 5 8 4
6 6 10 8
...
and the second dataframe "A2" looks like this (A1 and A2 are the same size):
index a b c
1 3 1 2
2 4 2 5
4 7 3 6
7 8 5 7
...
I want to merge both of these dataframes to get the final dataframe "data":
index a b c
0 1 4 1
1 3 1 2
2 4 2 5
3 2 7 3
...
Here is what I have tried:
data = A1.merge(A2, how='outer', left_index=True, right_index=True)
But I keep getting strange results. I don't even know if this works if you have multiple columns whose row order you need to preserve. I find that some of the entries become NaNs for some reason. I don't know how to fix it. I also tried data.join(A1, A2) but the compiler printed out that it couldn't join these two dataframes.
import pandas as pd
#Create Data Frame df and df1
df = pd.DataFrame({'a':[1,2,3,4],'b':[5,6,7,8],'c':[9,0,11,12]},index=[0,3,5,6])
df1 = pd.DataFrame({'a':[13,14,15,16],'b':[17,18,19,20],'c':[21,22,23,24]},index=[1,2,4,7])
#Append df and df1 and sort by index.
df2 = df.append(df1)
print(df2.sort_index())

Compare columns in Pandas between two unequal size Dataframes for condition check

I have two pandas DF. Of unequal sizes. For example :
Df1
id value
a 2
b 3
c 22
d 5
Df2
id value
c 22
a 2
No I want to extract from DF1 those rows which has the same id as in DF2. Now my first approach is to run 2 for loops, with something like :
x=[]
for i in range(len(DF2)):
for j in range(len(DF1)):
if DF2['id'][i] == DF1['id'][j]:
x.append(DF1.iloc[j])
Now this is okay, but for 2 files of 400,000 lines in one and 5,000 in another, I need an efficient Pythonic+Pnadas way
import pandas as pd
data1={'id':['a','b','c','d'],
'value':[2,3,22,5]}
data2={'id':['c','a'],
'value':[22,2]}
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
finaldf=pd.concat([df1,df2],ignore_index=True)
Output after concat
id value
0 a 2
1 b 3
2 c 22
3 d 5
4 c 22
5 a 2
Final Ouput
finaldf.drop_duplicates()
id value
0 a 2
1 b 3
2 c 22
3 d 5
You can concat the dataframes , then check if all the elements are duplicated or not , then drop_duplicates and keep just the first occurrence:
m = pd.concat((df1,df2))
m[m.duplicated('id',keep=False)].drop_duplicates()
id value
0 a 2
2 c 22
You can try this:
df = df1[df1.set_index(['id']).index.isin(df2.set_index(['id']).index)]

Pandas left join returning multiple rows

I am using python to merge two dataframe:
join=pd.merge(df1,df2,on=["A","B"],how="left")
Table 1:
A B
a 1
b 2
c 3
Table 2:
A B Flag C
a 1 0 20
b 2 1 40
c 3 0 60
a 1 1 80
b 2 0 10
The result that I get after left join is:
A B Flag C
a 1 0 20
a 1 1 80
b 2 1 40
b 2 0 10
c 3 0 60
Here we see row 1 and row 2 has come twice because of table 2. I want to keep just one row based on Flag column. I want to keep one of the two rows whose Falg value is `= 1
So Final Expected output is:
A B Flag C
a 1 1 80
b 2 1 40
c 3 0 60
Is there any pythonic way to do it?
# raise preferred lines to the top
df2 = df2.sort_values(by='Flag', ascending=False)
# deduplicate
df2 = df2.drop_duplicates(subset=['A','B'], keep='first')
# merge
pd.merge(df1, df2, on=['A','B'])
A B Flag C
0 a 1 1 80
1 b 2 1 40
2 c 3 0 60
The concept is similar to what you would do on SQL: separate a table with the selection criterea (in this case maximums for flag), leaving enough columns to match an observation on the joint table.
join = pd.merge(df1, df2, how="left").reset_index()
maximums = join.groupby(by='A').max()
join = pd.merge(join, maximums, on=['Flag', 'A'])
Try using this join:
join=pd.merge(df1,df2,on=["A","B"],how="left", left_index=True, right_index=True)
print(join)

Categories

Resources