I have two separate pandas dataframes (df1 and df2) which have multiple columns with some common columns.
I would like to find every row in df2 that does not have a match in df1. Match between df1 and df2 is defined as having the same values in two different columns A and B in the same row.
df1
A B C text
45 2 1 score
33 5 2 miss
20 1 3 score
df2
A B D text
45 3 1 shot
33 5 2 shot
10 2 3 miss
20 1 4 miss
Result df (Only Rows 1 and 3 are returned as the values of A and B in df2 have a match in the same row in df1 for Rows 2 and 4)
A B D text
45 3 1 shot
10 2 3 miss
Is it possible to use the isin method in this scenario?
This works:
# set index (as selecting columns)
df1 = df1.set_index(['A','B'])
df2 = df2.set_index(['A','B'])
# now .isin will work
df2[~df2.index.isin(df1.index)].reset_index()
A B D text
0 45 3 1 shot
1 10 2 3 miss
Related
I have a dataframe where there are duplicate values in column A that have different values in column B.
I want to delete rows if one of column A duplicated values has values higher than 15 in column B.
Original Datafram
A Column
B Column
1
10
1
14
2
10
2
20
3
5
3
10
Desired dataframe
A Column
B Column
1
10
1
14
3
5
3
10
This works:
dfnew = df.groupby('A Column').filter(lambda x: x['B Column'].max()<=15 )
dfnew.reset_index(drop=True, inplace=True)
dfnew = dfnew[['A Column','B Column']]
print(dfnew)
output:
A Column B Column
0 1 10
1 1 14
2 3 5
3 3 10
Here is another way using groupby() and transform()
df.loc[~df['B Column'].gt(15).groupby(df['A Column']).transform('any')]
A super simple question, for which I cannot find an answer.
I have a dataframe with 1000+ columns and cannot drop by column number, I do not know them. I want to drop all columns between two columns, based on their names.
foo = foo.drop(columns = ['columnWhatever233':'columnWhatever826'])
does not work. I tried several other options, but do not see a simple solution. Thanks!
You can use .loc with column range. For example if you have this dataframe:
A B C D E
0 1 3 3 6 0
1 2 2 4 9 1
2 3 1 5 8 4
Then to delete columns B to D:
df = df.drop(columns=df.loc[:, "B":"D"].columns)
print(df)
Prints:
A E
0 1 0
1 2 1
2 3 4
My data is contained within two dataframes. Within each dataframe, the entries are sorted in each column. I want to now merge the two dataframes while preserving row order. For example, suppose I have this:
The first dataframe "A1" looks like this:
index a b c
0 1 4 1
3 2 7 3
5 5 8 4
6 6 10 8
...
and the second dataframe "A2" looks like this (A1 and A2 are the same size):
index a b c
1 3 1 2
2 4 2 5
4 7 3 6
7 8 5 7
...
I want to merge both of these dataframes to get the final dataframe "data":
index a b c
0 1 4 1
1 3 1 2
2 4 2 5
3 2 7 3
...
Here is what I have tried:
data = A1.merge(A2, how='outer', left_index=True, right_index=True)
But I keep getting strange results. I don't even know if this works if you have multiple columns whose row order you need to preserve. I find that some of the entries become NaNs for some reason. I don't know how to fix it. I also tried data.join(A1, A2) but the compiler printed out that it couldn't join these two dataframes.
import pandas as pd
#Create Data Frame df and df1
df = pd.DataFrame({'a':[1,2,3,4],'b':[5,6,7,8],'c':[9,0,11,12]},index=[0,3,5,6])
df1 = pd.DataFrame({'a':[13,14,15,16],'b':[17,18,19,20],'c':[21,22,23,24]},index=[1,2,4,7])
#Append df and df1 and sort by index.
df2 = df.append(df1)
print(df2.sort_index())
I have two pandas DF. Of unequal sizes. For example :
Df1
id value
a 2
b 3
c 22
d 5
Df2
id value
c 22
a 2
No I want to extract from DF1 those rows which has the same id as in DF2. Now my first approach is to run 2 for loops, with something like :
x=[]
for i in range(len(DF2)):
for j in range(len(DF1)):
if DF2['id'][i] == DF1['id'][j]:
x.append(DF1.iloc[j])
Now this is okay, but for 2 files of 400,000 lines in one and 5,000 in another, I need an efficient Pythonic+Pnadas way
import pandas as pd
data1={'id':['a','b','c','d'],
'value':[2,3,22,5]}
data2={'id':['c','a'],
'value':[22,2]}
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
finaldf=pd.concat([df1,df2],ignore_index=True)
Output after concat
id value
0 a 2
1 b 3
2 c 22
3 d 5
4 c 22
5 a 2
Final Ouput
finaldf.drop_duplicates()
id value
0 a 2
1 b 3
2 c 22
3 d 5
You can concat the dataframes , then check if all the elements are duplicated or not , then drop_duplicates and keep just the first occurrence:
m = pd.concat((df1,df2))
m[m.duplicated('id',keep=False)].drop_duplicates()
id value
0 a 2
2 c 22
You can try this:
df = df1[df1.set_index(['id']).index.isin(df2.set_index(['id']).index)]
I am using python to merge two dataframe:
join=pd.merge(df1,df2,on=["A","B"],how="left")
Table 1:
A B
a 1
b 2
c 3
Table 2:
A B Flag C
a 1 0 20
b 2 1 40
c 3 0 60
a 1 1 80
b 2 0 10
The result that I get after left join is:
A B Flag C
a 1 0 20
a 1 1 80
b 2 1 40
b 2 0 10
c 3 0 60
Here we see row 1 and row 2 has come twice because of table 2. I want to keep just one row based on Flag column. I want to keep one of the two rows whose Falg value is `= 1
So Final Expected output is:
A B Flag C
a 1 1 80
b 2 1 40
c 3 0 60
Is there any pythonic way to do it?
# raise preferred lines to the top
df2 = df2.sort_values(by='Flag', ascending=False)
# deduplicate
df2 = df2.drop_duplicates(subset=['A','B'], keep='first')
# merge
pd.merge(df1, df2, on=['A','B'])
A B Flag C
0 a 1 1 80
1 b 2 1 40
2 c 3 0 60
The concept is similar to what you would do on SQL: separate a table with the selection criterea (in this case maximums for flag), leaving enough columns to match an observation on the joint table.
join = pd.merge(df1, df2, how="left").reset_index()
maximums = join.groupby(by='A').max()
join = pd.merge(join, maximums, on=['Flag', 'A'])
Try using this join:
join=pd.merge(df1,df2,on=["A","B"],how="left", left_index=True, right_index=True)
print(join)