Pandas: Match values from two dataframes - Many to One - python

I have two dataframes I need to match by row. Where a match occurs I need to increment the value +1 in of a field in df1. df2 has mulitple matches to df1. I don't want to merge the dataframes, just update df1 based off a match to df2.
The basic logic in my head is read the first row of df1, then try to match TRANID to each row of df2. When a match occurs, add +1 to the NUMINSTS value. Then loop back and do the same for the next row on df1. I'm just not sure how to approach this in Python/Pandas.
I'm an old COBOL programmer and am just learning Python/Pandas so any help is greatly appreciated.
Input Data
df1:
TRANID NUMINSTS
60000022 22
60000333 6
70000001 15
70000233 60
df2:
TRANID
60000333
70000233
70000233
Output
df3:
TRANID NUMINSTS
60000022 22
60000333 7 #incremented by 1
70000001 15
70000233 62 #incremented by 2

We can filter based on the values in df2 and keep adding or changing values in df1.
import pandas as pd
df1 = pd.DataFrame({"TRAINID":["60000022", "60000333", "70000001", "70000233"], "NUMINSTS":[22,6,15,60]})
df2 = pd.DataFrame({"TRAINID":[ "60000333", "70000233", "70000233"]})
def add_num(df1,df2):
for id in list(df2["TRAINID"]):
df1.loc[df1["TRAINID"] == id, "NUMINSTS"] += 1
return df1
df3 = add_num(df1,df2)
print(df3)

You want two cases:
Tranid exists in df1
Tranid doesn't exist in df1.
Here is your code:
import pandas as pd
df1=pd.DataFrame({'tranid':[1,2,3],'numinst':[2,4,6]})
df2=pd.DataFrame({'tranid':[1,2,4]})
tranvalues=df1['tranid']
for i in range(len(df2)):
if df2['tranid'][i] in tranvalues:
df1['numinst'][df1['tranid']==df2['tranid'][i]]=df1['numinst']+1
else:
df1.loc[len(df1.index)]=[df2['tranid'][i],1]

You may try:
df1 = pd.DataFrame({'TRANID':[60000022, 60000333, 70000001, 70000233],
'NUMINSTS':[22,6,15,60]})
df1:
TRANID NUMINSTS
0 60000022 22
1 60000333 6
2 70000001 15
3 70000233 60
df2 = pd.DataFrame({'TRANID':[60000333, 70000233, 70000233]})
df2:
TRANID
0 60000333
1 70000233
2 70000233
Build a dictionary of counts of TRANID values from df2:
d = df2['TRANID'].value_counts().to_dict()
Copy df3 from df1 and update the NUMINSTS column like if the TRANID is in the above dictionary , increment by the value count otherwise keep it the same:
df3 = df1.copy()
df3['NUMINSTS'] = df3.apply(
lambda row:
row['NUMINSTS']+d[row['TRANID']] if row['TRANID'] in d else row['NUMINSTS'], axis=1)
If you don't want the rows that don't match, you could replace None like below and then drop those with None values:
df3['NUMINSTS'] = df3.apply(
lambda row:
row['NUMINSTS']+d[row['TRANID']] if row['TRANID'] in d else None, axis=1)
df3.dropna(subset=['NUMINSTS'], inplace=True)
df3['NUMINSTS'] = df3['NUMINSTS'].astype(int)
df3.reset_index(inplace=True,drop=True)
Output df3:
TRANID NUMINSTS
0 60000333 7
1 70000233 62

Related

Pandas: Search and match based on two conditions

I am using the code below to make a search on a .csv file and match a column in both files and grab a different column I want and add it as a new column. However, I am trying to make the match based on two columns instead of one. Is there a way to do this?
import pandas as pd
df1 = pd.read_csv("matchone.csv")
df2 = pd.read_csv("comingfrom.csv")
def lookup_prod(ip):
for row in df2.itertuples():
if ip in row[1]:
return row[3]
else:
return '0'
df1['want'] = df1['name'].apply(lookup_prod)
df1[df1.want != '0']
print(df1)
#df1.to_csv('file_name.csv')
The code above makes a search from the column name 'samename' in both files and gets the column I request ([3]) from the df2. I want to make the code make a match for both column 'name' and another column 'price' and only if both columns in both df1 and df2 match then the code take the value on ([3]).
df 1 :
name price value
a 10 35
b 10 21
c 10 33
d 10 20
e 10 88
df 2 :
name price want
a 10 123
b 5 222
c 10 944
d 10 104
e 5 213
When the code is run (asking for the want column from d2, based on both if df1 name = df2 name) the produced result is :
name price value want
a 10 35 123
b 10 21 222
c 10 33 944
d 10 20 104
e 10 88 213
However, what I want is if both df1 name = df2 name and df1 price = df2 price, then take the column df2 want, so the desired result is:
name price value want
a 10 35 123
b 10 21 0
c 10 33 944
d 10 20 104
e 10 88 0
You need to use pandas.DataFrame.merge() method with multiple keys:
df1.merge(df2, on=['name','price'], how='left').fillna(0)
Method represents missing values as NaNs, so that the column's dtype changes to float64 but you can change it back after filling the missed values with 0.
Also please be aware that duplicated combinations of name and price in df2 will appear several times in the result.
If you are matching the two dataframes based on the name and the price, you can use df.where and df.isin
df1['want'] = df2['want'].where(df1[['name','price']].isin(df2).all(axis=1)).fillna('0')
df1
name price value want
0 a 10 35 123.0
1 b 10 21 0
2 c 10 33 944.0
3 d 10 20 104.0
4 e 10 88 0
Expanding on https://stackoverflow.com/a/73830294/20110802:
You can add the validate option to the merge in order to avoid duplication on one side (or both):
pd.merge(df1, df2, on=['name','price'], how='left', validate='1:1').fillna(0)
Also, if the float conversion is a problem for you, one option is to do an inner join first and then pd.concat the result with the "leftover" df1 where you already added a constant valued column. Would look something like:
df_inner = pd.merge(df1, df2, on=['name', 'price'], how='inner', validate='1:1')
merged_pairs = set(zip(df_inner.name, df_inner.price))
df_anti = df1.loc[~pd.Series(zip(df1.name, df1.price)).isin(merged_pairs)]
df_anti['want'] = 0
df_result = pd.concat([df_inner, df_anti]) # perhaps ignore_index=True ?
Looks complicated, but should be quite performant because it filters by set. I think there might be a possibility to set name and price as index, merge on index and then filter by index to not having to do the zip-set-shenanigans, bit I'm no expert on multiindex-handling.
#Try this code it will give you expected results
import pandas as pd
df1 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,10,10,10,10],
'value' : [35,21,33,20,88]})
df2 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,5,10,10,5],
'want' : [123,222,944,104 ,213]})
new = pd.merge(df1,df2, how='left', left_on=['name','price'], right_on=['name','price'])
print(new.fillna(0))

Delete row indices based on common columns in a Dataframe

I have following two dataframes df1 and df2
final raw st
abc 12 10
abc 17 15
abc 14 17
and
final raw
abc 12
abc 14
My expected output is
final raw st
abc 17 15
I would like to delete rows based on common column value.
My try:
df1.isin(df2)
This is giving me Boolean result. Another thing, I tried
df3 = pd.merge(df1, df2, on = ['final', 'raw'], how = 'inner') so that we get all the common columns for df1 and df3.
You are closed with merge you just need extra step. First you need to perform an outer join to keep all rows from both dataframes and enable indicator of merge then filter on this indicator to keep right values (from df2). Finally, keep only columns from df1:
df3 = pd.merge(df1, df2, on = ['final', 'raw'], how='outer', indicator=True) \
.query("_merge == 'left_only'")[df1.columns]
print(df3)
# Output
final raw st
1 abc 17 15
You need to refer to the correct column when using isin.
result = df1[~df1['raw'].isin(df2['raw'])]

Compare 2 columns and merge rows on match?

New to coding here and trying to make a project. I want to compare two DF, and if any of the rows in the product column matches, I want to copy it over to a new DF. The rows in DF1 and DF2 will not be in the same position. Like I want to compare row 1 DF1 against the entire column in DF2. Is there an easy solution to this?
Take a look at this: https://cmdlinetips.com/2018/02/how-to-subset-pandas-dataframe-based-on-values-of-a-column/
You can try:
df3 = df1[df1['Product'].isin(set(df2['Product']))]
Which gives:
>>> df1 = pd.DataFrame({'prod':[1,2], 'ean':[5,6]})
>>> df1
prod ean
0 1 5
1 2 6
>>> df2 = pd.DataFrame({'prod':[3,2]})
>>> df2
prod
0 3
1 2
>>> df1[df1['prod'].isin(set(df2['prod']))]
prod ean
1 2 6
To explain:
df1[...] is to filter the rows of df1 based on criterion ...
I'm using a set() here so it is fast to check whether a row in df1 is in df2's "Product" column

using pandas, extract data from long format df and add it to wide format df

I have two dataframes, df1 and df2. df1 has repeat observations arranged in wide format, and df2 in long format.
import pandas as pd
df1 = pd.DataFrame({"ID":[1,2,3],"colA_1":[1,2,3],"date1":["1.1.2001", "2.1.2001","3.1.2001"],"colA_2":[4,5,6],"date2":["1.1.2002", "2.1.2002","3.1.2002"]})
df2 = pd.DataFrame({"ID":[1,1,2,2,3,3],"col1":[1,1.5,2,2.5,3,3.5],"date":["1.1.2001", "1.1.2002","2.1.2001","2.1.2002","3.1.2001","3.1.2002"], "col3":[11,12,13,14,15,16],"col4":[21,22,23,24,25,26]})
df1 looks like:
ID colA_1 date1 colA_2 date2
0 1 1 1.1.2001 4 1.1.2002
1 2 2 2.1.2001 5 2.1.2002
2 3 3 3.1.2001 6 3.1.2002
df2 looks like:
ID col1 date1 col3 col4
0 1 1.0 1.1.2001 11 21
1 1 1.5 1.1.2002 12 22
2 2 2.0 2.1.2001 13 23
3 2 2.5 2.1.2002 14 24
4 3 3.0 3.1.2001 15 25
5 3 3.5 3.1.2002 16 26
6 3 4.0 4.1.2002 17 27
I want to take a given column from df2, "col3", and then:
(1) if the columns "ID" and "date" in df2 match with the columns "ID" and "date1" in df1, I want to put the value in a new column in df1 called "colB_1".
(2) else if the columns "ID" and "date" in df2 match with the columns "ID" and "date2" in df1, I want to put the value in a new column in df1 called "colB_2".
(3) else if the columns "ID" and "date" in df2 have no match with either ("ID" and "date1") or ("ID" and "date2"), I want to ignore these rows.
So, the output of this output dataframe, df3, should look like this:
ID colA_1 date1 colA_2 date2 colB_1 colB_2
0 1 1 1.1.2001 4 1.1.2002 11 12
1 2 2 2.1.2001 5 2.1.2002 13 14
2 3 3 3.1.2001 6 3.1.2002 15 16
What is the best way to do this?
I found this link, but the answer doesn't work for my case. I would like a really explicit way to specify column matching. I think it's possible that df.mask might be able to help me, but I am not sure how to implement it.
e.g.: the following code
df3 = df1.copy()
df3["colB_1"] = ""
df3["colB_2"] = ""
filter1 = (df1["ID"] == df2["ID"]) & (df1["date1"] == df2["date"])
filter2 = (df1["ID"] == df2["ID"]) & (df1["date2"] == df2["date"])
df3["colB_1"] = df.mask(filter1, other=df2["col3"])
df3["colB_2"] = df.mask(filter2, other=df2["col3"])
gives the error
ValueError: Can only compare identically-labeled Series objects
I asked this question previously, and it was marked as closed; my question was marked as a duplicate of this one. However, this is not the case. The answers in the linked question suggest the use of either map or df.merge. Map does not work with multiple conditions (in my case, ID and date). And df.merge (the answer given for matching multiple columns) does not work in my case when one of the column names in df1 and df2 that are to be merged are different ("date" and "date1", for example).
For example, the below code:
df3 = df1.merge(df2[["ID","date","col3"]], on=['ID','date1'], how='left')
fails with a Key Error.
Also noteworthy is that I will be dealing with many different files, with many different column naming schemes, and I will need a different subset each time. This is why I would like an answer that explicitly names the columns and conditions.
Any help with this would be much appreciated.
You can the pd.wide_to_long after replacing the underscore , this will unpivot the dataframe which you can use to merge with df2 and then pivot back using unstack:
m =df1.rename(columns=lambda x: x.replace('_',''))
unpiv = pd.wide_to_long(m,['colA','date'],'ID','v').reset_index()
merge_piv = (unpiv.merge(df2[['ID','date','col3']],on=['ID','date'],how='left')
.set_index(['ID','v'])['col3'].unstack().add_prefix('colB_'))
final = df1.merge(merge_piv,left_on='ID',right_index=True)
ID colA_1 date1 colA_2 date2 colB_1 colB_2
0 1 1 1.1.2001 4 1.1.2002 11 12
1 2 2 2.1.2001 5 2.1.2002 13 14
2 3 3 3.1.2001 6 3.1.2002 15 16

Vectorized code to iterate one panda dataframe and compare values to a second dataframe

I have two DataFrames:
df_1:
name value
foo 5
baz 5
df_2:
name value1 value2
foo 3 7
bar 12 15
baz 2 3
fuz 4 9
And I need to compare each row in df_1 to each row in df_2 to see if both:
The names match
The value in df_1, column 1 is within the range of the two values in df_2
Hits:
foo 5
The code below so far:
for row in df_1.iterrows():
mat_idx = (df_2.iloc[:,0] == row[1][1]) & (df_2.iloc[:,1] <= row[1][2]) & (df_2.iloc[:,2] >= row[1][2])
This works but is not fully vectorized, I would like to go without iterating through df_1, especially for << row DataFrames. Thanks.
Assuming df_1.columns=['name', 'value'] and df_2.columns=['name', 'value1', 'value2], you could:
combined = df_2.merge(df_1, on='name', how='left')
keep = combined[(combined.value1<=combined.value) & (combined.value2>=combined.value)]

Categories

Resources