Looking to map highest matching row values from Dataframe2 to Dataframe1 using conditions. We also need to check df1['sal_date'] between df2['from'] and df['to'] .
Want to compare each row(df1['id_number']) from df1 with the whole column(df2['identity_No']) of df2. I have applied using partial match for extracting the condition and its working,
But how i can check if df1['sal_date'] falls which of the df2['from'] and df['to'] .
df1
score id_number company_name company_code sal_date action_reqd
20 IN2231D AXN pvt Ltd IN225 2019-12-22 Yes
45 UK654IN Aviva Intl Ltd IN115 2018-10-10 No
65 SL1432H Ship Incorporations CZ555 2015-08-19 Yes
35 LK0678G Oppo Mobiles pvt ltd PQ795 2018-06-26 Yes
59 NG5678J Nokia Inc RS885 2020-12-28 No
20 IN2231D AXN pvt Ltd IN215 2020-12-08 Yes
df2
OR_score identity_No comp_name comp_code dte_from dte_to
51 UK654IN Aviva Int.L Ltd IN515 2017-12-05 2018-10-13
25 SL6752J Ship Inc Traders CZ555 2013-08-07 2022-06-21
79 NG5678K Nokia Inc RS005 2018-10-13 2019-12-15
51 UK654IN Aviva Int.L Ltd IN525 2018-12-15 2020-12-24
20 IN22312 AXN pvt Ltd IN255 2019-12-10 2022-06-21
79 NG5678K Nokia Inc RS055 2019-06-08 2024-12-30
38 LK0665G Oppo Mobiles ltd PQ895 2016-10-10 2022-12-08
20 IN22312 AXN pvt Ltd IN275 2017-08-17 2018-10-13
75 NG5678K Nokia Inc RS055 2013-06-08 2016-12-30
df1.id_number need to be compared with df2.identity_No and df1.sal_date must be between df2.from and df2.to .
Looking to match based on row1 of df1['id_number'] will match across all rows of df2['identity_No'], and has highest match percentage wrt. row4 of df2['identity_No'] , and its more than 80%, and df1.sal_date is between df2.from and df2.to.
it will copy the respective values from row4 of df2 to row1 of df1.
same to be applied for each row of df1.
Expected Output:
score id_number company_name company_code match_acc action_reqd
20 IN22312 AXN pvt Ltd IN255 2019-12-22 Yes
51 UK654IN Aviva Int.L Ltd IN515 2018-10-10 No
25 SL1432H Ship Incorporations CZ555 2015-08-19 Yes
38 LK0665G Oppo Mobiles ltd PQ795 2018-06-26 Yes
79 NG5678K Nokia Inc RS055 2020-12-28 No
20 IN22312 AXN pvt Ltd IN255 2020-12-08 Yes
I have tried this now:
for index, row in df1.iterrows():
for index2, config2 in df2.iterrows():
variable = fuzz.partial_ratio(row['id_number'], config2['identity_No'])
if variable>=80:
df1['id_number'][index] = config2['identity_No']
df1['company_name'][index] = config2['comp_name']
df1['company_code'][index] = config2['comp_code']
df1['score'][index] = config2['OR_Score']
How can i execute the remaining code after if condition as variable >=80% and df1.sal_date is between df2.from and df2.to
Please Suggestm How it can be executed.
Your code has two main flaws:
Going by your description of the problem (below), whether or not df1['sal_date'] is between dte_from and dte_to is the necessary condition and thus should be checked first. The second step is returning the highest possible match. Since you want to force 1:1 mapping, the match being >=80 doesn't matter, you simply return the highest one.
Looking to map highest matching row values from Dataframe2 to Dataframe1 using conditions. We also need to check df1['sal_date'] between df2['from'] and df['to'].
Your code doesn't really return the row from df2 with the highest match percentage over 80%, but it returns the last one. Every time the condition variable>=80 is met, the current current row in df1 is overwritten.
also, the name for column 1 in df2 is inconsistent; in df2 it's called OR_score with lowercase s but in the code it's called OR_Score with capital S.
I changed your code a little bit. I added highest_match, which keeps track of what the variable of the highest match was and only overwrites if the new match's variable is higher than the highest match. This resets for each row if df1.
I dont use >= thus it keeps the first match if variable is equal. If you want to keep your >=80 condition, you can initialize highest_match = 80, however this code want warn you if for one row of df1 no match >=80 is found and the row thus just stays as it was.
The code also only proceeds, if the date condition is met first.
from fuzzywuzzy import fuzz
for index, row in df1.iterrows():
highest_match = 0
for index2, config2 in df2.iterrows():
cond1 = df1['sal_date'][index] <= config2['dte_to']
cond2 = df1['sal_date'][index] >= config2['dte_from']
if cond1 and cond2:
variable = fuzz.partial_ratio(row['id_number'], config2['identity_No'])
if variable > highest_match:
df1['id_number'][index] = config2['identity_No']
df1['company_name'][index] = config2['comp_name']
df1['company_code'][index] = config2['comp_code']
df1['score'][index] = config2['OR_score']
highest_match = variable
This code is not optimized for time complexity, it just does what you were trying to accomplish. Or atleast it produces your expected output.. Adding the >=80 constraint might improve time, but then you'll need to add some logic for what should happen if no match is >=80.
Please add your code of how the tables are created as well the next time and not just the output. That makes recreating your problem much easier and more people would be willing to help, thanks.
EDIT:
If youn want to keep rows with missing sal_date simply skip them:
from fuzzywuzzy import fuzz
for index, row in df1.iterrows():
if pd.isna(row['sal_date']):
continue
highest_match = 0
for index2, config2 in df2.iterrows():
cond1 = df1['sal_date'][index] <= config2['dte_to']
cond2 = df1['sal_date'][index] >= config2['dte_from']
if cond1 and cond2:
variable = fuzz.partial_ratio(row['id_number'], config2['identity_No'])
if variable > highest_match:
df1['id_number'][index] = config2['identity_No']
df1['company_name'][index] = config2['comp_name']
df1['company_code'][index] = config2['comp_code']
df1['score'][index] = config2['OR_score']
highest_match = variable
Related
I know this question has been asked in some way so apologies. I'm trying to fuzzy match list 1(sample_name) to list 2 (actual_name). Actual_name has significantly more names than list 1 and I keep runninng into fuzzy match not working well. I've tried the multiple fuzzy match methods(partial, set_token) but keep running into issues since there are many more names in list 2 that are very similar. Is there any way to improve matching here. Ideally want to have list 1, matched name from list 2, with the match score in column 3 in a new dataframe. Any help would be much appreciated. Thanks.
Have used this so far:
df1=sample_df['sample_name'].to_list()
df2=actual_df['actual_name'].to_list()
response = {}
for name_to_find in df1:
for name_master in df2:
if fuzz.partial_ratio(name_to_find,name_master) > 90:
response[name_to_find] = name_master
break
for key, value in response.item():
print('sample name' + key + 'actual_name' + value)
sample_name
actual_name
jtsports
JT Sports LLC
tombaseball
Tom Baseball Inc.
context express
Context Express LLC
zb sicily
ZB Sicily LLC
lightening express
Lightening Express LLC
fire roads
Fire Road Express
N/A
Earth Treks
N/A
TS Sports LLC
N/A
MM Baseball Inc.
N/A
Contact Express LLC
N/A
AB Sicily LLC
N/A
Lightening Roads LLC
Not sure if this is your expected output (and you may need to adjust the threshold), but I think this is what you are looking for?
import pandas as pd
from fuzzywuzzy import process
threshold = 50
list1 = ['jtsports','tombaseball','context express','zb sicily',
'lightening express','fire roads']
list2 = ['JT Sports LLC','Tom Baseball Inc.','Context Express LLC',
'ZB Sicily LLC','Lightening Express LLC','Fire Road Express',
'Earth Treks','TS Sports LLC','MM Baseball Inc.','Contact Express LLC',
'AB Sicily LLC','Lightening Roads LLC']
response = []
for name_to_find in list1:
resp_match = process.extractOne(name_to_find ,list2)
if resp_match[1] > threshold:
row = {'sample_name':name_to_find,'actual_name':resp_match[0], 'score':resp_match[1]}
response.append(row)
print(row)
results = pd.DataFrame(response)
# If you need all the 'actual_name' tp be in the datframe, continue below
# Otherwise don't include these last 2 lines of code
unmatched = pd.DataFrame([x for x in list2 if x not in list(results['actual_name'])], columns=['actual_name'])
results = results.append(unmatched, sort=False).reset_index(drop=True)
Output:
print(results)
sample_name actual_name score
0 jtsports JT Sports LLC 79.0
1 tombaseball Tom Baseball Inc. 81.0
2 context express Context Express LLC 95.0
3 zb sicily ZB Sicily LLC 95.0
4 lightening express Lightening Express LLC 95.0
5 fire roads Fire Road Express 86.0
6 NaN Earth Treks NaN
7 NaN TS Sports LLC NaN
8 NaN MM Baseball Inc. NaN
9 NaN Contact Express LLC NaN
10 NaN AB Sicily LLC NaN
11 NaN Lightening Roads LLC NaN
It won't be the most efficient way to do it, being of order O(n) in the number of correct matches but you could calculate the Levenshtein distance between the left and right and then match based on the closest match.
That is how a lot of nieve spell check systems work.
I'm suggesting that you run this calculation for each of the correct names and return the match with the lowest score.
Adjusting the code you have posted I would follow something like the following. Bear in mind the Levenshtein distance lower is closer so it'll need some adjusting. It seems the function you are using higher is more close and so the following should work using that.
df1=sample_df['sample_name'].to_list()
df2=actual_df['actual_name'].to_list()
response = {}
for name_to_find in df1:
highest_so_far = ("", 0)
for name_master in df2:
score = fuzz.partial_ratio(name_to_find, name_master)
if score > highest_so_far[1]:
highest_so_far = (name_master, score)
response[name_to_find] = highest_so_far[0]
for key, value in response.item():
print('sample name' + key + 'actual_name' + value)
Looking to map values from dataframe2 to dataframe1 based on conditional statement. Need to map the values from df2 to df1 where matching percentage based on df1['id_number'] & df2['identity_No'] values are highest.
For Eg: if row1 from df1 will match across all rows of df2 based on a specific column, and has highest match percentage wrt. row4 of df2, and its more than 75%, it will copy the respective data to df1.
Dataframe1
score id_number company_name company_code match_acc action_reqd
20 IN2231D AXN pvt Ltd IN225 Yes
45 UK654IN Aviva Intl Ltd IN115 No
65 SL1432H Ship Incorporations CZ555 Yes
35 LK0678G Oppo Mobiles pvt ltd PQ795 Yes
59 NG5678J Nokia Inc RS885 No
20 IN2231D AXN pvt Ltd IN215 Yes
Dataframe2
OR_score identity_No comp_name comp_code
51 UK654IN Aviva Int.L Ltd IN515
25 SL6752J Ship Inc Traders CZ555
79 NG5678K Nokia Inc RS005
20 IN22312 AXN pvt Ltd IN255
38 LK0665G Oppo Mobiles ltd PQ895
I need to check the matching accuracy percentage where for Eg. row1 from df1 ("id_number") will match with each row from df2 (identity_No) based on the highest matching percentage (whichever row from df2 will have highest matching percentage) will map values from df2 to df1. Same will be continued for each row of df1.
Expected output:
score id_number company_name company_code match_acc action_reqd
20 IN22312 AXN pvt Ltd IN225 90 Yes
51 UK654IN Aviva Int.L Ltd IN115 100 No
25 SL1432H Ship Incorporations CZ555 30 Yes
38 LK0665G Oppo Mobiles ltd PQ795 80 Yes
79 NG5678K Nokia Inc RS885 85 No
Code i have been trying
cross = df1[['id_number']].merge(df2[['identity_no']].assign(tmp=0), how='outer', on='tmp').drop(columns='tmp'))
cross['match'] = cross.apply(lambda x: fuzz.ratio(x.id_number, x.identity_no), axis=1)
df1['match_acc'] = df1.id_number.map(cross.groupby('id_number').match.max())
for index, row in df1.iterrows():
for index2, config2 in df2.iterrows():
if row['match_acc'] >= 75:
df1['id_number'][index] = config2['identity_No']
df1['company_name'][index] = config2['comp_name']
df1['company_code'][index] = config2['comp_code']
df1['score'][index] = config2['OR_Score']
Not getting the expected answer. it copies row1 from df2 to the entire row of df1 where match_acc is >=75.
I am looking to merge two columns using cross function which i need to use later for further analysis.
Inpout Data
id_number company_name match_acc
IN2231D AXN pvt Ltd
UK654IN Aviva Intl Ltd
SL1432H Ship Incorporations
LK0678G Oppo Mobiles pvt ltd
NG5678J Nokia Inc
identity_no Pincode company_name
IN2231 110030 AXN pvt Ltd
UK654IN 897653 Aviva Intl Ltd
SL1432 07658 Ship Incorporations
LK0678G 120988 Oppo Mobiles Pvt Ltd
I am looking out to merge the column id_number with identity_no
Code i am using so far:
cross = df1[['id_number']].merge(df2[['identity_no']], how='cross')
But Getting the error:
pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False
Output what i need:
# id_number identity_no
# 0 IN2231D IN2231
# 1 IN2231D UK654IN
# 2 IN2231D SL1432
# ...
# 17 NG5678J UK654IN
# 18 NG5678J SL1432
# 19 NG5678J LK0678G
Please suggest.
how='cross'
This was a feature introduced in pd.__version__ == '1.2.0' so if you have an older version of pandas it will not work. If for some reason you cannot upgrade you can accomplish the same with the use of a helper column that is the same constant for both DataFrames that you then drop.
import pandas as pd
df1 = pd.DataFrame({'x': [1,2]})
df2 = pd.DataFrame({'y': ['a', 'b']})
# For versions >=1.2.0
df1.merge(df2, how='cross')
# x y
#0 1 a
#1 1 b
#2 2 a
#3 2 b
# For older versions assign a constant you merge on.
df1.assign(t=1).merge(df2.assign(t=1), on='t').drop(columns='t')
# x y
#0 1 a
#1 1 b
#2 2 a
#3 2 b
Working in python, in a Jupyter notebook. I am given this dataframe
congress chamber state party
80 house TX D
80 house TX D
80 house NJ D
80 house TX R
80 senate KY R
of every congressperson since the 80th congressional term, with a bunch of information. I've narrowed it down to what's needed for this question. I want to alter the dataframe so that I have a single row for every unique combination of congressional term, chamber, state, and party affiliation, Then a new column with the number of rows that are of the associated party divided by the number of rows where everything else besides that is the same. For example, this
congress chamber state party perc
80 house TX D 0.66
80 house NJ D 1
80 house TX R 0.33
80 senate KY R 1
is what I'd want my result to look like. The perc column is the percentage of, for example, democrats elected to congress in TX in the 80th congressional election.
I've tried a few different methods I've found on here, but most of them divide the number of rows by the number of rows in the entire dataframe, rather than by just the rows that meet the 3 given criteria. Here's the latest thing I've tried:
term=80
newdf = pd.crosstab(index=df['party'], columns=df['state']).stack()/len(df[df['congress']==term])
I define term because I'll only care about one term at a time for each dataframe.
A method I tried using groupby involved the following:
newdf = df.groupby(['congress', 'chamber','state']).agg({'party': 'count'})
state_pcts = newdf.groupby('party').apply(lambda x:
100 * x / float(x.sum()))
And it does group by term, chamber, state, but it returns a number that doesn't mean anything to me, when I check what the actual results should be.
Basically, you can do the following using value_counts for each group:
def func(f):
return f['party'].value_counts(normalize=True)
df = (df
.groupby(['congress','chamber','state'])
.apply(func)
.reset_index()
.rename(columns={'party':'perc','level_3':'party'}))
print(df)
congress chamber state party perc
0 80 house NJ D 1.000000
1 80 house TX D 0.666667
2 80 house TX R 0.333333
3 80 senate KY R 1.000000
I have an excel file with product names. First row is the category (A1: Water, A2: Sparkling, A3:Still, B1: Soft Drinks, B2: Coca Cola, B3: Orange Juice, B4:Lemonade etc.), each cell below is a different product. I want to keep this list in a viewable format (not comma separated etc.) as this is very easy for anybody to update the product names (I have a second person running the script without understanding the script)
If it helps I can also have the excel file in a CSV format and I can also move the categories from the top row to the first column
I would like to replace the cells of a dataframe (df) with the product categories. For example, Coca Cola would become Soft Drinks. If the product is not in the excel it would not be replaced (ex. Cookie).
print(df)
Product Quantity
0 Coca Cola 1234
1 Cookie 4
2 Still 333
3 Chips 88
Expected Outcome:
print (df1)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88
Use DataFrame.melt with DataFrame.dropna or DataFrame.stack for helper Series and then use Series.replace:
s = df1.melt().dropna().set_index('value')['variable']
Alternative:
s = df1.stack().reset_index(name='v').set_index('v')['level_1']
df['Product'] = df['Product'].replace(s)
#if performance is important
#df['Product'] = df['Product'].map(s).fillna(df['Product'])
print (df)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88