I am looking to execute the script only when it satisfies the condition.
If Column1 is not blank then only we can use the below script else will print the message. I have tried several ways but couldn't find the possible way to work.
Sheet1
id_number company_name match_acc
IN2231D AXN pvt Ltd
UK654IN Aviva Intl Ltd
Ship Incorporations
LK0678G Oppo Mobiles pvt ltd
NG5678J Nokia Inc
Sheet2
identity_no Pincode company_name
IN2231 110030 AXN pvt Ltd
UK654IN 897653 Aviva Intl Ltd
SL1432 07658 Ship Incorporations
LK0678G 120988 Oppo Mobiles Pvt Ltd
Script i have been using
df1 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet1')
df2 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet2')
if df1[['id_number']] is not NaN:
cross = df1[['id_number']].merge(df2[['identity_no']], how='cross')
cross['match_acc'] = cross.apply(lambda x: fuzz.ratio(x.id_number, x.identity_no), axis=1)
df1['match_acc'] = df1.id_number.map(cross.groupby('id_number').match_acc.max())
how about
filtered = df1[df1['id_number'] != ""]
then use the filtered variable for the rest of your code
The id_number may just be an empty string and not necessarily NaN:
I usually resort to this while checking for empty column:
df[ (df[column_name].notnull()) & (df[column_name]!='') ]
Related
My dataframe is slumped up with special chars and some company extensions that i'm trying to get rid of.
---
df
--
Microsoft inc
google INC
Apple Pvt Ltd
orc~l PvT ltd
Am##zon Pvt Ltd
Expected output
--
df
--
Microsoft
google
Apple
oracl
Amazon
What i tried
word_list= ['inc','INC','Pvt Ltd', 'PvT ltd']
df1= ''.join([repl if idx in word_list else idx for idx in df])
I am looking to find the percentage difference between two dataframes. I have tried using fuzzywuzzy but not getting the expected output for the same.
Suppose i have 2 dataframes with 3 columns each, i want to find the match percentage between these 2 dataframes.
df1
score id_number company_name company_code
200 IN2231D AXN pvt Ltd IN225
450 UK654IN Aviva Intl Ltd IN115
650 SL1432H Ship Incorporations CZ555
350 LK0678G Oppo Mobiles pvt ltd PQ795
590 NG5678J Nokia Inc RS885
250 IN2231D AXN pvt Ltd IN215
df2
QR_score Identity_No comp_name comp_code match_acc
200.00 IN2231D AXN pvt Inc IN225
420.0 UK655IN Aviva Intl Ltd IN315
350.35 SL2252H Ship Inc CK555
450.0 LK9978G Oppo Mobiles pvt ltd PRS95
590.5 NG5678J Nokia Inc RS885
250.0 IN5531D AXN pvt Ltd IN215
Code i am using:
df1 = df[['score','id_number','company_code']]
df2 = df[['QR_score','identity_No','comp_code']]
for idx, row1 in df1.iterrows():
for idx2, row2 in df2.iterrows():
df2['match_acc'] =
Suppose if first row in both the dataframe is matching by 75% so it
will be listed in df2['match_acc'] column , same to be followed for
each row.
IIUC rename the columns to match then use eq + mean on axis 1:
df1.columns = df2.columns
df2['match_acc'] = df1.eq(df2).mean(axis=1) * 100
df2:
QR_score Identity_No comp_name comp_code match_acc
0 200.00 IN2231D AXN pvt Inc IN225 75.0
1 420.00 UK655IN Aviva Intl Ltd IN315 25.0
2 350.35 SL2252H Ship Inc CK555 0.0
3 450.00 LK9978G Oppo Mobiles pvt ltd PRS95 25.0
4 590.50 NG5678J Nokia Inc RS885 75.0
5 250.00 IN5531D AXN pvt Ltd IN215 75.0
Complete Working Example
import pandas as pd
df1 = pd.DataFrame({
'score': [200, 450, 650, 350, 590, 250],
'id_number': ['IN2231D', 'UK654IN', 'SL1432H', 'LK0678G', 'NG5678J',
'IN2231D'],
'company_name': ['AXN pvt Ltd', 'Aviva Intl Ltd', 'Ship Incorporations',
'Oppo Mobiles pvt ltd', 'Nokia Inc', 'AXN pvt Ltd'],
'company_code': ['IN225', 'IN115', 'CZ555', 'PQ795', 'RS885', 'IN215']
})
df2 = pd.DataFrame({
'QR_score': [200.00, 420.0, 350.35, 450.0, 590.5, 250.0],
'Identity_No': ['IN2231D', 'UK655IN', 'SL2252H', 'LK9978G', 'NG5678J',
'IN5531D'],
'comp_name': ['AXN pvt Inc', 'Aviva Intl Ltd', 'Ship Inc',
'Oppo Mobiles pvt ltd', 'Nokia Inc', 'AXN pvt Ltd'],
'comp_code': ['IN225', 'IN315', 'CK555', 'PRS95', 'RS885', 'IN215']
})
df1.columns = df2.columns
df2['match_acc'] = df1.eq(df2).mean(axis=1) * 100
print(df2)
Assuming cell by cell similarity should be assessed by something like fuzzywuzzy instead, vectorize whatever fuzzywuzzy function to apply to all cells and create a new dataframe from the results. fuzzywuzzy will only work with strings, so handle object type columns and non-objects separately.
import numpy as np
import pandas as pd
from fuzzywuzzy import fuzz
# Make Column Names Match
df1.columns = df2.columns
# Select string (object) columns
t1 = df1.select_dtypes(include='object')
t2 = df2.select_dtypes(include='object')
# Apply fuzz.ratio to every cell of both frames
obj_similarity = pd.DataFrame(np.vectorize(fuzz.ratio)(t1, t2),
columns=t1.columns,
index=t1.index)
# Use non-object similarity with eq
other_similarity = df1.select_dtypes(exclude='object').eq(
df2.select_dtypes(exclude='object')) * 100
# Merge Similarities together and take the average per row
total_similarity = pd.concat((
obj_similarity, other_similarity
), axis=1).mean(axis=1)
df2['match_acc'] = total_similarity
df2:
QR_score Identity_No comp_name comp_code match_acc
0 200.00 IN2231D AXN pvt Inc IN225 93.25
1 420.00 UK655IN Aviva Intl Ltd IN315 66.50
2 350.35 SL2252H Ship Inc CK555 49.00
3 450.00 LK9978G Oppo Mobiles pvt ltd PRS95 57.75
4 590.50 NG5678J Nokia Inc RS885 75.00
5 250.00 IN5531D AXN pvt Ltd IN215 92.75
Looking to map highest matching row values from Dataframe2 to Dataframe1 using conditions. We also need to check df1['sal_date'] between df2['from'] and df['to'] .
Want to compare each row(df1['id_number']) from df1 with the whole column(df2['identity_No']) of df2. I have applied using partial match for extracting the condition and its working,
But how i can check if df1['sal_date'] falls which of the df2['from'] and df['to'] .
df1
score id_number company_name company_code sal_date action_reqd
20 IN2231D AXN pvt Ltd IN225 2019-12-22 Yes
45 UK654IN Aviva Intl Ltd IN115 2018-10-10 No
65 SL1432H Ship Incorporations CZ555 2015-08-19 Yes
35 LK0678G Oppo Mobiles pvt ltd PQ795 2018-06-26 Yes
59 NG5678J Nokia Inc RS885 2020-12-28 No
20 IN2231D AXN pvt Ltd IN215 2020-12-08 Yes
df2
OR_score identity_No comp_name comp_code dte_from dte_to
51 UK654IN Aviva Int.L Ltd IN515 2017-12-05 2018-10-13
25 SL6752J Ship Inc Traders CZ555 2013-08-07 2022-06-21
79 NG5678K Nokia Inc RS005 2018-10-13 2019-12-15
51 UK654IN Aviva Int.L Ltd IN525 2018-12-15 2020-12-24
20 IN22312 AXN pvt Ltd IN255 2019-12-10 2022-06-21
79 NG5678K Nokia Inc RS055 2019-06-08 2024-12-30
38 LK0665G Oppo Mobiles ltd PQ895 2016-10-10 2022-12-08
20 IN22312 AXN pvt Ltd IN275 2017-08-17 2018-10-13
75 NG5678K Nokia Inc RS055 2013-06-08 2016-12-30
df1.id_number need to be compared with df2.identity_No and df1.sal_date must be between df2.from and df2.to .
Looking to match based on row1 of df1['id_number'] will match across all rows of df2['identity_No'], and has highest match percentage wrt. row4 of df2['identity_No'] , and its more than 80%, and df1.sal_date is between df2.from and df2.to.
it will copy the respective values from row4 of df2 to row1 of df1.
same to be applied for each row of df1.
Expected Output:
score id_number company_name company_code match_acc action_reqd
20 IN22312 AXN pvt Ltd IN255 2019-12-22 Yes
51 UK654IN Aviva Int.L Ltd IN515 2018-10-10 No
25 SL1432H Ship Incorporations CZ555 2015-08-19 Yes
38 LK0665G Oppo Mobiles ltd PQ795 2018-06-26 Yes
79 NG5678K Nokia Inc RS055 2020-12-28 No
20 IN22312 AXN pvt Ltd IN255 2020-12-08 Yes
I have tried this now:
for index, row in df1.iterrows():
for index2, config2 in df2.iterrows():
variable = fuzz.partial_ratio(row['id_number'], config2['identity_No'])
if variable>=80:
df1['id_number'][index] = config2['identity_No']
df1['company_name'][index] = config2['comp_name']
df1['company_code'][index] = config2['comp_code']
df1['score'][index] = config2['OR_Score']
How can i execute the remaining code after if condition as variable >=80% and df1.sal_date is between df2.from and df2.to
Please Suggestm How it can be executed.
Your code has two main flaws:
Going by your description of the problem (below), whether or not df1['sal_date'] is between dte_from and dte_to is the necessary condition and thus should be checked first. The second step is returning the highest possible match. Since you want to force 1:1 mapping, the match being >=80 doesn't matter, you simply return the highest one.
Looking to map highest matching row values from Dataframe2 to Dataframe1 using conditions. We also need to check df1['sal_date'] between df2['from'] and df['to'].
Your code doesn't really return the row from df2 with the highest match percentage over 80%, but it returns the last one. Every time the condition variable>=80 is met, the current current row in df1 is overwritten.
also, the name for column 1 in df2 is inconsistent; in df2 it's called OR_score with lowercase s but in the code it's called OR_Score with capital S.
I changed your code a little bit. I added highest_match, which keeps track of what the variable of the highest match was and only overwrites if the new match's variable is higher than the highest match. This resets for each row if df1.
I dont use >= thus it keeps the first match if variable is equal. If you want to keep your >=80 condition, you can initialize highest_match = 80, however this code want warn you if for one row of df1 no match >=80 is found and the row thus just stays as it was.
The code also only proceeds, if the date condition is met first.
from fuzzywuzzy import fuzz
for index, row in df1.iterrows():
highest_match = 0
for index2, config2 in df2.iterrows():
cond1 = df1['sal_date'][index] <= config2['dte_to']
cond2 = df1['sal_date'][index] >= config2['dte_from']
if cond1 and cond2:
variable = fuzz.partial_ratio(row['id_number'], config2['identity_No'])
if variable > highest_match:
df1['id_number'][index] = config2['identity_No']
df1['company_name'][index] = config2['comp_name']
df1['company_code'][index] = config2['comp_code']
df1['score'][index] = config2['OR_score']
highest_match = variable
This code is not optimized for time complexity, it just does what you were trying to accomplish. Or atleast it produces your expected output.. Adding the >=80 constraint might improve time, but then you'll need to add some logic for what should happen if no match is >=80.
Please add your code of how the tables are created as well the next time and not just the output. That makes recreating your problem much easier and more people would be willing to help, thanks.
EDIT:
If youn want to keep rows with missing sal_date simply skip them:
from fuzzywuzzy import fuzz
for index, row in df1.iterrows():
if pd.isna(row['sal_date']):
continue
highest_match = 0
for index2, config2 in df2.iterrows():
cond1 = df1['sal_date'][index] <= config2['dte_to']
cond2 = df1['sal_date'][index] >= config2['dte_from']
if cond1 and cond2:
variable = fuzz.partial_ratio(row['id_number'], config2['identity_No'])
if variable > highest_match:
df1['id_number'][index] = config2['identity_No']
df1['company_name'][index] = config2['comp_name']
df1['company_code'][index] = config2['comp_code']
df1['score'][index] = config2['OR_score']
highest_match = variable
Looking to map values from dataframe2 to dataframe1 based on conditional statement. Need to map the values from df2 to df1 where matching percentage based on df1['id_number'] & df2['identity_No'] values are highest.
For Eg: if row1 from df1 will match across all rows of df2 based on a specific column, and has highest match percentage wrt. row4 of df2, and its more than 75%, it will copy the respective data to df1.
Dataframe1
score id_number company_name company_code match_acc action_reqd
20 IN2231D AXN pvt Ltd IN225 Yes
45 UK654IN Aviva Intl Ltd IN115 No
65 SL1432H Ship Incorporations CZ555 Yes
35 LK0678G Oppo Mobiles pvt ltd PQ795 Yes
59 NG5678J Nokia Inc RS885 No
20 IN2231D AXN pvt Ltd IN215 Yes
Dataframe2
OR_score identity_No comp_name comp_code
51 UK654IN Aviva Int.L Ltd IN515
25 SL6752J Ship Inc Traders CZ555
79 NG5678K Nokia Inc RS005
20 IN22312 AXN pvt Ltd IN255
38 LK0665G Oppo Mobiles ltd PQ895
I need to check the matching accuracy percentage where for Eg. row1 from df1 ("id_number") will match with each row from df2 (identity_No) based on the highest matching percentage (whichever row from df2 will have highest matching percentage) will map values from df2 to df1. Same will be continued for each row of df1.
Expected output:
score id_number company_name company_code match_acc action_reqd
20 IN22312 AXN pvt Ltd IN225 90 Yes
51 UK654IN Aviva Int.L Ltd IN115 100 No
25 SL1432H Ship Incorporations CZ555 30 Yes
38 LK0665G Oppo Mobiles ltd PQ795 80 Yes
79 NG5678K Nokia Inc RS885 85 No
Code i have been trying
cross = df1[['id_number']].merge(df2[['identity_no']].assign(tmp=0), how='outer', on='tmp').drop(columns='tmp'))
cross['match'] = cross.apply(lambda x: fuzz.ratio(x.id_number, x.identity_no), axis=1)
df1['match_acc'] = df1.id_number.map(cross.groupby('id_number').match.max())
for index, row in df1.iterrows():
for index2, config2 in df2.iterrows():
if row['match_acc'] >= 75:
df1['id_number'][index] = config2['identity_No']
df1['company_name'][index] = config2['comp_name']
df1['company_code'][index] = config2['comp_code']
df1['score'][index] = config2['OR_Score']
Not getting the expected answer. it copies row1 from df2 to the entire row of df1 where match_acc is >=75.
I am looking to merge two columns using cross function which i need to use later for further analysis.
Inpout Data
id_number company_name match_acc
IN2231D AXN pvt Ltd
UK654IN Aviva Intl Ltd
SL1432H Ship Incorporations
LK0678G Oppo Mobiles pvt ltd
NG5678J Nokia Inc
identity_no Pincode company_name
IN2231 110030 AXN pvt Ltd
UK654IN 897653 Aviva Intl Ltd
SL1432 07658 Ship Incorporations
LK0678G 120988 Oppo Mobiles Pvt Ltd
I am looking out to merge the column id_number with identity_no
Code i am using so far:
cross = df1[['id_number']].merge(df2[['identity_no']], how='cross')
But Getting the error:
pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False
Output what i need:
# id_number identity_no
# 0 IN2231D IN2231
# 1 IN2231D UK654IN
# 2 IN2231D SL1432
# ...
# 17 NG5678J UK654IN
# 18 NG5678J SL1432
# 19 NG5678J LK0678G
Please suggest.
how='cross'
This was a feature introduced in pd.__version__ == '1.2.0' so if you have an older version of pandas it will not work. If for some reason you cannot upgrade you can accomplish the same with the use of a helper column that is the same constant for both DataFrames that you then drop.
import pandas as pd
df1 = pd.DataFrame({'x': [1,2]})
df2 = pd.DataFrame({'y': ['a', 'b']})
# For versions >=1.2.0
df1.merge(df2, how='cross')
# x y
#0 1 a
#1 1 b
#2 2 a
#3 2 b
# For older versions assign a constant you merge on.
df1.assign(t=1).merge(df2.assign(t=1), on='t').drop(columns='t')
# x y
#0 1 a
#1 1 b
#2 2 a
#3 2 b