FuzzyWuzzy on 2 col from different DataFrames - python

I have a very easy but not simple(to me at least!) question
I have 2 DFs:
df1:
Account_Name
samsung
tesla
microsoft
df2:
Company_name
samsung electronics
samsung Ltd
tesla motors
Microsoft corporation
all I am trying to do is to find the best match for every row in df1 from df2 and also have an extra column that will tell me the similarity score for the best match found from df2.
I have got the code that allows me to compare the 2 columns and produce the similarity score but I have no clue how to iterate through df2 to find the best match for the row in question from df1
the similarity score code is below just in case but I don't think it is relevant to this question
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
for col in ['Account_Name']:
df[f"{col}_score"] = df.apply(lambda x: similar(x["Company_name"], x[col]) * 100 if
pd.notna(x[col]) else np.nan, axis=1)
The main issue is with finding the best similarity match when the data is in 2 separate DFs
help please!

Here is a proposition based on an answer I made a few days ago :
from difflib import get_close_matches, SequenceMatcher
​
def match(word, l):
m = get_close_matches(word, l, n=1, cutoff=0.4)
if m:
closest_match = m[0]
score = SequenceMatcher(None, word, closest_match).ratio()
return closest_match, score
return None, 0.0
​
​
cross = df1.merge(df2, how="cross")
​
l_matches = [match(x, list(cross["Company_name"])) for x in cross["Account_Name"]]
​
out = (
cross
.join(pd.DataFrame(l_matches, columns=["Company_name (match)", "Company_name (Score)"]))
.drop("Company_name", axis=1).groupby("Account_Name", as_index=False).max()
)
Output :
print(out)
Account_Name Company_name (match) Company_name (Score)
0 microsoft Microsoft corporation 0.533333
1 samsung samsung Ltd 0.777778
2 tesla tesla motors 0.588235

Related

LEFT ON Case When in Pandas

i wanted to ask that if in SQL I can do like JOIN ON CASE WHEN, is there a way to do this in Pandas?
disease = [
{"City":"CH","Case_Recorded":5300,"Recovered":2839,"Deaths":2461},
{"City":"NY","Case_Recorded":1311,"Recovered":521,"Deaths":790},
{"City":"TX","Case_Recorded":1991,"Recovered":1413,"Deaths":578},
{"City":"AT","Case_Recorded":3381,"Recovered":3112,"Deaths":269},
{"City":"TX","Case_Recorded":3991,"Recovered":2810,"Deaths":1311},
{"City":"LA","Case_Recorded":2647,"Recovered":2344,"Deaths":303},
{"City":"LA","Case_Recorded":4410,"Recovered":3344,"Deaths":1066}
]
region = {"North": ["AT"], "West":["TX","LA"]}
So what i have is 2 dummy dict and i have already converted it to become dataframe, first is the name of the cities with the case,and I'm trying to figure out which region the cities belongs to.
Region|City
North|AT
West|TX
West|LA
None|NY
None|CH
So what i thought in SQL was using left on case when, and if the result is null when join with North region then join with West region.
But if there are 15 or 30 region in some country, it'd be problems i think
Use:
#get City without duplicates
df1 = pd.DataFrame(disease)[['City']].drop_duplicates()
#create DataFrame from region dictionary
region = {"North": ["AT"], "West":["TX","LA"]}
df2 = pd.DataFrame([(k, x) for k, v in region.items() for x in v],
columns=['Region','City'])
#append not matched cities to df2
out = pd.concat([df2, df1[~df1['City'].isin(df2['City'])]])
print (out)
Region City
0 North AT
1 West TX
2 West LA
0 NaN CH
1 NaN NY
If order is not important:
out = df2.merge(df1, how = 'right')
print (out)
Region City
0 NaN CH
1 NaN NY
2 West TX
3 North AT
4 West LA
I'm sorry, I'm not exactly sure what's your expected result, can you express more? if your expected result is just getting the city's region there is no need for conditional joining? for ex: you can transform the city-region table into per city per region per row and direct join with the main df
disease = [
{"City":"CH","Case_Recorded":5300,"Recovered":2839,"Deaths":2461},
{"City":"NY","Case_Recorded":1311,"Recovered":521,"Deaths":790},
{"City":"TX","Case_Recorded":1991,"Recovered":1413,"Deaths":578},
{"City":"AT","Case_Recorded":3381,"Recovered":3112,"Deaths":269},
{"City":"TX","Case_Recorded":3991,"Recovered":2810,"Deaths":1311},
{"City":"LA","Case_Recorded":2647,"Recovered":2344,"Deaths":303},
{"City":"LA","Case_Recorded":4410,"Recovered":3344,"Deaths":1066}
]
region = [
{'City':'AT','Region':"North"},
{'City':'TX','Region':"West"},
{'City':'LA','Region':"West"}
]
df = pd.DataFrame(disease)
df_reg = pd.DataFrame(region)
df.merge( df_reg , on = 'City' , how = 'left' )

How to get sum and average of values from a pandas dataframe using values in multiple lists?

I've the following dataframe -
df1
Location
Office
ROP
Barcelona
Head Office
4.3%
Bengaluru
Corporate Office
9.6%
Chicago
Head Office
12.5%
Luxembourg
Corporate Office
14.1%
Paris
Head Office
12.7%
Toronto
Head Office
11.5%
Berlin
Corporate Office
14.3%
Bengaluru
Head Office
4.6%
Luxembourg
Head Office
7.1%
Berlin
Head Office
5.3%
Luxembourg
Virtual Center
10.1%
Berlin
Virtual Center
12.3%
Paris
Virtual Center
9.7%
:
:
:
:
:
:
:
:
:
a = ['Berlin','Paris','Luxembourg',...]
b = ['Head Office','Corporate Office',..]
Say there are multiple values in lists a and b, how do I find the sum and average of ROP based on the values in the lists and the given dataframe?
Example:
Say we have data from above mentioned dataframe in 'df2'.
df2 has just the visible 13 rows from dataframe 'df1'.
a = ['Berlin','Paris','Luxembourg']
b = ['Head Office','Corporate Office']
Expected output:
Sum: 14.3%+5.3%+12.7%+7.1%+14.1% = 53.5%
Average: (14.3%+5.3%+12.7%+7.1%+14.1%)/5 = 10.7%
Try:
# convert ROP column to float:
df["ROP_int"] = df["ROP"].str.strip("%").astype(float)
a = ["Berlin", "Paris", "Luxembourg"]
b = ["Head Office", "Corporate Office"]
# create a mask
m = df["Location"].isin(a) & df["Office"].isin(b)
# compute sum and average from the mask and ROP_int column:
s = df.loc[m, "ROP_int"].sum()
avg = df.loc[m, "ROP_int"].mean()
print(s)
print(avg)
Prints:
53.49999999999999
10.7

Fuzzy-compare two dataframes of addresses and copy info from 1 to another

I have this data set. df1 = 70,000 rows and df2 = ~30 rows. I want to match the address to see if df2 appears in df1 and if it does than I want to show the match and also pull info from df1 to create a new df3. Sometimes the address info is off by a bit..for example (road = rd, street = st, etc )Here's an example:
df1 =
address unique key (and more columns)
123 nice road Uniquekey1
150 spring drive Uniquekey2
240 happy lane Uniquekey3
80 sad parkway Uniquekey4
etc
df2 =
address (and more columns)
123 nice rd
150 spring dr
240 happy lane
80 sad parkway
etc
And this is what Id want a new dataframe :
df3=
address(from df2) addressed matched(from df1) unique key(comes from df1) (and more columns)
123 nice rd 123 nice road Uniquekey1
150 spring dr 150 spring drive Uniquekey2
240 happy lane 240 happy lane Uniquekey3
80 sad parkway 80 sad parkway Uniquekey4
etc
Here's what Ive tried so far using difflib:
df1['key'] = df1['address']
df2['key'] = df2['address']
df2['key'] = df2['key'].apply(lambda x: difflib.get_close_matches(x, df1['key'], n=1))
this returns what looks like a list, the answer is in []'s so then I convert the df2['key'] into a string using df2['key'] = df2['key'].apply(str)
then I try to merge using df2.merge(df1, on ='key') and no address is matching?
I'm not sure what it could be but any help would be greatly appreciated. I also am playing around with the fuzzywuzzy package.
My answer is similar to one of your old questions that I answered.
I slightly modified your dataframe:
>>> df1
address unique key
0 123 nice road Uniquekey1
1 150 spring drive Uniquekey2
2 240 happy lane Uniquekey3
3 80 sad parkway Uniquekey4
>>> df2 # shuffle rows
address
0 80 sad parkway
1 240 happy lane
2 150 winter dr # change the season :-)
3 123 nice rd
Use extractOne function from fuzzywuzzy.process:
from fuzzywuzzy import process
THRESHOLD = 90
best_match = \
df2['address'].apply(lambda x: process.extractOne(x, df1['address'],
score_cutoff=THRESHOLD))
The output of extractOne is:
>>> best_match
0 (80 sad parkway, 100, 3)
1 (240 happy lane, 100, 2)
2 None
3 (123 nice road, 92, 0)
Name: address, dtype: object
Now you can merge your 2 dataframes:
df3 = pd.merge(df2, df1.set_index(best_match.apply(pd.Series)[2]),
left_index=True, right_index=True, how='left')
>>> df3
address_x address_y unique key
0 80 sad parkway 80 sad parkway Uniquekey4
1 240 happy lane NaN NaN
2 150 winter dr 150 spring drive Uniquekey2
3 123 nice rd 123 nice road Uniquekey1
This answer is longer but I'll post it because maybe you can follow along better as you can see the steps as they happen.
Set up the frames:
import pandas as pd
#pip install fuzzywuzzy
#pip install python-Levenshtein
from fuzzywuzzy import fuzz, process
# matching threshold. may need altering from 45-95 etc. higher is better but being stricter means things aren't matched. fiddle as required
threshold = 75
df1 = pd.DataFrame({'address': {0: '123 nice road',
1: '150 spring drive',
2: '240 happy lane',
3: '80 sad parkway'},
'unique key (and more columns)': {0: 'Uniquekey1',
1: 'Uniquekey2',
2: 'Uniquekey3',
3: 'Uniquekey4'}})
df2 = pd.DataFrame({'address': {0: '123 nice rd',
1: '150 spring dr',
2: '240 happy lane',
3: '80 sad parkway'},
'unique key (and more columns)': {0: 'Uniquekey1',
1: 'Uniquekey2',
2: 'Uniquekey3',
3: 'Uniquekey4'}})
Then the main code:
# function used for fuzzywuzzy matching
def match_addresses(add, list_add, min_score=0):
max_score = -1
max_add = ''
for x in list_add:
score = fuzz.ratio(add, x)
if (score > min_score) & (score > max_score):
max_add = x
max_score = score
return (max_add, max_score)
# return the fuzzywuzzy score
def scoringMatches(x, s):
o = process.extractOne(x, s, score_cutoff = threshold)
if o != None:
return o[1]
# creating two lists from address column of both dataframes
df1_addresses = list(df1.address.unique())
df2_addresses = list(df2.address.unique())
# via fuzzywuzzy matching and using match_addresses() above
# return a dictionary of addresses where there is a match
names = []
for x in df1_addresses:
match = match_addresses(x, df2_addresses, threshold)
if match[1] >= threshold:
name = (str(x), str(match[0]))
names.append(name)
name_dict = dict(names)
# create new frame from fuzzywuzzy address matches dictionary
match_df = pd.DataFrame(name_dict.items(), columns=['df1_address', 'df2_address'])
# create new frame
df3 = pd.concat([df1, match_df], axis=1)
del df3['df1_address']
# shuffle the matched address column to be next to the original address of df1
c = df3.columns.tolist()
c.insert(1, c.pop(c.index('df2_address')))
df3 = df3.reindex(columns=c)
# add fuzzywuzzy scoring as a new column
df3['fuzzywuzzy_score'] = df3.apply(lambda x: scoringMatches(x['address'], df2['address']), axis=1)
print(df3)
Output:
address df2_address unique key (and more columns) fuzzywuzzy_score
0 123 nice road 123 nice rd Uniquekey1 92
1 150 spring drive 150 spring dr Uniquekey2 90
2 240 happy lane 240 happy lane Uniquekey3 100
3 80 sad parkway 80 sad parkway Uniquekey4 100

Fuzzy match for 2 lists with very similar names

I know this question has been asked in some way so apologies. I'm trying to fuzzy match list 1(sample_name) to list 2 (actual_name). Actual_name has significantly more names than list 1 and I keep runninng into fuzzy match not working well. I've tried the multiple fuzzy match methods(partial, set_token) but keep running into issues since there are many more names in list 2 that are very similar. Is there any way to improve matching here. Ideally want to have list 1, matched name from list 2, with the match score in column 3 in a new dataframe. Any help would be much appreciated. Thanks.
Have used this so far:
df1=sample_df['sample_name'].to_list()
df2=actual_df['actual_name'].to_list()
response = {}
for name_to_find in df1:
for name_master in df2:
if fuzz.partial_ratio(name_to_find,name_master) > 90:
response[name_to_find] = name_master
break
for key, value in response.item():
print('sample name' + key + 'actual_name' + value)
sample_name
actual_name
jtsports
JT Sports LLC
tombaseball
Tom Baseball Inc.
context express
Context Express LLC
zb sicily
ZB Sicily LLC
lightening express
Lightening Express LLC
fire roads
Fire Road Express
N/A
Earth Treks
N/A
TS Sports LLC
N/A
MM Baseball Inc.
N/A
Contact Express LLC
N/A
AB Sicily LLC
N/A
Lightening Roads LLC
Not sure if this is your expected output (and you may need to adjust the threshold), but I think this is what you are looking for?
import pandas as pd
from fuzzywuzzy import process
threshold = 50
list1 = ['jtsports','tombaseball','context express','zb sicily',
'lightening express','fire roads']
list2 = ['JT Sports LLC','Tom Baseball Inc.','Context Express LLC',
'ZB Sicily LLC','Lightening Express LLC','Fire Road Express',
'Earth Treks','TS Sports LLC','MM Baseball Inc.','Contact Express LLC',
'AB Sicily LLC','Lightening Roads LLC']
response = []
for name_to_find in list1:
resp_match = process.extractOne(name_to_find ,list2)
if resp_match[1] > threshold:
row = {'sample_name':name_to_find,'actual_name':resp_match[0], 'score':resp_match[1]}
response.append(row)
print(row)
results = pd.DataFrame(response)
# If you need all the 'actual_name' tp be in the datframe, continue below
# Otherwise don't include these last 2 lines of code
unmatched = pd.DataFrame([x for x in list2 if x not in list(results['actual_name'])], columns=['actual_name'])
results = results.append(unmatched, sort=False).reset_index(drop=True)
Output:
print(results)
sample_name actual_name score
0 jtsports JT Sports LLC 79.0
1 tombaseball Tom Baseball Inc. 81.0
2 context express Context Express LLC 95.0
3 zb sicily ZB Sicily LLC 95.0
4 lightening express Lightening Express LLC 95.0
5 fire roads Fire Road Express 86.0
6 NaN Earth Treks NaN
7 NaN TS Sports LLC NaN
8 NaN MM Baseball Inc. NaN
9 NaN Contact Express LLC NaN
10 NaN AB Sicily LLC NaN
11 NaN Lightening Roads LLC NaN
It won't be the most efficient way to do it, being of order O(n) in the number of correct matches but you could calculate the Levenshtein distance between the left and right and then match based on the closest match.
That is how a lot of nieve spell check systems work.
I'm suggesting that you run this calculation for each of the correct names and return the match with the lowest score.
Adjusting the code you have posted I would follow something like the following. Bear in mind the Levenshtein distance lower is closer so it'll need some adjusting. It seems the function you are using higher is more close and so the following should work using that.
df1=sample_df['sample_name'].to_list()
df2=actual_df['actual_name'].to_list()
response = {}
for name_to_find in df1:
highest_so_far = ("", 0)
for name_master in df2:
score = fuzz.partial_ratio(name_to_find, name_master)
if score > highest_so_far[1]:
highest_so_far = (name_master, score)
response[name_to_find] = highest_so_far[0]
for key, value in response.item():
print('sample name' + key + 'actual_name' + value)

Removing non-alphanumeric symbols in dataframe

How do I remove non-alphabet from the values in the dataframe? I only managed to convert all to lower case
def doubleAwardList(self):
dfwinList = pd.DataFrame()
dfloseList = pd.DataFrame()
dfwonandLost = pd.DataFrame()
#self.dfWIN... and self.dfLOSE... is just the function used to call the files chosen by user
groupby_name= self.dfWIN.groupby("name")
groupby_nameList= self.dfLOSE.groupby("name _List")
list4 = []
list5 = []
notAwarded = "na"
for x, group in groupby_name:
if x != notAwarded:
list4.append(str.lower(str(x)))
dfwinList= pd.DataFrame(list4)
for x, group in groupby_nameList:
list5.append(str.lower(str(x)))
dfloseList = pd.DataFrame(list5)
data sample: Basically I mainly need to remove the full stops and hyphens as I will require to compare it to another file but the naming isn't very consistent so i had to remove the non-alphanumeric for much more accurate result
creative-3
smart tech pte. ltd.
nutritive asia
asia's first
desired result:
creative 3
smart tech pte ltd
nutritive asia
asia s first
Use DataFrame.replace only and add whitespace to pattern:
df = df.replace('[^a-zA-Z0-9 ]', '', regex=True)
If one column - Series:
df = pd.DataFrame({'col': ['creative-3', 'smart tech pte. ltd.',
'nutritive asia', "asia's first"],
'col2':range(4)})
print (df)
col col2
0 creative-3 0
1 smart tech pte. ltd. 1
2 nutritive asia 2
3 asia's first 3
df['col'] = df['col'].replace('[^a-zA-Z0-9 ]', '', regex=True)
print (df)
col col2
0 creative3 0
1 smart tech pte ltd 1
2 nutritive asia 2
3 asias first 3
EDIT:
If multiple columns is possible select only object, obviously string columns and if necessary cast to strings:
cols = df.select_dtypes('object').columns
print (cols)
Index(['col'], dtype='object')
df[cols] = df[cols].astype(str).replace('[^a-zA-Z0-9 ]', '', regex=True)
print (df)
col col2
0 creative3 0
1 smart tech pte ltd 1
2 nutritive asia 2
3 asias first 3
Why not just the below, (i did make into lower btw):
df=df.replace('[^a-zA-Z0-9]', '',regex=True).str.lower()
Then now:
print(df)
Will get the desired data-frame
Update:
try:
df=df.apply(lambda x: x.str.replace('[^a-zA-Z0-9]', '').lower(),axis=0)
If only one column do:
df['your col']=df['your col'].str.replace('[^a-zA-Z0-9]', '').str.lower()

Categories

Resources