why does Fuzzywuzzy python script take forever to generate results?

why does Fuzzywuzzy python script take forever to generate results? - python

To give an idea, I have an excel file(.xlsx format) within which I am working with 2 sheets at a time.
I am interested in 'entity name' from sheet a and 'name' from sheet b.
Sheet b has 'name' column written 7times.
my sheet a looks like this.
Isin Entity Name
DE0005545503 1&1 AG
US68243Q1067 1-800-Flowers.Com Inc
US68269G1076 1Life Healthcare Inc
US3369011032 1st Source Corp
while my sheet b looks like this
name company_id name company_id name company_id name company_id name company_id name company_id name
LIVERPOOL PARTNERS MICROCAP GROWTH FUND MANAGER PTY LTD 586056 FERRARI NADIA 1000741 DORSET COMMUNITY RADIO LTD 1250023 Hunan Guangtongsheng Communication Service Co., Ltd. 1500335 Steffes PrÃ¼f- und Messtechnik GmbH, 1550006 CHL SRL 2000320 Qu Star, Inc.
BISCUIT AVENUE PTY LTD 586474 D AMBROSIO MARIA 1000382 LUCKY WORLD PRODUCTIONS LIMITED 1250024 Zhuzhou Wanlian Telecommunication Co., Ltd. 1500354 e42 II GmbH 1550510 EGGTRONIC SPA 2000023 Molly Shaheen, L.L.C.
CL MAY1212 PTY LTD 586475 TORIJA ZANE LUCIA LUCIA 1000389 FYLDE COAST MEDIA LTD 1250034 Zhongyi Tietong Co., Ltd. Yanling Xiayang Broadband TV Service Center 1500376 Valorem Capital UG (haftungsbeschrÃ¤nkt) 1550539 MARACAIBA INVEST SRL 2000139 Truptisudhir Pharmacy Inc
alternatively you can find the sheet b here:
Here's my code
import pandas as pd
from fuzzywuzzy import fuzz
filename = 'C:/Users/Downloads/SUniverse.xlsx'
dataframe1 = pd.read_excel(filename, sheet_name='A')
dataframe2 = pd.read_excel(filename, sheet_name='B')
# print(dataframe1.head())
# print(dataframe2.head())
# Clean customers lists
A_cleaned = [df1 for df1 in dataframe1["Entity Name"] if not(pd.isnull(df1))]
B_cleaned = [df2 for df2 in dataframe2["name"].unique() if not(pd.isnull(df2))]
print(A_cleaned)
print(B_cleaned)
# Perform fuzzy string matching
tuples_list = [max([(fuzz.token_set_ratio(i,j),j) for j in B_cleaned]) for i in A_cleaned]
print(tuples_list)
# Unpack list of tuples into two lists
similarity_score, fuzzy_match = map(list,zip(*tuples_list))
# Create pandas DataFrame
df = pd.DataFrame({"I_Entity_Name":A_cleaned, "I_Name": fuzzy_match, "similarity score":similarity_score})
df.to_excel("C:/Users/Downloads/fuz-match-output.xlsx", sheet_name="Fuzzy String Matching", index=False)
print('done!')
The code takes forever to generate results. It has been over 20hours and the script is still running. My excel input file is going over 50mbs in size(just wanna say that it contains millions of records).
How do I ensure that my script runs at a faster pace and generates the result? I want the output to be this:
Entity Name Name fuzzy score
apple APPLE 100
.
.
.

Related

Trying to rearrange multiple columns in a dataframe based on ranking row values

I'm working on a matching company names and I have a dataframe that returns output in the format below.
The table has an original name and for each original name, there could be N number of matches. For each match, there are 3 columns, match_name_0, score_0, match_index_0 and so on up to match_name_N.
I'm trying to figure out a way to return a new dataframe that sorts the columns after the original_name by the highest match scores. Essentially, if match_score_2 was the highest then match_score_0 followed by match_score_1 the columns would be
original_score, match_name_2, match_score_2, match_index_2, match_name_0, match_score_0, match_index_0, match_name_2, match_score_2, match_index_2,
In the event of a tie, the leftmost match should be ranked higher. I should note that sometimes they will be in the correct order but 30-40% of the times, they are not.
I've been staring at my screen for 2 hours and totally stumped so any help is greatly appreciated
index
original_name
match_name_0
score_0
match_index_0
match_name_1
score_1
match_index_1
match_name_2
score_2
match_index_2
match_name_3
score_3
match_index_3
match_name_4
score_4
match_index_4
0
aberdeen asset management plc
aberdeen asset management sa
100
2114
aberdeen asset management plc esop
100
2128
aberdeen asset management inc
100
2123
aberdeen asset management spain
71.18779356
2132
aberdeen asset management ireland
69.50514818
2125
2
agi partners llc
agi partners llc
100
5274
agi partners llc
100
5273
agr partners llc
57.51100704
5378
aci partners llc
53.45090217
3097
avi partners llc
53.45090217
17630
3
alberta investment management corporation
alberta investment management corporation
100
6754
alberta investment management corporation pension arm
100
6755
anchor investment management corporation
17.50748486
10682
cbc investment management corporation
11.79760839
36951
harvest investment management corporation
31.70316571
85547

I am assuming you want to impose the ordering of matches first by score and then by match_number individually for each original_name.
Wide datasets are usually difficult to deal with, including this case. I suggest to reshape to a long dataset, where you can easily impose your required ordering by
sort_values(by=['original_name','score','match_number'], ascending=[True,False,True])
Finally, you can reshape it back to a wide dataset.
import pandas as pd
from io import StringIO
# sample data
df = """
original_name,match_name_0,score_0,match_index_0,match_name_1,score_1,match_index_1,match_name_2,score_2,match_index_2,match_name_3,score_3,match_index_3,match_name_4,score_4,match_index_4
aberdeen asset management plc,aberdeen asset management sa,100,2114,aberdeen asset management plc esop,100,2128,aberdeen asset management inc,100,2123,aberdeen asset management spain,71.18779356,2132,aberdeen asset management ireland,69.50514818,2125
agi partners llc,agi partners llc,100,5274,agi partners llc,100,5273,agr partners llc,57.51100704,5378,aci partners llc,53.45090217,3097,avi partners llc,53.45090217,17630
alberta investment management corporation,alberta investment management corporation,100,6754,alberta investment management corporation pension arm,100,6755,anchor investment management corporation,17.50748486,10682,cbc investment management corporation,11.79760839,36951,harvest investment management corporation,31.70316571,85547
"""
df= pd.read_csv(StringIO(df.strip()), sep=',', engine='python')
# wide to long
result = pd.wide_to_long(df, ['match_name','score','match_index'], i='original_name', j='match_number', sep='_').reset_index()
# sort matches as per requirement
result = result.sort_values(by=['original_name','score','match_number'], ascending=[True,False,True])
# overwrite ranking imposed by previous sort
# this ensures that the order is maintained once it is
# reshaped back to a wide dataset
result['match_number'] = result.groupby('original_name').cumcount()
# reshape long to wide
result = result.set_index(['original_name','match_number']).unstack()
# tidy up to match expected result
result = result.swaplevel(axis=1).sort_index(axis=1)
result = result.reindex(['match_name','score','match_index'], axis=1, level=1)
result.columns = [f'{col[1]}_{col[0]}' for col in result.columns]
As a result, for example, previous match 4 of alberta investment management corporation is now match 2 (based on score). The order of matches 3 and 4 for agi partners llc remain the same because they have the same score.

Setting Character Limit on Pandas DataFrame Column

Background:
Given the following pandas df -
Holding Account
Model Type
Entity ID
Direct Owner ID
WF LLC | 100 Jones Street 26th Floor San Francisco Ca Ltd Liability - Income Based Gross USA Only (486941515)
51364633
4564564
5646546
RF LLC | Neuberger | LLC | Aukai Services LLC-Neuberger Smid - Income Accuring Net of Fees Worldwide Fund (456456218)
46256325
1645365
4926654
The ask:
What is the most pythonic way to enforce a 80 character limit to the Holding Account column (dtype = object) values?
Context: I am writing df to a .csv and then subsequently uploading to a system with an 80-character limit. The values of Holding Account column are unique, so I just want to sacrifice those characters that take the string over 80-characters.
My attempt:
This is what I attempted - df['column'] = df['column'].str[:80]

Why not just use .str, like you were doing?
df['Holding Account'] = df['Holding Account'].str[:80]
Output:
>>> df
Holding Account Model Type Entity ID Direct Owner ID
0 WF LLC | 100 Jones Street 26th Floor San Francisco Ca Ltd Liability - Income Bas 51364633 4564564 5646546
1 RF LLC | Neuberger | LLC | Aukai Services LLC-Neuberger Smid - Income Accuring N 46256325 1645365 4926654

Using slice will loss some information, I will suggest create a mapping table after get the factorized. This also save the storage space for server or db
s = df['Holding Account'].factorize()[0]
df['Holding Account'] = df['Holding Account'].factorize()[0]
d = dict(zip(s, df['Holding Account']))
If you would like get the databank just do
df['new'] = df['Holding Account'] .map(d)

Pandas: Remove all words from specific list within dataframe strings in large dataset

So I have three pandas dataframes(train, test). Overall it is about 700k lines. And I would like to remove all cities from a cities list - common_cities. But tqdm in notebook cell suggests that it would take about 24 hrs to replace all from a list of 33000 cities.
dataframe example (train_original):
id
name_1
name_2
0
sun blinds decoration paris inc.
indl de cuautitlan sa cv
1
eih ltd. dongguan wei shi
plastic new york product co., ltd.
2
jsh ltd. (hk) mexico city
arab shipbuilding seoul and repair yard madrid c
common_cities list example
common_cities = ['moscow', 'madrid', 'san francisco', 'mexico city']
what is supposed to be output:
id
name_1
name_2
0
sun blinds decoration inc.
indl de sa cv
1
eih ltd. wei shi
plastic product co., ltd.
2
jsh ltd. (hk)
arab shipbuilding and repair yard c
My solution in such case worked well on small filter words list, but when it is large, the performance is low.
%%time
for city in tqdm(common_cities):
train_original.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
train_augmented.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
test.replace(re.compile(fr'\b({city})\b'), '', inplace=True)
P.S: I presume it's not great to use list comprehension while splitting string and substituting city name, because city name could be > 2 words.
Any suggestions, ideas on approach to make a quick replacement on Pandas Dataframes in such situations?

Instead of iterating over the huge dfs for reach pass, remember that pandas replace accepts dictionaries with all the replacements to be done in a single go.
Therefore we can start by creating the dictionary and then using it with replace:
replacements = {x:'' for x in common_cities}
train_original = train_original.replace(replacements)
train_augmented = train_augmented.replace(replacements)
test = test.replace(replacements)
Edit: Reading the documentation it might be even easier, because it also accept lists of values to be replaced:
train_original = train_original.replace(common_cities,'')
train_augmented = train_augmented.replace(common_cities,'')
test = test.replace(common_cities,'')

Finding company matches in a list of financial news

I have a dataframe with company ticker("ticker"), full name ("longName) and short name ("unofficial_name") - this abridged name is created from the long name by removing inc., plc...
I also have a seperate datefame with company news: date ("date" ) of the news, headline ("name"), news text ("text") and sentiment analysis.
I am trying to find company name matches in the list of articles and create a new dataframe with unique company-article matches (i.e. if one article mentions more than one company, this article would have more rows depending on the number of companies mentioned).
I tried to execute the matching based on the "unofficial_name" with the following code:
dict=[]
for n, c in zip(df_news["text"], sp500_names["unofficial_name"]):
if c in n:
x = {"text":n, "unofficial_name":c}
dict.append(x)
print(dict)
But I get an empty list returned. Any ideas how to solve it?

sp500_names
ticker longName unofficial_name
0 A Agilent Technologies, Inc. Agilent Technologies
1 AAL American Airlines Group Inc. American Airlines Group
df_news
name date text neg neu pos compound
0 Asian stock markets reverse losses on global p... 2020-03-01 [By Tom Westbrook and Swati Pandey SINGAPORE (... 0.086 0.863 0.051 -0.9790
1 Energy & Precious Metals - Weekly Review and C... 2020-03-01 [By Barani Krishnan Investing.com - How much ... 0.134 0.795 0.071 -0.9982
Thank you!

Pandas : Create a new dataframe from 2 different dataframes using fuzzy matching [duplicate]

I have two data frames with each having a different number of rows. Below is a couple rows from each data set
df1 =
Company City State ZIP
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102
LACKEY SHEET METAL St. Louis MO 63102
and
df2 =
FDA Company FDA City FDA State FDA ZIP
LACKEY SHEET METAL St. Louis MO 63102
PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
HELGET GAS PRODUCTS INC Omaha NE 68127
ORTHOQUEST LLC La Vista NE 68128
I joined them side by side using combined_data = pandas.concat([df1, df2], axis = 1). My next goal is to compare each string under df1['Company'] to each string under in df2['FDA Company'] using several different matching commands from the fuzzy wuzzy module and return the value of the best match and its name. I want to store that in a new column. For example if I did the fuzz.ratio and fuzz.token_sort_ratio on LACKY SHEET METAL in df1['Company'] to df2['FDA Company'] it would return that the best match was LACKY SHEET METAL with a score of 100 and this would then be saved under a new column in combined data. It results would look like
combined_data =
Company City State ZIP FDA Company FDA City FDA State FDA ZIP fuzzy.token_sort_ratio match fuzzy.ratio match
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101 LACKEY SHEET METAL St. Louis MO 63102 LACKEY SHEET METAL 100 LACKEY SHEET METAL 100
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102 PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102 HELGET GAS PRODUCTS INC Omaha NE 68127
LACKEY SHEET METAL St. Louis MO 63102 ORTHOQUEST LLC La Vista NE 68128
I tried doing
combined_data['name_ratio'] = combined_data.apply(lambda x: fuzz.ratio(x['Company'], x['FDA Company']), axis = 1)
But got an error because the lengths of the columns are different.
I am stumped. How I can accomplish this?

I couldn't tell what you were doing. This is how I would do it.
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Create a series of tuples to compare:
compare = pd.MultiIndex.from_product([df1['Company'],
df2['FDA Company']]).to_series()
Create a special function to calculate fuzzy metrics and return a series.
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
Apply metrics to the compare series
compare.apply(metrics)
There are bunch of ways to do this next part:
Get closest matches to each row of df1
compare.apply(metrics).unstack().idxmax().unstack(0)
Get closest matches to each row of df2
compare.apply(metrics).unstack(0).idxmax().unstack(0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

why does Fuzzywuzzy python script take forever to generate results? - python

Related

Trying to rearrange multiple columns in a dataframe based on ranking row values

Setting Character Limit on Pandas DataFrame Column

Pandas: Remove all words from specific list within dataframe strings in large dataset

Finding company matches in a list of financial news

Pandas : Create a new dataframe from 2 different dataframes using fuzzy matching [duplicate]

Categories

Resources