I'm working on a matching company names and I have a dataframe that returns output in the format below.
The table has an original name and for each original name, there could be N number of matches. For each match, there are 3 columns, match_name_0, score_0, match_index_0 and so on up to match_name_N.
I'm trying to figure out a way to return a new dataframe that sorts the columns after the original_name by the highest match scores. Essentially, if match_score_2 was the highest then match_score_0 followed by match_score_1 the columns would be
original_score, match_name_2, match_score_2, match_index_2, match_name_0, match_score_0, match_index_0, match_name_2, match_score_2, match_index_2,
In the event of a tie, the leftmost match should be ranked higher. I should note that sometimes they will be in the correct order but 30-40% of the times, they are not.
I've been staring at my screen for 2 hours and totally stumped so any help is greatly appreciated
index
original_name
match_name_0
score_0
match_index_0
match_name_1
score_1
match_index_1
match_name_2
score_2
match_index_2
match_name_3
score_3
match_index_3
match_name_4
score_4
match_index_4
0
aberdeen asset management plc
aberdeen asset management sa
100
2114
aberdeen asset management plc esop
100
2128
aberdeen asset management inc
100
2123
aberdeen asset management spain
71.18779356
2132
aberdeen asset management ireland
69.50514818
2125
2
agi partners llc
agi partners llc
100
5274
agi partners llc
100
5273
agr partners llc
57.51100704
5378
aci partners llc
53.45090217
3097
avi partners llc
53.45090217
17630
3
alberta investment management corporation
alberta investment management corporation
100
6754
alberta investment management corporation pension arm
100
6755
anchor investment management corporation
17.50748486
10682
cbc investment management corporation
11.79760839
36951
harvest investment management corporation
31.70316571
85547
I am assuming you want to impose the ordering of matches first by score and then by match_number individually for each original_name.
Wide datasets are usually difficult to deal with, including this case. I suggest to reshape to a long dataset, where you can easily impose your required ordering by
sort_values(by=['original_name','score','match_number'], ascending=[True,False,True])
Finally, you can reshape it back to a wide dataset.
import pandas as pd
from io import StringIO
# sample data
df = """
original_name,match_name_0,score_0,match_index_0,match_name_1,score_1,match_index_1,match_name_2,score_2,match_index_2,match_name_3,score_3,match_index_3,match_name_4,score_4,match_index_4
aberdeen asset management plc,aberdeen asset management sa,100,2114,aberdeen asset management plc esop,100,2128,aberdeen asset management inc,100,2123,aberdeen asset management spain,71.18779356,2132,aberdeen asset management ireland,69.50514818,2125
agi partners llc,agi partners llc,100,5274,agi partners llc,100,5273,agr partners llc,57.51100704,5378,aci partners llc,53.45090217,3097,avi partners llc,53.45090217,17630
alberta investment management corporation,alberta investment management corporation,100,6754,alberta investment management corporation pension arm,100,6755,anchor investment management corporation,17.50748486,10682,cbc investment management corporation,11.79760839,36951,harvest investment management corporation,31.70316571,85547
"""
df= pd.read_csv(StringIO(df.strip()), sep=',', engine='python')
# wide to long
result = pd.wide_to_long(df, ['match_name','score','match_index'], i='original_name', j='match_number', sep='_').reset_index()
# sort matches as per requirement
result = result.sort_values(by=['original_name','score','match_number'], ascending=[True,False,True])
# overwrite ranking imposed by previous sort
# this ensures that the order is maintained once it is
# reshaped back to a wide dataset
result['match_number'] = result.groupby('original_name').cumcount()
# reshape long to wide
result = result.set_index(['original_name','match_number']).unstack()
# tidy up to match expected result
result = result.swaplevel(axis=1).sort_index(axis=1)
result = result.reindex(['match_name','score','match_index'], axis=1, level=1)
result.columns = [f'{col[1]}_{col[0]}' for col in result.columns]
As a result, for example, previous match 4 of alberta investment management corporation is now match 2 (based on score). The order of matches 3 and 4 for agi partners llc remain the same because they have the same score.
To give an idea, I have an excel file(.xlsx format) within which I am working with 2 sheets at a time.
I am interested in 'entity name' from sheet a and 'name' from sheet b.
Sheet b has 'name' column written 7times.
my sheet a looks like this.
Isin Entity Name
DE0005545503 1&1 AG
US68243Q1067 1-800-Flowers.Com Inc
US68269G1076 1Life Healthcare Inc
US3369011032 1st Source Corp
while my sheet b looks like this
name company_id name company_id name company_id name company_id name company_id name company_id name
LIVERPOOL PARTNERS MICROCAP GROWTH FUND MANAGER PTY LTD 586056 FERRARI NADIA 1000741 DORSET COMMUNITY RADIO LTD 1250023 Hunan Guangtongsheng Communication Service Co., Ltd. 1500335 Steffes Prüf- und Messtechnik GmbH, 1550006 CHL SRL 2000320 Qu Star, Inc.
BISCUIT AVENUE PTY LTD 586474 D AMBROSIO MARIA 1000382 LUCKY WORLD PRODUCTIONS LIMITED 1250024 Zhuzhou Wanlian Telecommunication Co., Ltd. 1500354 e42 II GmbH 1550510 EGGTRONIC SPA 2000023 Molly Shaheen, L.L.C.
CL MAY1212 PTY LTD 586475 TORIJA ZANE LUCIA LUCIA 1000389 FYLDE COAST MEDIA LTD 1250034 Zhongyi Tietong Co., Ltd. Yanling Xiayang Broadband TV Service Center 1500376 Valorem Capital UG (haftungsbeschränkt) 1550539 MARACAIBA INVEST SRL 2000139 Truptisudhir Pharmacy Inc
alternatively you can find the sheet b here:
Here's my code
import pandas as pd
from fuzzywuzzy import fuzz
filename = 'C:/Users/Downloads/SUniverse.xlsx'
dataframe1 = pd.read_excel(filename, sheet_name='A')
dataframe2 = pd.read_excel(filename, sheet_name='B')
# print(dataframe1.head())
# print(dataframe2.head())
# Clean customers lists
A_cleaned = [df1 for df1 in dataframe1["Entity Name"] if not(pd.isnull(df1))]
B_cleaned = [df2 for df2 in dataframe2["name"].unique() if not(pd.isnull(df2))]
print(A_cleaned)
print(B_cleaned)
# Perform fuzzy string matching
tuples_list = [max([(fuzz.token_set_ratio(i,j),j) for j in B_cleaned]) for i in A_cleaned]
print(tuples_list)
# Unpack list of tuples into two lists
similarity_score, fuzzy_match = map(list,zip(*tuples_list))
# Create pandas DataFrame
df = pd.DataFrame({"I_Entity_Name":A_cleaned, "I_Name": fuzzy_match, "similarity score":similarity_score})
df.to_excel("C:/Users/Downloads/fuz-match-output.xlsx", sheet_name="Fuzzy String Matching", index=False)
print('done!')
The code takes forever to generate results. It has been over 20hours and the script is still running. My excel input file is going over 50mbs in size(just wanna say that it contains millions of records).
How do I ensure that my script runs at a faster pace and generates the result? I want the output to be this:
Entity Name Name fuzzy score
apple APPLE 100
.
.
.
Given the following pandas df -
Holding Account
Entity ID
Holding Account Number
% Ownership
Entity ID %
Account # %
Ownership Audit Note
11 West Summit Drive LLC (80008660955)
3423435
54353453454
100
100
100
NaN
110 Goodwill LLC (91928475)
7653453
65464565
50
50
50
Partial Ownership [50.00%]
1110 Webbers St LLC (14219739)
1235734
12343535
100
100
100
NaN
120 Goodwill LLC (30271633)
9572953
96839592
55
55
55
Inactive Client [10.00%]
Objective - I am trying to create an Exceptions Report and only inc. those rows based on the following logic:
% Ownership =! 100% OR
(Ownership Audit Note == "-") & (Account # % OR Entity ID % ==100%)
Attempt - I am able to produce components, which make up my required logic, however can't seem to bring them together:
# This gets me rows which meet 1.
df = df[df['% Ownership'].eq(100)==False]
# Something 'like' this would get me 2.
df = df[df['Ownership Audit Note'] == "-"] & df[df['Account # %'|'Entity ID %'] == "None"]
I am looking for some hints/tips to help me bring all this together in the most pythonic way.
Use:
df = df[df['% Ownership'].ne(100) | (df['Ownership Audit Note'].eq("-") & (df['Account # %'].eq(100) | df['Entity ID %'].eq(100)))]
I have a dataframe with company ticker("ticker"), full name ("longName) and short name ("unofficial_name") - this abridged name is created from the long name by removing inc., plc...
I also have a seperate datefame with company news: date ("date" ) of the news, headline ("name"), news text ("text") and sentiment analysis.
I am trying to find company name matches in the list of articles and create a new dataframe with unique company-article matches (i.e. if one article mentions more than one company, this article would have more rows depending on the number of companies mentioned).
I tried to execute the matching based on the "unofficial_name" with the following code:
dict=[]
for n, c in zip(df_news["text"], sp500_names["unofficial_name"]):
if c in n:
x = {"text":n, "unofficial_name":c}
dict.append(x)
print(dict)
But I get an empty list returned. Any ideas how to solve it?
sp500_names
ticker longName unofficial_name
0 A Agilent Technologies, Inc. Agilent Technologies
1 AAL American Airlines Group Inc. American Airlines Group
df_news
name date text neg neu pos compound
0 Asian stock markets reverse losses on global p... 2020-03-01 [By Tom Westbrook and Swati Pandey SINGAPORE (... 0.086 0.863 0.051 -0.9790
1 Energy & Precious Metals - Weekly Review and C... 2020-03-01 [By Barani Krishnan Investing.com - How much ... 0.134 0.795 0.071 -0.9982
Thank you!
I have two data frames with each having a different number of rows. Below is a couple rows from each data set
df1 =
Company City State ZIP
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102
LACKEY SHEET METAL St. Louis MO 63102
and
df2 =
FDA Company FDA City FDA State FDA ZIP
LACKEY SHEET METAL St. Louis MO 63102
PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
HELGET GAS PRODUCTS INC Omaha NE 68127
ORTHOQUEST LLC La Vista NE 68128
I joined them side by side using combined_data = pandas.concat([df1, df2], axis = 1). My next goal is to compare each string under df1['Company'] to each string under in df2['FDA Company'] using several different matching commands from the fuzzy wuzzy module and return the value of the best match and its name. I want to store that in a new column. For example if I did the fuzz.ratio and fuzz.token_sort_ratio on LACKY SHEET METAL in df1['Company'] to df2['FDA Company'] it would return that the best match was LACKY SHEET METAL with a score of 100 and this would then be saved under a new column in combined data. It results would look like
combined_data =
Company City State ZIP FDA Company FDA City FDA State FDA ZIP fuzzy.token_sort_ratio match fuzzy.ratio match
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101 LACKEY SHEET METAL St. Louis MO 63102 LACKEY SHEET METAL 100 LACKEY SHEET METAL 100
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102 PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102 HELGET GAS PRODUCTS INC Omaha NE 68127
LACKEY SHEET METAL St. Louis MO 63102 ORTHOQUEST LLC La Vista NE 68128
I tried doing
combined_data['name_ratio'] = combined_data.apply(lambda x: fuzz.ratio(x['Company'], x['FDA Company']), axis = 1)
But got an error because the lengths of the columns are different.
I am stumped. How I can accomplish this?
I couldn't tell what you were doing. This is how I would do it.
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Create a series of tuples to compare:
compare = pd.MultiIndex.from_product([df1['Company'],
df2['FDA Company']]).to_series()
Create a special function to calculate fuzzy metrics and return a series.
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
Apply metrics to the compare series
compare.apply(metrics)
There are bunch of ways to do this next part:
Get closest matches to each row of df1
compare.apply(metrics).unstack().idxmax().unstack(0)
Get closest matches to each row of df2
compare.apply(metrics).unstack(0).idxmax().unstack(0)