I'm working on a matching company names and I have a dataframe that returns output in the format below.
The table has an original name and for each original name, there could be N number of matches. For each match, there are 3 columns, match_name_0, score_0, match_index_0 and so on up to match_name_N.
I'm trying to figure out a way to return a new dataframe that sorts the columns after the original_name by the highest match scores. Essentially, if match_score_2 was the highest then match_score_0 followed by match_score_1 the columns would be
original_score, match_name_2, match_score_2, match_index_2, match_name_0, match_score_0, match_index_0, match_name_2, match_score_2, match_index_2,
In the event of a tie, the leftmost match should be ranked higher. I should note that sometimes they will be in the correct order but 30-40% of the times, they are not.
I've been staring at my screen for 2 hours and totally stumped so any help is greatly appreciated
index
original_name
match_name_0
score_0
match_index_0
match_name_1
score_1
match_index_1
match_name_2
score_2
match_index_2
match_name_3
score_3
match_index_3
match_name_4
score_4
match_index_4
0
aberdeen asset management plc
aberdeen asset management sa
100
2114
aberdeen asset management plc esop
100
2128
aberdeen asset management inc
100
2123
aberdeen asset management spain
71.18779356
2132
aberdeen asset management ireland
69.50514818
2125
2
agi partners llc
agi partners llc
100
5274
agi partners llc
100
5273
agr partners llc
57.51100704
5378
aci partners llc
53.45090217
3097
avi partners llc
53.45090217
17630
3
alberta investment management corporation
alberta investment management corporation
100
6754
alberta investment management corporation pension arm
100
6755
anchor investment management corporation
17.50748486
10682
cbc investment management corporation
11.79760839
36951
harvest investment management corporation
31.70316571
85547
I am assuming you want to impose the ordering of matches first by score and then by match_number individually for each original_name.
Wide datasets are usually difficult to deal with, including this case. I suggest to reshape to a long dataset, where you can easily impose your required ordering by
sort_values(by=['original_name','score','match_number'], ascending=[True,False,True])
Finally, you can reshape it back to a wide dataset.
import pandas as pd
from io import StringIO
# sample data
df = """
original_name,match_name_0,score_0,match_index_0,match_name_1,score_1,match_index_1,match_name_2,score_2,match_index_2,match_name_3,score_3,match_index_3,match_name_4,score_4,match_index_4
aberdeen asset management plc,aberdeen asset management sa,100,2114,aberdeen asset management plc esop,100,2128,aberdeen asset management inc,100,2123,aberdeen asset management spain,71.18779356,2132,aberdeen asset management ireland,69.50514818,2125
agi partners llc,agi partners llc,100,5274,agi partners llc,100,5273,agr partners llc,57.51100704,5378,aci partners llc,53.45090217,3097,avi partners llc,53.45090217,17630
alberta investment management corporation,alberta investment management corporation,100,6754,alberta investment management corporation pension arm,100,6755,anchor investment management corporation,17.50748486,10682,cbc investment management corporation,11.79760839,36951,harvest investment management corporation,31.70316571,85547
"""
df= pd.read_csv(StringIO(df.strip()), sep=',', engine='python')
# wide to long
result = pd.wide_to_long(df, ['match_name','score','match_index'], i='original_name', j='match_number', sep='_').reset_index()
# sort matches as per requirement
result = result.sort_values(by=['original_name','score','match_number'], ascending=[True,False,True])
# overwrite ranking imposed by previous sort
# this ensures that the order is maintained once it is
# reshaped back to a wide dataset
result['match_number'] = result.groupby('original_name').cumcount()
# reshape long to wide
result = result.set_index(['original_name','match_number']).unstack()
# tidy up to match expected result
result = result.swaplevel(axis=1).sort_index(axis=1)
result = result.reindex(['match_name','score','match_index'], axis=1, level=1)
result.columns = [f'{col[1]}_{col[0]}' for col in result.columns]
As a result, for example, previous match 4 of alberta investment management corporation is now match 2 (based on score). The order of matches 3 and 4 for agi partners llc remain the same because they have the same score.
To give an idea, I have an excel file(.xlsx format) within which I am working with 2 sheets at a time.
I am interested in 'entity name' from sheet a and 'name' from sheet b.
Sheet b has 'name' column written 7times.
my sheet a looks like this.
Isin Entity Name
DE0005545503 1&1 AG
US68243Q1067 1-800-Flowers.Com Inc
US68269G1076 1Life Healthcare Inc
US3369011032 1st Source Corp
while my sheet b looks like this
name company_id name company_id name company_id name company_id name company_id name company_id name
LIVERPOOL PARTNERS MICROCAP GROWTH FUND MANAGER PTY LTD 586056 FERRARI NADIA 1000741 DORSET COMMUNITY RADIO LTD 1250023 Hunan Guangtongsheng Communication Service Co., Ltd. 1500335 Steffes Prüf- und Messtechnik GmbH, 1550006 CHL SRL 2000320 Qu Star, Inc.
BISCUIT AVENUE PTY LTD 586474 D AMBROSIO MARIA 1000382 LUCKY WORLD PRODUCTIONS LIMITED 1250024 Zhuzhou Wanlian Telecommunication Co., Ltd. 1500354 e42 II GmbH 1550510 EGGTRONIC SPA 2000023 Molly Shaheen, L.L.C.
CL MAY1212 PTY LTD 586475 TORIJA ZANE LUCIA LUCIA 1000389 FYLDE COAST MEDIA LTD 1250034 Zhongyi Tietong Co., Ltd. Yanling Xiayang Broadband TV Service Center 1500376 Valorem Capital UG (haftungsbeschränkt) 1550539 MARACAIBA INVEST SRL 2000139 Truptisudhir Pharmacy Inc
alternatively you can find the sheet b here:
Here's my code
import pandas as pd
from fuzzywuzzy import fuzz
filename = 'C:/Users/Downloads/SUniverse.xlsx'
dataframe1 = pd.read_excel(filename, sheet_name='A')
dataframe2 = pd.read_excel(filename, sheet_name='B')
# print(dataframe1.head())
# print(dataframe2.head())
# Clean customers lists
A_cleaned = [df1 for df1 in dataframe1["Entity Name"] if not(pd.isnull(df1))]
B_cleaned = [df2 for df2 in dataframe2["name"].unique() if not(pd.isnull(df2))]
print(A_cleaned)
print(B_cleaned)
# Perform fuzzy string matching
tuples_list = [max([(fuzz.token_set_ratio(i,j),j) for j in B_cleaned]) for i in A_cleaned]
print(tuples_list)
# Unpack list of tuples into two lists
similarity_score, fuzzy_match = map(list,zip(*tuples_list))
# Create pandas DataFrame
df = pd.DataFrame({"I_Entity_Name":A_cleaned, "I_Name": fuzzy_match, "similarity score":similarity_score})
df.to_excel("C:/Users/Downloads/fuz-match-output.xlsx", sheet_name="Fuzzy String Matching", index=False)
print('done!')
The code takes forever to generate results. It has been over 20hours and the script is still running. My excel input file is going over 50mbs in size(just wanna say that it contains millions of records).
How do I ensure that my script runs at a faster pace and generates the result? I want the output to be this:
Entity Name Name fuzzy score
apple APPLE 100
.
.
.
I have a huge txt file from that I want to exclude every Page Number, Tabular Data or Headings. The only differentiator i can think of is that the Text I need to keep is at least two lines Long
The data does look (exemplary) like this:
1 C o mp a n y
2 C o mb in ed ma na g emen t
r ep o r t
Total equity and liabilities
6,130.3
100.0%
5,930.0
100.0%
200.3
Additionally, there is bodytext, which I want to keep:
The total assets of ZALANDO SE rose by 3.4% primarily due to a further increase in financial
assets. The assets of ZALANDO SE mainly consist of financial and current assets, specifically
securities and cash, shares in affiliated companies as well as inventories and receivables.
Equity and liabilities comprise equity and current and non-current liabilities and provisions.
I did try to write:
myvariable = textstring.replace(\n.*\n," ") but it does not do anything.
I have a dataframe with company ticker("ticker"), full name ("longName) and short name ("unofficial_name") - this abridged name is created from the long name by removing inc., plc...
I also have a seperate datefame with company news: date ("date" ) of the news, headline ("name"), news text ("text") and sentiment analysis.
I am trying to find company name matches in the list of articles and create a new dataframe with unique company-article matches (i.e. if one article mentions more than one company, this article would have more rows depending on the number of companies mentioned).
I tried to execute the matching based on the "unofficial_name" with the following code:
dict=[]
for n, c in zip(df_news["text"], sp500_names["unofficial_name"]):
if c in n:
x = {"text":n, "unofficial_name":c}
dict.append(x)
print(dict)
But I get an empty list returned. Any ideas how to solve it?
sp500_names
ticker longName unofficial_name
0 A Agilent Technologies, Inc. Agilent Technologies
1 AAL American Airlines Group Inc. American Airlines Group
df_news
name date text neg neu pos compound
0 Asian stock markets reverse losses on global p... 2020-03-01 [By Tom Westbrook and Swati Pandey SINGAPORE (... 0.086 0.863 0.051 -0.9790
1 Energy & Precious Metals - Weekly Review and C... 2020-03-01 [By Barani Krishnan Investing.com - How much ... 0.134 0.795 0.071 -0.9982
Thank you!
This one has been relatively tricky for me. I am trying to extract the embedded table sourced from google sheets in python.
Here is the link
I do not own the sheet but it is publicly available.
here is my code thus far, when I go to output the headers it is showing me "". Any help would be greatly appreciated. End goal is to convert this table into a pandas DF. Thanks guys
import lxml.html as lh
import pandas as pd
url = 'https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQ--HR_GTaiv2dxaVwIwWYzY2fXTSJJN0dugyQe_QJnZEpKm7bu5o7eh6javLIk2zj0qtnvjJPOyvu2/pubhtml/sheet?headers=false&gid=1503072727'
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')
col = []
i = 0
for t in tr_elements[0]:
i +=1
name = t.text_content()
print('%d:"%s"'%(i,name))
col.append((name,[]))
Well if you would like to get the data into a DataFrame, you could load it directly:
df = pd.read_html('https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQ--HR_GTaiv2dxaVwIwWYzY2fXTSJJN0dugyQe_QJnZEpKm7bu5o7eh6javLIk2zj0qtnvjJPOyvu2/pubhtml/sheet?headers=false&gid=1503072727',
header=1)[0]
df.drop(columns='1', inplace=True) # remove unnecessary index column called "1"
This will give you:
Target Ticker Acquirer \
0 Acacia Communications Inc Com ACIA Cisco Systems Inc Com
1 Advanced Disposal Services Inc Com ADSW Waste Management Inc Com
2 Allergan Plc Com AGN Abbvie Inc Com
3 Ak Steel Holding Corp Com AKS Cleveland Cliffs Inc Com
4 Td Ameritrade Holding Corp Com AMTD Schwab (Charles) Corp Com
Ticker.1 Current Price Take Over Price Price Diff % Diff Date Announced \
0 CSCO $68.79 $70.00 $1.21 1.76% 7/9/2019
1 WM $32.93 $33.15 $0.22 0.67% 4/15/2019
2 ABBV $197.05 $200.22 $3.17 1.61% 6/25/2019
3 CLF $2.98 $3.02 $0.04 1.34% 12/3/2019
4 SCHW $49.31 $51.27 $1.96 3.97% 11/25/2019
Deal Type
0 Cash
1 Cash
2 C&S
3 Stock
4 Stock
Note read_html returns a list. In this case there is only
1 DataFrame, so we can refer to the first and only index location [0]