Finding company matches in a list of financial news - python

I have a dataframe with company ticker("ticker"), full name ("longName) and short name ("unofficial_name") - this abridged name is created from the long name by removing inc., plc...
I also have a seperate datefame with company news: date ("date" ) of the news, headline ("name"), news text ("text") and sentiment analysis.
I am trying to find company name matches in the list of articles and create a new dataframe with unique company-article matches (i.e. if one article mentions more than one company, this article would have more rows depending on the number of companies mentioned).
I tried to execute the matching based on the "unofficial_name" with the following code:
dict=[]
for n, c in zip(df_news["text"], sp500_names["unofficial_name"]):
if c in n:
x = {"text":n, "unofficial_name":c}
dict.append(x)
print(dict)
But I get an empty list returned. Any ideas how to solve it?

sp500_names
ticker longName unofficial_name
0 A Agilent Technologies, Inc. Agilent Technologies
1 AAL American Airlines Group Inc. American Airlines Group
df_news
name date text neg neu pos compound
0 Asian stock markets reverse losses on global p... 2020-03-01 [By Tom Westbrook and Swati Pandey SINGAPORE (... 0.086 0.863 0.051 -0.9790
1 Energy & Precious Metals - Weekly Review and C... 2020-03-01 [By Barani Krishnan Investing.com - How much ... 0.134 0.795 0.071 -0.9982
Thank you!

Related

Trying to rearrange multiple columns in a dataframe based on ranking row values

I'm working on a matching company names and I have a dataframe that returns output in the format below.
The table has an original name and for each original name, there could be N number of matches. For each match, there are 3 columns, match_name_0, score_0, match_index_0 and so on up to match_name_N.
I'm trying to figure out a way to return a new dataframe that sorts the columns after the original_name by the highest match scores. Essentially, if match_score_2 was the highest then match_score_0 followed by match_score_1 the columns would be
original_score, match_name_2, match_score_2, match_index_2, match_name_0, match_score_0, match_index_0, match_name_2, match_score_2, match_index_2,
In the event of a tie, the leftmost match should be ranked higher. I should note that sometimes they will be in the correct order but 30-40% of the times, they are not.
I've been staring at my screen for 2 hours and totally stumped so any help is greatly appreciated
index
original_name
match_name_0
score_0
match_index_0
match_name_1
score_1
match_index_1
match_name_2
score_2
match_index_2
match_name_3
score_3
match_index_3
match_name_4
score_4
match_index_4
0
aberdeen asset management plc
aberdeen asset management sa
100
2114
aberdeen asset management plc esop
100
2128
aberdeen asset management inc
100
2123
aberdeen asset management spain
71.18779356
2132
aberdeen asset management ireland
69.50514818
2125
2
agi partners llc
agi partners llc
100
5274
agi partners llc
100
5273
agr partners llc
57.51100704
5378
aci partners llc
53.45090217
3097
avi partners llc
53.45090217
17630
3
alberta investment management corporation
alberta investment management corporation
100
6754
alberta investment management corporation pension arm
100
6755
anchor investment management corporation
17.50748486
10682
cbc investment management corporation
11.79760839
36951
harvest investment management corporation
31.70316571
85547
I am assuming you want to impose the ordering of matches first by score and then by match_number individually for each original_name.
Wide datasets are usually difficult to deal with, including this case. I suggest to reshape to a long dataset, where you can easily impose your required ordering by
sort_values(by=['original_name','score','match_number'], ascending=[True,False,True])
Finally, you can reshape it back to a wide dataset.
import pandas as pd
from io import StringIO
# sample data
df = """
original_name,match_name_0,score_0,match_index_0,match_name_1,score_1,match_index_1,match_name_2,score_2,match_index_2,match_name_3,score_3,match_index_3,match_name_4,score_4,match_index_4
aberdeen asset management plc,aberdeen asset management sa,100,2114,aberdeen asset management plc esop,100,2128,aberdeen asset management inc,100,2123,aberdeen asset management spain,71.18779356,2132,aberdeen asset management ireland,69.50514818,2125
agi partners llc,agi partners llc,100,5274,agi partners llc,100,5273,agr partners llc,57.51100704,5378,aci partners llc,53.45090217,3097,avi partners llc,53.45090217,17630
alberta investment management corporation,alberta investment management corporation,100,6754,alberta investment management corporation pension arm,100,6755,anchor investment management corporation,17.50748486,10682,cbc investment management corporation,11.79760839,36951,harvest investment management corporation,31.70316571,85547
"""
df= pd.read_csv(StringIO(df.strip()), sep=',', engine='python')
# wide to long
result = pd.wide_to_long(df, ['match_name','score','match_index'], i='original_name', j='match_number', sep='_').reset_index()
# sort matches as per requirement
result = result.sort_values(by=['original_name','score','match_number'], ascending=[True,False,True])
# overwrite ranking imposed by previous sort
# this ensures that the order is maintained once it is
# reshaped back to a wide dataset
result['match_number'] = result.groupby('original_name').cumcount()
# reshape long to wide
result = result.set_index(['original_name','match_number']).unstack()
# tidy up to match expected result
result = result.swaplevel(axis=1).sort_index(axis=1)
result = result.reindex(['match_name','score','match_index'], axis=1, level=1)
result.columns = [f'{col[1]}_{col[0]}' for col in result.columns]
As a result, for example, previous match 4 of alberta investment management corporation is now match 2 (based on score). The order of matches 3 and 4 for agi partners llc remain the same because they have the same score.

why does Fuzzywuzzy python script take forever to generate results?

To give an idea, I have an excel file(.xlsx format) within which I am working with 2 sheets at a time.
I am interested in 'entity name' from sheet a and 'name' from sheet b.
Sheet b has 'name' column written 7times.
my sheet a looks like this.
Isin Entity Name
DE0005545503 1&1 AG
US68243Q1067 1-800-Flowers.Com Inc
US68269G1076 1Life Healthcare Inc
US3369011032 1st Source Corp
while my sheet b looks like this
name company_id name company_id name company_id name company_id name company_id name company_id name
LIVERPOOL PARTNERS MICROCAP GROWTH FUND MANAGER PTY LTD 586056 FERRARI NADIA 1000741 DORSET COMMUNITY RADIO LTD 1250023 Hunan Guangtongsheng Communication Service Co., Ltd. 1500335 Steffes Prüf- und Messtechnik GmbH, 1550006 CHL SRL 2000320 Qu Star, Inc.
BISCUIT AVENUE PTY LTD 586474 D AMBROSIO MARIA 1000382 LUCKY WORLD PRODUCTIONS LIMITED 1250024 Zhuzhou Wanlian Telecommunication Co., Ltd. 1500354 e42 II GmbH 1550510 EGGTRONIC SPA 2000023 Molly Shaheen, L.L.C.
CL MAY1212 PTY LTD 586475 TORIJA ZANE LUCIA LUCIA 1000389 FYLDE COAST MEDIA LTD 1250034 Zhongyi Tietong Co., Ltd. Yanling Xiayang Broadband TV Service Center 1500376 Valorem Capital UG (haftungsbeschränkt) 1550539 MARACAIBA INVEST SRL 2000139 Truptisudhir Pharmacy Inc
alternatively you can find the sheet b here:
Here's my code
import pandas as pd
from fuzzywuzzy import fuzz
filename = 'C:/Users/Downloads/SUniverse.xlsx'
dataframe1 = pd.read_excel(filename, sheet_name='A')
dataframe2 = pd.read_excel(filename, sheet_name='B')
# print(dataframe1.head())
# print(dataframe2.head())
# Clean customers lists
A_cleaned = [df1 for df1 in dataframe1["Entity Name"] if not(pd.isnull(df1))]
B_cleaned = [df2 for df2 in dataframe2["name"].unique() if not(pd.isnull(df2))]
print(A_cleaned)
print(B_cleaned)
# Perform fuzzy string matching
tuples_list = [max([(fuzz.token_set_ratio(i,j),j) for j in B_cleaned]) for i in A_cleaned]
print(tuples_list)
# Unpack list of tuples into two lists
similarity_score, fuzzy_match = map(list,zip(*tuples_list))
# Create pandas DataFrame
df = pd.DataFrame({"I_Entity_Name":A_cleaned, "I_Name": fuzzy_match, "similarity score":similarity_score})
df.to_excel("C:/Users/Downloads/fuz-match-output.xlsx", sheet_name="Fuzzy String Matching", index=False)
print('done!')
The code takes forever to generate results. It has been over 20hours and the script is still running. My excel input file is going over 50mbs in size(just wanna say that it contains millions of records).
How do I ensure that my script runs at a faster pace and generates the result? I want the output to be this:
Entity Name Name fuzzy score
apple APPLE 100
.
.
.

Setting Character Limit on Pandas DataFrame Column

Background:
Given the following pandas df -
Holding Account
Model Type
Entity ID
Direct Owner ID
WF LLC | 100 Jones Street 26th Floor San Francisco Ca Ltd Liability - Income Based Gross USA Only (486941515)
51364633
4564564
5646546
RF LLC | Neuberger | LLC | Aukai Services LLC-Neuberger Smid - Income Accuring Net of Fees Worldwide Fund (456456218)
46256325
1645365
4926654
The ask:
What is the most pythonic way to enforce a 80 character limit to the Holding Account column (dtype = object) values?
Context: I am writing df to a .csv and then subsequently uploading to a system with an 80-character limit. The values of Holding Account column are unique, so I just want to sacrifice those characters that take the string over 80-characters.
My attempt:
This is what I attempted - df['column'] = df['column'].str[:80]
Why not just use .str, like you were doing?
df['Holding Account'] = df['Holding Account'].str[:80]
Output:
>>> df
Holding Account Model Type Entity ID Direct Owner ID
0 WF LLC | 100 Jones Street 26th Floor San Francisco Ca Ltd Liability - Income Bas 51364633 4564564 5646546
1 RF LLC | Neuberger | LLC | Aukai Services LLC-Neuberger Smid - Income Accuring N 46256325 1645365 4926654
Using slice will loss some information, I will suggest create a mapping table after get the factorized. This also save the storage space for server or db
s = df['Holding Account'].factorize()[0]
df['Holding Account'] = df['Holding Account'].factorize()[0]
d = dict(zip(s, df['Holding Account']))
If you would like get the databank just do
df['new'] = df['Holding Account'] .map(d)

Create new column based on value of another column

I have a solution below to give me a new column as a universal identifier, but what if there is additional data in the NAME column, how can I tweak the below to account for a wildcard like search term?
I want to basically have so if German/german or Mexican/mexican is in that row value then to give me Euro or South American value in new col
df["Identifier"] = (df["NAME"].str.lower().replace(
to_replace = ['german', 'mexican'],
value = ['Euro', 'South American']
))
print(df)
NAME Identifier
0 German Euro
1 german Euro
2 Mexican South American
3 mexican South American
Desired output
NAME Identifier
0 1990 German Euro
1 german 1998 Euro
2 country Mexican South American
3 mexican city 2006 South American
Based on an answer in this post:
r = '(german|mexican)'
c = dict(german='Euro', mexican='South American')
df['Identifier'] = df['NAME'].str.lower().str.extract(r, expand=False).map(c)
Another approach would be using np.where with those two conditions, but probably there is a more ellegant solution.
below code will work. i tried it using apply function but somehow can't able to get it. probably in sometime. meanwhile workable code below
df3['identifier']=''
js_ref=[{'german':'Euro'},{'mexican':'South American'}]
for i in range(len(df3)):
for l in js_ref:
for k,v in l.items():
if k.lower() in df3.name[i].lower():
df3.identifier[i]=v
break

How python can read and modify the CSV

**iShares Russell 3000 ETF
Inception Date May 22, 2000
Fund Holdings as of 31-oct-16
Total Net Assets $ 6,308,266,677
Shares Outstanding 48,550,000
Stock -
Bond -
Cash -
Other -**
 
Ticker Name Asset Class Weight (%) Price Shares Market Value Notional Value Sector SEDOL ISIN Exchange
AAPL APPLE INC Equity 2.8074 113.54 1,521,794 $ 172,784,491 172,784,490.76 Information Technology 2046251 US0378331005 NASDAQ
MSFT MICROSOFT CORP Equity 2.0474 59.92 2,103,008 $ 126,012,239 126,012,239.36 Information Technology 2588173 US5949181045 NASDAQ
XOM EXXON MOBIL CORP Equity 1.5675 83.32 1,157,835 $ 96,470,812 96,470,812.20 Energy 2326618 US30231G1022 New York Stock Exchange Inc.
JNJ JOHNSON & JOHNSON Equity 1.4378 115.99 762,927 $ 88,491,903 88,491,902.73 Health Care 2475833 US4781601046 New York Stock Exchange Inc.
I am trying to delete the basic info and read just the structured data below, Pandas read csv is throwing error.
Any help will be really helpful.
I assume that you want to discard the non-csv header, and that your file is tab separated afterwards (it cannot be space or it isn't a parsable csv file due to the lack of quotes and presence of space in data)
In that case you could iterate until you find the title, and then create a csv reader on the file, like this:
import csv
with open("input.csv") as f:
while(True):
l=next(f) # read input file lines
if not l or l.startswith("Ticker"): # title found (or end of file, safety)
break
cr = csv.reader(f,delimiter="\t")
for row in cr:
print(row) # get valid rows

Categories

Resources