I have a nested dictionary like below and I would like to replace string using inner key value pair if that key is at the end of string, replace it with value only when the country Code equals to dict key (not inner key)
'IND': {' PVT. LTD.': ' Pvt. Ltd.',
' pvt. Ltd': ' Pvt. Ltd.',
' PVT LTD': ' Pvt. Ltd.',
' L.L.P.': ' LLP',
' LTD.': ' Ltd.',
' LLP.': ' LLP',
' ltd': ' Ltd.',
' llp': ' LLP'},
'GBR': {' P.L.C.': ' PLC',
' C.I.C.': ' CIC',
' p.l.c': ' PLC',
' c.i.c': ' CIC',
' s.e.': ' SE',
' PLC.': ' PLC'},
'USA': {' LTD. CO.': ' Ltd. Co.',
' L.L.L.P.': ' LLLP',
' ltd. Co': ' Ltd. Co.',
' l.l.l.p': ' LLLP',
' L.L.P.': ' LLP',
' L.L.C.': ' LLC',
' l.l.p': ' LLP',
' l.l.c': ' LLC'}
My dataframe has two cols. Legal name and Reg Country Code -
Name
Reg Country Code
NexPoint LTD. CO.
USA
Silverplay P.L.C.
GBR
ALLOYS PVT. LTD.
IND
GALLIUM ltd.
IND
ELLIOTT s.e.
GBR
I used below code - it is replacing the string as and when the legal name finds the inner key but not checking the country condition with outer key. Can someone pl suggest me. (I have a big list)
for i in range(len(df)):
for k1 in country_dict.items():
if df.loc[i, 'Reg Country Code'] == k1:
for k2, v2 in country_dict[k1].items():
df.loc[df['Reg Country Code'] == k1, 'Name'] = [re.sub(k2, v, x) if x.endswith(k2) else x for x in df.loc[df['Reg Country Code'] == k1, 'Name']]
My Output should be:
Name
Reg Country Code
NexPoint Ltd. Co.
USA
Silverplay PLC
GBR
ALLOYS Pvt. Ltd.
IND
GALLIUM Ltd.
IND
ELLIOTT SE
GBR
You can group the df by country code and replace
df['NAME'] = df.groupby('REG COUNTRY CODE')['NAME'].apply(lambda x: x.replace(d[x.name], regex = True))
NAME REG COUNTRY CODE
0 NexPoint Ltd. Co. USA
1 Silverplay PLC GBR
2 ALLOYS Pvt. Ltd. IND
3 GALLIUM Ltd.. IND
4 ELLIOTT SE GBR
Explanation:
df.groupby('REG COUNTRY CODE').name returns name of the group (country code in this case).
By using d[x.name], we are able to access the value dictionary corresponding to dictionary keys (country codes)
Setting regex to True helps us replace the string partially
Related
['AAR CORP. FQ1 2011 EARNINGS CALL | SEP 16, 2010 ',
'Copyright © 2019 S&P Global Market Intelligence, a division of S&P Global Inc. All Rights reserved. ',
'spglobal.com/marketintelligence 3 ',
' ',
' ',
'Call Participants ',
'EXECUTIVES ',
' ',
'David P. Storch ',
'Chairman of the Board ',
' ',
'Richard J. Poulton ',
'Former Chief Financial Officer, Vice ',
'President and Treasurer ',
' ',
'Timothy J. Romenesko ',
'Former Vice Chairman ',
'Tom Udovich ',
'ANALYSTS ',
'Arnold Ursaner ',
'CJS Securities, Inc. ',
' ',
'Eric Charles Hugel ',
'Stephens Inc., Research Division ',
' ',
'J. B. Groh ',
'D.A. Davidson & Co., Research ',
'Division ',
' ',
'Jonathan Paul Braatz ',
'Kansas City Capital Associates ',
'Joseph DeNardi ',
'Kenneth George Herbert ',
'Wedbush Securities Inc., Research ',
'Division ',
'Thomas Lewis ',
'Tyler Hojo ',
'Sidoti & Company, LLC ']
The above is how the text data is available.
This is how I would like my data to look like.
Can anyone please guide me how to use python for this purpose?
enter image description here
I'm trying to scrape a list from EDGAR.
The information I need (such as "entity-name") are in the "td" class. However, the code I currently have doesn't return anything. I would appreciate any help. Thanks in advance!
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
s = Service('/PATH/chromedriver')
driver = webdriver.Chrome(service=s)
driver.get("https://www.sec.gov/edgar/search/#/q=%2522cyber%2520insurance%2522&dateRange=custom&category=form-cat1&startdt=2011-01-01&enddt=2022-03-12&filter_forms=10-K")
try:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'entity-name')))
except TimeoutException:
print('Page timed out after 10 secs.')
page = BeautifulSoup(driver.page_source,'html.parser')
print(page)
To extract the texts from the entity-name column instead of presence_of_all_elements_located() you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following locator strategies:
Using CSS_SELECTOR and text attribute:
driver.get('https://www.sec.gov/edgar/search/#/q=%2522cyber%2520insurance%2522&dateRange=custom&category=form-cat1&startdt=2011-01-01&enddt=2022-03-12&filter_forms=10-K')
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "td.entity-name")))])
Using XPATH and get_attribute("innerHTML"):
driver.get('https://www.sec.gov/edgar/search/#/q=%2522cyber%2520insurance%2522&dateRange=custom&category=form-cat1&startdt=2011-01-01&enddt=2022-03-12&filter_forms=10-K')
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//td[#class='entity-name']")))])
Console Output:
['Excel Corp ', 'PROGRESSIVE CORP/OH/ (PGR) ', 'Electromed, Inc. (ELMD) ', 'HOOKER FURNITURE CORP (HOFT) ', 'HOOKER FURNITURE CORP (HOFT) ', 'SOUTHERN CO (SO, SOJA, SOJB, SOJC, SOJD, SOLN) <br> ALABAMA POWER CO (ALPVN, APRCP, APRDM, APRDN, APRDO, APRDP, ALP-PQ) <br> GEORGIA POWER CO (GPJA) <br> MISSISSIPPI POWER CO <br> SOUTHERN Co GAS <br> SOUTHERN POWER CO ', 'HOOKER FURNITURE CORP (HOFT) ', 'SOUTHERN CO (SO, SOJA, SOJB, SOJC, SOJD, SOLN) <br> ALABAMA POWER CO (ALPVN, APRCP, APRDM, APRDN, APRDO, APRDP, ALP-PQ) <br> GEORGIA POWER CO (GPJA) <br> MISSISSIPPI POWER CO <br> SOUTHERN Co GAS <br> SOUTHERN POWER CO ', 'BENCHMARK ELECTRONICS INC (BHE) ', 'MARRIOTT INTERNATIONAL INC /MD/ (MAR) ', 'Sprouts Farmers Market, Inc. (SFM) ', 'CF BANKSHARES INC. (CFBK) ', 'Repay Holdings Corp (RPAY) ', 'Sprouts Farmers Market, Inc. (SFM) ', 'MARRIOTT INTERNATIONAL INC /MD/ (MAR) ', 'Sprouts Farmers Market, Inc. (SFM) ', 'Albertsons Companies, Inc. (ACI) ', 'MARRIOTT INTERNATIONAL INC /MD/ (MAR) ', 'MARRIOTT INTERNATIONAL INC /MD/ (MAR) ', 'HENNESSY ADVISORS INC (HNNA) ', 'Repay Holdings Corp (RPAY, RPAYW) ', 'Repay Holdings Corp (RPAY, RPAYW, TBRGU) ', 'Arlo Technologies, Inc. (ARLO) ', 'Repay Holdings Corp (RPAY, RPAYW) ', 'NATIONAL HEALTH INVESTORS INC (NHI) ', 'MOTORCAR PARTS AMERICA INC (MPAA) ', 'RGC RESOURCES INC (RGCO) ', 'Arlo Technologies, Inc. (ARLO) ', 'CRYOLIFE INC (CRY) ', 'Mimecast Ltd (MIME) ', 'RGC RESOURCES INC (RGCO) ', 'MOTORCAR PARTS AMERICA INC (MPAA) ', 'NOODLES & Co (NDLS) ', 'PAPA JOHNS INTERNATIONAL INC (PZZA) ', 'MOTORCAR PARTS AMERICA INC (MPAA) ', 'MOTORCAR PARTS AMERICA INC (MPAA) ', 'PAPA JOHNS INTERNATIONAL INC (PZZA) ', 'MOTORCAR PARTS AMERICA INC (MPAA) ', 'Sprouts Farmers Market, Inc. (SFM) ', 'MOTORCAR PARTS AMERICA INC (MPAA) ', 'GARMIN LTD (GRMN) ', 'Sprouts Farmers Market, Inc. (SFM) ', 'nDivision Inc. (NDVN) ', 'nDivision Inc. (NDVN) ', 'nDivision Inc. (NDVN) ', 'WEYCO GROUP INC (WEYS) ', 'DiamondRock Hospitality Co (DRH) ', 'Pebblebrook Hotel Trust (PEB, PEB-PC, PEB-PD, PEB-PE, PEB-PF) ', 'Sprouts Farmers Market, Inc. (SFM) ', 'MYR GROUP INC. (MYRG) ', 'Chatham Lodging Trust (CLDT, CLDT-PA) ', 'WEYCO GROUP INC (WEYS) ', 'INFINITE GROUP INC (IMCI) ', 'DiamondRock Hospitality Co (DRH) ', 'Pebblebrook Hotel Trust (PEB, PEB-PC, PEB-PD, PEB-PE, PEB-PF) ', 'DiamondRock Hospitality Co (DRH, DRH-PA) ', 'Pebblebrook Hotel Trust (PEB, PEB-PC, PEB-PD, PEB-PE, PEB-PF) ', 'DLH Holdings Corp. (DLHC) ', 'Summit Hotel Properties, Inc. (INN) ', 'BOYD GAMING CORP (BYD) ', 'Summit Hotel Properties, Inc. (INN) ', 'DiamondRock Hospitality Co (DRH, DRH-PA) ', 'CINCINNATI FINANCIAL CORP (CINF) ', 'Summit Hotel Properties, Inc. (INN) ', 'Pebblebrook Hotel Trust (PEB, PEB-PC, PEB-PD, PEB-PE, PEB-PF) ', 'ARTIVION, INC. (AORT) ', 'STAR GROUP, L.P. (SGU) ', 'Pebblebrook Hotel Trust (PEB, PEB-PE, PEB-PF, PEB-PG, PEB-PH) ', 'RGC RESOURCES INC (RGCO) ', 'INFINITE GROUP INC (IMCI) ', 'LEGGETT & PLATT INC (LEG) ', 'RGC RESOURCES INC (RGCO) ', 'COSTCO WHOLESALE CORP /NEW (COST) ', 'DLH Holdings Corp. (DLHC) ', 'CANTERBURY PARK HOLDING CORP ', 'WEYCO GROUP INC (WEYS) ', 'DLH Holdings Corp. (DLHC) ', 'WEYCO GROUP INC (WEYS) ', 'Canterbury Park Holding Corp (CPHC) ', 'RGC RESOURCES INC (RGCO) ', 'IEC ELECTRONICS CORP (IEC) ', 'INFINITE GROUP INC (IMCI) ', 'Canterbury Park Holding Corp (CPHC) ', 'WEYCO GROUP INC (WEYS) ', 'Canterbury Park Holding Corp (CPHC) ', 'AMERICAN STATES WATER CO (AWR) <br> Golden State Water CO ', 'LEGGETT & PLATT INC (LEG) ', 'Vy Global Growth (VYGG, VYGG-UN, VYGG-WT) ', 'Summit Hotel Properties, Inc. (INN) ', 'Vy Global Growth (VYGG, VYGG-UN, VYGG-WT) ', 'Sunstone Hotel Investors, Inc. (SHO, SHO-PE, SHO-PF) ', 'CRYOLIFE INC (CRY) ', 'BOYD GAMING CORP (BYD) ', 'Sunstone Hotel Investors, Inc. (SHO, SHO-PE, SHO-PF) ', 'Summit Hotel Properties, Inc. (INN, INN-PE, INN-PF) ', 'Green Bancorp, Inc. (GNBC) ', 'TELKONET INC (TKOI) ', 'COHEN & STEERS INC (CNS) ', 'Sunstone Hotel Investors, Inc. (SHO, SHO-PE, SHO-PF) ', 'Green Bancorp, Inc. (GNBC) ']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
I have a data frame with a column of countries that reviewers are from. I want to replace all of the countries that are NOT in my nationalities list with "other".
I created the following code but it will not run. I get this error..
ValueError: ('Lengths must match to compare', (51575,), (9,))
nationalities = ['United Kingdom', 'United States of America', 'Australia', 'Ireland', 'United Arab Emirates', 'Saudi Arabia', 'Netherlands', 'Germany', 'Canada' ]
sample_hotel_df['Reviewer_Nationality'] = sample_hotel_df['Reviewer_Nationality'].replace(np.where(sample_hotel_df['Reviewer_Nationality'] != nationalities), 'Other')
Sample Input:
sample_hotel_df['Reviewer_Nationality'] = np.array([[' Latvia ', ' Israel ', ' Lebanon ', ' Azerbaijan ',
' Kazakhstan ', ' Iraq ', ' Thailand ', ' Denmark ', ' Bulgaria ',
' Luxembourg ', ' Jordan ', ' Kenya ', ' Iceland ', ' Estonia ',
' Serbia ', ' Malta ', ' Cyprus ', ' Greece ', ' South Africa ',
' Croatia ', ' Oman ', ' Bahrain ', ' Finland ', ' Singapore ',
' Malaysia ', ' Portugal ', ' Yemen ', ' Bangladesh ', ' Sudan ',
' Libya ', ' Palestinian Territory ', ' Lithuania ',
' Philippines ', ' Hong Kong ', ' ', ' Dominican Republic ',
' Armenia ', ' Slovakia ', ' Tunisia ', ' Chile ', ' Mauritius ',
' Nepal ', ' Peru ', ' Ghana ', ' Montenegro ', ' Jersey ',
' Morocco ', ' Andorra ', ' Sri Lanka ', ' Argentina ',
' Puerto Rico ', ' Honduras ', ' Indonesia ', ' Abkhazia Georgia ',
' Ukraine ', ' Mongolia ', ' Taiwan ', ' Georgia ',
' Bosnia and Herzegovina ', ' Montserrat ', ' Uruguay ', ' Syria ',
' Jamaica ', ' Angola ', ' Gibraltar ', ' Zambia '])
Output:
sample_hotel_df['Reviewer_Nationality'] = np.array(['United Kingdom',
'United States of America',
'Australia', 'Ireland',
'United Arab Emirates',
'Saudi Arabia',
'Netherlands', 'Germany',
'Canada', 'Other'
])
I can run a for loop but it's computationally heavy. Any suggestions?
Thanks!
You do not need str.replace for this.
sample_hotel_df.loc[~sample_hotel_df['Reviewer_Nationality'].isin(nationalities), 'Reviewer_Nationality'] = "other"
Let's say this is your CSV file (data.csv):
Reviewer_Nationality
Latvia
Israel
United States of America
Lebanon
United Kingdom
Australia
You can read it by using pandas:
>>> import pandas as pd
>>> rev_nat = pd.read_csv('data.csv')['Reviewer_Nationality'].to_list()
Then you can filter the nationalities in this way:
>>> nat = ['United Kingdom', 'United States of America', 'Australia', 'Ireland']
>>> result = list(set(n if n in nat else 'Other' for n in rev_nat))
The final result is
['Other', 'United States of America', 'Australia']
The simplest is probably :
ser = sample_hotel_df['Reviewer_Nationality']
sample_hotel_df['Reviewer_Nationality'] = ser.where(ser.isin(nationalities), 'Other')
Anyway, use ser.isin(lst) and ~ser.isin(lst) in your filters instead of == and != that's why you had an error.
== and != are for a single element comparison
Edit :
Yes it works :)
But :
Your sample Series has no country that shouldn't be 'Other' according to your list...
All your countries have trailing and leading spaces. So you should clean it with .str.strip()
So this should work, even with your data :
ser = sample_hotel_df['Reviewer_Nationality'].str.strip()
sample_hotel_df['Reviewer_Nationality'] = ser.where(ser.isin(nationalities), 'Other')
I've a DataFrame with a Company column.
Company
-------------------------------
Tundra Corporation Art Limited
Desert Networks Incorporated
Mount Yellowhive Security Corp
Carter, Rath and Mueller Limited (USD/AC)
Barrows corporation /PACIFIC
Corporation, Mounted Security
I've a dictionary with regexes to normalize the company entities.
(^|\s)corporation(\s|$); Corp
(^|\s)Limited(\s|$); LTD
(^|\s)Incorporated(\s|$); INC
...
I need to normalize only the last occurrence. This is my desired output.
Company
-------------------------------
Tundra Corporation Art LTD
Desert Networks INC
Mount Yellowhive Security Corp
Carter, Rath and Mueller LTD (USD/AC)
Barrows Corp /PACIFIC
Corp, Mounted Security
(Only normalize Limited and not Corporation for : Tundra Corporation Art Limited)
My code:
for k, v in entity_dict.items():
df['Company'].replace(regex=True, inplace=True, to_replace=re.compile(k,re.I), value=v)
Is it possible to only change the last occurrence of an entity (do i need to change my regex)?
Change (\s|$) to ($) for match end of strings:
entity_dict = {'(^|\s)corporation($)': ' Corp',
'(^|\s)Limited($)': ' LTD',
'(^|\s)Incorporated($)': ' INC'}
for k, v in entity_dict.items():
df['Company'].replace(regex=True, inplace=True, to_replace=re.compile(k,re.I), value=v)
print (df)
Company
0 Tundra Corporation Art LTD
1 Desert Networks INC
2 Mount Yellowhive Security Corp
EDIT: You can simplify dictionary for no regex, then create lowercase dict for possible use Series.str.findall, get last value of indexing str[-1] and Series.map by lowercase dict, last replace in list comprension:
entity_dict = {'corporation': 'Corp',
'Limited': 'LTD',
'Incorporated': 'INC'}
lower = {k.lower():v for k, v in entity_dict.items()}
s1 = df['Company'].str.findall('|'.join(lower.keys()), flags=re.I).str[-1].fillna('')
s2 = s1.str.lower().map(lower).fillna('')
df['Company'] = [a.replace(b, c) for a, b, c in zip(df['Company'], s1, s2)]
print (df)
Company
0 Tundra Corporation Art LTD
1 Desert Networks INC
2 Mount Yellowhive Security Corp
3 Carter, Rath and Mueller LTD (USD/AC)
4 Barrows Corp /PACIFIC
5 Corp, Mounted Security
I have a dictionary (key, value) and a dataframe using pandas.
mydict = {'KULAR LUMPUR' : 'MY',
'SINGAPORE' : 'SG',
'HONG KONG' : 'HK',
'VIETNAM': 'VN'}
and a dataframe with column ['Address']
Address
0 234 JALAN ST KULAR LUMPUR MALAYSIA
1 123 BUILDING STREET SINGAPORE
2 67 CANNING VALE, HONG KONG
How do I search through the dataframe to get the value from the dictionary if substring is found in the key of the dictionary.
e.g.
Address Code
0 234 JALAN ST KULAR LUMPUR MALAYSIA MY
1 123 BUILDING STREET SINGAPORE SG
2 67 CANNING VALE, HONG KONG HK
Use str.extract by regex with keys of dictionary with map:
df = pd.DataFrame({'Address': ['234 JALAN ST KULAR LUMPUR MALAYSIA',
'123 BUILDING STREET SINGAPORE',
'67 CANNING VALE, HONG KONG']})
print (df)
Address
0 234 JALAN ST KULAR LUMPUR MALAYSIA
1 123 BUILDING STREET SINGAPORE
2 67 CANNING VALE, HONG KONG
mydict = {'KULAR LUMPUR' : 'MY',
'SINGAPORE' : 'SG',
'HONG KONG' : 'HK',
'VIETNAM': 'VN'}
pat = '|'.join(r"\b{}\b".format(x) for x in mydict.keys())
df['Code'] = df['Address'].str.extract('('+ pat + ')', expand=False).map(mydict)
print (df)
Address Code
0 234 JALAN ST KULAR LUMPUR MALAYSIA MY
1 123 BUILDING STREET SINGAPORE SG
2 67 CANNING VALE, HONG KONG HK
Explanation:
print (pat)
\bKULAR LUMPUR\b|\bSINGAPORE\b|\bHONG KONG\b|\bVIETNAM\b
\b are called word boundaries for match words between \b
| are for regex OR