Related
I'm trying to scrape a list from EDGAR.
The information I need (such as "entity-name") are in the "td" class. However, the code I currently have doesn't return anything. I would appreciate any help. Thanks in advance!
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
s = Service('/PATH/chromedriver')
driver = webdriver.Chrome(service=s)
driver.get("https://www.sec.gov/edgar/search/#/q=%2522cyber%2520insurance%2522&dateRange=custom&category=form-cat1&startdt=2011-01-01&enddt=2022-03-12&filter_forms=10-K")
try:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'entity-name')))
except TimeoutException:
print('Page timed out after 10 secs.')
page = BeautifulSoup(driver.page_source,'html.parser')
print(page)
To extract the texts from the entity-name column instead of presence_of_all_elements_located() you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following locator strategies:
Using CSS_SELECTOR and text attribute:
driver.get('https://www.sec.gov/edgar/search/#/q=%2522cyber%2520insurance%2522&dateRange=custom&category=form-cat1&startdt=2011-01-01&enddt=2022-03-12&filter_forms=10-K')
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "td.entity-name")))])
Using XPATH and get_attribute("innerHTML"):
driver.get('https://www.sec.gov/edgar/search/#/q=%2522cyber%2520insurance%2522&dateRange=custom&category=form-cat1&startdt=2011-01-01&enddt=2022-03-12&filter_forms=10-K')
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//td[#class='entity-name']")))])
Console Output:
['Excel Corp ', 'PROGRESSIVE CORP/OH/ (PGR) ', 'Electromed, Inc. (ELMD) ', 'HOOKER FURNITURE CORP (HOFT) ', 'HOOKER FURNITURE CORP (HOFT) ', 'SOUTHERN CO (SO, SOJA, SOJB, SOJC, SOJD, SOLN) <br> ALABAMA POWER CO (ALPVN, APRCP, APRDM, APRDN, APRDO, APRDP, ALP-PQ) <br> GEORGIA POWER CO (GPJA) <br> MISSISSIPPI POWER CO <br> SOUTHERN Co GAS <br> SOUTHERN POWER CO ', 'HOOKER FURNITURE CORP (HOFT) ', 'SOUTHERN CO (SO, SOJA, SOJB, SOJC, SOJD, SOLN) <br> ALABAMA POWER CO (ALPVN, APRCP, APRDM, APRDN, APRDO, APRDP, ALP-PQ) <br> GEORGIA POWER CO (GPJA) <br> MISSISSIPPI POWER CO <br> SOUTHERN Co GAS <br> SOUTHERN POWER CO ', 'BENCHMARK ELECTRONICS INC (BHE) ', 'MARRIOTT INTERNATIONAL INC /MD/ (MAR) ', 'Sprouts Farmers Market, Inc. (SFM) ', 'CF BANKSHARES INC. (CFBK) ', 'Repay Holdings Corp (RPAY) ', 'Sprouts Farmers Market, Inc. (SFM) ', 'MARRIOTT INTERNATIONAL INC /MD/ (MAR) ', 'Sprouts Farmers Market, Inc. (SFM) ', 'Albertsons Companies, Inc. (ACI) ', 'MARRIOTT INTERNATIONAL INC /MD/ (MAR) ', 'MARRIOTT INTERNATIONAL INC /MD/ (MAR) ', 'HENNESSY ADVISORS INC (HNNA) ', 'Repay Holdings Corp (RPAY, RPAYW) ', 'Repay Holdings Corp (RPAY, RPAYW, TBRGU) ', 'Arlo Technologies, Inc. (ARLO) ', 'Repay Holdings Corp (RPAY, RPAYW) ', 'NATIONAL HEALTH INVESTORS INC (NHI) ', 'MOTORCAR PARTS AMERICA INC (MPAA) ', 'RGC RESOURCES INC (RGCO) ', 'Arlo Technologies, Inc. (ARLO) ', 'CRYOLIFE INC (CRY) ', 'Mimecast Ltd (MIME) ', 'RGC RESOURCES INC (RGCO) ', 'MOTORCAR PARTS AMERICA INC (MPAA) ', 'NOODLES & Co (NDLS) ', 'PAPA JOHNS INTERNATIONAL INC (PZZA) ', 'MOTORCAR PARTS AMERICA INC (MPAA) ', 'MOTORCAR PARTS AMERICA INC (MPAA) ', 'PAPA JOHNS INTERNATIONAL INC (PZZA) ', 'MOTORCAR PARTS AMERICA INC (MPAA) ', 'Sprouts Farmers Market, Inc. (SFM) ', 'MOTORCAR PARTS AMERICA INC (MPAA) ', 'GARMIN LTD (GRMN) ', 'Sprouts Farmers Market, Inc. (SFM) ', 'nDivision Inc. (NDVN) ', 'nDivision Inc. (NDVN) ', 'nDivision Inc. (NDVN) ', 'WEYCO GROUP INC (WEYS) ', 'DiamondRock Hospitality Co (DRH) ', 'Pebblebrook Hotel Trust (PEB, PEB-PC, PEB-PD, PEB-PE, PEB-PF) ', 'Sprouts Farmers Market, Inc. (SFM) ', 'MYR GROUP INC. (MYRG) ', 'Chatham Lodging Trust (CLDT, CLDT-PA) ', 'WEYCO GROUP INC (WEYS) ', 'INFINITE GROUP INC (IMCI) ', 'DiamondRock Hospitality Co (DRH) ', 'Pebblebrook Hotel Trust (PEB, PEB-PC, PEB-PD, PEB-PE, PEB-PF) ', 'DiamondRock Hospitality Co (DRH, DRH-PA) ', 'Pebblebrook Hotel Trust (PEB, PEB-PC, PEB-PD, PEB-PE, PEB-PF) ', 'DLH Holdings Corp. (DLHC) ', 'Summit Hotel Properties, Inc. (INN) ', 'BOYD GAMING CORP (BYD) ', 'Summit Hotel Properties, Inc. (INN) ', 'DiamondRock Hospitality Co (DRH, DRH-PA) ', 'CINCINNATI FINANCIAL CORP (CINF) ', 'Summit Hotel Properties, Inc. (INN) ', 'Pebblebrook Hotel Trust (PEB, PEB-PC, PEB-PD, PEB-PE, PEB-PF) ', 'ARTIVION, INC. (AORT) ', 'STAR GROUP, L.P. (SGU) ', 'Pebblebrook Hotel Trust (PEB, PEB-PE, PEB-PF, PEB-PG, PEB-PH) ', 'RGC RESOURCES INC (RGCO) ', 'INFINITE GROUP INC (IMCI) ', 'LEGGETT & PLATT INC (LEG) ', 'RGC RESOURCES INC (RGCO) ', 'COSTCO WHOLESALE CORP /NEW (COST) ', 'DLH Holdings Corp. (DLHC) ', 'CANTERBURY PARK HOLDING CORP ', 'WEYCO GROUP INC (WEYS) ', 'DLH Holdings Corp. (DLHC) ', 'WEYCO GROUP INC (WEYS) ', 'Canterbury Park Holding Corp (CPHC) ', 'RGC RESOURCES INC (RGCO) ', 'IEC ELECTRONICS CORP (IEC) ', 'INFINITE GROUP INC (IMCI) ', 'Canterbury Park Holding Corp (CPHC) ', 'WEYCO GROUP INC (WEYS) ', 'Canterbury Park Holding Corp (CPHC) ', 'AMERICAN STATES WATER CO (AWR) <br> Golden State Water CO ', 'LEGGETT & PLATT INC (LEG) ', 'Vy Global Growth (VYGG, VYGG-UN, VYGG-WT) ', 'Summit Hotel Properties, Inc. (INN) ', 'Vy Global Growth (VYGG, VYGG-UN, VYGG-WT) ', 'Sunstone Hotel Investors, Inc. (SHO, SHO-PE, SHO-PF) ', 'CRYOLIFE INC (CRY) ', 'BOYD GAMING CORP (BYD) ', 'Sunstone Hotel Investors, Inc. (SHO, SHO-PE, SHO-PF) ', 'Summit Hotel Properties, Inc. (INN, INN-PE, INN-PF) ', 'Green Bancorp, Inc. (GNBC) ', 'TELKONET INC (TKOI) ', 'COHEN & STEERS INC (CNS) ', 'Sunstone Hotel Investors, Inc. (SHO, SHO-PE, SHO-PF) ', 'Green Bancorp, Inc. (GNBC) ']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
I have a data frame with a column of countries that reviewers are from. I want to replace all of the countries that are NOT in my nationalities list with "other".
I created the following code but it will not run. I get this error..
ValueError: ('Lengths must match to compare', (51575,), (9,))
nationalities = ['United Kingdom', 'United States of America', 'Australia', 'Ireland', 'United Arab Emirates', 'Saudi Arabia', 'Netherlands', 'Germany', 'Canada' ]
sample_hotel_df['Reviewer_Nationality'] = sample_hotel_df['Reviewer_Nationality'].replace(np.where(sample_hotel_df['Reviewer_Nationality'] != nationalities), 'Other')
Sample Input:
sample_hotel_df['Reviewer_Nationality'] = np.array([[' Latvia ', ' Israel ', ' Lebanon ', ' Azerbaijan ',
' Kazakhstan ', ' Iraq ', ' Thailand ', ' Denmark ', ' Bulgaria ',
' Luxembourg ', ' Jordan ', ' Kenya ', ' Iceland ', ' Estonia ',
' Serbia ', ' Malta ', ' Cyprus ', ' Greece ', ' South Africa ',
' Croatia ', ' Oman ', ' Bahrain ', ' Finland ', ' Singapore ',
' Malaysia ', ' Portugal ', ' Yemen ', ' Bangladesh ', ' Sudan ',
' Libya ', ' Palestinian Territory ', ' Lithuania ',
' Philippines ', ' Hong Kong ', ' ', ' Dominican Republic ',
' Armenia ', ' Slovakia ', ' Tunisia ', ' Chile ', ' Mauritius ',
' Nepal ', ' Peru ', ' Ghana ', ' Montenegro ', ' Jersey ',
' Morocco ', ' Andorra ', ' Sri Lanka ', ' Argentina ',
' Puerto Rico ', ' Honduras ', ' Indonesia ', ' Abkhazia Georgia ',
' Ukraine ', ' Mongolia ', ' Taiwan ', ' Georgia ',
' Bosnia and Herzegovina ', ' Montserrat ', ' Uruguay ', ' Syria ',
' Jamaica ', ' Angola ', ' Gibraltar ', ' Zambia '])
Output:
sample_hotel_df['Reviewer_Nationality'] = np.array(['United Kingdom',
'United States of America',
'Australia', 'Ireland',
'United Arab Emirates',
'Saudi Arabia',
'Netherlands', 'Germany',
'Canada', 'Other'
])
I can run a for loop but it's computationally heavy. Any suggestions?
Thanks!
You do not need str.replace for this.
sample_hotel_df.loc[~sample_hotel_df['Reviewer_Nationality'].isin(nationalities), 'Reviewer_Nationality'] = "other"
Let's say this is your CSV file (data.csv):
Reviewer_Nationality
Latvia
Israel
United States of America
Lebanon
United Kingdom
Australia
You can read it by using pandas:
>>> import pandas as pd
>>> rev_nat = pd.read_csv('data.csv')['Reviewer_Nationality'].to_list()
Then you can filter the nationalities in this way:
>>> nat = ['United Kingdom', 'United States of America', 'Australia', 'Ireland']
>>> result = list(set(n if n in nat else 'Other' for n in rev_nat))
The final result is
['Other', 'United States of America', 'Australia']
The simplest is probably :
ser = sample_hotel_df['Reviewer_Nationality']
sample_hotel_df['Reviewer_Nationality'] = ser.where(ser.isin(nationalities), 'Other')
Anyway, use ser.isin(lst) and ~ser.isin(lst) in your filters instead of == and != that's why you had an error.
== and != are for a single element comparison
Edit :
Yes it works :)
But :
Your sample Series has no country that shouldn't be 'Other' according to your list...
All your countries have trailing and leading spaces. So you should clean it with .str.strip()
So this should work, even with your data :
ser = sample_hotel_df['Reviewer_Nationality'].str.strip()
sample_hotel_df['Reviewer_Nationality'] = ser.where(ser.isin(nationalities), 'Other')
I'm quite new using Python. I'm trying to pratice my web-scraping skills using BeautifulSoup, on a Danish stock trading website: https://www.nordnet.dk/markedet/aktiekurser
This is my code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.nordnet.dk/markedet/aktiekurser')
soup = BeautifulSoup(page.content, 'html.parser')
stocks = soup.find(id='tabs-tabpanel-0')
items = stocks.find_all(class_ ='c02356 c02375')
name = [item.find(class_ = 'c02393 c02394').get_text() for item in items]
Price = [item.find(class_ = 'number c02398').get_text() for item in items]
print(name)
print(Price)
It is working fine, but it seems like the class 'number c02398' contains more than one value, since print(Price) returns two values. How do I change my code, so that I only get one value from the class?
This is the span tag which returns in item variable:
<span class="number c02398" style="white-space:nowrap"><span class="c02420"> <!-- -->426.2<!-- --> </span><span aria-hidden="true">426,20</span></span
If you want to get the value inside the span tag with aria-hidden's attribute is True, use the below code line:
Price = [item.find(class_ = 'number c02398').find("span", attrs={"aria-hidden":"true"}).get_text() for item in items]
Returns :
['426,20', '1.251,50', '13.035', '981,00', '337,30', '2.399', '600,00', '1.070,00', '925,00', '98,84', '178,50', '940,00', '12.250', '113,00', '229,00', '609,80', '560,00', '2.162', '492,50', '826,00', '198,35', '207,40', '264,40', '20,69', '56,30', '672,40', '223,50', '216,70', '183,40', '185,60', '255,40', '184,00', '469,50', '2.306', '53,04', '101,30', '1,125', '269', '126,10', '269,60', '229,20', '114,00', '541', '605', '24,16', '117,00', '114,80', '44,40', '78,80', '135', '71,80', '270,00', '126,20', '2,755', '12,20', '59,60', '2.040', '136,00', '116,00', '11,28', '219', '337,00', '110,00', '508', '49,05', '3,04', '5,24', '235', '87,40', '59,0', '4,90', '675', '57,50', '174,00', '47,15', '233', '10,20', '584', '1.015', '688', '118', '3,020', '4,12', '1,870', '248', '5,74', '5.200', '78,50', '4,92', '177,00', '0,393', '8,60', '15,50', '6.400', '448', '540', '13,95', '90,40', '9,43', '131,50']
If you want to get the value inside the span tag with class name: c02420, use the below code line:
Price = [item.find(class_ = 'number c02398').find("span").get_text() for item in items]
Returns:
[' 426.2 ', ' 1251.5 ', ' 13035 ', ' 981 ', ' 337.3 ', ' 2399 ', ' 600 ', ' 1070 ', ' 925 ', ' 98.84 ', ' 178.5 ', ' 940 ', ' 12250 ', ' 113 ', ' 229 ', ' 609.8 ', ' 560 ', ' 2162 ', ' 492.5 ', ' 826 ', ' 198.35 ', ' 207.4 ', ' 264.4 ', ' 20.69 ', ' 56.3 ', ' 672.4 ', ' 223.5 ', ' 216.7 ', ' 183.4 ', ' 185.6 ', ' 255.4 ', ' 184 ', ' 469.5 ', ' 2306 ', ' 53.04 ', ' 101.3 ', ' 1.125 ', ' 269 ', ' 126.1 ', ' 269.6 ', ' 229.2 ', ' 114 ', ' 541 ', ' 605 ', ' 24.16 ', ' 117 ', ' 114.8 ', ' 44.4 ', ' 78.8 ', ' 135 ', ' 71.8 ', ' 270 ', ' 126.2 ', ' 2.755 ', ' 12.2 ', ' 59.6 ', ' 2040 ', ' 136 ', ' 116 ', ' 11.28 ', ' 219 ', ' 337 ', ' 110 ', ' 508 ', ' 49.05 ', ' 3.04 ', ' 5.24 ', ' 235 ', ' 87.4 ', ' 59 ', ' 4.9 ', ' 675 ', ' 57.5 ', ' 174 ', ' 47.15 ', ' 233 ', ' 10.2 ', ' 584 ', ' 1015 ', ' 688 ', ' 118 ', ' 3.02 ', ' 4.12 ', ' 1.87 ', ' 248 ', ' 5.74 ', ' 5200 ', ' 78.5 ', ' 4.92 ', ' 177 ', ' 0.393 ', ' 8.6 ', ' 15.5 ', ' 6400 ', ' 448 ', ' 540 ', ' 13.95 ', ' 90.4 ', ' 9.43 ', ' 131.5 ']
I have a nested dictionary like below and I would like to replace string using inner key value pair if that key is at the end of string, replace it with value only when the country Code equals to dict key (not inner key)
'IND': {' PVT. LTD.': ' Pvt. Ltd.',
' pvt. Ltd': ' Pvt. Ltd.',
' PVT LTD': ' Pvt. Ltd.',
' L.L.P.': ' LLP',
' LTD.': ' Ltd.',
' LLP.': ' LLP',
' ltd': ' Ltd.',
' llp': ' LLP'},
'GBR': {' P.L.C.': ' PLC',
' C.I.C.': ' CIC',
' p.l.c': ' PLC',
' c.i.c': ' CIC',
' s.e.': ' SE',
' PLC.': ' PLC'},
'USA': {' LTD. CO.': ' Ltd. Co.',
' L.L.L.P.': ' LLLP',
' ltd. Co': ' Ltd. Co.',
' l.l.l.p': ' LLLP',
' L.L.P.': ' LLP',
' L.L.C.': ' LLC',
' l.l.p': ' LLP',
' l.l.c': ' LLC'}
My dataframe has two cols. Legal name and Reg Country Code -
Name
Reg Country Code
NexPoint LTD. CO.
USA
Silverplay P.L.C.
GBR
ALLOYS PVT. LTD.
IND
GALLIUM ltd.
IND
ELLIOTT s.e.
GBR
I used below code - it is replacing the string as and when the legal name finds the inner key but not checking the country condition with outer key. Can someone pl suggest me. (I have a big list)
for i in range(len(df)):
for k1 in country_dict.items():
if df.loc[i, 'Reg Country Code'] == k1:
for k2, v2 in country_dict[k1].items():
df.loc[df['Reg Country Code'] == k1, 'Name'] = [re.sub(k2, v, x) if x.endswith(k2) else x for x in df.loc[df['Reg Country Code'] == k1, 'Name']]
My Output should be:
Name
Reg Country Code
NexPoint Ltd. Co.
USA
Silverplay PLC
GBR
ALLOYS Pvt. Ltd.
IND
GALLIUM Ltd.
IND
ELLIOTT SE
GBR
You can group the df by country code and replace
df['NAME'] = df.groupby('REG COUNTRY CODE')['NAME'].apply(lambda x: x.replace(d[x.name], regex = True))
NAME REG COUNTRY CODE
0 NexPoint Ltd. Co. USA
1 Silverplay PLC GBR
2 ALLOYS Pvt. Ltd. IND
3 GALLIUM Ltd.. IND
4 ELLIOTT SE GBR
Explanation:
df.groupby('REG COUNTRY CODE').name returns name of the group (country code in this case).
By using d[x.name], we are able to access the value dictionary corresponding to dictionary keys (country codes)
Setting regex to True helps us replace the string partially
This is the code in which I tried to get the data from one website using the requests and saved in dictionary called table but when I tried to iterate through those values and saved them in the list , I faced with below error, any help is appreciated.
import requests
from bs4 import BeautifulSoup
list1 = []
table = {}
r = requests.get("https://www.century21.com/real-estate/rock-springs-wy/LCWYROCKSPRINGS/?k=1")
content = r.content
soup = BeautifulSoup(content,'html.parser')
all = soup.find_all('div',{"class":"property-card-primary-info"})
for item in all:
print(item.find('a',{"class":"listing-price"}).text.replace('\n','').replace(' ',''))
table['address'] = item.find('div',{"class":"property-address"}).text.replace('\n','').replace(' ','')
table['city'] = item.find('div',{"class":"property-city"}).text.replace('\n','').replace(' ','')
table['beds'] = item.find('div',{"class":"property-beds"}).text.replace('\n','').replace(' ','')
table['baths'] = item.find('div',{"class":"property-baths"}).text.replace('\n','').replace(' ','')
try:
table['half-baths'] = item.find("div",{"class":"property-half-baths"}).text.replace('\n','').replace(' ','')
except:
table['half-baths'] = None
try:
table['property sq.ft.'] = item.find("div",{"class":"property-sqft"}).text.replace(' ','').replace("\n",'')
except:
table['property sq.ft.'] = None
list1.append(table)
list1
OUTPUT
$325,000
$249,000
$390,000
$274,900
$208,000
$169,000
$127,500
$990,999
I'm getting the unique values when I print price values , but when I append to the list all the values are replicated. Any help will means a lot.
Question : how to get rid of this replication of data and get the corresponding values?
[{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '},
{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '},
{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '},
{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '},
{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '},
{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '},
{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '},
{'address': ' 1129 Hilltop Drive',
'city': 'Rock Springs WY 82901 ',
'beds': '4 beds ',
'baths': '5 baths ',
'half-baths': '2 half baths ',
'property sq.ft.': '10,300 sq. ft '}]
for item in all:
table ={} # important
print(item.find('a',{"class":"listing-price"}).text.replace('\n','').replace(' ',''))
table['address'] = item.find('div',{"class":"property-address"}).text.replace('\n','').replace(' ','')
table['city'] = item.find('div',{"class":"property-city"}).text.replace('\n','').replace(' ','')
table['beds'] = item.find('div',{"class":"property-beds"}).text.replace('\n','').replace(' ','')
table['baths'] = item.find('div',{"class":"property-baths"}).text.replace('\n','').replace(' ','')
try:
table['half-baths'] = item.find("div",{"class":"property-half-baths"}).text.replace('\n','').replace(' ','')
except:
table['half-baths'] = None
try:
table['property sq.ft.'] = item.find("div",{"class":"property-sqft"}).text.replace(' ','').replace("\n",'')
except:
table['property sq.ft.'] = None
list1.append(table)
print(set(list1)) # print list outside the loop use set to remove dups