How to find the attribute and element id by selenium.webdriver? - python

I am learning web scrapping since I need it for my work. I wrote the following code:
from selenium import webdriver
chromedriver='/home/es/drivers/chromedriver'
driver = webdriver.Chrome(chromedriver)
driver.implicitly_wait(30)
driver.get('http://crdd.osdd.net/raghava/hemolytik/submitkey_browse.php?ran=1955')
df = pd.read_html(driver.find_element_by_id("table.example.display.datatable").get_attribute('example'))[0]
However, it is showing the following error:
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="table.example.display.datatable"]"}
(Session info: chrome=103.0.5060.134)
Then I inspect the table that I wanna scrape this table from this page
what is the attribute that needs to be included in get_attribute() function in the following line?
df = pd.read_html(driver.find_element_by_id("table.example.display.datatable").get_attribute('example'))[0]
what I should write in the driver.find_element_by_id?
EDITED:
Some tables have lots of records in multi-pages.
For example, this page has 2,246 entries, which shows 100 entries on each page. Once I tried to web-scrape it, there were only 320 entries in df and the record ID is from 1232-1713, which means it took entries from the next few pages and it is not starting from the first page to the end at the last page.
What we can do in such cases?

You need to get the outerHTML property of the table first, then call the table element from pandas.
You need to wait for element to be visible. Use explicit wait like WebdriverWait()
driver.get('http://crdd.osdd.net/raghava/hemolytik/submitkey_browse.php?ran=1955')
table=WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"table#example")))
tableRows=table.get_attribute("outerHTML")
df = pd.read_html(tableRows)[0]
print(df)
Import below libraries.
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import pandas as pd
Output:
ID PMID YEAR ... DSSP Natural Structure Final Structure
0 1643 16137634 2005 ... CCCCCCCCCCCSCCCC NaN NaN
1 1644 16137634 2005 ... CCTTSCCSSCCCC NaN NaN
2 1645 16137634 2005 ... CTTTCGGGHHHHHHHHCC NaN NaN
3 1646 16137634 2005 ... CGGGTTTHHHHHHHGGGC NaN NaN
4 1647 16137634 2005 ... CCSCCCSSCHHHHHHHHHTTC NaN NaN
5 1910 16730859 2006 ... CCCCCCCSSCCSHHHHHHHHTTHHHHHHHHSSCCC NaN NaN
6 1911 16730859 2006 ... CCSCC NaN NaN
7 1912 16730859 2006 ... CCSSSCSCC NaN NaN
8 1913 16730859 2006 ... CCCSSCCSSCCSHHHHHTTHHHHTTTCSCC NaN NaN
9 1914 16730859 2006 ... CCSHHHHHHHHHHHHHCCCC NaN NaN
10 2110 11226440 2001 ... CCCSSCCCBTTBTSSSSSSCSCC NaN NaN
11 3799 9204560 1997 ... CCSSCC NaN NaN
12 4149 16137634 2005 ... CCHHHHHHHHHHHC NaN NaN
[13 rows x 17 columns]

If you want to select table by #id you need
driver.find_element_by_id("example")
By.CSS:
driver.find_element_by_css_selector("table#example")
By.XPATH:
driver.find_element_by_xpath("//table[#id='example'])
If you want to extract #id value you need
.get_attribute('id')
Since there is not much sense in searching by #id to extract that exact #id you might use other attribute of table node:
driver.find_element_by_xpath("//table[#aria-describedby='example_info']").get_attribute('id')

Related

How to trim down a Pandas data frame rows?

I'm trying so hard to shorten this awful lot of rows from an XML sitemap but I can't find a solution to trim it down.
import advertools as adv
import pandas as pd
site = "https://www.halfords.com/sitemap_index.xml"
sitemap = adv.sitemap_to_df(site)
sitemap = sitemap.dropna(subset=["loc"]).reset_index(drop=True)
# Some sitemaps keeps urls with "/" on the end, some is with no "/"
# If there is "/" on the end, we take the second last column as slugs
# Else, the last column is the slug column
slugs = sitemap['loc'].dropna()[~sitemap['loc'].dropna().str.endswith('/')].str.split('/').str[-2].str.replace('-', ' ')
slugs2 = sitemap['loc'].dropna()[~sitemap['loc'].dropna().str.endswith('/')].str.split('/').str[-1].str.replace('-', ' ')
# Merge two series
slugs = list(slugs) + list(slugs2)
# adv.word_frequency automatically removes the stop words
word_counts_onegram = adv.word_frequency(slugs)
word_counts_twogram = adv.word_frequency(slugs, phrase_len=2)
competitor = pd.concat([word_counts_onegram, word_counts_twogram])\
.rename({'abs_freq':'Count','word':'Ngram'}, axis=1)\
.sort_values('Count', ascending=False)
competitor.to_csv('competitor.csv',index=False)
competitor
competitor.shape
(67758, 2)
(67758, 2)
I've been raveling around several blogs included resources on Stack Overflow but nothing seemed to work.
This is definitely something going on with my zero expertise in coding I suppose
Two things:
You can use adv.url_to_df to split URLs and get the slugs (there should be a column called last_dir:
urldf = adv.url_to_df(sitemap['loc'].dropna())
urldf
url
scheme
netloc
path
query
fragment
dir_1
dir_2
dir_3
dir_4
dir_5
dir_6
dir_7
dir_8
dir_9
last_dir
0
https://www.halfords.com/cycling/cycling-technology/helmet-cameras/removu-k1-4k-camera-and-stabiliser-694977.html
https
www.halfords.com
/cycling/cycling-technology/helmet-cameras/removu-k1-4k-camera-and-stabiliser-694977.html
nan
nan
cycling
cycling-technology
helmet-cameras
removu-k1-4k-camera-and-stabiliser-694977.html
nan
nan
nan
nan
nan
removu-k1-4k-camera-and-stabiliser-694977.html
1
https://www.halfords.com/technology/bluetooth-car-kits/jabra-drive-bluetooth-speakerphone---white-695094.html
https
www.halfords.com
/technology/bluetooth-car-kits/jabra-drive-bluetooth-speakerphone---white-695094.html
nan
nan
technology
bluetooth-car-kits
jabra-drive-bluetooth-speakerphone---white-695094.html
nan
nan
nan
nan
nan
nan
jabra-drive-bluetooth-speakerphone---white-695094.html
2
https://www.halfords.com/tools/power-tools-and-accessories/power-tools/stanley-fatmax-v20-18v-combi-drill-kit-695102.html
https
www.halfords.com
/tools/power-tools-and-accessories/power-tools/stanley-fatmax-v20-18v-combi-drill-kit-695102.html
nan
nan
tools
power-tools-and-accessories
power-tools
stanley-fatmax-v20-18v-combi-drill-kit-695102.html
nan
nan
nan
nan
nan
stanley-fatmax-v20-18v-combi-drill-kit-695102.html
3
https://www.halfords.com/technology/dash-cams/mio-mivue-c450-695262.html
https
www.halfords.com
/technology/dash-cams/mio-mivue-c450-695262.html
nan
nan
technology
dash-cams
mio-mivue-c450-695262.html
nan
nan
nan
nan
nan
nan
mio-mivue-c450-695262.html
4
https://www.halfords.com/technology/dash-cams/mio-mivue-818-695270.html
https
www.halfords.com
/technology/dash-cams/mio-mivue-818-695270.html
nan
nan
technology
dash-cams
mio-mivue-818-695270.html
nan
nan
nan
nan
nan
nan
mio-mivue-818-695270.html
There are options that pandas provides, which you can change. For example:
pd.options.display.max_rows
60
# change it to display more/fewer rows:
pd.options.display.max_rows = 100
As you did, you can easily create onegrams and bigrams, combine them, and display them:
text_list = urldf['last_dir'].str.replace('-', ' ').dropna()
one_grams = adv.word_frequency(text_list, phrase_len=1)
bigrams = adv.word_frequency(text_list, phrase_len=2)
print(pd.concat([one_grams, bigrams])
.sort_values('abs_freq', ascending=False)
.head(15) # <-- change this to 100 for example
.reset_index(drop=True))
word
abs_freq
0
halfords
2985
1
car
1430
2
bike
922
3
kit
829
4
black
777
5
laser
686
6
set
614
7
wheel
540
8
pack
524
9
mats
511
10
car mats
478
11
thule
453
12
paint
419
13
4
413
14
spray
382
Hope that helps?

Pandas dataframe merge row by addition

I want to create a dataframe from census data. I want to calculate the number of people that returned a tax return for each specific earnings group.
For now, I wrote this
census_df = pd.read_csv('../zip code data/19zpallagi.csv')
sub_census_df = census_df[['zipcode', 'agi_stub', 'N02650', 'A02650', 'ELDERLY', 'A07180']].copy()
num_of_returns = ['Number_of_returns_1_25000', 'Number_of_returns_25000_50000', 'Number_of_returns_50000_75000',
'Number_of_returns_75000_100000', 'Number_of_returns_100000_200000', 'Number_of_returns_200000_more']
for i, column_name in zip(range(1, 7), num_of_returns):
sub_census_df[column_name] = sub_census_df[sub_census_df['agi_stub'] == i]['N02650']
I have 6 groups attached to a specific zip code. I want to get one row, with the number of returns for a specific zip code appearing just once as a column. I already tried to change NaNs to 0 and to use groupby('zipcode').sum(), but I get 50 million rows summed for zip code 0, where it seems that only around 800k should exist.
Here is the dataframe that I currently get:
zipcode agi_stub N02650 A02650 ELDERLY A07180 Number_of_returns_1_25000 Number_of_returns_25000_50000 Number_of_returns_50000_75000 Number_of_returns_75000_100000 Number_of_returns_100000_200000 Number_of_returns_200000_more Amount_1_25000 Amount_25000_50000 Amount_50000_75000 Amount_75000_100000 Amount_100000_200000 Amount_200000_more
0 0 1 778140.0 10311099.0 144610.0 2076.0 778140.0 NaN NaN NaN NaN NaN 10311099.0 NaN NaN NaN NaN NaN
1 0 2 525940.0 19145621.0 113810.0 17784.0 NaN 525940.0 NaN NaN NaN NaN NaN 19145621.0 NaN NaN NaN NaN
2 0 3 285700.0 17690402.0 82410.0 9521.0 NaN NaN 285700.0 NaN NaN NaN NaN NaN 17690402.0 NaN NaN NaN
3 0 4 179070.0 15670456.0 57970.0 8072.0 NaN NaN NaN 179070.0 NaN NaN NaN NaN NaN 15670456.0 NaN NaN
4 0 5 257010.0 35286228.0 85030.0 14872.0 NaN NaN NaN NaN 257010.0 NaN NaN NaN NaN NaN 35286228.0 NaN
And here is what I want to get:
zipcode Number_of_returns_1_25000 Number_of_returns_25000_50000 Number_of_returns_50000_75000 Number_of_returns_75000_100000 Number_of_returns_100000_200000 Number_of_returns_200000_more
0 0 778140.0 525940.0 285700.0 179070.0 257010.0 850.0
here is one way to do it using groupby and sum the desired columns
num_of_returns = ['Number_of_returns_1_25000', 'Number_of_returns_25000_50000', 'Number_of_returns_50000_75000',
'Number_of_returns_75000_100000', 'Number_of_returns_100000_200000', 'Number_of_returns_200000_more']
df.groupby('zipcode', as_index=False)[num_of_returns].sum()
zipcode Number_of_returns_1_25000 Number_of_returns_25000_50000 Number_of_returns_50000_75000 Number_of_returns_75000_100000 Number_of_returns_100000_200000 Number_of_returns_200000_more
0 0 778140.0 525940.0 285700.0 179070.0 257010.0 0.0
This question needs more information to actually give a proper answer. For example you leave out what is meant by certain columns in your data frame:
- `N1: Number of returns`
- `agi_stub: Size of adjusted gross income`
According to IRS this has the following levels.
Size of adjusted gross income "0 = No AGI Stub
1 = ‘Under $1’
2 = '$1 under $10,000'
3 = '$10,000 under $25,000'
4 = '$25,000 under $50,000'
5 = '$50,000 under $75,000'
6 = '$75,000 under $100,000'
7 = '$100,000 under $200,000'
8 = ‘$200,000 under $500,000’
9 = ‘$500,000 under $1,000,000’
10 = ‘$1,000,000 or more’"
I got the above from https://www.irs.gov/pub/irs-soi/16incmdocguide.doc
With this information, I think what you want to find is the number of
people who filed a tax return for each of the income levels of agi_stub.
If that is what you mean then, this can be achieved by:
import pandas as pd
data = pd.read_csv("./data/19zpallagi.csv")
## select only the desired columns
data = data[['zipcode', 'agi_stub', 'N1']]
## solution to your problem?
df = data.pivot_table(
index='zipcode',
values='N1',
columns='agi_stub',
aggfunc=['sum']
)
## bit of cleaning up.
PREFIX = 'agi_stub_level_'
df.columns = [PREFIX + level for level in df.columns.get_level_values(1).astype(str)]
Here's the output.
In [77]: df
Out[77]:
agi_stub_level_1 agi_stub_level_2 ... agi_stub_level_5 agi_stub_level_6
zipcode ...
0 50061850.0 37566510.0 ... 21938920.0 8859370.0
1001 2550.0 2230.0 ... 1420.0 230.0
1002 2850.0 1830.0 ... 1840.0 990.0
1005 650.0 570.0 ... 450.0 60.0
1007 1980.0 1530.0 ... 1830.0 460.0
... ... ... ... ... ...
99827 470.0 360.0 ... 170.0 40.0
99833 550.0 380.0 ... 290.0 80.0
99835 1250.0 1130.0 ... 730.0 190.0
99901 1960.0 1520.0 ... 1030.0 290.0
99999 868450.0 644160.0 ... 319880.0 142960.0
[27595 rows x 6 columns]

Scraping dynamic data selenium - Unable to locate element

I am very new to scraping and have a question. I am scraping worldometers covid data. As it is dynamic - I am doing it with selenium.
The code is the following:
from selenium import webdriver
import time
URL = "https://www.worldometers.info/coronavirus/"
# Start the Driver
driver = webdriver.Chrome(executable_path = r"C:\Webdriver\chromedriver.exe")
# Hit the url and wait for 10 seconds.
driver.get(URL)
time.sleep(10)
#find class element
data= driver.find_elements_by_class_name("odd" and "even")
#for loop
for d in data:
country=d.find_element_by_xpath(".//*[#id='main_table_countries_today']").text
print(country)
current output:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":".//*[#id='main_table_countries_today']"}
(Session info: chrome=96.0.4664.45)
To scrape table within worldometers covid data you need to induce WebDriverWait for the visibility_of_element_located() and using DataFrame from Pandas you can use the following Locator Strategy:
Code Block:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
options = Options()
options.add_argument("start-maximized")
s = Service('C:\\BrowserDrivers\\chromedriver.exe')
driver = webdriver.Chrome(service=s, options=options)
driver.get("https://www.worldometers.info/coronavirus/")
data = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table#main_table_countries_today"))).get_attribute("outerHTML")
df = pd.read_html(data)
print(df)
driver.quit()
Console Output:
[ # Country,Other TotalCases NewCases ... Deaths/1M pop TotalTests Tests/ 1M pop Population
0 NaN World 264359298 632349.0 ... 673.3 NaN NaN NaN
1 1.0 USA 49662381 89259.0 ... 2415.0 756671013.0 2267182.0 3.337495e+08
2 2.0 India 34609741 3200.0 ... 336.0 643510926.0 459914.0 1.399198e+09
3 3.0 Brazil 22118782 12910.0 ... 2865.0 63776166.0 297051.0 2.146975e+08
4 4.0 UK 10329074 53945.0 ... 2124.0 364875273.0 5335159.0 6.839070e+07
.. ... ... ... ... ... ... ... ... ...
221 221.0 Samoa 3 NaN ... NaN NaN NaN 2.002800e+05
222 222.0 Saint Helena 2 NaN ... NaN NaN NaN 6.103000e+03
223 223.0 Micronesia 1 NaN ... NaN NaN NaN 1.167290e+05
224 224.0 Tonga 1 NaN ... NaN NaN NaN 1.073890e+05
225 NaN Total: 264359298 632349.0 ... 673.3 NaN NaN NaN
[226 rows x 15 columns]]

Scraping table from Python Beautifulsoup

I tried to scrape table from this website: https://stockrow.com/VRTX/financials/income/quarterly
I am using Python Google Colab and I'd like to have the dates as columns. (e.g. 2020-06-30 etc) I used code to do something like this:
source = urllib.request.urlopen('https://stockrow.com/VRTX/financials/income/quarterly').read()
soup = bs.BeautifulSoup(source,'lxml')
table = soup.find_all('table')
However, I cannot get the tables. I am a bit new to scraping so I looked at other Stackoverflow pages but couldn't solve the problem. Can you please help me? That would be much appreciated.
You can use their API to load the data:
import requests
import pandas as pd
indicators_url = 'https://stockrow.com/api/indicators.json'
data_url = 'https://stockrow.com/api/companies/VRTX/financials.json?ticker=VRTX&dimension=Q&section=Income+Statement'
indicators = {i['id']: i for i in requests.get(indicators_url).json()}
all_data = []
for d in requests.get(data_url).json():
d['id'] = indicators[d['id']]['name']
all_data.append(d)
df = pd.DataFrame(all_data)
df.to_csv('data.csv')
print(df)
Prints:
id 2020-06-30 2020-03-31 2019-12-31 2019-09-30 2019-06-30 ... 2011-12-31 2011-09-30 2011-06-30 2011-03-31 2010-12-31 2010-09-30
0 Consolidated Net Income/Loss 837270000.0 602753000.0 583234100.0 57518000.0 267427000.0 ... 188141000.0 228452000.0 -199318000.0 -176096000.0 -180392000.0 -208957000.0
1 EPS (Basic, from Continuous Ops) 3.2248 2.3199 2.2654 0.2239 1.044 ... 0.9374 1.109 -0.9751 -0.8703 -0.8966 -1.0402
2 Net Profit Margin 0.5492 0.3978 0.4127 0.0606 0.2841 ... 0.2816 0.3354 -1.5213 -2.3906 -2.7531 -8.7816
3 Gross Profit 1339965000.0 1352610000.0 1228253000.0 817914000.0 805553000.0 ... 533213000.0 620794000.0 105118000.0 70996000.0 62475000.0 20567000.0
4 Income Tax Provision -12500000.0 54781000.0 93716000.0 13148000.0 59711000.0 ... 22660000.0 -27842000.0 24448000.0 0.0 NaN 0.0
5 Operating Income 718033000.0 720224100.0 551464400.0 99333000.0 269960000.0 ... 223901900.0 215707000.0 -165890000.0 -159899000.0 -166634000.0 -199588000.0
6 EBIT 718033000.0 720224100.0 551464700.0 99333000.0 269960000.0 ... 223901900.0 215707000.0 -165890000.0 -159899000.0 -166634000.0 -199588000.0
7 EPS (Diluted, from Cont. Ops) 3.1787 2.2874 2.2319 0.2208 1.0293 ... 1.0011 1.0415 -0.9751 -0.8703 -0.8966 -1.0402
8 EBITDA 744730000.0 747045000.0 577720400.0 125180000.0 297658000.0 ... 233625900.0 223457000.0 -157181000.0 -151041000.0 -158429000.0 -192830000.0
9 EPS (Basic, Consolidated) 3.2248 2.3199 2.2654 0.2239 1.044 ... 0.9374 1.109 -0.9751 -0.8703 -0.8966 -1.0402
10 EBT 824770000.0 657534000.0 676950000.0 70666000.0 327138000.0 ... 210801000.0 200610000.0 -174870000.0 -176096000.0 -180392000.0 -208957000.0
11 Operating Cash Flow Margin 0.6812 0.5384 0.3156 0.3525 0.4927 ... 0.8941 0.0651 -1.8894 -2.5336 -2.535 -6.8918
12 EBT margin 0.541 0.434 0.479 0.0744 0.3475 ... 0.3742 0.3043 -1.5283 -2.3906 -2.7531 -8.7816
13 EBIT Margin 0.471 0.4754 0.3902 0.1046 0.2868 ... 0.3975 0.3272 -1.4498 -2.1707 -2.5431 -8.3878
14 Income from Continuous Operations 837270000.0 602753000.0 583234000.0 57518000.0 267427000.0 ... 188141000.0 228452000.0 -199318000.0 -176096000.0 -180392000.0 -208957000.0
15 R&D Expenses 420928000.0 448528000.0 480011000.0 555948000.0 379091000.0 ... 186438000.0 189052000.0 173604000.0 158612000.0 168888000.0 170434000.0
16 Non-operating Interest Expenses 13871000.0 14136000.0 14249000.0 14548000.0 14837000.0 ... 11659000.0 7059000.0 6962000.0 12001000.0 7686000.0 3951000.0
17 EBITDA Margin 0.4885 0.4931 0.4088 0.1318 0.3162 ... 0.4147 0.339 -1.3737 -2.0505 -2.4179 -8.1038
18 Non-operating Income/Expense 106737000.0 -62690000.0 125485000.0 -28667000.0 57178000.0 ... -13101000.0 -15097000.0 -8980000.0 -16197000.0 -13758000.0 -9369000.0
19 EPS (Basic) 3.22 2.32 2.26 0.22 1.04 ... 0.76 1.06 -0.85 -0.87 -0.9 -1.04
20 Gross Margin 0.879 0.8927 0.8691 0.8611 0.8558 ... 0.9465 0.9417 0.9187 0.9638 0.9535 0.8643
21 Revenue 1524485000.0 1515107000.0 1413265000.0 949828000.0 941293000.0 ... 563340000.0 659200000.0 114424000.0 73662000.0 65524000.0 23795000.0
22 Shares (Diluted, Average) 263403000.0 263515000.0 262108000.0 260473000.0 259822000.0 ... 217602000.0 219349000.0 204413000.0 202329000.0 201355000.0 200887000.0
23 Cost of Revenue 184520000.0 162497000.0 185012000.0 131914000.0 135740000.0 ... 30127000.0 38406000.0 9306000.0 2666000.0 3049000.0 3228000.0
24 SG&A Expenses 191804000.0 182258000.0 195277000.0 159674000.0 156502000.0 ... 121881000.0 110654000.0 96663000.0 71523000.0 62478000.0 48855000.0
25 EPS (Diluted, Consolidated) 3.1787 2.2874 2.2319 0.2208 1.0293 ... 1.0011 1.0415 -0.9751 -0.8703 -0.8966 -1.0402
26 Revenue Growth 0.6196 0.765 0.6242 0.2107 0.2515 ... 7.5975 26.7033 2.6185 2.2842 0.9335 -0.0466
27 Shares (Basic, Weighted) 259637000.0 259815000.0 256728000.0 256946000.0 256154000.0 ... 204891000.0 206002000.0 204413000.0 202329000.0 200402000.0 200887000.0
28 Income after Tax 837270000.0 602753000.0 583234000.0 57518000.0 267427000.0 ... 188141000.0 228452000.0 -199318000.0 -176096000.0 -180392000.0 -208957000.0
29 EPS (Diluted) 3.18 2.29 2.23 0.22 1.03 ... 0.74 1.02 -0.85 -0.87 -0.9 -1.04
30 Net Income Common 837270000.0 602753000.0 583234100.0 57518000.0 267427000.0 ... 158629000.0 221110000.0 -174069000.0 -176096000.0 -180392000.0 -208957000.0
31 Shares (Diluted, Weighted) 263403000.0 263515000.0 260673000.0 260473000.0 259822000.0 ... 208807000.0 219349000.0 204413000.0 202329000.0 200402000.0 200887000.0
32 Non-Controlling Interest NaN NaN NaN NaN NaN ... 29512000.0 7342000.0 -25249000.0 0.0 NaN 0.0
33 Dividends (Preferred) NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
34 EPS (Basic, from Discontinued Ops) NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
35 EPS (Diluted, from Disc. Ops) NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
36 Income from Discontinued Operations NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
[37 rows x 41 columns]
And saves data.csv:
Or donwload their XLSX from that page:
url = 'https://stockrow.com/api/companies/VRTX/financials.xlsx?dimension=Q&section=Income%20Statement&sort=desc'
df = pd.read_excel(url)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
print(df)
First problem is, that table is loaded via javascript and BeautifulSoup does not find it, because it's not loaded yet at the moment of parsing. To solve this problem you'll need to use selenium.
Second problem is, that there is no table tag in HTML, it uses grid formatting.
Since you're using Google Colab, you'll need to install there selenium web driver (code taken from this answer):
!pip install selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
After that you can load the page and parse it:
from bs4 import BeautifulSoup
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
# load page via selenium
wd.get("https://stockrow.com/VRTX/financials/income/quarterly")
# wait 5 seconds until element with class mainGrid will be loaded
grid = WebDriverWait(wd, 5).until(EC.presence_of_element_located((By.CLASS_NAME, 'mainGrid')))
# parse content of the grid
soup = BeautifulSoup(grid.get_attribute('innerHTML'), 'lxml')
# access grid cells, your logic should be here
for tag in soup.find_all('div', {'class': 'financials-value'}):
print(tag)

Selenium code can not catch the table from Chrome

I am using selenium to parse from
https://www.worldometers.info/coronavirus/
and doing as the following, I get attribute error and the table variable remains empty, what is the reason ?
I use Chrome 80. Are the tags right ?
AttributeError: 'NoneType' object has no attribute 'tbody'
from selenium import webdriver
import bs4
browser = webdriver.Chrome()
browser.get("https://www.worldometers.info/coronavirus/")
html = bs4.BeautifulSoup(browser.page_source, "html.parser")
table = html.find("table",class_="table table-bordered table-hover main_table_countries dataTable no-footer") #
Wherever I have table tags, I find it easier to use pandas to capture the table.
import pandas as pd
url = 'https://www.worldometers.info/coronavirus/'
table = pd.read_html(url)[0]
Output:
print(table)
Country,Other TotalCases ... Tot Cases/1M pop Tot Deaths/1M pop
0 China 81093 ... 56.00 2.0
1 Italy 63927 ... 1057.00 101.0
2 USA 43734 ... 132.00 2.0
3 Spain 35136 ... 751.00 49.0
4 Germany 29056 ... 347.00 1.0
.. ... ... ... ... ...
192 Somalia 1 ... 0.06 NaN
193 Syria 1 ... 0.06 NaN
194 Timor-Leste 1 ... 0.80 NaN
195 Turks and Caicos 1 ... 26.00 NaN
196 Total: 378782 ... 48.60 2.1
[197 rows x 10 columns]

Categories

Resources