Get data form table in beautiful soup - python

I am trying to retreive the 'Shares Outstanding' of a stock via this page:
https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-20-000052&xbrl_type=v#
(Click on 'Financial Statements' - 'Condensed Consolidated Balance Sheets (Unaudited) (Parenthical)')
the data is in the bottom of the table in the left row, I am using beautiful soup but I am having issues with retreiving the sharecount.
the code I am using:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-20-000052&xbrl_type=v#'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.find_all('tr')
for row in rows:
document = row.find('a', string='Common stock, shares outstanding (in shares)')
shares = row.find('td', class_='nump')
if None in (document, shares):
continue
print(document)
print(shares)
this returns nothing, but the desired output is 4,323,987,000
can someone help me to retreive this data?
Thanks!

That's a JS rendered page. Use Selenium:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from time import sleep
# import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-20-000052&xbrl_type=v#'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.set_window_size(1024, 600)
driver.maximize_window()
driver.get(url)
time.sleep(10) # <--- waits for 10 seconds so that page can gets rendered
# action = webdriver.ActionChains(driver)
# print(driver.page_source) # <--- this will give you source code
soup = BeautifulSoup(driver.page_source)
rows = soup.find_all('tr')
for row in rows:
shares = row.find('td', class_='nump')
if shares:
print(shares)
<td class="nump">4,334,335<span></span>
</td>
<td class="nump">4,334,335<span></span>
</td>
Better use :
shares = soup.find('td', class_='nump')
if shares:
print(shares.text.strip())
4,334,335

Ah, the joys of scraping EDGAR filings :(...
You're not getting your expected output because you're looking in the wrong place. The url you have is an ixbrl viewer. The data comes from here:
url = 'https://www.sec.gov/Archives/edgar/data/320193/000032019320000052/R1.htm'
You can either find that url by looking at network tab in the developer tooks, or, you can simply translate the viewer url into this url: for example, the 320193& figure is the cik number, etc.
Once you figure that out, the rest is simple:
req = requests.get(url)
soup = bs(req.text,'lxml')
soup.select_one('.nump').text.strip()
Output:
'4,334,335'
Edit:
To search by "Shares Outstanding", try:
targets = soup.select('tr.ro')
for target in targets:
targ = target.select('td.pl')
for t in targ:
if "Shares Outstanding" in t.text:
print(target.select_one('td.nump').text.strip())
And might as well throw this one in: Another, different way, to do that is to use xpath instead, using the lxml library:
import lxml.html as lh
doc = lh.fromstring(req.text)
doc.xpath('//tr[#class="ro"]//td[#class="pl "][contains(.//text(),"Shares Outstanding")]/following-sibling::td[#class="nump"]/text()')[0]

Related

I'm Trying To Scrape The Duration Of Tiktok Videos But I am Getting 'None'

I want to scrape the duration of tiktok videos for an upcoming project but my code isn't working
import requests; from bs4 import BeautifulSoup
content = requests.get('https://vm.tiktok.com/ZMFFKmx3K/').text
soup = BeautifulSoup(content, 'lxml')
data = soup.find('div', class_="tiktok-1g3unbt-DivSeekBarTimeContainer e123m2eu1")
print(data)
Using an example tiktok
I would think this would work could anyone help
If you turn off JavaScript then check out the element selection in chrome devtools then you will see that the value is like 00/000 but when you will turn JS and the video is on play mode then the duration is increasing uoto finishig.So the real duration value of that element depends on Js. So you have to use an automation tool something like selenium to grab that dynamic value. And How much duration will scrape that depend on time.sleep() if you are on selenium. If time.sleep is more than the video length then it will show None typEerror.
Example:
import time
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url ='https://vm.tiktok.com/ZMFFKmx3K/'
driver.get(url)
driver.maximize_window()
time.sleep(25)
soup = BeautifulSoup(driver.page_source,"lxml")
data = soup.find('div', class_="tiktok-1g3unbt-DivSeekBarTimeContainer e123m2eu1")
print(data.text)
Output:
00:25/00:28
the ID associated is likely randomized. Try using regex to get element by class ending in 'TimeContainer' + some other id
import requests
from bs4 import BeautifulSoup
import re
content = requests.get('https://vm.tiktok.com/ZMFFKmx3K/').text
soup = BeautifulSoup(content, 'lxml')
data = soup.find('div', {'class': re.compile(r'TimeContainer.*$')})
print(data)
you next issue is that the page loads before the video, so you'll get 0/0 for the time. try selenium instead so you can add timer waits for loading

Python BS4 find_all replaces text inside the tag with <!--empty-->

I am trying to scrape mortgage rates from https://www.td.com/ca/en/personal-banking/products/mortgages/mortgage-rates/
When I use find_all to get value from a cell in specific table, the returned value is "!--empty--" instead of the text within that cell.
The actual html for that cell is:
<span class="h2 ng-binding ng-isolate-scope" code="a.reslrates.MTGF036C" high-ratio="false" resl-rate="" type="S">2.54%</span>
The result which is returned is:
<span class="h2" code="a.reslrates.MTGF036C" high-ratio="false" resl-rate="" type="S"><!--empty--></span>
Instead of the 2.54% rate text, I get !--empty-- result. I get Am I missing something here? Full code below:
html_text = requests.get("https://www.td.com/ca/en/personal-banking/products/mortgages/mortgage-rates/").text
soup = BeautifulSoup(html_text, "html.parser")
# Get the table
table = soup.find("div", class_="td-rates-table rates-bg-row1").table
rows = table.tbody.find_all("tr")
for row in rows:
for rate in row.find_all("td"):
print(rate)
I appreciate all responses! Thanks a lot!
Using selenium. Please install necessay dependencies and execute the script.
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
url = 'https://www.td.com/ca/en/personal-banking/products/mortgages/mortgage-rates/'
# html_text = requests.get("https://www.td.com/ca/en/personal-banking/products/mortgages/mortgage-rates/").text
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")
# Get the table
table = soup.find("div", class_="td-rates-table rates-bg-row1").table
rows = table.tbody.find_all("tr")
for row in rows:
for rate in row.find_all("td"):
print(rate.text)

Is there a way to web scarpe a website with unchanging URLs?

I am trying to webscrape a dynamic page using selenium and beautifulsoup and python and am able to scrape the first page. But when i try to get to the next page, the url doesnt change and when i Inspect, i am unable to see Form Data as well. Can someone can help me?
import time
from selenium import webdriver
from parsel import Selector
from bs4 import BeautifulSoup
import random
import re
import csv
import requests
import pandas as pd
companies = []
overview = []
people = []
driver = webdriver.Chrome(executable_path=r'C:\\Users\\rahul\Downloads\\chromedriver_win32 (1)\\chromedriver.exe')
driver.get('https://coverager.com/data/companies/')
driver.maximize_window()
src = driver.page_source
soup = BeautifulSoup(src, 'lxml')
table = soup.find('tbody')
descrip = []
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
#print(td)
row = [i.text.strip() for i in td]
descrip.append(row)
#print(row)
#file = open('gag.csv','w')
#with file:
# write = csv.writer(file)
# write.writerows(descrip)
url = ('https://coverager.com')
a_tags = table.find_all('a', href = True)
for link in a_tags:
ol = link.get('href')
pl = link.string.strip()
#companies.append(row)
#print(pl)
#print(ol)
driver.get(url + ol)
driver.implicitly_wait(1000)
data1 = driver.find_element_by_class_name('tab-details').text
overview.append(data1.strip())
data2 = driver.find_element_by_link_text('People').click()
p_tags = driver.find_element_by_class_name('tab-details').text
people.append(p_tags)
In your case of https://coverager.com/data/companies/ it would be much easier to scrape the api call instead of the html on the page.
Open dev tools (on chrome right click and hit inspect) and go to the network tab. When you hit the "next" button a row should show up in the network tab. Click on this row and then go to preview. You should see the company in this tab.
The api is accessing links which look like the following:
https://coverager.com/wp-json/ath/v1/coverager-data/companies?per_page=20&page=2&draw=4&column=3&dir=desc&filters=%7B%22companies%22:[],%22company_lob%22:[],%22industry%22:[],%22company_type%22:[],%22company_category%22:[],%22region%22:[],%22founded%22:[],%22company_stage%22:[],%22company_business_model%22:[]%7D
It seems like all the pages call the same api url but change the page= and raw= which are 2 apart.
So, simply use requests to call this class of links and loop through as many pages as you need! You could also change the per_page return as many companies as you need. You will have to test that though.

Using Python to Scrape Sky Cinema List

I'd like to gather a list of films and their links to all available movies on Sky Cinema website.
The website is:
http://www.sky.com/tv/channel/skycinema/find-a-movie#/search?genre=all&window=skyCinema&certificate=all&offset=0&scrollPosition=200
I am using Python 3.6 and Beautiful Soup.
I am having problems finding the title and link. Especially as there are several pages to click through - possibly based on scroll position (in the URL?)
I've tried using BS and Python but there is no output. The code I have tried would only return the title. I'd like the title and the link to the film. As these are in different areas on the site, I am unsure on how this is done.
Code I have tried:
from bs4 import BeautifulSoup
import requests
link = "http://www.sky.com/tv/channel/skycinema/find-a-movie#/search?genre=all&window=skyCinema&certificate=all&offset=0&scrollPosition=200"
r = requests.get(link)
page = BeautifulSoup(r.content, "html.parser")
for dd in page.find_all("div", {"class":"sentence-result-infos"}):
title = dd.find(class_="title ellipsis ng-binding").text.strip()
print(title)
spans=page.find_all('span', {'class': 'title ellipsis ng-binding'})
for span in spans:
print(span.text)
I'd like the output to show as the title, link.
EDIT:
I have just tried the following but get "text" is not an attribute:
from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
response = session.get('http://www.sky.com/tv/channel/skycinema/find-a-movie/search?genre=all&window=skyCinema&certificate=all&offset=0&scrollPosition=200')
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('span', {'class': 'title ellipsis ng-binding'}).text.strip()
print(title)
There is an API to be found in network tab. You can get all results with one call. You can set the limit to a number greater than the expected result count
r = requests.get('http://www.sky.com/tv/api/search/movie?limit=10000&window=skyMovies').json()
Or use the number you can see on the page
import requests
import pandas as pd
base = 'http://www.sky.com/tv'
r = requests.get('http://www.sky.com/tv/api/search/movie?limit=1555&window=skyMovies').json()
data = [(item['title'], base + item['url']) for item in r['items']]
df = pd.DataFrame(data, columns = ['Title', 'Link'])
print(df)
First of all, read terms and conditions of the site you are going to scrape.
Next, you need selenium:
from selenium import webdriver
import bs4
# MODIFY the url with YOURS
url = "type the url to scrape here"
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = bs4.BeautifulSoup(html, "html.parser")
baseurl = 'http://www.sky.com/'
titles = [n.text for n in soup.find_all('span', {'class':'title ellipsis ng-binding'})]
links = [baseurl+h['href'] for h in soup.find_all('a', {'class':'sentence-result-pod ng-isolate-scope'})]

Extract data from BSE website

How can I extract the value of Security ID, Security Code, Group / Index, Wtd.Avg Price, Trade Date, Quantity Traded, % of Deliverable Quantity to Traded Quantity using Python 3 and save it to an XLS file. Below is the link.
https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/
PS: I am completely new to the python. I know there are few libs which make scrapping easier like BeautifulSoup, selenium, requests, lxml etc. Don't have much idea about them.
Edit 1:
I tried something
from bs4 import BeautifulSoup
import requests
URL = 'https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/'
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.find('div', attrs = {'id':'newheaddivgrey'})
print(table)
Its output is None. I was expecting all tables in the webpage and filter them further to get required data.
import requests
import lxml.html
URL = 'https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/'
r = requests.get(URL)
root = lxml.html.fromstring(r.content)
title = root.xpath('//*[#id="SecuritywiseDeliveryPosition"]/table/tbody/tr/td/table/tbody/tr[1]/td')
print(title)
Tried another code. Same problem.
Edit 2:
Tried selenium. But I am not getting the table contents.
from selenium import webdriver
driver = webdriver.Chrome(r"C:\Program Files\JetBrains\PyCharm Community Edition 2017.3.3\bin\chromedriver.exe")
driver.get('https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/')
table=driver.find_elements_by_xpath('//*[#id="SecuritywiseDeliveryPosition"]/table/tbody/tr/td/table/tbody/tr[1]/td')
print(table)
driver.quit()
Output is [<selenium.webdriver.remote.webelement.WebElement (session="befdd4f01e6152942c9cfc7c563a6bf2", element="0.13124528538297953-1")>]
After loading the page with Selenium, you can get the Javascript modified page source using driver.page_source. You can then pass this page source in the BeautifulSoup object.
driver = webdriver.Chrome()
driver.get('https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/')
html = driver.page_source
driver.quit()
soup = BeautifulSoup(html, 'lxml')
table = soup.find('div', id='SecuritywiseDeliveryPosition')
This code will give you the Securitywise Delivery Position table in the table variable. You can then parse this BeautifulSoup object to get the different values you want.
The soup object contains the full page source including the elements that were dynamically added. Now, you can parse this to get all the things you mentioned.

Categories

Resources