Is there a way to web scarpe a website with unchanging URLs?

Is there a way to web scarpe a website with unchanging URLs? - python

I am trying to webscrape a dynamic page using selenium and beautifulsoup and python and am able to scrape the first page. But when i try to get to the next page, the url doesnt change and when i Inspect, i am unable to see Form Data as well. Can someone can help me?
import time
from selenium import webdriver
from parsel import Selector
from bs4 import BeautifulSoup
import random
import re
import csv
import requests
import pandas as pd
companies = []
overview = []
people = []
driver = webdriver.Chrome(executable_path=r'C:\\Users\\rahul\Downloads\\chromedriver_win32 (1)\\chromedriver.exe')
driver.get('https://coverager.com/data/companies/')
driver.maximize_window()
src = driver.page_source
soup = BeautifulSoup(src, 'lxml')
table = soup.find('tbody')
descrip = []
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
#print(td)
row = [i.text.strip() for i in td]
descrip.append(row)
#print(row)
#file = open('gag.csv','w')
#with file:
# write = csv.writer(file)
# write.writerows(descrip)
url = ('https://coverager.com')
a_tags = table.find_all('a', href = True)
for link in a_tags:
ol = link.get('href')
pl = link.string.strip()
#companies.append(row)
#print(pl)
#print(ol)
driver.get(url + ol)
driver.implicitly_wait(1000)
data1 = driver.find_element_by_class_name('tab-details').text
overview.append(data1.strip())
data2 = driver.find_element_by_link_text('People').click()
p_tags = driver.find_element_by_class_name('tab-details').text
people.append(p_tags)

In your case of https://coverager.com/data/companies/ it would be much easier to scrape the api call instead of the html on the page.
Open dev tools (on chrome right click and hit inspect) and go to the network tab. When you hit the "next" button a row should show up in the network tab. Click on this row and then go to preview. You should see the company in this tab.
The api is accessing links which look like the following:
https://coverager.com/wp-json/ath/v1/coverager-data/companies?per_page=20&page=2&draw=4&column=3&dir=desc&filters=%7B%22companies%22:[],%22company_lob%22:[],%22industry%22:[],%22company_type%22:[],%22company_category%22:[],%22region%22:[],%22founded%22:[],%22company_stage%22:[],%22company_business_model%22:[]%7D
It seems like all the pages call the same api url but change the page= and raw= which are 2 apart.
So, simply use requests to call this class of links and loop through as many pages as you need! You could also change the per_page return as many companies as you need. You will have to test that though.

Related

Get data form table in beautiful soup

I am trying to retreive the 'Shares Outstanding' of a stock via this page:
https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-20-000052&xbrl_type=v#
(Click on 'Financial Statements' - 'Condensed Consolidated Balance Sheets (Unaudited) (Parenthical)')
the data is in the bottom of the table in the left row, I am using beautiful soup but I am having issues with retreiving the sharecount.
the code I am using:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-20-000052&xbrl_type=v#'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.find_all('tr')
for row in rows:
document = row.find('a', string='Common stock, shares outstanding (in shares)')
shares = row.find('td', class_='nump')
if None in (document, shares):
continue
print(document)
print(shares)
this returns nothing, but the desired output is 4,323,987,000
can someone help me to retreive this data?
Thanks!

That's a JS rendered page. Use Selenium:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from time import sleep
# import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-20-000052&xbrl_type=v#'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.set_window_size(1024, 600)
driver.maximize_window()
driver.get(url)
time.sleep(10) # <--- waits for 10 seconds so that page can gets rendered
# action = webdriver.ActionChains(driver)
# print(driver.page_source) # <--- this will give you source code
soup = BeautifulSoup(driver.page_source)
rows = soup.find_all('tr')
for row in rows:
shares = row.find('td', class_='nump')
if shares:
print(shares)
<td class="nump">4,334,335<span></span>
</td>
<td class="nump">4,334,335<span></span>
</td>
Better use :
shares = soup.find('td', class_='nump')
if shares:
print(shares.text.strip())
4,334,335

Ah, the joys of scraping EDGAR filings :(...
You're not getting your expected output because you're looking in the wrong place. The url you have is an ixbrl viewer. The data comes from here:
url = 'https://www.sec.gov/Archives/edgar/data/320193/000032019320000052/R1.htm'
You can either find that url by looking at network tab in the developer tooks, or, you can simply translate the viewer url into this url: for example, the 320193& figure is the cik number, etc.
Once you figure that out, the rest is simple:
req = requests.get(url)
soup = bs(req.text,'lxml')
soup.select_one('.nump').text.strip()
Output:
'4,334,335'
Edit:
To search by "Shares Outstanding", try:
targets = soup.select('tr.ro')
for target in targets:
targ = target.select('td.pl')
for t in targ:
if "Shares Outstanding" in t.text:
print(target.select_one('td.nump').text.strip())
And might as well throw this one in: Another, different way, to do that is to use xpath instead, using the lxml library:
import lxml.html as lh
doc = lh.fromstring(req.text)
doc.xpath('//tr[#class="ro"]//td[#class="pl "][contains(.//text(),"Shares Outstanding")]/following-sibling::td[#class="nump"]/text()')[0]

How do you reference a specific ID while web scraping in Python?

I am trying to web scrape this site in order to get basic stock information: https://www.macrotrends.net/stocks/charts/AAPL/apple/financial-ratios
My code is as follows:
from requests import get
from bs4 import BeautifulSoup as bs
url = 'https://www.macrotrends.net/stocks/charts/AAPL/apple/financial-ratios'
response = get(url)
html_soup = bs(response.text, 'html.parser')
stock_container = html_soup.find_all("div", attrs= {'id': 'row0jqxgrid'})
print(len(stock_container))
Right now I am taking it slow and just trying to return the number of "div" under the id name "row0jqxgrid". I am pretty sure everything up to line 8 is fine but I don't know how to properly reference the id using attrs, or if that's even possible.
Can anybody provide any information?
Ross

You can use selenium for this job:
from selenium import webdriver
import os
# define path to chrome driver
chrome_driver = os.path.abspath(os.path.dirname(__file__)) + '/chromedriver'
browser = webdriver.Chrome(chrome_driver)
browser.get("https://www.macrotrends.net/stocks/charts/AAPL/apple/financial-ratios")
# get row element
row = browser.find_element_by_xpath('//*[#id="row0jqxgrid"]')
# find all divs currently displayed
divs_list = row.find_elements_by_tag_name('div')
# get text from cells
for item in divs_list:
print(item.text)
Output:
Output text is doubled because table data ale loaded dynamically as you move bottom scroll to right.
Current Ratio
Current Ratio
1.5401
1.5401
1.1329
1.1329
1.2761
1.2761
1.3527
1.3527
1.1088
1.1088
1.0801
1.0801

Extracting text from span

I have a problem regarding a span tag, that has no id or class.
The larger approach is to extract the text between "ITEM 1. BUSINESS" TO "ITEM 1A. RISK FACTORS" from the link below. However, I can't figure out a way to find this part, because the span it is in, has no id nor a class I can search for (only the parent div the span is in: div = soup.find("div", {"id": "dynamic-xbrl-form"}).
This code does not work, sadly: #text = unicodedata.normalize('NFKD', soup.get_text()).replace('\n', '')
Here is my approach:
url = 'https://www.sec.gov/ix?doc=/Archives/edgar/data/934549/000093454919000017/actg2018123110-k.htm#s62CF0831C63E51C2BEF33F4163F1DE65'
raw = requests.get(url)
soup = BeautifulSoup(raw.content)
div = soup.find("span", {"id": ... })
print(div.txt)
Do you have any ideas or hints?
Thanks a lot
Julius

As #Gagan said , The content of website are loaded from Javascript. You need to use Selenium
Using Selenium is more powerful than other Python function .I used ChromeDriver so If you don't install yet You can install it in
http://chromedriver.chromium.org/
from selenium import webdriver
driver_path = r'your driver path'
browser = webdriver.Chrome(executable_path=driver_path)
browser.get("https://www.sec.gov/ix?doc=/Archives/edgar/data/934549/000093454919000017/actg2018123110-k.htm#s62CF0831C63E51C2BEF33F4163F1DE65")
datas = browser.find_elements_by_css_selector("span") // use # or . for class or id name like span#id_name , span.class_name
for spans in datas:
print(spans.text)
You can also get all source
print (browser.page_source)

The content of this page are loaded from JavaScript, you cannot use BeautifulSoup for this. Make use of selenium for this purpose.

In my case I am checking using id of span tag, this solved mine:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.facebook.com/hackerv728'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
titles = soup.find_all('span', id='fb-timeline-cover-name')
for title in titles:
print(title.text.strip())

Extract data from BSE website

How can I extract the value of Security ID, Security Code, Group / Index, Wtd.Avg Price, Trade Date, Quantity Traded, % of Deliverable Quantity to Traded Quantity using Python 3 and save it to an XLS file. Below is the link.
https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/
PS: I am completely new to the python. I know there are few libs which make scrapping easier like BeautifulSoup, selenium, requests, lxml etc. Don't have much idea about them.
Edit 1:
I tried something
from bs4 import BeautifulSoup
import requests
URL = 'https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/'
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.find('div', attrs = {'id':'newheaddivgrey'})
print(table)
Its output is None. I was expecting all tables in the webpage and filter them further to get required data.
import requests
import lxml.html
URL = 'https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/'
r = requests.get(URL)
root = lxml.html.fromstring(r.content)
title = root.xpath('//*[#id="SecuritywiseDeliveryPosition"]/table/tbody/tr/td/table/tbody/tr[1]/td')
print(title)
Tried another code. Same problem.
Edit 2:
Tried selenium. But I am not getting the table contents.
from selenium import webdriver
driver = webdriver.Chrome(r"C:\Program Files\JetBrains\PyCharm Community Edition 2017.3.3\bin\chromedriver.exe")
driver.get('https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/')
table=driver.find_elements_by_xpath('//*[#id="SecuritywiseDeliveryPosition"]/table/tbody/tr/td/table/tbody/tr[1]/td')
print(table)
driver.quit()
Output is [<selenium.webdriver.remote.webelement.WebElement (session="befdd4f01e6152942c9cfc7c563a6bf2", element="0.13124528538297953-1")>]

After loading the page with Selenium, you can get the Javascript modified page source using driver.page_source. You can then pass this page source in the BeautifulSoup object.
driver = webdriver.Chrome()
driver.get('https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/')
html = driver.page_source
driver.quit()
soup = BeautifulSoup(html, 'lxml')
table = soup.find('div', id='SecuritywiseDeliveryPosition')
This code will give you the Securitywise Delivery Position table in the table variable. You can then parse this BeautifulSoup object to get the different values you want.
The soup object contains the full page source including the elements that were dynamically added. Now, you can parse this to get all the things you mentioned.

Count Images on Amazon Product Detail Page Python

Im new to coding with Python. So please bear with me Im trying to find the number of product images a Product has on Amazon.
1. I cant seem to get it work correctly?
2. Is there a way to insert a list of ASINS so they can all print out with the number?
Thanks!
import bs4
import webbrowser
import requests
File = requests.get('https://www.amazon.com/dp/B01MRXQPJ5')
soup = bs4.BeautifulSoup(File.text, 'html.parser' )
elems = soup.select('ul.a-unordered-list a-nostyle a-button-list a-vertical a-spacing-top-micro > li ')

Since Amazon render it's page using javascript, the content is generated at client side, instead of server-side.
When you use requests you get the content at server-side. To get the content generated in client-side, you must use selenium or dryscrape, for example.
Here's a working code that will count the number of items for a list of product ids.
Code:
import selenium.webdriver as webdriver
import lxml.html as html
import lxml.html.clean as clean
urls = ['B017TSPK5K', 'B00B96KLCQ', 'B01MZ9E6CG']
browser = webdriver.Chrome()
for url in urls:
amazon_url = "https://www.amazon.com/dp/{}".format(url)
browser.get(amazon_url)
content = browser.page_source
cleaner = clean.Cleaner()
content = cleaner.clean_html(content)
doc = html.fromstring(content)
soup = BeautifulSoup(content, 'html.parser')
soup_li = soup.find_all('li', {'class':'a-spacing-small item a-declarative'})
print("Product ID: {} has {} images.".format(url, len(soup_li)))
browser.close()
Output:
'Product ID: B017TSPK5K has 2 images.'
'Product ID: B00B96KLCQ has 5 images.'
'Product ID: B01MZ9E6CG has 3 images.'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is there a way to web scarpe a website with unchanging URLs? - python

Related

Get data form table in beautiful soup

How do you reference a specific ID while web scraping in Python?

Extracting text from span

Extract data from BSE website

Count Images on Amazon Product Detail Page Python

Categories

Resources