Scraping a Dynamic Website using Selenium or Beautiful Soup

Scraping a Dynamic Website using Selenium or Beautiful Soup - python

I am trying to web scrape this dynamic website to get the course names and lecture time offered during a semester: https://www.utsc.utoronto.ca/registrar/timetable
The problem is when you first enter the website there are no courses displayed yet, only after selecting the "Session(s)" and clicking "Search for Courses" will the courses start to show up.
Here is where the problems start:
I cannot do
html = urlopen(url).read()
using urllib.request, as it will only display the HTML of the page when there is nothing.
I did quick search on how to webscrape dynamic website and run across a code like this:
import requests
url = 'https://www.utsc.utoronto.ca/registrar/timetable'
r= requests.get(url)
data = r.json()
print(data)
however, when I run this it returns "JSONDecodeError: Expecting value" and I have no idea why this occurs when it has worked on other dynamic websites.
I do not really have to use Selenium or Beautiful Soup so if there are better alternatives I will gladly try it. Also I was wondering when:
html = urlopen(url).read()
what is the format of the html that is returned? I want to know if I can just copy the changed HTML from inspecting the website after selecting the Session(s) and clicking search.

you can use this code to get the data you need :
import requests
url = "https://www.utsc.utoronto.ca/regoffice/timetable/view/api.php"
# for winter session
payload = "coursecode=&sessions%5B%5D=20219&instructor=&courseTitle="
headers = {
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)

def render_page(url):
driver = webdriver.Chrome(PATH)
driver.get(url)
r = driver.page_source
driver.quit()
return r
#render page using chrome driver and get all the html code on that certain webpage
def create_soup(html_text):
soup = BeautifulSoup(html_text, 'lxml')
return soup
You will need to use selenium for this if the content is loaded dynamically. Create a Beutiful Soup with the returned value from render_page() and see if you can manipulate the data there.

Related

Why is there an empty result while scraping a Exchange Rate table?

I want to scrape Korea Exchange Rate by using http://www.smbs.biz/ExRate/StdExRate.jsp this website.
Daily exchange rate is provided by table, So I tried to scrape using BeautifulSoup, but it's responses are empty.
Table is like,
url = "http://www.smbs.biz/ExRate/StdExRate.jsp"
html = requests.get(url, verify=False).text
soup = BeautifulSoup(html, 'html.parser')
title = soup.select_one('#frm_SearchDate > div:nth-child(17) > table')
title.text
Result :
'\n일별 매매기준율\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'

Always and first of all, take a look at your soup to see if all the expected ingredients are there.
Data is loaded via XHR request and table is rendered dynamically by JavaScript, That is why you won't get the table with BeautifulSoup cause it could not find it in response of your request.
There are option to get it anyway:
check your browser dev tools on XHR tab to locate the api and pull part of info from there.
use selenium to get driver.page_source with whole table from the 'browser like'rendered version of website.
Example
import requests
from bs4 import BeautifulSoup
url = 'http://www.smbs.biz/ExRate/StdExRate_xml.jsp?arr_value=USD_2023-01-12_2023-02-03'
soup=BeautifulSoup(requests.get(url).text)
{s.get('label'):s.get('value') for s in soup.select('set')}
Output
{'23.01.12': '1245.3',
'23.01.13': '1244.6',
'23.01.16': '1240.6',
'23.01.17': '1234',
'23.01.18': '1238.5',
'23.01.19': '1239.8',
'23.01.20': '1236',
'23.01.25': '1234.4',
'23.01.26': '1233.4',
'23.01.27': '1231.4',
'23.01.30': '1230.2',
'23.01.31': '1228.7',
'23.02.01': '1230.8',
'23.02.02': '1231.4',
'23.02.03': '1219.3'}

Extracting Scraped Web Content from iframe

Attempting to scrape the table at https://coronavirus.health.ny.gov/zip-code-vaccination-data
I've looked at
Python BeautifulSoup - Scrape Web Content Inside Iframes and have gotten this far but I don't know how to extract the information from soup.
Any help is greatly appreciated.
import requests
from bs4 import BeautifulSoup
s = requests.Session()
r = s.get("https://coronavirus.health.ny.gov/zip-code-vaccination-data")
soup = BeautifulSoup(r.content, "html.parser")
iframe_src = '//static-assets.ny.gov/load_global_footer/ajax?iframe=true'
r = s.get(f"https:{iframe_src}")
soup = BeautifulSoup(r.content, "html.parser")

Website you're trying to scrape generates <iframe> dynamically with javascript so you need either something to automate browser actions like selenium, puppeteer or assign <iframe> url to a variable, because it seems to not change in the near future. Here is an url of your <iframe>:
https://public.tableau.com/views/Vaccination_Rate_Public/NYSVaccinationRate?:embed=y&:showVizHome=n&:tabs=n&:toolbar=n&:device=desktop&showShareOptions=false&:apiID=host0#navType=1&navSrc=Parse

BeautifulSoup not returning results of a search on a website

I am trying to get the links to the individual search results on a website (National Gallery of Art). But the link to the search doesn't load the search results. Here is how I try to do it:
url = 'https://www.nga.gov/collection-search-result.html?artist=C%C3%A9zanne%2C%20Paul'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
I can see that the links to the individual results could be found under soup.findAll('a') but they do not appear, instead the last output is a link to empty search result:
https://www.nga.gov/content/ngaweb/collection-search-result.html
How could I get a list of links, the first of which is the first search result (https://www.nga.gov/collection/art-object-page.52389.html), the second is the second search result (https://www.nga.gov/collection/art-object-page.52085.html) etc?

Actually, data is generating from api calls json response. Here is the desired
list of links.
Code:
import requests
import json
url= 'https://www.nga.gov/collection-search-result/jcr:content/parmain/facetcomponent/parList/collectionsearchresu.pageSize__30.pageNumber__1.json?artist=C%C3%A9zanne%2C%20Paul&_=1634762134895'
r = requests.get(url)
for item in r.json()['results']:
url = item['url']
abs_url = f'https://www.nga.gov{url}'
print(abs_url)
Output:
https://www.nga.gov/content/ngaweb/collection/art-object-page.52389.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.52085.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46577.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46580.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46578.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.136014.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46576.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53120.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.54129.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.52165.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46575.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53122.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.93044.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.66405.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53119.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53121.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46579.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.66406.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45866.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53123.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45867.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45986.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45877.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.136025.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.74193.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.74192.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.66486.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76288.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76223.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76268.html

This seems to work for me:
from bs4 import BeautifulSoup
import requests
url = 'https://www.nga.gov/collection-search-result.html?artist=C%C3%A9zanne%2C%20Paul'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for a in soup.findAll('a'):
print(a['href'])
It returns all of the html a href links.
For the links from the search results specifically, those are loaded via AJAX and you would need to implement something that renders the javascript like headless chrome. You can read about one of the ways to implement this here, which fits your use case very closely. http://theautomatic.net/2019/01/19/scraping-data-from-javascript-webpage-python/
If you want to ask how to render javascript from python and then parse the result, you would need to close this question and open a new one, as it is not scoped correctly as is.

Can't fetch links connected to different exhibitors from a webpage

I've been trying to fetch the links connected to different exhibitors from this webpage using python script but I get nothing as result, no error either. The class name m-exhibitors-list__items__item__name__link I've used within my script is available in the page source so they are not generated dynamically.
What change should I bring about within my script to get the links?
This is what I've tried with:
from bs4 import BeautifulSoup
import requests
link = 'https://www.topdrawer.co.uk/exhibitors?page=1'
with requests.Session() as s:
s.headers['User-Agent']='Mozilla/5.0'
response = s.get(link)
soup = BeautifulSoup(response.text,"lxml")
for item in soup.select("a.m-exhibitors-list__items__item__name__link"):
print(item.get("href"))
One such links I'm after (the first one):
https://www.topdrawer.co.uk/exhibitors/alessi-1

#Life is complex is right that site you used to scrape is protected by Incapsula service to protect site from web scraping and other attacks, it checks for request header whether it is from browser or from robot(you or bot), However it is more likely site has proprietary data, or they might preventing from other threats
However there is option to achieve what you want using Selenium and BS4
following is code snip for your reference
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
link = 'https://www.topdrawer.co.uk/exhibitors?page=1'
CHROMEDRIVER_PATH ="C:\Users\XYZ\Downloads/Chromedriver.exe"
wd = webdriver.Chrome(CHROMEDRIVER_PATH)
response = wd.get(link)
html_page = wd.page_source
soup = BeautifulSoup(html_page,"lxml")
results = soup.findAll("a", {"class" : "m-exhibitors-list__items__item__name__link"})
#interate list of anchor tags to get href attribute
for item in results:
print(item.get("href"))
wd.quit()

The site that you are attempting to scrape is protected with Incapsula.
target_url = 'https://www.topdrawer.co.uk/exhibitors?page=1'
response = requests.get(target_url,
headers=http_headers, allow_redirects=True, verify=True, timeout=30)
raw_html = response.text
soupParser = BeautifulSoup(raw_html, 'lxml')
pprint (soupParser.text)
**OUTPUTS**
soupParser = BeautifulSoup(raw_html, 'html')
('Request unsuccessful. Incapsula incident ID: '
'438002260604590346-1456586369751453219')
Read through this: https://www.quora.com/How-can-I-scrape-content-with-Python-from-a-website-protected-by-Incapsula
and these: https://stackoverflow.com/search?q=Incapsula

Web scraping cnbc.com

I am trying to scrape this page with bs4 and I was wondering how can I scrape the EUR/USD, price change, and price % ?
I am pretty new to this, so this is all I have so far:
import requests
from bs4 import BeautifulSoup
url = 'http://www.cnbc.com/pre-markets/'
source_code = requests.get(url).text
soup = BeautifulSoup(source_code, 'lxml')
for r in soup.find_all('td', {'class': 'first text'}):
print(r)

The data you're looking for are probably loaded with javaScript and therefore you can't see them with bs4. But you can do it using an headless browser like PhantomJS, Selenium or Splash. See also this response: scraping dynamic updates of temperature sensor data from a website

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping a Dynamic Website using Selenium or Beautiful Soup - python

Related

Why is there an empty result while scraping a Exchange Rate table?

Extracting Scraped Web Content from iframe

BeautifulSoup not returning results of a search on a website

Can't fetch links connected to different exhibitors from a webpage

Web scraping cnbc.com

Categories

Resources