Beautiful Soup and Selenium cannot scrape website contents - python

So I am trying to scrape the contents of a webpage. Initially I tried to use BeautifulSoup, however I was unable to grab the contents because the contents are loaded in dynamically.
After reading around I tried to use Selenium based on people suggestions, however after doing so I'm still unable to grab the contents. The scraped contents is the same as Beautiful soup.
Is it just not possible to scrape the contents of this webpage? (ex: https://odb.org/TW/2021/08/11/accessible-to-all)
import datetime as d
import requests
from bs4 import BeautifulSoup as bs
# BeautifulSoup Implementation
def devo_scrap():
full_date = d.date.today()
string_date = str(full_date)
format_date = string_date[0:4] + '/' + string_date[5:7] + '/' + string_date[8:]
url = "https://odb.org/" + format_date
r = requests.get(url)
soup = bs(r.content, 'lxml')
return soup
print(devo_scrap())
So the above is Beautiful soup implementation. Does anyone have any suggestions? Is it just not possible to scrape? Thanks in advance.
(Updated with Selenium Implementation)
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import datetime as d
PATH = '' <chrome driver path>
driver = webdriver.Chrom(PATH)
full_date = d.date.today()
string_date = str(full_date)
format_date = string_date[0:4] + '/' + string_date[5:7] + '/' + string_date[8:]
url = "https://odb.org/" + format_date
content = driver.get(url)
print(content)
The content (html) grabbed with selenium is the same as with BeautifulSoup.

You can simply do :
source = driver.page_source
to get the page source using selenium. And convert that source into BeautifulSoup as usual :
source = BeautifulSoup(source,"lxml")
Complete code with some improvement :
from selenium import webdriver
from datetime import datetime
import time
from bs4 import BeautifulSoup
now = datetime.today()
format_date= now.strftime("%Y/%m/%d")
driver = webdriver.<>(executable_path=r'<>')
url = "https://odb.org/" + format_date
driver.get(url)
time.sleep(10)
# To load page completely.
content=BeautifulSoup(driver.page_source,"lxml")
print(content)
# Title :
print(content.find("h1",class_="devo-title").text)
# Content :
print(content.find("article",class_="content").text)

Data comes from an API call which returns a list with a single dictionary. Some of the values in the dictionary are html so will need an html parser to parse out the info. You might choose to do this based on the associated keys. For now, for demo'ing contents, I have used a simple test of the whether value is a string and starts with "<". You should consider whether something more robust is required.
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://api.experience.odb.org/devotionals/v2?site_id=1&status=publish&country=TW&on=08-11-2021', headers = {'User-Agent':'Mozilla/5.0'})
data = r.json()[0]
for k,v in data.items():
print(k + ' :')
if isinstance(v, str):
if v.startswith('<'):
soup = bs(v)
print(soup.get_text(' '))
else:
print(v)
else:
print(v)
print()

Related

Parsing data scraped from Javascript rendered webpage with python

I am trying to use .find off of a soup variable but when I go to the webpage and try to find the right class it returns none.
from bs4 import *
import time
import pandas as pd
import pickle
import html5lib
from requests_html import HTMLSession
s = HTMLSession()
url = "https://cryptoli.st/lists/fixed-supply"
def get_data(url):
r = s.get(url)
global soup
soup = BeautifulSoup(r.text, 'html.parser')
return soup
def get_next_page(soup):
page = soup.find('div', {'class': 'dataTables_paginate paging_simple_numbers'})
return page
get_data(url)
print(get_next_page(soup))
The "page" variable returns "None" even though I pulled it from the website element inspector. I suspect it has something to do with the fact that the website is rendered with javascript but can't figure out why. If I take away the {'class' : ''datatables_paginate paging_simple_numbers'} and just try to find 'div' then it works and returns the first div tag so I don't know what else to do.
So you want to scrape dynamic page content , You can use beautiful soup with selenium webdriver. This answer is based on explanation here https://www.geeksforgeeks.org/scrape-content-from-dynamic-websites/
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
url = "https://cryptoli.st/lists/fixed-supply"
driver = webdriver.Chrome('./chromedriver')
driver.get(url)
# this is just to ensure that the page is loaded
time.sleep(5)
html = driver.page_source
# this renders the JS code and stores all
# of the information in static HTML code.
# Now, we could simply apply bs4 to html variable
soup = BeautifulSoup(html, "html.parser")

How to extract downloadable links inside table containing list of links, and scrape multiple pages

I would like to extract all .doc links when clicking into a specific title (i.e. announcement in this case) inside table.
I am able to extract title, date, and all links in the first level for only one page as per code below:
from lxml import etree
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import sys
import pandas as pd
from urllib.request import urlparse, urljoin
from bs4 import BeautifulSoup
import requests
frame =[]
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options = chrome_options)
for page_number in range(1,78):
url = 'http://example.com/index{}.html'.format(page_number)
driver.get(url)
html = etree.HTML(driver.page_source)
extract_announcements_list = html.xpath('//table[#id="14681"]/tbody/tr/td/table[#width="90%"][position()>=2 and position() <= (last())]')
for i in list:
date = i.xpath('./tbody/tr/td[3]/text()')
title = i.xpath('./tbody/tr/td[2]/font/a/#title')
link = i.xpath('./tbody/tr/td[2]/font/a/#href')
real_link = 'http://example.com'+ link[0]
print(title,date,real_link)
frame.append({
'title': title,
'link': real_link,
'date': date,
**'content': doc_link,** #this is the doc_link I want to extract in the second level
})
dfs = pd.DataFrame(frame)
dfs.to_csv('myscraper.csv',index=False,encoding='utf-8-sig')
I am trying hours to search for a solution for this. I would really appreciated if someone can help me extract the second link to get content for the .doc link ('content': doc_link), as well as way to scrape all pages in the website.
Thank you very much in advance!
UPDATED: Many thanks to #Ares Zephyr for sharing your code. Here is what I have made to my code as per suggestion. But it did not yield any results for being able to get the inside links.
from lxml import etree
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import sys
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup
import requests
frame =[]
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options = chrome_options)
for page_number in range(1,2):
url = 'http://example.com/index{}.html'.format(page_number)
print('Downloading page %s...' % url)
driver.get(url)
html = etree.HTML(driver.page_source)
html_page = urllib.request.urlopen(url)
soup = BeautifulSoup(html_page, "html.parser")
extract_announcements_list = html.xpath('//table[#id="14681"]/tbody/tr/td/table[#width="90%"][position()>=2 and position() <= (last())]')
for i in list:
date = i.xpath('./tbody/tr/td[3]/text()')
title = i.xpath('./tbody/tr/td[2]/font/a/#title')
link = i.xpath('./tbody/tr/td[2]/font/a/#href')
real_link = 'http://example.com'+ link[0]
soup = BeautifulSoup(requests.get(real_link).content, 'html.parser')
for doc_link in soup.findAll('a'):
thelink = doc_link.get('href')
frame.append({
'title': title,
'link': real_link,
'date': date,
'doclink': thelink,
})
dfs = pd.DataFrame(frame)
dfs.to_csv('myscraper.csv',index=False,encoding='utf-8-sig')
You need to indent this piece in your code for the append function to work on all your scraped data. I believe that is what #arundeep-chohan is also trying to highlight.
frame.append({
'title': announcement_title,
'link': real_link,
'date': announcement_date,
**'content': doc_link,** #this is the doc_link I want to extract in the second level
})
The logic to find doc files is as follows. Please modify and use it. This is part of my code that I use to download pdf files.
for link in soup.findAll('a'):
theLink = link.get('href')
name= link.string
# Logic to find .pdf files
if theLink[-4:] == ".pdf" or theLink[-4:] == ".pdf":
if theLink[-4:] ==".pdf":
fileExtension = ".pdf"

BeautifulSoup, lmxl not downloading time text

I'm trying to download data from a website, everything is working fine except for when it encounters dates, it will just return "". I've looked at the html downloaded into the program and it has nothing between the tags which is why it's returning nothing. When you inspect the html online you can see it there clearly. Does anyone have any ideas?
from bs4 import BeautifulSoup
import requests
stocks=["3PL"]
keys = list()
values = list()
for stock in stocks:
source = requests.get(r"https://www.reuters.com/companies/" + stock + ".AX/key-metrics").text
soup = BeautifulSoup(source, 'lxml')
for data in soup.find_all("tr", class_="data"):
keys.append(data.th.text)
if data.td.text != "--":
values.append(data.td.text)
else:
values.append("nan")
print(keys[3])
print(values[3]) #This should return the date
It would seem your data is added with javascript. This is something requests will not handle as it won't render the page like a normal browsers. Only fetch the raw data.
However, you can use the selenium package to do this successfully. To instal this:
pip install selenium
You may need to setup some web drivers to use Firefox, or Chrome. But in the case below I used the browser that worked out of the box, being Safari.
I have adjusted your code a little to use the selenium package, and put your data in a dict to keep a nicer consistency.
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
stocks=["3PL"]
response_data = {}
driver = webdriver.Safari()
for stock in stocks:
url = r"https://www.reuters.com/companies/" + stock + ".AX/key-metrics"
driver.get(url)
source = driver.page_source
soup = BeautifulSoup(source)
for data in soup.find_all("tr", class_="data"):
if data.td.text != "--":
response_data[data.th.text] = data.td.text
else:
response_data[data.th.text] = 'nan'
driver.close()
Now you can check if the data is correctly downloaded:
print(response_data['Pricing date'])
Sep-04

Python, BS and Selenium

I try to webscrape with javascript dynamic + bs + python and Ive read a lot of things to come up with this code where I try to scrape a price rendered with javascript on a famous website for example:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.nespresso.com/fr/fr/order/capsules/original/"
browser = webdriver.PhantomJS(executable_path = "C:/phantomjs-2.1.1-windows/bin/phantomjs.exe")
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
soup.find("span", {'class':'ProductListElement__price'}).text
But I only have as a result '\xa0' which is the source value, not the javascript value and I don't know really what I did wrong ...
Best regards
You don't need the expense of a browser. The info is in a script tag so you can regex that out and handle with json library
import requests, re, json
r = requests.get('https://www.nespresso.com/fr/fr/order/capsules/original/')
p = re.compile(r'window\.ui\.push\((.*ProductList.*)\)')
data = json.loads(p.findall(r.text)[0])
products = {product['name']:product['price'] for product in data['configuration']['eCommerceData']['products']}
print(products)
Regex:
Here are two ways to get the prices
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.nespresso.com/fr/fr/order/capsules/original/"
browser = webdriver.Chrome()
browser.get(url)
html = browser.page_source
# Getting the prices using bs4
soup = BeautifulSoup(html, 'lxml')
prices = soup.select('.ProductListElement__price')
print([p.text for p in prices])
# Getting the prices using selenium
prices =browser.find_elements_by_class_name("ProductListElement__price")
print([p.text for p in prices])

How to scrape src from img html in python

I'm trying to scrape the src of the img, but the code I found returns many img src, but not the one I want. I can't figure out what I am doing wrong. I am scraping TripAdvisor on "https://www.tripadvisor.dk/Restaurant_Review-g189541-d15804886-Reviews-The_Pescatarian-Copenhagen_Zealand.html"
So this is the HTML snippet I'm trying to extract from:
<div class="restaurants-detail-overview-cards-LocationOverviewCard__cardColumn--2ALwF"><h6>Placering og kontaktoplysninger</h6><span><div><span data-test-target="staticMapSnapshot" class=""><img class="restaurants-detail-overview-cards-LocationOverviewCard__mapImage--22-Al" src="https://trip-raster.citymaps.io/staticmap?scale=1&zoom=15&size=347x137&language=da&center=55.687988,12.596316&markers=icon:http%3A%2F%2Fc1.tacdn.com%2F%2Fimg2%2Fmaps%2Ficons%2Fcomponent_map_pins_v1%2FR_Pin_Small.png|55.68799,12.596316"></span></div></span>
I want the code to return: (a sub-string from src)
55.68799,12.596316
I have tried:
import pandas as pd
pd.options.display.max_colwidth = 200
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
import re
web_url = "https://www.tripadvisor.dk/Restaurant_Review-g189541-d15804886-Reviews-The_Pescatarian-Copenhagen_Zealand.html"
url = urlopen(web_url)
url_html = url.read()
soup = bs(url_html, 'lxml')
soup.find_all('img')
for link in soup.find_all('img'):
print(link.get('src'))
the return is along the lines of this BUT NOT the src that I need :
https://static.tacdn.com/img2/branding/rebrand/TA_logo_secondary.svg
https://static.tacdn.com/img2/branding/rebrand/TA_logo_primary.svg
https://static.tacdn.com/img2/branding/rebrand/TA_logo_secondary.svg
data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==
data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==
You can do this with just requests and re. It is only the co-ordinates part of the src which are the location based variable.
import requests, re
p = re.compile(r'"coords":"(.*?)"')
r = requests.get('https://www.tripadvisor.dk/Restaurant_Review-g189541-d15804886-Reviews-The_Pescatarian-Copenhagen_Zealand.html')
coords = p.findall(r.text)[1]
src = f'https://trip-raster.citymaps.io/staticmap?scale=1&zoom=15&size=347x137&language=da&center={coords}&markers=icon:http://c1.tacdn.com//img2/maps/icons/component_map_pins_v1/R_Pin_Small.png|{coords}'
print(src)
print(coords)
Selenium is a workaround i tested it and works liek a charm. Here you are:
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
driver.get("https://www.tripadvisor.dk/Restaurant_Review-g189541-d15804886-Reviews-The_Pescatarian-Copenhagen_Zealand.html")
links = driver.find_elements_by_xpath("//*[#src]")
urls = []
for link in links:
url = link.get_attribute('src')
if '|' in url:
urls.append(url.split('|')[1]) # saves in a list only the numbers you want i.e. 55.68799,12.596316
print(url)
print(urls)
Result of above
['55.68799,12.596316']
If you haven't used selenium before here you can find a webdriver https://chromedriver.storage.googleapis.com/index.html?path=2.46/
or here
https://sites.google.com/a/chromium.org/chromedriver/downloads

Categories

Resources