Scraping paginated data loaded with Javascript

Scraping paginated data loaded with Javascript - python

I am trying to use selenium and beautifulsoup to scrape videos off a website. The videos are loaded when the 'videos' tab is clicked (via JS I guess). When the videos are loaded, there is also the pagination where videos on each page is loaded on click (via JS I guess).
Here is how it looks
When I inspect element, here is what I get
My issue is I can't seem to get all videos across all pages, I can only get the first page. Here is my code,
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as soup
import random
import time
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.default_content_setting_values.notifications": 2}
chrome_options.add_experimental_option("prefs", prefs)
chrome_options.add_argument('--headless')
seconds = 5 + (random.random() * 5)
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.implicitly_wait(30)
driver.get("https://")
time.sleep(seconds)
time.sleep(seconds)
for i in range(1):
element = driver.find_element_by_id("tab-videos")
driver.execute_script("arguments[0].click();", element)
time.sleep(seconds)
time.sleep(seconds)
html = driver.page_source
page_soup = soup(html, "html.parser")
containers = page_soup.findAll("div", {"id": "tabVideos"})
for videos in containers:
main_videos = videos.find_all("div", {"class":"thumb-block tbm-init-ok"})
print(main_videos)
driver.quit()
Please what am I missing here?

The content is loaded from URL 'https://www.x***s.com/amateur-channels/ajibola_elizabeth/videos/best/{page}' where page goes from 0.
This script will print all video URLs:
import requests
from bs4 import BeautifulSoup
url = 'https://www.x***s.com/amateur-channels/ajibola_elizabeth/videos/best/{page}'
page = 0
while True:
soup = BeautifulSoup(requests.get(url.format(page=page)).content, 'html.parser')
for video in soup.select('div[id^="video_"] .title a'):
u = video['href'].rsplit('/', maxsplit=2)
print('https://www.x***s.com/video' + u[-2] + '/' + u[-1])
next_page = soup.select_one('a.next-page')
if not next_page:
break
page += 1

Related

How to get selenium to webscrape the 2nd page of results in a popup window

I am trying to webscrape various pages of results. The first page works fine but when I switch to the next page, unfortunately,it just webscrapes the first page of results again. The results dont return a new URL so that way doesn't work but rather its a window on top of the url opened page. I also cant seem to figure out how to append the results of the first page to add the second page, they come out as separate lists. Below is the code I have.
from selenium import webdriver
import time
import requests
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys
#original webscraping code to get the names of locations from page 1
url = r'https://autochek.africa/en/ng/fix-your-car/service/scheduled-car-service'
driver = webdriver.Chrome()
driver.get(url)
xpath_get_locations = r'/html/body/div[1]/div/div[2]/div/div[1]/div/div[2]/div[2]/div/div/form/div[7]/div/label'
driver.find_element_by_xpath(xpath_get_locations).click()
soup = BeautifulSoup(driver.page_source, 'html.parser')
location_results = [i.text for i in soup.find_all('div', {'class': 'jsx-1642469937 state'})]
print(location_results)
time.sleep(3)
#finished page 1, finding the next button to go to page 2
xpath_find_next_button = r'/html/body/div[1]/div/div[2]/div/div[1]/div/div[2]/div[2]/div[2]/div[2]/div/div/div[3]/ul/li[13]'
driver.find_element_by_xpath(xpath_find_next_button).click()
#getting the locations from page 2
second_page_results = [i.text for i in soup.find_all('div', {'class': 'jsx-1642469937 state'})]
print(second_page_results)
time.sleep(2)

After loading new page or running some JavaScript code on page you have to run again
soup = BeautifulSoup(driver.page_source, 'html.parser')
to work with new HTML.
Or skip BeautifulSoup and do all in Selenium.
Use find_elements_... with char s in word elements.
items = driver.find_elements_by_xpath('//div[#class="jsx-1642469937 state"]')
location_result = [i.text for i in items]
By The Way:
(xpath doesn't need prefix r because it doesn't use \ )
Shorter and more readable xpath.
#xpath_get_locations = r'/html/body/div[1]/div/div[2]/div/div[1]/div/div[2]/div[2]/div/div/form/div[7]/div/label'
xpath_get_locations = '//label[text()="Drop-off at Autochek location"]'
And it would be simpler to use button Next > instead of searching buttons 2, 3, etc.
xpath_find_next_button = '//li[#class="next-li"]/a'
EDIT:
Full working code which uses while-loop to visit all pages.
I added module webdriver_manager which automatically downloads (fresh) driver for browser.
I use find_elemens(By.XPATH, ...) because find_elemens_by_xpath(...) is deprecated.
from selenium import webdriver
from selenium.webdriver.common.by import By
#from selenium.webdriver.common.keys import Keys
#from selenium.webdriver.support.ui import WebDriverWait
#from selenium.webdriver.support import expected_conditions as EC
#from selenium.common.exceptions import NoSuchElementException, TimeoutException
import time
#from bs4 import BeautifulSoup
from webdriver_manager.chrome import ChromeDriverManager
#from webdriver_manager.firefox import GeckoDriverManager
driver = webdriver.Chrome(executable_path=ChromeDriverManager().install())
#driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
# ---
url = 'https://autochek.africa/en/ng/fix-your-car/service/scheduled-car-service'
driver.get(url)
#xpath_get_locations = r'/html/body/div[1]/div/div[2]/div/div[1]/div/div[2]/div[2]/div/div/form/div[7]/div/label'
xpath_get_locations = '//label[text()="Drop-off at Autochek location"]'
driver.find_element(By.XPATH, xpath_get_locations).click()
# ---
all_locations = []
while True:
# --- get locations on page
time.sleep(1) # sometimes `JavaScript` may need time to add new items (and you can't catch it with `WebDriverWait`)
#items = soup.find_all('div', {'class': 'jsx-1642469937 state'})
items = driver.find_elements(By.XPATH, '//div[#class="jsx-1642469937 state"]')
#soup = BeautifulSoup(driver.page_source, 'html.parser')
locations = [i.text for i in items]
print(locations)
print('-------')
all_locations += locations
# --- find button `next >` and try to click it
#xpath_find_next_button = r'/html/body/div[1]/div/div[2]/div/div[1]/div/div[2]/div[2]/div[2]/div[2]/div/div/div[3]/ul/li[13]'
xpath_find_next_button = '//li[#class="next-li"]/a'
try:
driver.find_element(By.XPATH, xpath_find_next_button).click()
except:
break # exit loop
# ---
#driver.close()

How to wait to page to fully load using requests_html

While accessing this link https://www.dickssportinggoods.com/f/tents-accessories?pageNumber=2 with requests_html, i need to wait to wait some time before the page actually loads. Is it possible with this?
My code:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
from lxml import etree
s = HTMLSession()
response = s.get(
'https://www.dickssportinggoods.com/f/tents-accessories?pageNumber=2')
response.html.render()
soup = BeautifulSoup(response.content, "html.parser")
dom = etree.HTML(str(soup))
item = dom.xpath('//a[#class="rs_product_description d-block"]/text()')[0]
print(item)

It looks like the data you are looking for can be fetched using HTTP GET to
https://prod-catalog-product-api.dickssportinggoods.com/v2/search?searchVO=%7B%22selectedCategory%22%3A%2212301_1809051%22%2C%22selectedStore%22%3A%220%22%2C%22selectedSort%22%3A1%2C%22selectedFilters%22%3A%7B%7D%2C%22storeId%22%3A15108%2C%22pageNumber%22%3A2%2C%22pageSize%22%3A48%2C%22totalCount%22%3A112%2C%22searchTypes%22%3A%5B%22PINNING%22%5D%2C%22isFamilyPage%22%3Atrue%2C%22appliedSeoFilters%22%3Afalse%2C%22snbAudience%22%3A%22%22%2C%22zipcode%22%3A%22%22%7D
The call will return a JSON and you can use that direcly with zero scraping code.
Copy/Paste the URL into the browser --> see the data.
You can specify the page number in the url:
searchVO={"selectedCategory":"12301_1809051","selectedStore":"0","selectedSort":1,"selectedFilters":{},"storeId":15108,"pageNumber":2,"pageSize":48,"totalCount":112,"searchTypes":["PINNING"],"isFamilyPage":true,"appliedSeoFilters":false,"snbAudience":"","zipcode":""}
working code below
import requests
import pprint
page_num = 2
url = f'https://prod-catalog-product-api.dickssportinggoods.com/v2/search?searchVO=%7B%22selectedCategory%22%3A%2212301_1809051%22%2C%22selectedStore%22%3A%220%22%2C%22selectedSort%22%3A1%2C%22selectedFilters%22%3A%7B%7D%2C%22storeId%22%3A15108%2C%22pageNumber%22%3A2%2C%2{page_num}pageSize%22%3A48%2C%22totalCount%22%3A112%2C%22searchTypes%22%3A%5B%22PINNING%22%5D%2C%22isFamilyPage%22%3Atrue%2C%22appliedSeoFilters%22%3Afalse%2C%22snbAudience%22%3A%22%22%2C%22zipcode%22%3A%22%22%7D'
r = requests.get(url)
if r.status_code == 200:
pprint.pprint(r.json())

You can induce Selenium as well in headless mode.
Selenium has the capability to wait unit elements are found with Explicit waits.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument('--window-size=1920,1080')
options.add_argument("--headless")
driver = webdriver.Chrome(executable_path = driver_path, options = options)
driver.get("URL here")
wait = WebDriverWait(driver, 20)
wait.until(EC.visibility_of_element_located((By.XPATH, "//a[#class='rs_product_description d-block']")))
PS: You'd have to download chromedriver from here

Python BSoup doesn't find hidden elements, problems with Selenium as alternative

Hey I asked a similar question before and from what I've learned BSoup doesn't find what I'm searching for because the element is hidden.
For context I find FancyCompLabel when I have the cursor above Rathberger and examine the html code
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://www.techpilot.de/zulieferer-suchen?laserschneiden%202d%20(laserstrahlschneiden)"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
cclass=soup.find("div",class_="fancyCompLabel")
print(cclass)
It seems like Selenium could be a fix but I have really big problems implementing it.
heres what i tried:
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
# Set some Selenium Options
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
# Webdriver
wd = webdriver.Chrome('chromedriver',options=options)
# URL
url = 'https://www.techpilot.de/zulieferer-suchen?laserschneiden%202d%20(laserstrahlschneiden)'
# Load URL
wd.get(url)
# Get HTML
soup = BeautifulSoup(wd.page_source, 'html.parser')
cclass=soup.find_all("div",class_="fancyCompLabel")
print(cclass)

1 There are two "Accept cookies" modals, one of them is inside an iframe. You need to accept both and to stitch the the iframe of the seconds one.
2 You will need for your elements to become present in DOM after you accept the cookies.
I've managed to accomplish this with Selenium.
Solution
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
# Set some Selenium Options
options = webdriver.ChromeOptions()
# options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
# Webdriver
wd = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver', options=options)
# URL
url = 'https://www.techpilot.de/zulieferer-suchen?laserschneiden%202d%20(laserstrahlschneiden)'
# Load URL
wd.get(url)
# Get HTML
soup = BeautifulSoup(wd.page_source, 'html.parser')
wait = WebDriverWait(wd, 15)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#bodyJSP #CybotCookiebotDialogBodyLevelButtonLevelOptinAllowAll"))).click()
wait.until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, "#efficientSearchIframe")))
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".hideFunctionalScrollbar #CybotCookiebotDialogBodyLevelButtonLevelOptinAllowAll"))).click()
#wd.switch_to.default_content() # you do not need to switch to default content because iframe is closed already
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".fancyCompLabel")))
results = wd.find_elements_by_css_selector(".fancyCompLabel")
for p in results:
print(p.text)
Result:
Rathberger GmbH
Deschberger Metall- und Blechbearbeit...
Anant GmbH
Gröbmer GmbH
Berma Plaatwerk BV
Punzonado y Láser METALKOR S.L.
Blankart AG - Werkzeugbau + Fertigung
Goodwill Precision Machinery (German ...
Bechtold GmbH
PMT GmbH

Python 3 extract html data from sports site

I have been trying to extract data from a sports site and so far failing. I am Trying to extract the 35, Shots on Goal and 23 but have been failing.
<div class="statTextGroup">
<div class="statText statText--homeValue">35</div>
<div class="statText statText--titleValue">Shots on Goal</div>
<div class="statText statText--awayValue">23</div></div>
from bs4 import BeautifulSoup
import requests
result = requests.get("https://www.scoreboard.com/uk/match/lvbns58C/#match-statistics;0")
src = result.content
soup = BeautifulSoup(src, 'html.parser')
stats = soup.find("div", {"class": "tab-statistics-0-statistic"})
print(stats)
This is the code I have been trying to use and when I run it I get "None" printed to me. Could someone help me so I can print out the data.
Full page found here: https://www.scoreboard.com/uk/match/lvbns58C/#match-statistics;0

As the website is rendered by javascript, possible option would load the page using selenium and then parse it with BeautifulSoup:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
# initialize selenium driver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('<<PATH_TO_SELENIUMDRIVER>>', options=chrome_options)
# load page via selenium
wd.get("https://www.scoreboard.com/uk/match/lvbns58C/#match-statistics;0")
# wait 30 seconds until element with class mainGrid will be loaded
table = WebDriverWait(wd, 30).until(EC.presence_of_element_located((By.ID, 'statistics-content')))
# parse content of the table
soup = BeautifulSoup(table.get_attribute('innerHTML'), 'html.parser')
print(soup)
# close selenium driver
wd.quit()

Scraping: cannot extract content from webpage

I am trying to scrape the news content from the following page, but with no success.
https://www.business-humanrights.org/en/latest-news/?&search=nike
I have tried with Beautifulsoup :
r = requests.get("https://www.business-humanrights.org/en/latest-news/?&search=nike")
soup = BeautifulSoup(r.content, 'lxml')
soup
but the content that I am looking for - the bits of news that are tagged as div class = 'card__content', do not appear in the soup output.
I also checked, but I could not find any frames to switch to.
Finally, I tried with phantomjs and the following code but with no success:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.business-humanrights.org/en/latest-news/?&search=nike"
driver = webdriver.PhantomJS(executable_path= '~\Chromedriver\phantomjs-2.1.1-windows\bin\phantomjs.exe')
driver.get(url)
time.sleep(7)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
container = soup.find_all('div', attrs={
'class':'card__content'})
print(container)
I am running out of options, anyone can help?

Use API
import requests
r = requests.get("https://www.business-humanrights.org/en/api/internal/explore/?format=json&search=nike")
print(r.json())

I didn't understand why you're facing this. I tried the same above but not with requests and bs4. I used requests_html. xpaths can be used directly in this library without any other libraries.
import requests_html
session = requests_html.HTMLSession()
URL = 'https://www.business-humanrights.org/en/latest-news/?&search=nike'
res = session.get(URL)
divs_with_required_class = res.html.xpath(r'//div[#class="card__content"]')
for item in divs_with_required_class:
print(f'Div {divs_with_required_class.index(item) + 1}:\n', item.text, end='\n\n')

driver.page_source returns initial HTML-doc content no matter how long you wait (time.sleep(7) has no effect).
Try below:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.get(url)
cards = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//div[#class='card__content' and normalize-space(.)]")))
texts = [card.text for card in cards]
print(texts)
driver.quit()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping paginated data loaded with Javascript - python

Related

How to get selenium to webscrape the 2nd page of results in a popup window

How to wait to page to fully load using requests_html

Python BSoup doesn't find hidden elements, problems with Selenium as alternative

Python 3 extract html data from sports site

Scraping: cannot extract content from webpage

Categories

Resources