While accessing this link https://www.dickssportinggoods.com/f/tents-accessories?pageNumber=2 with requests_html, i need to wait to wait some time before the page actually loads. Is it possible with this?
My code:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
from lxml import etree
s = HTMLSession()
response = s.get(
'https://www.dickssportinggoods.com/f/tents-accessories?pageNumber=2')
response.html.render()
soup = BeautifulSoup(response.content, "html.parser")
dom = etree.HTML(str(soup))
item = dom.xpath('//a[#class="rs_product_description d-block"]/text()')[0]
print(item)
It looks like the data you are looking for can be fetched using HTTP GET to
https://prod-catalog-product-api.dickssportinggoods.com/v2/search?searchVO=%7B%22selectedCategory%22%3A%2212301_1809051%22%2C%22selectedStore%22%3A%220%22%2C%22selectedSort%22%3A1%2C%22selectedFilters%22%3A%7B%7D%2C%22storeId%22%3A15108%2C%22pageNumber%22%3A2%2C%22pageSize%22%3A48%2C%22totalCount%22%3A112%2C%22searchTypes%22%3A%5B%22PINNING%22%5D%2C%22isFamilyPage%22%3Atrue%2C%22appliedSeoFilters%22%3Afalse%2C%22snbAudience%22%3A%22%22%2C%22zipcode%22%3A%22%22%7D
The call will return a JSON and you can use that direcly with zero scraping code.
Copy/Paste the URL into the browser --> see the data.
You can specify the page number in the url:
searchVO={"selectedCategory":"12301_1809051","selectedStore":"0","selectedSort":1,"selectedFilters":{},"storeId":15108,"pageNumber":2,"pageSize":48,"totalCount":112,"searchTypes":["PINNING"],"isFamilyPage":true,"appliedSeoFilters":false,"snbAudience":"","zipcode":""}
working code below
import requests
import pprint
page_num = 2
url = f'https://prod-catalog-product-api.dickssportinggoods.com/v2/search?searchVO=%7B%22selectedCategory%22%3A%2212301_1809051%22%2C%22selectedStore%22%3A%220%22%2C%22selectedSort%22%3A1%2C%22selectedFilters%22%3A%7B%7D%2C%22storeId%22%3A15108%2C%22pageNumber%22%3A2%2C%2{page_num}pageSize%22%3A48%2C%22totalCount%22%3A112%2C%22searchTypes%22%3A%5B%22PINNING%22%5D%2C%22isFamilyPage%22%3Atrue%2C%22appliedSeoFilters%22%3Afalse%2C%22snbAudience%22%3A%22%22%2C%22zipcode%22%3A%22%22%7D'
r = requests.get(url)
if r.status_code == 200:
pprint.pprint(r.json())
You can induce Selenium as well in headless mode.
Selenium has the capability to wait unit elements are found with Explicit waits.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument('--window-size=1920,1080')
options.add_argument("--headless")
driver = webdriver.Chrome(executable_path = driver_path, options = options)
driver.get("URL here")
wait = WebDriverWait(driver, 20)
wait.until(EC.visibility_of_element_located((By.XPATH, "//a[#class='rs_product_description d-block']")))
PS: You'd have to download chromedriver from here
Related
I'm trying to scrape the following table from this URL: https://baseballsavant.mlb.com/leaderboard/outs_above_average?type=Fielder&startYear=2022&endYear=2022&split=no&team=&range=year&min=10&pos=of&roles=&viz=show
This is my code:
import requests
from bs4 import BeautifulSoup
url = "https://baseballsavant.mlb.com/leaderboard/outs_above_average?type=Fielder&startYear=2022&endYear=2022&split=no&team=&range=year&min=10&pos=of&roles=&viz=show"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
table = soup.find("table")
for row in table.findAll("tr"):
print([i.text for i in row.findAll("td")])
However, my variable table returns None, even though there is clearly a table tag in the HTML code of the website. How do I get it?
The webpage is loaded dynamically and relies on JavaScript, therefore requests won't support it. You could use another parser library such as selenium.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
url = "https://baseballsavant.mlb.com/leaderboard/outs_above_average?type=Fielder&startYear=2022&endYear=2022&split=no&team=&range=year&min=10&pos=of&roles=&viz=show"
driver.get(url)
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.TAG_NAME, 'table')))
table = driver.find_element(By.TAG_NAME, 'table')
table_html = table.get_attribute('innerHTML')
# print('table html:', table_html)
for tr_web_element in table.find_elements(By.TAG_NAME, 'tr'):
for td_web_element in tr_web_element.find_elements(By.TAG_NAME, 'td'):
print(td_web_element.text)
driver.close()
Or see this answer to incorporate Selenium with BeautifulSoup.
I am working on an office project to get data to check active status on different websites but whenever I want to get data sometimes it shows none and sometimes it shows this Attribute error, I follow youtube videos steps but still get this error. help, please.
//Python Code
from bs4 import BeautifulSoup
import requests
html_text = requests.get(
"https://www.mintscan.io/cosmos/validators/cosmosvaloper1we6knm8qartmmh2r0qfpsz6pq0s7emv3e0meuw").text
soup = BeautifulSoup(html_text, 'lxml')
status = soup.find('div', {'class': "ValidatorInfo_statusBadge__PBIGr"})
para = status.find('p').text
print(para)
The url is dynamic meaning data is populated by javascript. So you need automation tool something like selenium.
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
url = 'https://www.mintscan.io/cosmos/validators/cosmosvaloper1we6knm8qartmmh2r0qfpsz6pq0s7emv3e0meuw'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
time.sleep(8)
driver.get(url)
time.sleep(10)
soup = BeautifulSoup(driver.page_source, 'lxml')
#driver.close()
status = soup.find('div', {'class': "ValidatorInfo_statusBadge__PBIGr"})
para = status.find('p').text
print(para)
Output:
Active
You have the most common problem - modern pages use JavaScript to add elements but requests/BeautifulSoup can't run JavaScript.
So soup.find('div',...) gives None instead expected element and later it makes problem with None.find('p')
You may use Selenium to control real web browser which can run JavaScript.
from selenium import webdriver
#from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
#from selenium.common.exceptions import NoSuchElementException, TimeoutException
#from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.firefox import GeckoDriverManager
url = "https://www.mintscan.io/cosmos/validators/cosmosvaloper1we6knm8qartmmh2r0qfpsz6pq0s7emv3e0meuw"
#driver = webdriver.Chrome(executable_path=ChromeDriverManager().install())
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
driver.get(url)
#status = driver.find_element(By.XPATH, '//div[#class="ValidatorInfo_statusBadge__PBIGr"]')
wait = WebDriverWait(driver, 10)
status = wait.until(EC.visibility_of_element_located((By.XPATH, '//div[#class="ValidatorInfo_statusBadge__PBIGr"]')))
print(status.text)
Eventually you should check if page gives some API to get data.
You may also use DevTools (tab: Network) to check if JavaScript reads data from some URL and you may try to use this URL with requests. It could work faster than with Selenium but server may detect script/bot and block it.
JavaScript usually get data as JSON so it may not need to scrape HTML with BeautifulSoup
I am using the below code to print the soup variable that is nothing but the source code of the page.
Code
from urllib.request import urlopen
from bs4 import BeautifulSoup
import json, requests, re, sys
from selenium import webdriver
import re, time
yes_url = "https://www.yesbank.in/personal-banking/yes-first/cards/credit-card/yes-first-exclusive-credit-card"
driver = webdriver.Chrome(executable_path="C:\\Users\\Hari\\Downloads\\chromedriver.exe")
driver.get(yes_url)
time.sleep(3)
# r = requests.get(yes_url)
soup = BeautifulSoup(driver.page_source, 'lxml')
print(soup)
driver.close()
Link I am scraping the page source from is : https://www.yesbank.in/personal-banking/yes-first/cards/credit-card/yes-first-exclusive-credit-card
After running the above code the code keeps running till hours and hours but I don't get the output.
Please help me in scraping the page source, so that I get some output after I run the code.
Issue: You are dealing with a modern website which check the browser itself if it's controlled or not using robust.
How That can be done?
Simply, open your browser console and then type the following:
navigator.webdriver
if it's false so your browser isn't contorlled by any robust program such as selenium.
if it's true so it's controlled.
In your case, you've to disable it in order to trick the website checking mechanism.
Below you can achieve your goal:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup
options = Options()
options.headless = True
options.set_preference("dom.webdriver.enabled", False)
driver = webdriver.Firefox(options=options)
driver.get('https://www.yesbank.in/personal-banking/yes-first/cards/credit-card/yes-first-exclusive-credit-card')
try:
element = WebDriverWait(driver, 10).until(
EC.title_contains('YES'))
soup = BeautifulSoup(driver.page_source, 'lxml')
print(soup.prettify())
finally:
driver.quit()
I am trying to scrape the news content from the following page, but with no success.
https://www.business-humanrights.org/en/latest-news/?&search=nike
I have tried with Beautifulsoup :
r = requests.get("https://www.business-humanrights.org/en/latest-news/?&search=nike")
soup = BeautifulSoup(r.content, 'lxml')
soup
but the content that I am looking for - the bits of news that are tagged as div class = 'card__content', do not appear in the soup output.
I also checked, but I could not find any frames to switch to.
Finally, I tried with phantomjs and the following code but with no success:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.business-humanrights.org/en/latest-news/?&search=nike"
driver = webdriver.PhantomJS(executable_path= '~\Chromedriver\phantomjs-2.1.1-windows\bin\phantomjs.exe')
driver.get(url)
time.sleep(7)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
container = soup.find_all('div', attrs={
'class':'card__content'})
print(container)
I am running out of options, anyone can help?
Use API
import requests
r = requests.get("https://www.business-humanrights.org/en/api/internal/explore/?format=json&search=nike")
print(r.json())
I didn't understand why you're facing this. I tried the same above but not with requests and bs4. I used requests_html. xpaths can be used directly in this library without any other libraries.
import requests_html
session = requests_html.HTMLSession()
URL = 'https://www.business-humanrights.org/en/latest-news/?&search=nike'
res = session.get(URL)
divs_with_required_class = res.html.xpath(r'//div[#class="card__content"]')
for item in divs_with_required_class:
print(f'Div {divs_with_required_class.index(item) + 1}:\n', item.text, end='\n\n')
driver.page_source returns initial HTML-doc content no matter how long you wait (time.sleep(7) has no effect).
Try below:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.get(url)
cards = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//div[#class='card__content' and normalize-space(.)]")))
texts = [card.text for card in cards]
print(texts)
driver.quit()
I am stuggling to scrape as per code below. Would apprciate it if someone can have a look at what I am missing?
Regards
PyProg70
from selenium import webdriver
from selenium.webdriver import FirefoxOptions
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from bs4 import BeautifulSoup
import pandas as pd
import re, time
binary = FirefoxBinary('/usr/bin/firefox')
opts = FirefoxOptions()
opts.add_argument("--headless")
browser = webdriver.Firefox(options=opts, firefox_binary=binary)
browser.implicitly_wait(10)
url = 'http://tenderbulletin.eskom.co.za/'
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
not Java but Javascript. it dynamic page you need to wait and check if Ajax finished the request and content rendered using WebDriverWait.
....
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
.....
browser.get(url)
# wait max 30 second until table loaded
WebDriverWait(browser, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR , 'table.CSSTableGenerator .ng-binding')))
html = browser.find_element_by_css_selector('table.CSSTableGenerator')
soup = BeautifulSoup(html.get_attribute("outerHTML"), 'lxml')
print(soup.prettify().encode('utf-8'))