Python BeautifulSoup - can't read website pagination - python

I tried to extract div with class='no-selected-number extreme-number' that contains the website pagination, but I don't get the expected result. Can anyone help me?
Below is my code:
import requests from bs4 import BeautifulSoup
URL ="https://www.falabella.com.pe/falabella-pe/category/cat40703/Perfumes-de-Mujer/"
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538 Safari/537.36'}
r = requests.get(URL, headers=headers, timeout=5) html = r.content
soup = BeautifulSoup(html, 'lxml') box_3 =
soup.find_all('div','fb-filters-sort')
for div in box_3:
last_page = div.find_all("div",{"class","no-selected-number extreme-number"})
print(last_page)

You may need a method that allows time for page loading e.g. using selenium. I don't think the data you are after is present with requests.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
url ="https://www.falabella.com.pe/falabella-pe/category/cat40703/Perfumes-de-Mujer/"
d = webdriver.Chrome(chrome_options=chrome_options)
d.get(url)
print(d.find_element_by_css_selector('.content-items-number-list .no-selected-number.extreme-number:last-child').text)
d.quit()

Related

BeautifulSoup doesn’t find tags

BeautifulSoup doesn’t find any tag on this page. Does anyone know what the problem can be?
I can find elements on the page with selenium, but since I have a list of pages, I don’t want to use selenium.
import requests
from bs4 import BeautifulSoup
url = 'https://dzen.ru/news/story/VMoskovskoj_oblasti_zapushhen_chat-bot_ochastichnoj_mobilizacii--b093f9a22a32ed6731e4a4ca50545831?lang=ru&from=reg_portal&fan=1&stid=fOB6O7PV5zeCUlGyzvOO&t=1664886434&persistent_id=233765704&story=90139eae-79df-5de1-9124-0d830e4d59a5&issue_tld=ru'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
soup.find_all('h1')
You can get the info on that page by adding headers to your requests, mimicking what you can see in Dev tools - Network tab main request to that url. Here is one way to get all links from that page:
import requests
from bs4 import BeautifulSoup as bs
headers = {
'Cookie': 'sso_checked=1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://dzen.ru/news/story/VMoskovskoj_oblasti_zapushhen_chat-bot_ochastichnoj_mobilizacii--b093f9a22a32ed6731e4a4ca50545831?lang=ru&from=reg_portal&fan=1&stid=fOB6O7PV5zeCUlGyzvOO&t=1664886434&persistent_id=233765704&story=90139eae-79df-5de1-9124-0d830e4d59a5&issue_tld=ru'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
links = [a.get('href') for a in soup.select('a')]
print(links)
Result printed in terminal:
['/news', 'https://dzen.ru/news', 'https://dzen.ru/news/region/moscow', 'https://dzen.ru/news/rubric/mobilizatsiya', 'https://dzen.ru/news/rubric/personal_feed', 'https://dzen.ru/news/rubric/politics', 'https://dzen.ru/news/rubric/society', 'https://dzen.ru/news/rubric/business', 'https://dzen.ru/news/rubric/world', 'https://dzen.ru/news/rubric/sport', 'https://dzen.ru/news/rubric/incident', 'https://dzen.ru/news/rubric/culture', 'https://dzen.ru/news/rubric/computers', 'https://dzen.ru/news/rubric/science', 'https://dzen.ru/news/rubric/auto', 'https://www.mosobl.kp.ru/online/news/4948743/?utm_source=yxnews&utm_medium=desktop', 'https://www.mosobl.kp.ru/online/news/4948743/?utm_source=yxnews&utm_medium=desktop', 'https://www.mosobl.kp.ru/online/news/4948743/?utm_source=yxnews&utm_medium=desktop', 'https://mosregtoday.ru/soc/v-podmoskove-zapustili-chat-bot-po-voprosam-chastichnoj-mobilizacii/?utm_source=yxnews&utm_medium=desktop', ...]

Difficulty Scraping Product Information from Website

I am having difficulties scraping the "product name" and "price" from this website: https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571
Looking to scrap "$4.30" and "Zespri New Zealand Kiwifruit - Green" from the webpage. I have tried various approaches (Beautiful Soup, request_html, selenium) without any success. Attached the sample code approaches I have taken.
I am able to view the 'price' and 'product name' details in the "Developer Tools" tab of Chrome. It seems like that webpage uses Javascript to dynamically load the product information, so the various approaches mentioned above are not able to scrape the information properly.
Appreciate any assistance on this issue.
Requests_html Approach:
from requests_html import HTMLSession
import json
url='https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571'
session = HTMLSession()
r = session.get(url)
r.html.render(timeout=20)
json_text=r.html.xpath("//script[#type='application/ld+json']/text()")[0][:-1]
json_data = json.loads(json_text)
print(json_data['name']['price'])
Beautiful Soup Approach:
import sys
import time
from bs4 import BeautifulSoup
import requests
import re
url='https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.69'}
page=requests.get(url, headers=headers)
time.sleep(2)
soup=BeautifulSoup(page.text,'html.parser')
linkitem=soup.find_all('span',attrs={'class':'sc-1bsd7ul-1 djlKtC'})
print(linkitem)
linkprice=soup.find_all('span',attrs={'class':'sc-1bsd7ul-1 sc-13n2dsm-5 kxEbZl deQJPo'})
print(linkprice)
Selenium Approach:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
linkitem = soup.find_all('span',attrs={'class':'sc-1bsd7ul-1 djlKtC'})
print(linkitem)
That approach of yours with the embedded JSON needs some refinement. In other words, you're almost there. Also, this can be done with pure requests and bs4.
PS. I'm using different URLS, as the one you give returns a 404.
Here's how:
import json
import requests
from bs4 import BeautifulSoup
urls = [
"https://www.fairprice.com.sg/product/11798142",
"https://www.fairprice.com.sg/product/vina-maipo-cabernet-sauvignon-merlot-750ml-11690254",
"https://www.fairprice.com.sg/product/new-moon-new-zealand-abalone-425g-75342",
]
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:96.0) Gecko/20100101 Firefox/96.0",
}
for url in urls:
product_data = (
json.loads(
BeautifulSoup(requests.get(url, headers=headers).text, "lxml")
.find("script", type="application/ld+json")
.string[:-1]
)
)
print(product_data["name"])
print(product_data["offers"]["price"])
This should output:
Nongshim Instant Cup Noodle - Spicy
1.35
Vina Maipo Red Wine - Cabernet Sauvignon Merlot
14.95
New Moon New Zealand Abalone
33.8

Python How Can I Scrape Image URL and Image Title From This Link With BeautifulSoup

I could not reach the image url and image title with Beautiful soup. I will be glad if you help me. Thanks
The image url I want to scrape is:
https://cdn.homebnc.com/homeimg/2017/01/29-entry-table-ideas-homebnc.jpg
Title I want to scrape
37 Best Entry Table Ideas (Decorations and Designs) for 2017
from bs4 import BeautifulSoup
import requests
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
}
response = requests.get("https://www.bing.com/images/search?view=detailV2&ccid=sg67yP87&id=39EC3D95F0FC25C52E714B1776D819AB564D474B&thid=OIP.sg67yP87Kr9hQF8PiKnKZQHaLG&mediaurl=https%3a%2f%2fcdn.homebnc.com%2fhomeimg%2f2017%2f01%2f29-entry-table-ideas-homebnc.jpg&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.b20ebbc8ff3b2abf61405f0f88a9ca65%3frik%3dS0dNVqsZ2HYXSw%26pid%3dImgRaw%26r%3d0&exph=2247&expw=1500&q=table+ideas&simid=608015185750411203&FORM=IRPRST&ck=7EA9EDE471AB12F7BDA2E7DA12DC56C9&selectedIndex=0&qft=+filterui%3aimagesize-large", headers=headers)
print(response.status_code)
soup = BeautifulSoup(response.content, "html.parser")
As mentioned in the comment above by #Carst3n, BeautifulSoup is only giving you the html format before any scripts are executed. For this reason you should try to scrape the website with a combination of Selenium and BeautifulSoup.
You can download chromedriver from here
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
opts = Options()
opts.add_argument(" --headless")
opts.add_argument("--log-level=3")
driver = webdriver.Chrome(PATH_TO_CHROME_DRIVER, options=opts)
driver.get("https://www.bing.com/images/search?view=detailV2&ccid=sg67yP87&id=39EC3D95F0FC25C52E714B1776D819AB564D474B&thid=OIP.sg67yP87Kr9hQF8PiKnKZQHaLG&mediaurl=https%3a%2f%2fcdn.homebnc.com%2fhomeimg%2f2017%2f01%2f29-entry-table-ideas-homebnc.jpg&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.b20ebbc8ff3b2abf61405f0f88a9ca65%3frik%3dS0dNVqsZ2HYXSw%26pid%3dImgRaw%26r%3d0&exph=2247&expw=1500&q=table+ideas&simid=608015185750411203&FORM=IRPRST&ck=7EA9EDE471AB12F7BDA2E7DA12DC56C9&selectedIndex=0&qft=+filterui%3aimagesize-large")
time.sleep(2)
soup_file = driver.page_source
driver.quit()
soup = BeautifulSoup(soup_file)
print(soup.select('#mainImageWindow .nofocus')[0].get('src'))
print(soup.select('.novid')[0].string)

Why is my CSS selector not working with beautifulsoup but works fine as a chrome console query?

I have a css selector that works fine when executing it in the chrome JS Console, but does not work when running it through beautifulsoup on one example, yet works on another (I'm unable to discern the difference between the two).
url_1 = 'https://www.amazon.com/s?k=bacopa&page=1'
url_2 = 'https://www.amazon.com/s?k=acorus+calamus&page=1'
The following query works fine on both when executing it in the chrome console.
document.querySelectorAll('div.s-result-item')
Then running the two urls through beautifulsoup, this is the output I get.
url_1 (works)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}
r = requests.get(url_1, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
listings = soup .select('div.s-result-item')
print(len(listings))
output: 53 (correct)
url_2 (does not work)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}
r = requests.get(url_2, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
listings = soup.select('div.s-result-item')
print(len(listings))
output: 0 (incorrect - expected: 49)
Does anyone know what might be going on here and how I can get the css selector to work with beautifulsoup?
I think it is the html. Change the parser to 'lxml'. You can also shorten your css selector to just class and re-use connection with Session object for efficiency.
import requests
from bs4 import BeautifulSoup as bs
urls = ['https://www.amazon.com/s?k=bacopa&page=1','https://www.amazon.com/s?k=acorus+calamus&page=1']
with requests.Session() as s:
for url in urls:
r = s.get(url, headers = {'User-Agent' : 'Mozilla/5.0'})
soup = bs(r.content, 'lxml')
listings = soup.select('.s-result-item')
print(len(listings))
Try selenium library to download the webpage
from selenium import webdriver
from bs4 import BeautifulSoup
url_1 = 'https://www.amazon.com/s?k=bacopa&page=1'
url_2 = 'https://www.amazon.com/s?k=acorus+calamus&page=1'
#set chrome webdriver path
driver = webdriver.Chrome('/usr/bin/chromedriver')
#download webpage
driver.get(url_2)
soup = BeautifulSoup(driver.page_source, 'html.parser')
listings = soup.find_all('div',{'class':'s-result-item'})
print(len(listings))
O/P:
url_1: 50
url_2 : 48

printing number from html tag in python

Hi I have been trying to get the time data from this website: https://clockofeidolon.com (hours, minutes, seconds) and tried to use beautifulsoup to print contents of 'span class="big' tags since the time information is kept there and I have come up with this:
from bs4 import BeautifulSoup
from requests import Session
session = Session()
session.headers['user-agent'] = (
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/'
'66.0.3359.181 Safari/537.36'
)
url = 'https://clockofeidolon.com'
response = session.get(url=url)
data = response.text
soup = BeautifulSoup(data, "html.parser")
spans = soup.find_all('<span class="big')
print([span.text for span in spans])
But the output only shows "[]" and nothing else. How would I go about printing the number in each of the 3 tags?
As mentioned this can be achieved with selenium once you have the correct geckodriver installed the following should get you on the right track:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://clockofeidolon.com')
html = driver.page_source
soup = BeautifulSoup(html,'lxml')
spans = soup.find_all(class_='big-hour')
for span in spans:
print(span.text)
driver.quit()

Categories

Resources