I am having difficulties scraping the "product name" and "price" from this website: https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571
Looking to scrap "$4.30" and "Zespri New Zealand Kiwifruit - Green" from the webpage. I have tried various approaches (Beautiful Soup, request_html, selenium) without any success. Attached the sample code approaches I have taken.
I am able to view the 'price' and 'product name' details in the "Developer Tools" tab of Chrome. It seems like that webpage uses Javascript to dynamically load the product information, so the various approaches mentioned above are not able to scrape the information properly.
Appreciate any assistance on this issue.
Requests_html Approach:
from requests_html import HTMLSession
import json
url='https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571'
session = HTMLSession()
r = session.get(url)
r.html.render(timeout=20)
json_text=r.html.xpath("//script[#type='application/ld+json']/text()")[0][:-1]
json_data = json.loads(json_text)
print(json_data['name']['price'])
Beautiful Soup Approach:
import sys
import time
from bs4 import BeautifulSoup
import requests
import re
url='https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.69'}
page=requests.get(url, headers=headers)
time.sleep(2)
soup=BeautifulSoup(page.text,'html.parser')
linkitem=soup.find_all('span',attrs={'class':'sc-1bsd7ul-1 djlKtC'})
print(linkitem)
linkprice=soup.find_all('span',attrs={'class':'sc-1bsd7ul-1 sc-13n2dsm-5 kxEbZl deQJPo'})
print(linkprice)
Selenium Approach:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.fairprice.com.sg/product/zespri-new-zealand-kiwifruit-green-4-6-per-pack-13045571"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
linkitem = soup.find_all('span',attrs={'class':'sc-1bsd7ul-1 djlKtC'})
print(linkitem)
That approach of yours with the embedded JSON needs some refinement. In other words, you're almost there. Also, this can be done with pure requests and bs4.
PS. I'm using different URLS, as the one you give returns a 404.
Here's how:
import json
import requests
from bs4 import BeautifulSoup
urls = [
"https://www.fairprice.com.sg/product/11798142",
"https://www.fairprice.com.sg/product/vina-maipo-cabernet-sauvignon-merlot-750ml-11690254",
"https://www.fairprice.com.sg/product/new-moon-new-zealand-abalone-425g-75342",
]
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:96.0) Gecko/20100101 Firefox/96.0",
}
for url in urls:
product_data = (
json.loads(
BeautifulSoup(requests.get(url, headers=headers).text, "lxml")
.find("script", type="application/ld+json")
.string[:-1]
)
)
print(product_data["name"])
print(product_data["offers"]["price"])
This should output:
Nongshim Instant Cup Noodle - Spicy
1.35
Vina Maipo Red Wine - Cabernet Sauvignon Merlot
14.95
New Moon New Zealand Abalone
33.8
Related
BeautifulSoup doesn’t find any tag on this page. Does anyone know what the problem can be?
I can find elements on the page with selenium, but since I have a list of pages, I don’t want to use selenium.
import requests
from bs4 import BeautifulSoup
url = 'https://dzen.ru/news/story/VMoskovskoj_oblasti_zapushhen_chat-bot_ochastichnoj_mobilizacii--b093f9a22a32ed6731e4a4ca50545831?lang=ru&from=reg_portal&fan=1&stid=fOB6O7PV5zeCUlGyzvOO&t=1664886434&persistent_id=233765704&story=90139eae-79df-5de1-9124-0d830e4d59a5&issue_tld=ru'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
soup.find_all('h1')
You can get the info on that page by adding headers to your requests, mimicking what you can see in Dev tools - Network tab main request to that url. Here is one way to get all links from that page:
import requests
from bs4 import BeautifulSoup as bs
headers = {
'Cookie': 'sso_checked=1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://dzen.ru/news/story/VMoskovskoj_oblasti_zapushhen_chat-bot_ochastichnoj_mobilizacii--b093f9a22a32ed6731e4a4ca50545831?lang=ru&from=reg_portal&fan=1&stid=fOB6O7PV5zeCUlGyzvOO&t=1664886434&persistent_id=233765704&story=90139eae-79df-5de1-9124-0d830e4d59a5&issue_tld=ru'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
links = [a.get('href') for a in soup.select('a')]
print(links)
Result printed in terminal:
['/news', 'https://dzen.ru/news', 'https://dzen.ru/news/region/moscow', 'https://dzen.ru/news/rubric/mobilizatsiya', 'https://dzen.ru/news/rubric/personal_feed', 'https://dzen.ru/news/rubric/politics', 'https://dzen.ru/news/rubric/society', 'https://dzen.ru/news/rubric/business', 'https://dzen.ru/news/rubric/world', 'https://dzen.ru/news/rubric/sport', 'https://dzen.ru/news/rubric/incident', 'https://dzen.ru/news/rubric/culture', 'https://dzen.ru/news/rubric/computers', 'https://dzen.ru/news/rubric/science', 'https://dzen.ru/news/rubric/auto', 'https://www.mosobl.kp.ru/online/news/4948743/?utm_source=yxnews&utm_medium=desktop', 'https://www.mosobl.kp.ru/online/news/4948743/?utm_source=yxnews&utm_medium=desktop', 'https://www.mosobl.kp.ru/online/news/4948743/?utm_source=yxnews&utm_medium=desktop', 'https://mosregtoday.ru/soc/v-podmoskove-zapustili-chat-bot-po-voprosam-chastichnoj-mobilizacii/?utm_source=yxnews&utm_medium=desktop', ...]
I could not reach the image url and image title with Beautiful soup. I will be glad if you help me. Thanks
The image url I want to scrape is:
https://cdn.homebnc.com/homeimg/2017/01/29-entry-table-ideas-homebnc.jpg
Title I want to scrape
37 Best Entry Table Ideas (Decorations and Designs) for 2017
from bs4 import BeautifulSoup
import requests
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36',
}
response = requests.get("https://www.bing.com/images/search?view=detailV2&ccid=sg67yP87&id=39EC3D95F0FC25C52E714B1776D819AB564D474B&thid=OIP.sg67yP87Kr9hQF8PiKnKZQHaLG&mediaurl=https%3a%2f%2fcdn.homebnc.com%2fhomeimg%2f2017%2f01%2f29-entry-table-ideas-homebnc.jpg&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.b20ebbc8ff3b2abf61405f0f88a9ca65%3frik%3dS0dNVqsZ2HYXSw%26pid%3dImgRaw%26r%3d0&exph=2247&expw=1500&q=table+ideas&simid=608015185750411203&FORM=IRPRST&ck=7EA9EDE471AB12F7BDA2E7DA12DC56C9&selectedIndex=0&qft=+filterui%3aimagesize-large", headers=headers)
print(response.status_code)
soup = BeautifulSoup(response.content, "html.parser")
As mentioned in the comment above by #Carst3n, BeautifulSoup is only giving you the html format before any scripts are executed. For this reason you should try to scrape the website with a combination of Selenium and BeautifulSoup.
You can download chromedriver from here
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
opts = Options()
opts.add_argument(" --headless")
opts.add_argument("--log-level=3")
driver = webdriver.Chrome(PATH_TO_CHROME_DRIVER, options=opts)
driver.get("https://www.bing.com/images/search?view=detailV2&ccid=sg67yP87&id=39EC3D95F0FC25C52E714B1776D819AB564D474B&thid=OIP.sg67yP87Kr9hQF8PiKnKZQHaLG&mediaurl=https%3a%2f%2fcdn.homebnc.com%2fhomeimg%2f2017%2f01%2f29-entry-table-ideas-homebnc.jpg&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.b20ebbc8ff3b2abf61405f0f88a9ca65%3frik%3dS0dNVqsZ2HYXSw%26pid%3dImgRaw%26r%3d0&exph=2247&expw=1500&q=table+ideas&simid=608015185750411203&FORM=IRPRST&ck=7EA9EDE471AB12F7BDA2E7DA12DC56C9&selectedIndex=0&qft=+filterui%3aimagesize-large")
time.sleep(2)
soup_file = driver.page_source
driver.quit()
soup = BeautifulSoup(soup_file)
print(soup.select('#mainImageWindow .nofocus')[0].get('src'))
print(soup.select('.novid')[0].string)
Following this article, I created my first web scraper with Python. My intention is to scrape Google Shopping, looking for products price. The script works, but I want to search more than one product when I run the script.
So, I'm looping over a list of products like this:
from time import sleep
from random import randint
import requests
from bs4 import BeautifulSoup
# from dataProducts import products
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
stores = ["Submarino", "Casas Bahia", "Extra.com.br", "Americanas.com",
"Pontofrio.com", "Shoptime", "Magazine Luiza", "Amazon.com.br - Retail", "Girafa"]
products = [
{
"name" : "Console Playstation 5",
"lowestPrice" : 4000.0,
"highestPrice" : 4400.0
},
{
"name" : "Controle Xbox Robot",
"lowestPrice" : 320.0,
"highestPrice" : 375.0
}
]
for product in products:
params = {"q": product["name"], 'tbm': 'shop'}
response = requests.get("https://www.google.com/search",
params=params,
headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
# Normal results
for shopping_result in soup.select('.sh-dgr__content'):
product = shopping_result.select_one('.Lq5OHe.eaGTj h4').text
price = shopping_result.select_one('span.kHxwFf span.a8Pemb').text
store = shopping_result.select_one('.IuHnof').text
link = f"https://www.google.com{shopping_result.select_one('.Lq5OHe.eaGTj')['href']}"
if store in stores:
print(product)
print(price)
print(store)
print(link)
print()
print()
print('####################################################################################################################################################')
When I run the script, it doesn't bring all the data. And sometimes, It doesn't even bring any data from the first search. It just show the prints from the second iteration. I tried to put a sleed after the soup line, 10 seconds, after last line of the loop, and nothing changes.
I don't understang why my script can't get all the results for the given products. Can anyone give me a little help?
To start off I would recommend selenium requests will most times not bring data. Second if you are trying to get alerts for stock for PS5's or Xbox's I would scrape a website not google. You will need to install chrome and chrome driver. Link: https://chromedriver.chromium.org/downloads Below is how to use Selenium!
import selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
ua = UserAgent()
options = Options()
options.add_argument("useragent="+ua.random)
options.add_argument("--headless")
options.add_argument("--disable-gpu")
options.add_experimental_option("excludeSwitches", ["enable-logging"])
browser = webdriver.Chrome("chromedriver location", options=options)
browser.get("https://google.com")
html = browser.page_source
So you need to do:
pip install selenium
pip install fake_useragent to get it setup.
Then using html you can use BS4 to scrape the website.
I tried to extract div with class='no-selected-number extreme-number' that contains the website pagination, but I don't get the expected result. Can anyone help me?
Below is my code:
import requests from bs4 import BeautifulSoup
URL ="https://www.falabella.com.pe/falabella-pe/category/cat40703/Perfumes-de-Mujer/"
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538 Safari/537.36'}
r = requests.get(URL, headers=headers, timeout=5) html = r.content
soup = BeautifulSoup(html, 'lxml') box_3 =
soup.find_all('div','fb-filters-sort')
for div in box_3:
last_page = div.find_all("div",{"class","no-selected-number extreme-number"})
print(last_page)
You may need a method that allows time for page loading e.g. using selenium. I don't think the data you are after is present with requests.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
url ="https://www.falabella.com.pe/falabella-pe/category/cat40703/Perfumes-de-Mujer/"
d = webdriver.Chrome(chrome_options=chrome_options)
d.get(url)
print(d.find_element_by_css_selector('.content-items-number-list .no-selected-number.extreme-number:last-child').text)
d.quit()
I wanted to pick up the address of "Spotlight 29 casino address" through Google search in the python script. Why isn't my code working?
from bs4 import BeautifulSoup
# from googlesearch import search
import urllib.request
import datetime
article='spotlight 29 casino address'
url1 ='https://www.google.co.in/#q='+article
content1 = urllib.request.urlopen(url1)
soup1 = BeautifulSoup(content1,'lxml')
#print(soup1.prettify())
div1 = soup1.find('div', {'class':'Z0LcW'}) #get the div where it's located
# print (datetime.datetime.now(), 'street address: ' , div1.text)
print (div1)
Pastebin Link
Google uses javascript rendering for that purpose, that's why you don't receive that div with urllib.request.urlopen.
As solution you can use selenium - python library for emulating browser. Install it using 'pip install selenium' console command, then code like this shall work:
from bs4 import BeautifulSoup
from selenium import webdriver
article = 'spotlight 29 casino address'
url = 'https://www.google.co.in/#q=' + article
driver = webdriver.Firefox()
driver.get(url)
html = BeautifulSoup(driver.page_source, "lxml")
div = html.find('div', {'class': 'Z0LcW'})
print(div.text)
If you want to get google search result. Selenium with Python is more simple way.
below is simple code.
from selenium import webdriver
import urllib.parse
from bs4 import BeautifulSoup
chromedriver = '/xxx/chromedriver' #xxx is chromedriver in your installed path
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(chromedriver, chrome_options=chrome_options)
article='spotlight 29 casino address'
driver.get("https://www.google.co.in/#q="+urllib.parse.quote(article))
# driver.page_source <-- html source, you can parser it later.
soup = BeautifulSoup(driver.page_source, 'lxml')
div = soup.find('div',{'class':'Z0LcW'})
print(div.text)
driver.quit()
You were getting an empty div because if you were using requests library the default user-agent is python-requests thus your request is being blocked by Google (in this case) and you received a different HTML with different elements. User-agent fakes "real" user visit.
You can achieve it without selenium if the address was in HTML code (which in this case it is) by passing user-agent into request headers:
headers = {
"User-agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)
Here's code and full example:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get("https://www.google.com/search?q=spotlight 29 casino address", headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
print(soup.select_one(".sXLaOe").text)
# 46-200 Harrison Pl, Coachella, CA 92236
P.S. There's a dedicated web scraping blog of mine. If you need to parse search engines, have a try using SerpApi.
Disclaimer, I work for SerpApi.