I'm trying to scrape the href values for the items on the following page, however only if the items show as in stock: https://www.waitrosecellar.com/whisky-shop/view-all-whiskies/whisky-by-brand/macallan
With the following code, I've managed to successfully scrape the hrefs, however the out_of_stock flag does not appear to be working and still returns items that are out of stock in the print list. My code:
import ssl
import requests
import sys
import time
import smtplib
from email.message import EmailMessage
import hashlib
from urllib.request import urlopen
from datetime import datetime
import json
import random
import requests
from itertools import cycle
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from urllib3.exceptions import InsecureRequestWarning
from requests_html import HTMLSession
session = HTMLSession()
user_agent_list = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
]
for i in range(1,4):
#Pick a random user agent
user_agent = random.choice(user_agent_list)
#Set the headers
headers = {'User-Agent': user_agent}
url = 'https://www.waitrosecellar.com/whisky-shop/view-all-whiskies/whisky-by-brand/macallan'
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text,features="html.parser")
test = []
for product in soup.find_all('div', class_="productName"):
out_of_stock=False
for span in product.parent.find_all('span', ):
if "Out of stock" in span.text:
out_of_stock = True
break
if not out_of_stock:
test.append(product.a['href'])
print(test)
Please could I have suggestions as to how to make the out_of_stock flag work correctly, in order to only print items that are in stock. Thank you!
Here is one way to differentiate between out of stock/available products:
import requests
from bs4 import BeautifulSoup as bs
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://www.waitrosecellar.com/whisky-shop/view-all-whiskies/whisky-by-brand/macallan'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
cards = soup.select('div[class="productCard"]')
for c in cards:
product = c.select_one('div[class="productName" ] a').text.strip()
product_url = c.select_one('div[class="productName" ] a').get('href')
availability = 'Product Available' if c.select_one('div[class="productOutOfStock"]').get('style') == 'display:none;' else 'Out of Stock'
if availability == 'Product Available':
print(product, product_url, availability)
Result in terminal:
Macallan 12 Year Old Sherry Oak https://www.waitrosecellar.com/macallan-12-year-old-sherry-oak-717201 Product Available
Of course you can get other data points about products as well. See BeautifulSoup documentation here: https://beautiful-soup-4.readthedocs.io/en/latest/
Also, Requests-Html seems to be unmaintained, last release being almost 4 years ago? Released: Feb 17, 2019
Related
When I am trying to scrap website over multiple pages BeautifulSoup returning the 1st page content for all the page range.. It is getting repeated again and again..
data=pd.DataFrame()
for i in range(1,10):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
url="https://www.collegesearch.in/engineering-colleges-india".format(i)
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html5lib')
#clg url and name
clg=soup.find_all('h2', class_='media-heading mg-0')
#other details
details=soup.find_all('dl', class_='dl-horizontal mg-0')
_dict={'clg':clg,'details':details}
df=pd.DataFrame(_dict)
data=data.append(df,ignore_index=True)
It is not an issue of BeautifulSoup - Check your loop, you never change the page, cause url is always the same:
https://www.collegesearch.in/engineering-colleges-india
So change your code and set your counter as value of page parameter:
for i in range(1,10):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
url=f"https://www.collegesearch.in/engineering-colleges-india?page={i}"
print(url)
May also take a short read: https://docs.python.org/3/tutorial/inputoutput.html
I am trying to do a simple WebScrapper to monitor Nike's site here in Brazil.
Basically i want to track products that have stock right now, to check when new products are added.
My problem is that when i navigate to the site https://www.nike.com.br/snkrs#estoque I see different products compared to what I see using python requests method.
Here is the code I am using:
import requests
from bs4 import BeautifulSoup
headers ={
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
url = 'https://www.nike.com.br/snkrs#estoque'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
len(soup.find_all(class_='produto produto--comprar'))
This code gives me 40, but using the browser I can see 56 products https://prnt.sc/26jeo1i
The data comes from a different source, within 3 pages.
import requests
from bs4 import BeautifulSoup
headers ={
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
productList = []
for p in [1,2,3]:
url = f'https://www.nike.com.br/Snkrs/Estoque?p={p}&demanda=true'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
productList += soup.find_all(class_='produto produto--comprar')
Output:
print(len(productList))
56
I am trying to get the price of a coinmarketcap cryptocurrency. Unfortunately it's not working. Can anyone help?
I would like to display the price of this coin: https://coinmarketcap.com/currencies/bombcrypto/
My code:
#!/usr/bin/env python3
from bs4 import BeautifulSoup as S
import requests
c = input('bombcrypto')
url = f'https://coinmarketcap.com/currencies/{c}/'
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
r = requests.get(url,headers=headers)
soup = S(r.content,'html.parser')
print(f'the price of {c} now is ')
x = soup.find(class_='sc-16r8icm-0 kjciSH priceTitle').text
print(x)
thank you
There is a couple issues with your code. Make sure you follow the structure of the webpage correctly when finding a path through the elements. I also changed the input statement on line 5.
#!/usr/bin/env python3
from bs4 import BeautifulSoup as S
import requests
c = input("Crypto: ")#'bombcrypto'
url = f'https://coinmarketcap.com/currencies/{c}/'
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
r = requests.get(url,headers=headers)
soup = S(r.content,'html.parser')
print(f'the price of {c} now is ')
x = soup.find(class_='sc-16r8icm-0 kjciSH priceTitle').findChild(class_="priceValue").findChild('span')
print(x.text)
I am trying to scrape the link from amazon website but they will provide me 2 or 3 links
the link of website is https://www.amazon.com/s?rh=n%3A1069242&fs=true&ref=lp_1069242_sar
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
r =requests.get('https://www.amazon.com/s?rh=n%3A1069242&fs=true&ref=lp_1069242_sar')
soup=BeautifulSoup(r.content, 'html.parser')
for link in soup.find_all('a',href=True):
print(link['href'])
Here is the working solution:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urljoin
base_url='https://www.amazon.com'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36','session':'141-2320098-4829807'}
r = requests.get('https://www.amazon.com/s?rh=n%3A1069242&fs=true&ref=lp_1069242_sar', headers = headers)
soup = BeautifulSoup(r.content, 'lxml')
for link in soup.find_all('a',class_="a-link-normal s-underline-text s-underline-link-text a-text-normal",href=True):
p=link['href']
l=urljoin(base_url,p)
print(l)
Output:
https://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_browse_office-products_sr_pg1_1?ie=UTF8&adId=A05861132UJ9W79S82Z3&url=%2FFiskars-Inch-Student-Scissors-Pack%2Fdp%2FB08CL355MN%2Fref%3Dsr_1_1_sspa%3Fdchild%3D1%26qid%3D1633717907%26s%3Doffice-products%26sr%3D1-1-spons%26psc%3D1&qualifier=1633717907&id=1565389383398743&widgetName=sp_atf_browse
https://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_browse_office-products_sr_pg1_1?ie=UTF8&adId=A0918144191FAIKGYK3YC&url=%2FFiskars-Inch-Blunt-Kids-Scissors%2Fdp%2FB00TJSS9ZW%2Fref%3Dsr_1_2_sspa%3Fdchild%3D1%26qid%3D1633717907%26s%3Doffice-products%26sr%3D1-2-spons%26psc%3D1&qualifier=1633717907&id=1565389383398743&widgetName=sp_atf_browse
https://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_browse_office-products_sr_pg1_1?ie=UTF8&adId=A09889161KB2CNO5NB8QC&url=%2FLind-Kitchen-Dispenser-Decorative-Stationery%2Fdp%2FB07VRLW5C6%2Fref%3Dsr_1_3_sspa%3Fdchild%3D1%26qid%3D1633717907%26s%3Doffice-products%26sr%3D1-3-spons%26psc%3D1&qualifier=1633717907&id=1565389383398743&widgetName=sp_atf_browse
https://www.amazon.com/Zebra-Pen-Retractable-Ballpoint-18-Count/dp/B00M382RJO/ref=sr_1_4?dchild=1&qid=1633717907&s=office-products&sr=1-4
... so on
I am trying to scrape an api call with requests. This is the website
Following Is The Error That It Gives Me:
ValueError: No JSON object could be decoded
Following Is The Code :
import requests
import json
import time
from bs4 import BeautifulSoup
url = 'https://www.nseindia.com/api/event-calendar'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
request = requests.get(url,headers=headers)
data = json.loads(request.text)
print(data)
How Can I Scrape This Website ?
Try this:
import requests
from bs4 import BeautifulSoup
url = 'https://www.nseindia.com/companies-listing/corporate-filings-event-calendar'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
request = requests.get(url,headers=headers)
soup = BeautifulSoup(request.text,'html.parser')
print(soup)
The table is probably being dynamically generated with Javascript. Therefore, requests won't work. You need selenium and a headless browser to do that.