Unable to scrape website for price element - python

I wanted to scrape name, roast and price and I have succesfully done it with the code below. However I am not able to scrape the price . it shows up as 'None'.
URLS = ["https://www.thirdwavecoffeeroasters.com/products/vienna-roast","https://www.thirdwavecoffeeroasters.com/products/baarbara-estate","https://www.thirdwavecoffeeroasters.com/products/el-diablo-blend","https://www.thirdwavecoffeeroasters.com/products/organic-signature-filter-coffee-blend","https://www.thirdwavecoffeeroasters.com/products/moka-pot-express-blend-1","https://www.thirdwavecoffeeroasters.com/products/karadykan-estate","https://www.thirdwavecoffeeroasters.com/products/french-roast","https://www.thirdwavecoffeeroasters.com/products/signature-cold-brew-blend","https://www.thirdwavecoffeeroasters.com/products/bettadakhan-estate","https://www.thirdwavecoffeeroasters.com/products/monsoon-malabar-aa"]
for url in range(0,10):
req=requests.get(URLS[url])
soup = bs(req.text,"html.parser")
coffees = soup.find_all("div",class_="col-md-4 col-sm-12 col-xs-12")
for coffee in coffees:
name = coffee.find("div",class_="product-details-main").find("ul",class_="uk-breadcrumb uk-text-uppercase").span.text
roast = coffee.find("div",class_="uk-flex uk-flex-middle uk-width-1-1 coff_type_main").find("p",class_="coff_type uk-margin-small-left uk-text-uppercase").text.split("|")[0]
prices = coffee.find("div",class_="uk-width-1-1 uk-first-column")
print(name,roast,price)

The issue here is that in the response you are not getting the HTML element you are looking for. If you search for the price element in req.text, you won't find it. This is probably because the website renders the page dynamically with JavaScript.
If you look closely in req.text, however, you can find all the coffee variants with their respective price in the meta JavaScript object. You can look for it and parse it into a Python dictionary using the loads() function from the json library.
Here is your working code that also prints the prices. Note that all prices are multiplied by 100, so you may need to account for that.
import requests
from bs4 import BeautifulSoup as bs
import json
URLS = ["https://www.thirdwavecoffeeroasters.com/products/vienna-roast","https://www.thirdwavecoffeeroasters.com/products/baarbara-estate","https://www.thirdwavecoffeeroasters.com/products/el-diablo-blend","https://www.thirdwavecoffeeroasters.com/products/organic-signature-filter-coffee-blend","https://www.thirdwavecoffeeroasters.com/products/moka-pot-express-blend-1","https://www.thirdwavecoffeeroasters.com/products/karadykan-estate","https://www.thirdwavecoffeeroasters.com/products/french-roast","https://www.thirdwavecoffeeroasters.com/products/signature-cold-brew-blend","https://www.thirdwavecoffeeroasters.com/products/bettadakhan-estate","https://www.thirdwavecoffeeroasters.com/products/monsoon-malabar-aa"]
for url in URLS:
req=requests.get(url)
soup = bs(req.text,"html.parser")
coffees = soup.find_all("div",class_="col-md-4 col-sm-12 col-xs-12")
for coffee in coffees:
name = coffee.find("div",class_="product-details-main").find("ul",class_="uk-breadcrumb uk-text-uppercase").span.text
roast = coffee.find("div",class_="uk-flex uk-flex-middle uk-width-1-1 coff_type_main").find("p",class_="coff_type uk-margin-small-left uk-text-uppercase").text.split("|")[0]
print(name,roast)
# Find the JavaScript declaration (there is only one `meta` variable, so no chance of ambiguity)
json_start = req.text.find("var meta = ") + len("var meta = ")
# Find the end of the JavaScript object (there won't be any `;` inside)
json_end = req.text.find(";",json_start)
json_str = req.text[json_start:json_end]
json_data = json.loads(json_str)
# Get the prices from the JSON object
product = json_data["product"]
variants = product["variants"]
for variant in variants:
print(f'Variant: {variant["name"]}, price: {variant["price"]}')
print("\n")
I hope this helps, cheers.

Related

Why is Python requests returning a different text value to what I get when I navigate to the webpage by hand?

I am trying to build a simple 'stock-checker' for a T-shirt I want to buy. Here is the link: https://yesfriends.co/products/mens-t-shirt-black?variant=40840532689069
As you can see, I am present with 'Coming Soon' text, whereas usually if an item is in stock, it will show 'Add To Cart'.
I thought the simplest way would be to use requests and beautifulsoup to isolate this <button> tag, and read the value of text. If it eventually says 'Add To Cart', then I will write the code to email/message myself it's back in stock.
However, here's the code I have so far, and you'll see that the response says the text contains 'Add To Cart', which is not what the website actually shows?
import requests
import bs4
URL = 'https://yesfriends.co/products/mens-t-shirt-black?variant=40840532689069'
def check_stock(url):
page = requests.get(url)
soup = bs4.BeautifulSoup(page.content, "html.parser")
buttons = soup.find_all('button', {'name': 'add'})
return buttons
if __name__ == '__main__':
buttons = check_stock(URL)
print(buttons[0].text)
All data available in <script> tag in JSON. So we need to get this, and extract the information we need. Let's use a simple slice by indexes to get clean JSON
import requests
import json
url = 'https://yesfriends.co/products/mens-t-shirt-black'
response = requests.get(url)
index_start = response.text.index('product:', 0) + len('product:')
index_finish = response.text.index(', }', index_start)
json_obj = json.loads(response.text[index_start:index_finish])
for variant in json_obj['variants']:
available = 'IN STOCK' if variant['available'] else 'OUT OF STOCK'
print(variant['id'], variant['option1'], available)
OUTPUT:
40840532623533 XXS OUT OF STOCK
40840532656301 XS OUT OF STOCK
40840532689069 S OUT OF STOCK
40840532721837 M OUT OF STOCK
40840532754605 L OUT OF STOCK
40840532787373 XL OUT OF STOCK
40840532820141 XXL OUT OF STOCK
40840532852909 3XL IN STOCK
40840532885677 4XL OUT OF STOCK

Unable to parse a specific text in HTML with beautifulsoup

I am trying to parse the rating I have provided to a movie on IMDB. Below is my code:
with Session() as s:
shaw = s.get('https://www.imdb.com/title/tt0111161/')
shaw_soup = bs4.BeautifulSoup(shaw.content, 'html.parser')
title_block = shaw_soup.find(class_ = 'title_block')
rating_widget = title_block.find('div', id = 'star-rating-widget')
star_rating_value = rating_widget.find('span', class_ = 'star-rating-value')
print(star_rating_value)
The html structure of the portion of webpage is as follows:
enter image description here
The output of the print(star_rating_value) is None.
The curious part is when I am parsing other attributes, there is no issue. This issue is only for parsing the rating which I have provided to a movie.

Asking the user to input something and use Beautiful Soup to parse a website

I am supposed to use Beautiful Soup 4 to obtain course information off of my school's website as an exercise. I have been at this for the past few days and my code still does not work.
The first thing I ask the user is to import the course catalog abbreviation. For example, ICS is abbreviated as Information for Computer Science. Beautiful Soup 4 is supposed to list all of the courses and how many students are enrolled.
While I was able to get the input portion to work, I still have errors or the program just stops.
Question: Is there a way for Beautiful Soup to accept user input so that when the user inputs ICS, the output would be a list of all courses that are related to ICS?
Here is the code and my attempt at it:
from bs4 import BeautifulSoup
import requests
import re
#get input for course
course = input('Enter the course:')
#Here is the page link
BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"
#get request and response
page_response = requests.get(BASE_AVAILABILITY_URL)
#getting Beautiful Soup to gather the html content
page_content = BeautifulSoup(page_response.content, 'html.parser')
#getting course information
main = page_content.find_all(class_='parent clearfix')
main_p = "".join(str (x) for x in main)
#get the course anchor tags
main_q = BeautifulSoup(main_p, "html.parser")
courses = main.find('a', href = True)
#get each course name
#empty dictionary for course list
courses_list = []
for a in courses:
courses_list.append(a.text)
search = input('Enter the course title:')
for course in courses_list:
if re.search(search, course, re.IGNORECASE):
print(course)
This is the original code that was provided in Juypter Notebook
import requests, bs4
BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"
#get input for course
course = input('Enter the course:')
def scrape_availability(text):
soup = bs4.BeautifulSoup(text)
r = requests.get(str(BASE_AVAILABILITY_URL) + str(course))
rows = soup.select('.listOfClasses tr')
for row in rows[1:]:
columns = row.select('td')
class_name = columns[2].contents[0]
if len(class_name) > 1 and class_name != b'\xa0':
print(class_name)
print(columns[4].contents[0])
print(columns[7].contents[0])
print(columns[8].contents[0])
What's odd is that if the user saves the html file, uploads it into Juypter Notebook, and then opens the file to be read, the courses are displayed. But, for this task, the user can not save files and it must be an outright input to get the output.
The problem with your code is page_content.find_all(class_='parent clearfix') retuns and empty list []. So thats the first thing you need to change. Looking at the html, you'll want to be looking for <table>, <tr>, <td>, tags
working off what was provided from the original code, you just need to alter a few things to flow logically:
I'll point out what I changed:
import requests, bs4
BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"
#get input for course
course = input('Enter the course:')
def scrape_availability(text):
soup = bs4.BeautifulSoup(text) #<-- need to get the html text before creating a bs4 object. So I move the request (line below) before this, and also adjusted the parameter for this function.
# the rest of the code is fine
r = requests.get(str(BASE_AVAILABILITY_URL) + str(course))
rows = soup.select('.listOfClasses tr')
for row in rows[1:]:
columns = row.select('td')
class_name = columns[2].contents[0]
if len(class_name) > 1 and class_name != b'\xa0':
print(class_name)
print(columns[4].contents[0])
print(columns[7].contents[0])
print(columns[8].contents[0])
This will give you:
import requests, bs4
BASE_AVAILABILITY_URL = "https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s="
#get input for course
course = input('Enter the course:')
url = BASE_AVAILABILITY_URL + course
def scrape_availability(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'html.parser')
rows = soup.select('.listOfClasses tr')
for row in rows[1:]:
columns = row.select('td')
class_name = columns[2].contents[0]
if len(class_name) > 1 and class_name != b'\xa0':
print(class_name)
print(columns[4].contents[0])
print(columns[7].contents[0])
print(columns[8].contents[0])
scrape_availability(url)

Scraping each element from website with BeautifulSoup

I wrote a code for scraping one real estate website. This is the link:
https://www.nekretnine.rs/stambeni-objekti/stanovi/lista/po-stranici/10/
From this page I can get only location, size and price of the apartment, but Is it possible to write a code that will go on page of each appartment and scrape values from it, because it contains much more info. Check this link:
https://www.nekretnine.rs/stambeni-objekti/stanovi/arena-bulevar-arsenija-carnojevica-97m-2-lode-energoprojekt/NkvJK0Ou5tV/
I have posted a code. I noticed that my url changes when I click on specific real estate. For example:
arena-bulevar-arsenija-carnojevica-97m-2-lode-energoprojekt/NkvJK0Ou5tV/
I taught about creating for loop, but there is no way to know how it changes because it has some id number at the end:
NkvJK0Ou5tV
This is the code that I have:
from bs4 import BeautifulSoup
import requests
website = "https://www.nekretnine.rs/stambeni-objekti/stanovi/lista/po-stranici/10/"
soup = requests.get(website).text
my_html = BeautifulSoup(soup, 'lxml')
lokacija = my_html.find_all('p', class_='offer-location text-truncate')
ukupna_kvadratura = my_html.find_all('p', class_='offer-price offer-price--invert')
ukupna_cena = my_html.find_all('div', class_='d-flex justify-content-between w-100')
ukupni_opis = my_html.find_all('div', class_='mt-1 mb-1 mt-lg-0 mb-lg-0 d-md-block offer-meta-info offer-adress')
for lok, kvadratura, cena_stana, sumarno in zip(lokacija, ukupna_kvadratura, ukupna_cena, ukupni_opis):
lok = lok.text.split(',')[0] #lokacija
kv = kvadratura.span.text.split(' ')[0] #kvadratura
jed = kvadratura.span.text.split(' ')[1] #jedinica mere
cena = cena_stana.span.text #cena
sumarno = sumarno.text
datum = sumarno.split('|')[0].strip()
status = sumarno.split('|')[1].strip()
opis = sumarno.split('|')[2].strip()
print(lok, kv, jed, cena, datum, status, opis)
You can get href from div class="placeholder-preview-box ratio-4-3".
From here you can find the URL.
You can iterate over the links provided by the pagination at the bottom of the page:
from bs4 import BeautifulSoup as soup
import requests
d = soup(requests.get('https://www.nekretnine.rs/stambeni-objekti/stanovi/lista/po-stranici/10/').text, 'html.parser')
def scrape_page(page):
return [{'title':i.h2.get_text(strip=True), 'loc':i.p.get_text(strip=True), 'price':i.find('p', {'class':'offer-price'}).get_text(strip=True)} for i in page.find_all('div', {'class':'row offer'})]
result = [scrape_page(d)]
while d.find('a', {'class':'pagination-arrow arrow-right'}):
d = soup(requests.get(f'https://www.nekretnine.rs{d.find("a", {"class":"pagination-arrow arrow-right"})["href"]}').text, 'html.parser')
result.append(scrape_page(d))

Python 3 Scrape yellow Pages

Im trying to scrape the data off of yellowpages but am running into where I can't get the text of each business name and address/phone. I'm using the code below, where am I going wrong? I'm trying to print the text of each business but only printing it out for the sake of seeing it right now as I test but once I'm done then Im going to save the data to csv.
import csv
import requests
from bs4 import BeautifulSoup
#dont worry about opening this file
"""with open('cities_louisiana.csv','r') as cities:
lines = cities.read().splitlines()
cities.close()"""
for city in lines:
print(city)
url = "http://www.yellowpages.com/search? search_terms=businesses&geo_location_terms=amite+LA&page="+str(count)
for city in lines:
for x in range (0, 50):
print("http://www.yellowpages.com/search?search_terms=businesses&geo_location_terms=amite+LA&page="+str(x))
page = requests.get("http://www.yellowpages.com/search?search_terms=businesses&geo_location_terms=amite+LA&page="+str(x))
soup = BeautifulSoup(page.text, "html.parser")
name = soup.find_all("div", {"class": "v-card"})
for name in name:
try:
print(name.contents[0]).find_all(class_="business-name").text
#print(name.contents[1].text)
except:
pass
You should iterate over search results, then, for every search result locate the business name (the element with the "business-name" class) and the address (the element with the "adr" class):
for result in soup.select(".search-results .result"):
name = result.select_one(".business-name").get_text(strip=True, separator=" ")
address = result.select_one(".adr").get_text(strip=True, separator=" ")
print(name, address)
.select() and .select_one() are handy CSS selector methods.

Categories

Resources