Price comparison - python - python

Hi guys i am trying to create a program in python that compares prices from websites but i cant get the prices. I have managed to ge the title of the product and the quantity using the code bellow.
page = requests.get(urls[7],headers=Headers)
soup = BeautifulSoup(page.text, 'html.parser')
title = soup.find("h1",{"class" : "Titlestyles__TitleStyles-sc-6rxg4t-0 fDKOTS"}).get_text().strip()
quantity = soup.find("li", class_="quantity").get_text().strip()
total_price = soup.find('div', class_='Pricestyles__ProductPriceStyles-sc-118x8ec-0 fzwZWj price')
print(title)
print(quantity)
print(total_price)
Iam trying to get the price from this website (Iam creating a program do look for diper prices lol) https://www.drogasil.com.br/fralda-huggies-tripla-protecao-tamanho-m.html .
the price is not coming even if i get the text it always says that its nonetype.

Some of the information is built up via javascript from data stored in <script> sections in the HTML. You can access this directly by searching for it and using Python's JSON library to decode it into a Python structure. For example:
from bs4 import BeautifulSoup
import requests
import json
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}
url = 'https://www.drogasil.com.br/fralda-huggies-tripla-protecao-tamanho-m.html'
req = requests.get(url, headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')
script = soup.find('script', type='application/ld+json')
data = json.loads(script.text)
title = data['name']
total_price = data['offers']['price']
quantity = soup.find("li", class_="quantity").get_text().strip()
print(title)
print(quantity)
print(total_price)
Giving you:
HUGGIES FRALDAS DESCARTAVEL INFANTIL TRIPLA PROTECAO TAMANHO M COM 42 UNIDADES
42 Tiras
38.79
I recommend you add print(data) to see what other information is available.

Related

Why am I not seeing any results in my output from extracting indeed data using python

I am trying to run this code in idle 3.10.6 and I am not seeing any kind of data that should be extracted from Indeed. All this data should be in the output when I run it but it isn't. Below is the input statement
#Indeed data
import requests
from bs4 import BeautifulSoup
import pandas as pd
def extract(page):
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko"}
url = "https://www.indeed.com/jobs?q=Data&l=United+States&sc=0kf%3Ajt%28internship%29%3B&vjk=a2f49853f01db3cc={page}"
r = requests.get(url,headers)
soup = BeautifulSoup(r.content, "html.parser")
return soup
def transform(soup):
divs = soup.find_all("div", class_ = "jobsearch-SerpJobCard")
for item in divs:
title = item.find ("a").text.strip()
company = item.find("span", class_="company").text.strip()
try:
salary = item.find("span", class_ = "salarytext").text.strip()
finally:
salary = ""
summary = item.find("div",{"class":"summary"}).text.strip().replace("\n","")
job = {
"title":title,
"company":company,
'salary':salary,
"summary":summary
}
joblist.append(job)
joblist = []
for i in range(0,40,10):
print(f'Getting page, {i}')
c = extract(10)
transform(c)
df = pd.DataFrame(joblist)
print(df.head())
df.to_csv('jobs.csv')
Here is the output I get
Getting page, 0
Getting page, 10
Getting page, 20
Getting page, 30
Empty DataFrame
Columns: []
Index: []
Why is this going on and what should I do to get that extracted data from indeed? What I am trying to get is the jobtitle,company,salary, and summary information. Any help would be greatly apprieciated.
The URL string includes {page}, bit it's not an f-string, so it's not being interpolated, and the URL you are fetching is:
https://www.indeed.com/jobs?q=Data&l=United+States&sc=0kf%3Ajt%28internship%29%3B&vjk=a2f49853f01db3cc={page}
That returns an error page.
So you should add an f before opening quote when you set url.
Also, you are calling extract(10) each time, instead of extract(i).
This is the correct way of using url
url = "https://www.indeed.com/jobs?q=Data&l=United+States&sc=0kf%3Ajt%28internship%29%3B&vjk=a2f49853f01db3cc={page}".format(page=page)
r = requests.get(url,headers)
here r.status_code gives an error 403 which means the request is forbidden.The site will block your request from fullfilling.use indeed job search Api

How to extract movie genre from Metacritic website using BeautifulSoup

I want to do this for the top 500 movies of Metacritic found at https://www.metacritic.com/browse/movies/score/metascore/all/filtered?sort=desc
Each genre will be extracted from a detail link like this(for the first one): https://www.metacritic.com/movie/citizen-kane-1941/details
Just need some help on the extraction of the genre part from the HTML from the above-detailed link
My get_genre function (but I get an attribute error)
def get_genre(detail_link):
detail_page = requests.get(detail_link, headers = headers)
detail_soup = BeautifulSoup(detail_page.content, "html.parser")
try:
#time.sleep(1)
table=detail_soup.find('table',class_='details',summary=movie_name +" Details and Credits")
#print(table)
gen_line1=table.find('tr',class_='genres')
#print(gen_line1)
gen_line=gen_line1.find('td',class_='data')
#print(gen_line)
except:
time.sleep(1)
year=detail_soup.find(class_='release_date')
year=year.findAll('span')[-1]
year=year.get_text()
year=year.split()[-1]
table=detail_soup.find('table',class_='details',summary=movie_name +" ("+ year +")"+" Details and Credits")
#print(table)
gen_line1=table.find('tr',class_='genres')
#print(gen_line1)
gen_line=gen_line1.find('td',class_='data')
genres=[]
for line in gen_line:
genre = gen_line.get_text()
genres.append(genre.strip())
genres=list(set(genres))
genres=(str(genres).split())
return genres
you're too much focused on getting the table. just use what elements you're sure about. here's an example with select
from bs4 import BeautifulSoup
import requests
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_0) AppleWebKit/536.1 (KHTML, like Gecko) Chrome/58.0.849.0 Safari/536.1'}
detail_link="https://www.metacritic.com/movie/citizen-kane-1941/details"
detail_page = requests.get(detail_link, headers = headers)
detail_soup = BeautifulSoup(detail_page.content, "html.parser")
genres=detail_soup.select('tr.genres td.data span')
print([genre.text for genre in genres])
>>> ['Drama', 'Mystery']

BeautifulSoup: how to get the value of the price from the webpage's source code if there is no id within the source code for the price

I am doing web scraping in Python with BeautifulSoup and wondering if I there is a way of getting the value of a cell when it has no id. The code is as below:
from bs4 import BeautifulSoup
import requests
import time
import datetime
URL = "https://www.amazon.co.uk/Got-Data-MIS-Business-Analyst/dp/B09F319PK2/ref=sr_1_1?keywords=funny+got+data+mis+data+systems+business+analyst+tshirt&qid=1636481904&qsid=257-9827493-6142040&sr=8-1&sres=B09F319PK2%2CB09F33452D%2CB08MCBFLHC%2CB07Y8Z4SF8%2CB07GJGXY7P%2CB07Z2DV1C2%2CB085MZDMZ8%2CB08XYL6GRM%2CB095CXJ226%2CB08JDMYMPV%2CB08525RB37%2CB07ZDNR6MP%2CB07WL5JGPH%2CB08Y67YF63%2CB07GD73XD8%2CB09JN7Z3G2%2CB078W9GXJY%2CB09HVDRJZ1%2CB07JD7R6CB%2CB08JDKYR6Q&srpt=SHIRT"
headers = {"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 14092.77.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.107 Safari/537.36"}
page = requests.get(URL, headers = headers)
soup1 = BeautifulSoup(page.content, "html.parser")
soup2 = BeautifulSoup(soup1.prettify(), "html.parser")
title = soup2.find(id="productTitle").get_text()
price = soup2.find(id="priceblock_ourprice").get_text()
print(title)
print(price)
For this page, you have to select the garment size before the price is displayed. We can get the price from the dropdown list of sizes which is a SELECT with id = "dropdown_selected_size_name"
First let's get a list of the options in the SELECT dropdown:
options = soup2.find(id='variation_size_name').select('select option')
Then we can get the price say for size 'Large'
for opt in options:
if opt.get('data-a-html-content', '') == 'Large':
print(opt['value'])
or a little more succinctly:
print([opt['value'] for opt in options if opt.get('data-a-html-content', '') == 'Large'][0])

Webscraping information could be in 1 of 3 places on a page, not sure how to use an if statement to eliminate the nonetype results

The following code is scraping data from an amazon product page. I have since discovered that the data for the price could be in 1 of 3 places depending on the type of product and how the price has been added to the page. the 2 other CSS selectors are not present.
for the website url = f'https://www.amazon.co.uk/dp/B083PHB6XX' I am using price = soup.find('span', {'id': 'priceblock_ourprice'}).text.strip()
However, for the website url = f'https://www.amazon.co.uk/dp/B089SQHDMR' I need to use price = soup.find('span', {'id':'priceblock_pospromoprice'}).text.strip()
Finally, for the website url = f'https://www.amazon.co.uk/dp/B0813QVVSW' I need to use price = soup.find('span', {'class':'a-size-base a-color-price'}).text.strip()
I think a solution to hand this is for each url, try to find priceblock_ourprice and if that fails, try to find priceblock_pospromoprice and if that fails, go on to find a-size base a-color-price. I am not able to understand how to put this into an if statement which will not stop when an element is not found.
My original code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36"}
urls = [f'https://www.amazon.co.uk/dp/B083PHB6XX', f'https://www.amazon.co.uk/dp/B089SQHDMR', f'https://www.amazon.co.uk/dp/B0813QVVSW']
for url in urls:
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
name = soup.find('span', {'id': 'productTitle'}).text.strip()
price = soup.find('span', {'id': 'priceblock_ourprice'}).text.strip()
print(name)
print(price)
You were already on the right track with your thoughts. First check the existence of the element in the conditions and then set the value for the price.
if soup.find('span', {'id': 'priceblock_ourprice'}):
price = soup.find('span', {'id': 'priceblock_ourprice'}).text.strip()
elif soup.find('span', {'id':'priceblock_pospromoprice'}):
price = soup.find('span', {'id':'priceblock_pospromoprice'}).text.strip()
elif soup.find('span', {'class':'a-size-base a-color-price'}):
price = soup.find('span', {'class':'a-size-base a-color-price'}).text.strip()
else:
price = ''
Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36"}
urls = [f'https://www.amazon.co.uk/dp/B083PHB6XX', f'https://www.amazon.co.uk/dp/B089SQHDMR', f'https://www.amazon.co.uk/dp/B0813QVVSW']
for url in urls:
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
name = soup.find('span', {'id': 'productTitle'}).text.strip()
if soup.find('span', {'id': 'priceblock_ourprice'}):
price = soup.find('span', {'id': 'priceblock_ourprice'}).text.strip()
elif soup.find('span', {'id':'priceblock_pospromoprice'}):
price = soup.find('span', {'id':'priceblock_pospromoprice'}).text.strip()
elif soup.find('span', {'class':'a-size-base a-color-price'}):
price = soup.find('span', {'class':'a-size-base a-color-price'}).text.strip()
else:
price = ''
print(name)
print(price)
Output
Boti 36475 Shark Hand Puppet Baby Shark with Sound Function and Speed Control, Approx. 27 x 21 x 14 cm, Soft Polyester, Battery Operated
£19.50
LEGO 71376 Super Mario Thwomp Drop Expansion Set
£34.99
LEGO 75286 Star Wars General Grievous’s Starfighter Set
£86.99

Scraping live stock price using google finance

I have recently started learning python and one of my first project is to get live stock prices from Google finance using beautifulsoup. Basically I am looking up for a stock and setting a price alert.
here is what my code looks like.
import requests
import time
import tkinter
from bs4 import BeautifulSoup
def st_Price(symbol):
baseurl = 'http://google.com/finance/quote/'
URL = baseurl + symbol + ":NSE?hl=en&gl=in"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(class_="YMlKec fxKbKc")
result = results.__str__()
#print(result)
res = result.split("₹")[1].split("<")[0]
res_flt = float(res.replace(",",""))
return res_flt
def main():
sym = input("Enter Stock Symbol : ")
price = input("Enter desired price : ")
x = st_Price(sym)
while x < float(price):
print(x)
t1 = time.perf_counter()
x = st_Price(sym)
t2 = time.perf_counter()
print("Internal refresh time is {}".format(t2-t1))
else:
print("The Stock {} achieved price greater than {}".format(sym,x))
root = tkinter.Tk()
root.geometry("150x150")
tkinter.messagebox.showinfo(title="Price Alert",message="Stock Price {} greater Than {}".format(x,price))
root.destroy()
if __name__ == "__main__":
main()
I am looking up following class in the Page HTML:
HTML element for the Stock
The code works perfectly fine but it takes too much time to fetch the information:
Enter Stock Symbol : INFY
Enter desired price : 1578
1574.0
Internal refresh time is 9.915285099999892
1574.0
Internal refresh time is 7.2284357999997155
I am not too much familiar with HTML. By referring online documentation I was able to figure out how to scrape necessary part.
Is there any way to reduce the time to fetch the data ?
Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser.
Also, when using the requests library, the default requests user-agent is python-requests so websites understand that it's a bot or a script that sends a request, not a real user. Check what's your user-agent and pass it request headers.
To get just the current price you would need to use such CSS selector AHmHk .fxKbKc via the select_one() bs4 method, which could also change in the future.
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
}
html = requests.get(f"https://www.google.com/finance/quote/INFY:NSE", headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
current_price = soup.select_one(".zzDege").text
print(current_price)
# ₹1,860.50
Code and full example in the online IDE to scrape current price and right panel data:
from bs4 import BeautifulSoup
import requests, lxml, json
from itertools import zip_longest
def scrape_google_finance(ticker: str):
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
# https://www.whatismybrowser.com/detect/what-is-my-user-agent
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
}
html = requests.get(f"https://www.google.com/finance/quote/{ticker}", headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
ticker_data = {"right_panel_data": {},
"ticker_info": {}}
ticker_data["ticker_info"]["title"] = soup.select_one(".zzDege").text
ticker_data["ticker_info"]["current_price"] = soup.select_one(".AHmHk .fxKbKc").text
right_panel_keys = soup.select(".gyFHrc .mfs7Fc")
right_panel_values = soup.select(".gyFHrc .P6K39c")
for key, value in zip_longest(right_panel_keys, right_panel_values):
key_value = key.text.lower().replace(" ", "_")
ticker_data["right_panel_data"][key_value] = value.text
return ticker_data
data = scrape_google_finance(ticker="INFY:NSE")
# ensure_ascii=False to display Indian Rupee ₹ symbol
print(json.dumps(data, indent=2, ensure_ascii=False))
print(data["right_panel_data"].get("ceo"))
Outputs:
{
"right_panel_data": {
"previous_close": "₹1,882.95",
"day_range": "₹1,857.15 - ₹1,889.60",
"year_range": "₹1,311.30 - ₹1,953.90",
"market_cap": "7.89T INR",
"p/e_ratio": "36.60",
"dividend_yield": "1.61%",
"primary_exchange": "NSE",
"ceo": "Salil Parekh",
"founded": "Jul 2, 1981",
"headquarters": "Bengaluru, KarnatakaIndia",
"website": "infosys.com",
"employees": "292,067"
},
"ticker_info": {
"title": "Infosys Ltd",
"current_price": "₹1,860.50"
}
}
Salil Parekh
If you want to scrape more data with a line-by-line explanation, there's a Scrape Google Finance Ticker Quote Data in Python blog post of mine.

Categories

Resources