How do I properly use the find function from BeatifulSoup4 in python3? - python

I'm following a youtube tutorial on how to scrape an amazon product-page. First I'm trying to get the product title. Later I want to get the amazon price and the secon-hand-price. For this I'm ustin requests and bs4. Here the code so far:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.de/Teenage-Engineering-Synthesizer-FM-Radio-AMOLED-Display/dp/B00CXSJUZS/ref=sr_1_1_sspa?__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&dchild=1&keywords=op-1&qid=1594672884&sr=8-1-spons&psc=1&smid=A1GQGGPCGF8PV9&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUFEMUZSUjhQMUM3NTkmZW5jcnlwdGVkSWQ9QTAwMzMwODkyQkpTNUJUUE9QUFVFJmVuY3J5cHRlZEFkSWQ9QTA4MzM4NDgxV1Y3UzVVN1lXTUZKJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}
page = requests.get(URL,headers=headers)
soup = BeautifulSoup(page.content,'html.parser')
title = soup.find('span',{'id' : "productTitle"})
print(title)
my title is None. So the find function doesn't find the element with the id "productTitle". But checking the soup shows, that there is an element with that id..
So what's wrong with my code?
I also tried:
title = soup.find(id = "productTitle")

Try this:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.de/Teenage-Engineering-Synthesizer-FM-Radio-AMOLED-Display/dp/B00CXSJUZS/ref=sr_1_1_sspa?__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&dchild=1&keywords=op-1&qid=1594672884&sr=8-1-spons&psc=1&smid=A1GQGGPCGF8PV9&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUFEMUZSUjhQMUM3NTkmZW5jcnlwdGVkSWQ9QTAwMzMwODkyQkpTNUJUUE9QUFVFJmVuY3J5cHRlZEFkSWQ9QTA4MzM4NDgxV1Y3UzVVN1lXTUZKJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}
page = requests.get(URL,headers=headers)
soup = BeautifulSoup(page.content,'lxml')
title = soup.find('span',{'id' : "productTitle"})
print(title.text.strip())
You do the right thing but have a "bad" parser. Read more about the differences between parsers here. I prefer lxml but also sometimes use html5lib. I also added
.text.strip()
to the print so only the title text is printed.
Note: you have to install lxml for python first!

Related

Web Scrapping just return None

I'm trying to make a pop-up program with mir4 draco price. But the price return None :
import requests
from bs4 import BeautifulSoup
urll = 'https://www.xdraco.com/coin/price/'
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/86.0.4240.198 Safari/537.36"}
site = requests.get(urll, headers=headers)
soup = BeautifulSoup(site.content, 'html5lib')
price = soup.find('span', class_="amount")
print(price)
You won't be able to parse a site that is dynamically loaded using JS as #jabbson mentioned.
This might be a way to get the data you want.
If you check the network requests being made by the page, you will find that it makes calls to a few different APIs. I found one that might have the info you're looking for. You can make POST requests to this API as shown below...
import requests
import json
headers = {'accept':'application/json, text/plain, */*','user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
html = requests.post('https://api.mir4global.com/wallet/prices/hydra/daily', headers=headers)
output = json.loads(html.text)
# 'output' is a dictionary. If we index the last element, we can get the latest data entry
print(output['Data'][-1])
OUTPUT:
{'CreatedDT': '2022-08-04 21:55:00', 'HydraPrice': '2.1301000000000001', 'HydraAmount': '13434', 'HydraPricePrev': '2.3336000000000001', 'HydraAmountPrev': '5972', 'HydraUSDWemixRate': '2.9401340627166839', 'HydraUSDKLAYRate': '0.29840511595654395', 'USDHydraRate': '6.2627795669928084'}

BeautifulSoup will only return None

Im learning Beautiful Soup and I dont know what I could be doing wrong, I'm using soup.find on an id, and Ive tried this on multiple different sites, and I run it and it always returns None.
import requests
from bs4 import BeautifulSoup
site = 'https://www.amazon.com/PlayStation-5-Console/dp/B09DFCB66S'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}
def stock_check():
page = requests.get(site, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('span', id = 'productTitle')
print(title)
stock_check()
There are 3 errors in your code:
1.incorrect locator
2.not invoking text
3.not inject cookies
Now your code is working fine:
import requests
from bs4 import BeautifulSoup
site = 'https://www.amazon.com/PlayStation-5-Console/dp/B09DFCB66S'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}
cookies={'session':'141-2320098-4829807'}
def stock_check():
page = requests.get(site, headers = headers,cookies=cookies)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('span', attrs={'id':'productTitle'})
print(title.get_text(strip=True))
stock_check()
Output:
PlayStation 5 Console
The answers of HedgeHog and Fazlul are of course correct, but I want to comment on this.
When you scrape something from the web and try to extract tags from the HTML but get nothing, first check the whole HTML document you recieved to make sure it's what you expected. Personally I just print out soup.prettify() to debug this, as explained in BeautifulSoup's Quick Start:
Another nifty trick if the HTML is impractical to read is to paste it into a HTML previewer like this one, and we get the answer quickly.
BeautifulSoup can be a great tool, but you need to pay attention when using it.

My code prints none when trying to webscrape

I'm a beginner just started learning python a week ago, I was trying to get a product title for a specific product on amazon but when I try to run my code it prints "None" instead of printing the title, Any help?
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/Sony-ILCE7SM2-mount-Camera-Full-Frame/dp/B0158SRJVQ/ref=sr_1_1?
dchild=1&keywords=a7s&qid=1589917834&sr=8-1'
headers = {
'user_agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id='productTitle')
print(title)

Find a key inside a script tag with BeautifulSoup

What I'm trying to do is acquire a product ID from a script tag inside an HTML document. Unfortunately, StockX doesn't offer a public API, so I have to scrape the data from an HTML document. Here are my attempts at it (both work):
Attempt 1
import requests
PRODUCT_URL = 'https://stockx.com/supreme-steiff-bear-heather-grey'
HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = requests.get(url=PRODUCT_URL, headers=HEADERS).text
PRODUCT_ID = response[response.find('"product":{"id":"')+17:].partition('"')[0]
PRODUCT_NAME = response[response.find('<title>')+7:].partition('<')[0]
Attempt 2
from bs4 import BeautifulSoup
import requests
# Gets HTML document
PRODUCT_URL = 'https://stockx.com/supreme-steiff-bear-heather-grey'
HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
html_content = requests.get(url=PRODUCT_URL, headers=HEADERS)
# Make BeautifulSoup parser from HTML document
soup = BeautifulSoup(html_content.text, 'html.parser')
# Get product name
PRODUCT_NAME = soup.title.text
# Get script tag data with product ID
js_content = soup.find_all('script', type='text/javascript')[9].text
PRODUCT_ID = js_content[50:86]
print(PRODUCT_ID)
Output:
884861d2-abe6-4b09-90ff-c8ad1967ac8c
However, I feel like there is a better approach to this problem instead of just "hard-coding" in where to find the ID.
If you view the page source of the product URL and do a search for "product":{"id":, you will find that the ID is inside a nested dictionary that is assigned to an object and inside a tag.
Is there any better way to go about obtaining the product ID from an HTML document?
EDIT: Here is the content of html_content: https://gist.github.com/leecharles50/9b6b11fb458767cabcfc0ed4f961984d
My first idea was to parse the JavaScript inside the tag. There is a package called slimit that can do this. See for example this answer.
However, in your case there is an even easier solution. I searched the DOM for the id you gave (884861d2-abe6-4b09-90ff-c8ad1967ac8) and found an occurrence inside the following tag:
<script type="application/ld+json">
{
[...]
"sku" : "884861d2-abe6-4b09-90ff-c8ad1967ac8c",
[...]
}
</script>
which contains valid JSON. Simply find the tag with BeautifulSoup:
tag = soup('script', {'type': 'application/ld+json'})[-1]
and decode the JSON within:
import json
product_id = json.loads(tag.text)['sku']
As you can see by the product URL, this has been tested on multiple product pages.
import requests
import json
from bs4 import BeautifulSoup
#product_url = 'https://stockx.com/supreme-steiff-bear-heather-grey'
product_url = 'https://stockx.com/air-jordan-1-retro-high-shattered-backboard-3'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
html_content = requests.get(url=product_url, headers=headers)
soup = BeautifulSoup(html_content.text, 'lxml')
script_tags = soup.find_all('script', attrs={'type': 'application/ld+json'})
product_info_text = script_tags[-1].text
# contains a bunch of useful info
product_info_json = json.loads(product_info_text, strict=False)
print(json.dumps(product_info_json, indent=4))
product_sku = product_info_json['sku']
print(product_sku)
I will try to implement the use of a SoupStrainer.
Here is an alternative using regex:
import requests
import re
product_uuid = re.compile(r'"product":{"id":"(\w{8}-(?:\w{4}-){3}\w{12}){1}"')
product_name = re.compile(r'<title>(.*)</title>')
url = 'https://stockx.com/supreme-steiff-bear-heather-grey'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
content = requests.get(url, headers=headers)
if content.ok:
PRODUCT_NAME = product_name.findall(content.text)[0]
PRODUCT_UUID = product_uuid.findall(content.text)[0]
print(PRODUCT_NAME)
print(PRODUCT_UUID)
Slightly hard-coded but easy to adjust and depends only on standard modules.
If you want to scrape on large volumes, you can use the API of Piloterr

How to get live wind from a site in python

Hi I am writing a python script where it takes live wind from a given site where I live, now if I use the following code on the website I get a 'none' value but on the website there is information at the given position.
I tried this code:
import requests
from bs4 import BeautifulSoup
link = 'http://www.actuelewind.nl/?stationcode=6308#SpotPage'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/76.0.3809.132 Safari/537.36'}
def checkwind():
pagina = requests.get(link, headers=headers)
soup = BeautifulSoup(pagina.content, 'html.parser')
windsnelheid = soup.find('div', attrs={"id": "spotInfoWindsnelheidMS"})
print(windsnelheid)
checkwind()
Can anyone show me how to get live wind from this website?

Categories

Resources