Hello developers,
I am trying to learn how to scrape, and I came across this website here.
I want to get the link to the map using requests and beautifulsoup library.
This is what I have done so far
URL = 'https://www.zoopla.com/new-homes/details/61239885/?search_identifier=0dcdcfea4b6e6e84e1a93c25c4c0d808'
headers = {'User-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36'
res = requests.get(URL, timeout=15,
headers = headers)
soup = BeautifulSoup(res.content, 'html.parser')
soup.find('div', class_=re.compile('MapInner'))
<div class="css-1md2b5a-MapInner e74mx470"></div>
I do able to locate the parent tag, but there is no img tag, I believe it is because it has been not rendered.
What can be done here, If I don't want to use selenium?
Related
I am trying to do a simple WebScrapper to monitor Nike's site here in Brazil.
Basically i want to track products that have stock right now, to check when new products are added.
My problem is that when i navigate to the site https://www.nike.com.br/snkrs#estoque I see different products compared to what I see using python requests method.
Here is the code I am using:
import requests
from bs4 import BeautifulSoup
headers ={
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
url = 'https://www.nike.com.br/snkrs#estoque'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
len(soup.find_all(class_='produto produto--comprar'))
This code gives me 40, but using the browser I can see 56 products https://prnt.sc/26jeo1i
The data comes from a different source, within 3 pages.
import requests
from bs4 import BeautifulSoup
headers ={
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
productList = []
for p in [1,2,3]:
url = f'https://www.nike.com.br/Snkrs/Estoque?p={p}&demanda=true'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
productList += soup.find_all(class_='produto produto--comprar')
Output:
print(len(productList))
56
Im learning Beautiful Soup and I dont know what I could be doing wrong, I'm using soup.find on an id, and Ive tried this on multiple different sites, and I run it and it always returns None.
import requests
from bs4 import BeautifulSoup
site = 'https://www.amazon.com/PlayStation-5-Console/dp/B09DFCB66S'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}
def stock_check():
page = requests.get(site, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('span', id = 'productTitle')
print(title)
stock_check()
There are 3 errors in your code:
1.incorrect locator
2.not invoking text
3.not inject cookies
Now your code is working fine:
import requests
from bs4 import BeautifulSoup
site = 'https://www.amazon.com/PlayStation-5-Console/dp/B09DFCB66S'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}
cookies={'session':'141-2320098-4829807'}
def stock_check():
page = requests.get(site, headers = headers,cookies=cookies)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('span', attrs={'id':'productTitle'})
print(title.get_text(strip=True))
stock_check()
Output:
PlayStation 5 Console
The answers of HedgeHog and Fazlul are of course correct, but I want to comment on this.
When you scrape something from the web and try to extract tags from the HTML but get nothing, first check the whole HTML document you recieved to make sure it's what you expected. Personally I just print out soup.prettify() to debug this, as explained in BeautifulSoup's Quick Start:
Another nifty trick if the HTML is impractical to read is to paste it into a HTML previewer like this one, and we get the answer quickly.
BeautifulSoup can be a great tool, but you need to pay attention when using it.
I'm following a youtube tutorial on how to scrape an amazon product-page. First I'm trying to get the product title. Later I want to get the amazon price and the secon-hand-price. For this I'm ustin requests and bs4. Here the code so far:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.de/Teenage-Engineering-Synthesizer-FM-Radio-AMOLED-Display/dp/B00CXSJUZS/ref=sr_1_1_sspa?__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&dchild=1&keywords=op-1&qid=1594672884&sr=8-1-spons&psc=1&smid=A1GQGGPCGF8PV9&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUFEMUZSUjhQMUM3NTkmZW5jcnlwdGVkSWQ9QTAwMzMwODkyQkpTNUJUUE9QUFVFJmVuY3J5cHRlZEFkSWQ9QTA4MzM4NDgxV1Y3UzVVN1lXTUZKJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}
page = requests.get(URL,headers=headers)
soup = BeautifulSoup(page.content,'html.parser')
title = soup.find('span',{'id' : "productTitle"})
print(title)
my title is None. So the find function doesn't find the element with the id "productTitle". But checking the soup shows, that there is an element with that id..
So what's wrong with my code?
I also tried:
title = soup.find(id = "productTitle")
Try this:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.de/Teenage-Engineering-Synthesizer-FM-Radio-AMOLED-Display/dp/B00CXSJUZS/ref=sr_1_1_sspa?__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&dchild=1&keywords=op-1&qid=1594672884&sr=8-1-spons&psc=1&smid=A1GQGGPCGF8PV9&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUFEMUZSUjhQMUM3NTkmZW5jcnlwdGVkSWQ9QTAwMzMwODkyQkpTNUJUUE9QUFVFJmVuY3J5cHRlZEFkSWQ9QTA4MzM4NDgxV1Y3UzVVN1lXTUZKJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}
page = requests.get(URL,headers=headers)
soup = BeautifulSoup(page.content,'lxml')
title = soup.find('span',{'id' : "productTitle"})
print(title.text.strip())
You do the right thing but have a "bad" parser. Read more about the differences between parsers here. I prefer lxml but also sometimes use html5lib. I also added
.text.strip()
to the print so only the title text is printed.
Note: you have to install lxml for python first!
What I'm trying to do is acquire a product ID from a script tag inside an HTML document. Unfortunately, StockX doesn't offer a public API, so I have to scrape the data from an HTML document. Here are my attempts at it (both work):
Attempt 1
import requests
PRODUCT_URL = 'https://stockx.com/supreme-steiff-bear-heather-grey'
HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = requests.get(url=PRODUCT_URL, headers=HEADERS).text
PRODUCT_ID = response[response.find('"product":{"id":"')+17:].partition('"')[0]
PRODUCT_NAME = response[response.find('<title>')+7:].partition('<')[0]
Attempt 2
from bs4 import BeautifulSoup
import requests
# Gets HTML document
PRODUCT_URL = 'https://stockx.com/supreme-steiff-bear-heather-grey'
HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
html_content = requests.get(url=PRODUCT_URL, headers=HEADERS)
# Make BeautifulSoup parser from HTML document
soup = BeautifulSoup(html_content.text, 'html.parser')
# Get product name
PRODUCT_NAME = soup.title.text
# Get script tag data with product ID
js_content = soup.find_all('script', type='text/javascript')[9].text
PRODUCT_ID = js_content[50:86]
print(PRODUCT_ID)
Output:
884861d2-abe6-4b09-90ff-c8ad1967ac8c
However, I feel like there is a better approach to this problem instead of just "hard-coding" in where to find the ID.
If you view the page source of the product URL and do a search for "product":{"id":, you will find that the ID is inside a nested dictionary that is assigned to an object and inside a tag.
Is there any better way to go about obtaining the product ID from an HTML document?
EDIT: Here is the content of html_content: https://gist.github.com/leecharles50/9b6b11fb458767cabcfc0ed4f961984d
My first idea was to parse the JavaScript inside the tag. There is a package called slimit that can do this. See for example this answer.
However, in your case there is an even easier solution. I searched the DOM for the id you gave (884861d2-abe6-4b09-90ff-c8ad1967ac8) and found an occurrence inside the following tag:
<script type="application/ld+json">
{
[...]
"sku" : "884861d2-abe6-4b09-90ff-c8ad1967ac8c",
[...]
}
</script>
which contains valid JSON. Simply find the tag with BeautifulSoup:
tag = soup('script', {'type': 'application/ld+json'})[-1]
and decode the JSON within:
import json
product_id = json.loads(tag.text)['sku']
As you can see by the product URL, this has been tested on multiple product pages.
import requests
import json
from bs4 import BeautifulSoup
#product_url = 'https://stockx.com/supreme-steiff-bear-heather-grey'
product_url = 'https://stockx.com/air-jordan-1-retro-high-shattered-backboard-3'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
html_content = requests.get(url=product_url, headers=headers)
soup = BeautifulSoup(html_content.text, 'lxml')
script_tags = soup.find_all('script', attrs={'type': 'application/ld+json'})
product_info_text = script_tags[-1].text
# contains a bunch of useful info
product_info_json = json.loads(product_info_text, strict=False)
print(json.dumps(product_info_json, indent=4))
product_sku = product_info_json['sku']
print(product_sku)
I will try to implement the use of a SoupStrainer.
Here is an alternative using regex:
import requests
import re
product_uuid = re.compile(r'"product":{"id":"(\w{8}-(?:\w{4}-){3}\w{12}){1}"')
product_name = re.compile(r'<title>(.*)</title>')
url = 'https://stockx.com/supreme-steiff-bear-heather-grey'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
content = requests.get(url, headers=headers)
if content.ok:
PRODUCT_NAME = product_name.findall(content.text)[0]
PRODUCT_UUID = product_uuid.findall(content.text)[0]
print(PRODUCT_NAME)
print(PRODUCT_UUID)
Slightly hard-coded but easy to adjust and depends only on standard modules.
If you want to scrape on large volumes, you can use the API of Piloterr
Hi I am writing a python script where it takes live wind from a given site where I live, now if I use the following code on the website I get a 'none' value but on the website there is information at the given position.
I tried this code:
import requests
from bs4 import BeautifulSoup
link = 'http://www.actuelewind.nl/?stationcode=6308#SpotPage'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/76.0.3809.132 Safari/537.36'}
def checkwind():
pagina = requests.get(link, headers=headers)
soup = BeautifulSoup(pagina.content, 'html.parser')
windsnelheid = soup.find('div', attrs={"id": "spotInfoWindsnelheidMS"})
print(windsnelheid)
checkwind()
Can anyone show me how to get live wind from this website?