this webscraper was working for a while but the website must have been updated so it no longer works. After each request I get an Access Denied error, I have tried adding headers but still get the same issue. This is what the code prints:
</html>
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access "http://www.jdsports.co.uk/product/white-nike-air-force-1-shadow-womens/15984107/" on this server.<p>
Reference #18.4d4c1002.1616968601.6e2013c
</p></body>
</html>
Heres the part of the code to get the HTML:
scraper=requests.Session()
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',
}
html = scraper.get(info[0], proxies= proxy_test, headers=headers).text
soup = BeautifulSoup(html, 'html.parser')
print(soup)
stock = soup.findAll("button", {"class": "btn btn-default"})
What else can I try to fix it? The website I was to scrape is https://www.jdsports.co.uk/
Not sure where you are, but here in the US, your code works for me. I just had to use a different product as the one listed above in the url didn't exist. I was able to see a list of buttons. Didn't require headers either.
url = 'https://www.jdsports.co.uk/product/black-nike-air-force-1-react-lv8-all-stars/16080098/'
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
soup.findAll("button", {"class": "btn btn-default"})
Related
I'm trying to check this site for changes on the title:
https://allestoringen.nl/storing/kpn/
source = requests.get(url).text
soup = bs4.BeautifulSoup(source,'html.parser')
event_string = str(soup.find(text='Er is geen storing bij KPN'))
print (event_string)
However, event_string returns None every time.
The reason you don't get a result might be that the website doesn't accept your request. I got this result.
page = requests.get(url)
page.status_code # 403
page.reason # 'Forbidden'
You might want to take a look at this post for a solution.
It is always a good idea to check the return status of your request in your code.
But to solve your problem. You might want to check the <title> element instead of a specific string.
# stolen from the post I mentioned
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
page = requests.get(url, headers=headers)
page.status_code # 200. Adding a header solved the problem.
soup = bs4.BeautifulSoup(page.text,'html.parser')
# get title.
print(soup.find("title").text)
'KPN storing? Actuele storingen en problemen | Allestoringen'
I am trying to web scrape a website and when I am doing that I am getting below output.
Is there a way I can scrape this website?
url = "https://www.mustang6g.com/forums/threads/pre-collision-alert-system.132807/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
print(soup)
Output of the above code is as follows
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access this resource.</p>
</body></html>
The website server expected a header to be passed:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
'AppleWebKit/537.36 (KHTML, like Gecko) '\
'Chrome/75.0.3770.80 Safari/537.36'}
URL = 'https://www.mustang6g.com/forums/threads/pre-collision-alert-system.132807/'
httpx = requests.get(URL, headers=headers)
print(httpx.text)
By passing header, we told the server that we are Mozilla:)
I am running the following code to parse an amazon page using beautiful soup in Python but when I run the print line, I keep getting None. I am wondering whether I am doing something wrong or if theres an explanation/solution to this. Any help will be appreciated.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.ca/Magnetic-Erase-Whiteboard-Bulletin-
Board/dp/B07GNVZKY2/ref=sr_1_3_sspa?keywords=whiteboard&qid=1578902710&s=office&sr=1-3-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzOE5ZSkFGSDdCOFVDJmVuY3J5cHRlZElkPUEwMDM2ODA4M0dWMEtMWkI1U1hJJmVuY3J5cHRlZEFkSWQ9QTA0MDIwMjQxMEUwMzlMQ0pTQVlBJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="productTitle")
print(title)
Your code is absolutely correct.
There seems to be some issue with the the parser that you have used (html.parser)
I used html5lib in place of html.parser and the code now works:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.ca/Magnetic-Erase-Whiteboard-BulletinBoard/dp/B07GNVZKY2/ref=sr_1_3_sspa?keywords=whiteboard&qid=1578902710&s=office&sr=1-3-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzOE5ZSkFGSDdCOFVDJmVuY3J5cHRlZElkPUEwMDM2ODA4M0dWMEtMWkI1U1hJJmVuY3J5cHRlZEFkSWQ9QTA0MDIwMjQxMEUwMzlMQ0pTQVlBJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html5lib')
title = soup.find(id='productTitle')
print(title)
More Info not directly related to the answer:
For the other answer given to this question, I wasn't asked for a captcha when visiting the page.
However Amazon does change the response content if it detects that a bot is visiting the website: Remove the headers from requests.get() method, and try page.text
The default headers added by requests library lead to the identification of the request as being form a bot.
When requesting that page outside of a normal browser environment it asked for a captcha, I'd assume that's why the element doesn't exist.
Amazon probably has specific measures to counter "robots" accessing their pages, I suggest to look at their APIs to see if there's anything helpful instead of scraping the webpages directly.
I am trying to pull the ingredients list from the following webpage:
https://skinsalvationsf.com/2012/08/updated-comedogenic-ingredients-list/
So the first ingredient I want to pull would be Acetylated Lanolin, and the last ingredient would be Octyl Palmitate.
Looking at the page source for this URL, I learn that the pattern for the ingredients list looks like this:
<td valign="top" width="33%">Acetylated Lanolin <sup>5</sup></td>
So I wrote some code to pull the list, and it is giving me zero results. Below is the code.
import requests
r = requests.get('https://skinsalvationsf.com/2012/08/updated-comedogenic-ingredients-list/')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('td', attrs={'valign':'top'})
When I try len(results), it gives me a zero.
What am I doing wrong? Why am I not able to pull the list as intended? I am a beginner to web scrapers.
Your web scraping code is working as intended. However, your request did not work. If you check the status code of your request, you can see that you get a 403 status.
r = requests.get('https://skinsalvationsf.com/2012/08/updated-comedogenic-ingredients-list/')
print(r.status_code) # 403
What happens is that the server does not allow a non-browser request. To make it work, you need to use a header while making the request. This header should be similar to what a browser would send:
headers = {
'User-Agent': ('Mozilla/5.0 (Windows NT 6.1; WOW64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/56.0.2924.76 Safari/537.36')
}
r = requests.get('https://skinsalvationsf.com/2012/08/updated-comedogenic-ingredients-list/', headers=headers)
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('td', attrs={'valign':'top'})
print(len(results))
Your soup request is forbidden.
Hence you can not crawl it. Seems website is blocking scraping.
print(soup)
<html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<center><h1>403 Forbidden</h1></center>
<hr/><center>nginx</center>
</body>
</html>
I want to create a script to go on to https://www.size.co.uk/featured/footwear/ and scrape the content but somehow when i run the script, i got access denied. Here is the code:
from urllib import urlopen
from bs4 import BeautifulSoup as BS
url = urlopen('https://www.size.co.uk/')
print BS(url, 'lxml')
The output is
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access "http://www.size.co.uk/" on this server.
<p>
Reference #18.6202655f.1498945327.11002828
</p></body>
</html>
When i try it with other websites, the code works fine and also when i use Selenium, nothing happens but i still want to know how to bypass this error without using Selenium. But when i use Selenium on different website like http://www.footpatrol.co.uk/shop i got the same Access Denied error, here is the code for footpatrol:
from selenium import webdriver
driver = webdriver.PhantomJS('C:\Users\V\Desktop\PY\web_scrape\phantomjs.exe')
driver.get('http://www.footpatrol.com')
pageSource = driver.page_source
soup = BS(pageSource, 'lxml')
print soup
Output is:
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access "http://www.footpatrol.co.uk/" on this
server.<p>
Reference #18.6202655f.1498945644.110590db
</p></body></html>
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.size.co.uk/'
agent = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
page = requests.get(url, headers=agent)
print (BS(page.content, 'lxml'))
try this :
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101
Firefox/50.0'}
source=requests.get(url, headers=headers).text
print(source)