You don't have permission to access this resource Python webscraping - python

I am trying to web scrape a website and when I am doing that I am getting below output.
Is there a way I can scrape this website?
url = "https://www.mustang6g.com/forums/threads/pre-collision-alert-system.132807/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
print(soup)
Output of the above code is as follows
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access this resource.</p>
</body></html>

The website server expected a header to be passed:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
'AppleWebKit/537.36 (KHTML, like Gecko) '\
'Chrome/75.0.3770.80 Safari/537.36'}
URL = 'https://www.mustang6g.com/forums/threads/pre-collision-alert-system.132807/'
httpx = requests.get(URL, headers=headers)
print(httpx.text)
By passing header, we told the server that we are Mozilla:)

Related

Webscraping with Python Requests and getting Access Denied even after updating headers

this webscraper was working for a while but the website must have been updated so it no longer works. After each request I get an Access Denied error, I have tried adding headers but still get the same issue. This is what the code prints:
</html>
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access "http://www.jdsports.co.uk/product/white-nike-air-force-1-shadow-womens/15984107/" on this server.<p>
Reference #18.4d4c1002.1616968601.6e2013c
</p></body>
</html>
Heres the part of the code to get the HTML:
scraper=requests.Session()
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',
}
html = scraper.get(info[0], proxies= proxy_test, headers=headers).text
soup = BeautifulSoup(html, 'html.parser')
print(soup)
stock = soup.findAll("button", {"class": "btn btn-default"})
What else can I try to fix it? The website I was to scrape is https://www.jdsports.co.uk/
Not sure where you are, but here in the US, your code works for me. I just had to use a different product as the one listed above in the url didn't exist. I was able to see a list of buttons. Didn't require headers either.
url = 'https://www.jdsports.co.uk/product/black-nike-air-force-1-react-lv8-all-stars/16080098/'
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
soup.findAll("button", {"class": "btn btn-default"})

Get content of HTML Header with Beautifulsoup

I'm creating an bot that should retrieves me the status of an order.
I started with this:
import requests
from bs4 import BeautifulSoup
nextline = '\n'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = "https://footlocker.narvar.com/footlocker/tracking/de-mail?order_number=31900491219XXXXXXX"
def getstatus(url):
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
for EachPart in soup.select('div[class="tracking-status-container status_reposition"]'):
print(EachPart)
getstatus(url)
But still after several tries, "EachPart" is empty.
Then I noticed that the Information I want / need is not in the HTML-Body, it is in the Header.
So if I just print the soup I receive :
<head>
var translation = {"comm_sms_auto_response_msg":"........... "widgets_tracking_status_justshipped_status":"Ready to go" }
var xxxxxx
var xxxxxx
var xxxxxx
</head>
<body>
..................
</body>
In the "var translation", there is "widgets_tracking_status_justshipped_status":"Ready to go"
And thats what i need to extractm the "widgets_tracking_status_justshipped_status" and the text of the field, so "Ready to go".
for Javascript string use Regex
def getstatus(url):
response = requests.get(url, headers=headers)
status = re.search(r'_justshipped_status":"([^"]+)', response.text).group(1)
print(status)
# Ready to go

Webscraping with http shows "Web page blocked"

I am trying to scrape http website using proxies and when I am trying to extract text, it shows as "Web page Blocked". How could I avoid this error?
My code is as follows
url = "http://campanulaceae.myspecies.info/"
proxy_dict = {
'http' : "174.138.54.49:8080",
'https' : "174.138.54.49:8080"
}
page = requests.get(url, proxies=proxy_dict)
soup = BeautifulSoup(page.text,'html.parser')
print(soup)
I get below output when I am trying to output text from the website.
<html>
<head>
<title>Web Page Blocked</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="NO-CACHE" http-equiv="PRAGMA"/>
<meta content="initial-scale=1.0" name="viewport"/>
........
<body bgcolor="#e7e8e9">
<div id="content">
<h1>Web Page Blocked</h1>
<p>Access to the web page you were trying to visit has been blocked in accordance with company policy. Please contact your system administrator if you believe this is in error.</p>
Because you did not specify a user-agent for the request headers.
Quite often, sites block requests that come from robot-like sources.
Try it like this:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36'}
page = requests.get(url, headers=headers, proxies=proxy_dict)

My Beautiful Soup scraper is not working as intended

I am trying to pull the ingredients list from the following webpage:
https://skinsalvationsf.com/2012/08/updated-comedogenic-ingredients-list/
So the first ingredient I want to pull would be Acetylated Lanolin, and the last ingredient would be Octyl Palmitate.
Looking at the page source for this URL, I learn that the pattern for the ingredients list looks like this:
<td valign="top" width="33%">Acetylated Lanolin <sup>5</sup></td>
So I wrote some code to pull the list, and it is giving me zero results. Below is the code.
import requests
r = requests.get('https://skinsalvationsf.com/2012/08/updated-comedogenic-ingredients-list/')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('td', attrs={'valign':'top'})
When I try len(results), it gives me a zero.
What am I doing wrong? Why am I not able to pull the list as intended? I am a beginner to web scrapers.
Your web scraping code is working as intended. However, your request did not work. If you check the status code of your request, you can see that you get a 403 status.
r = requests.get('https://skinsalvationsf.com/2012/08/updated-comedogenic-ingredients-list/')
print(r.status_code) # 403
What happens is that the server does not allow a non-browser request. To make it work, you need to use a header while making the request. This header should be similar to what a browser would send:
headers = {
'User-Agent': ('Mozilla/5.0 (Windows NT 6.1; WOW64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/56.0.2924.76 Safari/537.36')
}
r = requests.get('https://skinsalvationsf.com/2012/08/updated-comedogenic-ingredients-list/', headers=headers)
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('td', attrs={'valign':'top'})
print(len(results))
Your soup request is forbidden.
Hence you can not crawl it. Seems website is blocking scraping.
print(soup)
<html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<center><h1>403 Forbidden</h1></center>
<hr/><center>nginx</center>
</body>
</html>

Access denied while scraping

I want to create a script to go on to https://www.size.co.uk/featured/footwear/ and scrape the content but somehow when i run the script, i got access denied. Here is the code:
from urllib import urlopen
from bs4 import BeautifulSoup as BS
url = urlopen('https://www.size.co.uk/')
print BS(url, 'lxml')
The output is
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access "http://www.size.co.uk/" on this server.
<p>
Reference #18.6202655f.1498945327.11002828
</p></body>
</html>
When i try it with other websites, the code works fine and also when i use Selenium, nothing happens but i still want to know how to bypass this error without using Selenium. But when i use Selenium on different website like http://www.footpatrol.co.uk/shop i got the same Access Denied error, here is the code for footpatrol:
from selenium import webdriver
driver = webdriver.PhantomJS('C:\Users\V\Desktop\PY\web_scrape\phantomjs.exe')
driver.get('http://www.footpatrol.com')
pageSource = driver.page_source
soup = BS(pageSource, 'lxml')
print soup
Output is:
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access "http://www.footpatrol.co.uk/" on this
server.<p>
Reference #18.6202655f.1498945644.110590db
</p></body></html>
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.size.co.uk/'
agent = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
page = requests.get(url, headers=agent)
print (BS(page.content, 'lxml'))
try this :
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101
Firefox/50.0'}
source=requests.get(url, headers=headers).text
print(source)

Categories

Resources