Python 3: Urllib giving 403 error message - python

I wanted to run a python 3 program that I created a while ago where it retrieves the weather from a website from someone's specific zipcode. It was working perfectly when I tried it a few months ago but now I get a urllib 403 error message.
I got some advice, and someone told me that the website no longer accepts bots.
My entire project looked like this:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
# asks about zipcode
print("What is your (valid) US zipcode?")
# turns zipcode into a string
zipcode = str(input())
# adds zipcode to the URL
my_url = 'https://weather.com/weather/today/l/' + zipcode + ':4:US'
#Opening up connection, grabbing the page.
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
# grabs the temp
weather_data = page_soup.find("div", {"class":"today_nowcard-temp"})
# prints the temp without the extra code
print(weather_data.text)
Then, I was told to insert this before I open the connection:
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:50.0) Gecko/20100101 Firefox/50.0'}
This doesn't help.
My error is a 403 error. This is the entire message:
Traceback (most recent call last):
File "c:/Users/natek/Downloads/Test.py", line 14, in <module>
uClient = uReq(my_url)
File "C:\Users\natek\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\natek\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\natek\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\natek\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "C:\Users\natek\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\natek\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
I'm kind of stumped, and could use some help. Should I choose a new website entirely?

From what you are saying, this website is not accepting requests that are missing some sort of authentication technique. In a quick request logging I could see this request being made:
https://api.weather.com/v3/location/search?apiKey=d522aa97197fd864d36b418f39ebb323&format=json&language=en-US&locationType=locale&query=[SOMETHING I TYPED]
If you break down the query string, you can see apiKey=d522aa97197fd864d36b418f39ebb323. This means that you need to provide an API key on the request and it will work as intended.
I would pursue the path of checking if the website has a way for you to register and acquire an API key, allowing you to make requests directly, probably based on a set of rules.
I provide below an example of usage with the current provided API key (should be invalidated in a few hours, but I'll give it a shot).
const weatherApi = 'https://api.weather.com/v3/location/search?apiKey=d522aa97197fd864d36b418f39ebb323&format=json&language=en-US&locationType=locale&query='
$('#build').on('click', () => {
const text = $('#text').val();
const resultEl = $('#result');
const uri = `${weatherApi}${encodeURI(text)}`;
fetch(uri)
.then(r => r.json())
.then(r => JSON.stringify(r))
.then(r => resultEl.html(r))
.catch(e => alert(e));
});
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<div>
<input id='text' type='text'><button id='build'>Search</button>
</div>
<p id='result'></p>

Related

Error in web-Scraping Code Using BeautifulSoup

I want to get data from https://www.cvedetails.com/vulnerability-list/vendor_id-26/product_id-32238/Microsoft-Windows-10.html
from page 1 to the last page while it is sorted by "CVE Number Ascending"
The data I want to retrieve in CSV format is everything in the table header
and the table data
I have been trying out a few codes
but it doesn't seem to work
and I'm kind of desperate now
https://youtu.be/XQgXKtPSzUI
the place I try to learn from
Any help would be appreciated
I asked this once before
The replies I got were great
But it doesn't seem to get what I need and I am confused about how this works
And more so because of how weird the sources code for the website is
#!/usr/bin/env python3
import bs4 # Good HTML parser
from urllib.request import urlopen as uReq # Helps with opening URL
from bs4 import BeautifulSoup as soup
# The target URL
my_url = 'https://www.cvedetails.com/vulnerability-list.php?vendor_id=26&product_id=32238&version_id=&page=1&hasexp=0&opdos=0&opec=0&opov=0&opcsrf=0&opgpriv=0&opsqli=0&opxss=0&opdirt=0&opmemc=0&ophttprs=0&opbyp=0&opfileinc=0&opginf=0&cvssscoremin=0&cvssscoremax=0&year=0&month=0&cweid=0&order=2&trc=851&sha=41e451b72c2e412c0a1cb8cb1dcfee3d16d51c44'
# Check process
# print(my_url)
# Open a connection and grab the webpage and downloads it
uClient = uReq(my_url)
# Save the webpage into a variable
page_html = uClient.read()
# Close the internet connection from uclient
uClient.close()
# Calling soup to parse the html with html parser and saving it to a variable
page_soup = soup(page_html,"html.parser")
print(page_soup.h1)
This is the error code
Traceback (most recent call last):
File "./Testing3.py", line 21, in <module>
uClient = uReq(my_url)
File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/usr/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/usr/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
To avoid this error, you need to supply the user agent through the header in the request.
Try modifying your script as:
#!/usr/bin/env python3
import bs4
from urllib.request import urlopen as uReq, Request
from bs4 import BeautifulSoup as soup
#bs4 is a good html parser
#urllib.request helps with opening the url
#setting the target url
my_url = 'https://www.cvedetails.com/vulnerability-list.php?vendor_id=26&product_id=32238&version_id=&page=1&hasexp=0&opdos=0&opec=0&opov=0&opcsrf=0&opgpriv=0&opsqli=0&opxss=0&opdirt=0&opmemc=0&ophttprs=0&opbyp=0&opfileinc=0&opginf=0&cvssscoremin=0&cvssscoremax=0&year=0&month=0&cweid=0&order=2&trc=851&sha=41e451b72c2e412c0a1cb8cb1dcfee3d16d51c44'
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(my_url,headers=hdr)
page = uReq(req)
page_soup = soup(page)
print(page_soup.h1)
Instead of urllib why don't you directly use requests module. Try this code
import requests
from bs4 import BeautifulSoup as soup
my_url = 'https://www.cvedetails.com/vulnerability-list.php?vendor_id=26&product_id=32238&version_id=&page=1&hasexp=0&opdos=0&opec=0&opov=0&opcsrf=0&opgpriv=0&opsqli=0&opxss=0&opdirt=0&opmemc=0&ophttprs=0&opbyp=0&opfileinc=0&opginf=0&cvssscoremin=0&cvssscoremax=0&year=0&month=0&cweid=0&order=2&trc=851&sha=41e451b72c2e412c0a1cb8cb1dcfee3d16d51c44'
page_html = requests.get(my_url).text
page_soup = soup(page_html,"html.parser")
print(page_soup.h1)
output:
<h1>
Microsoft ยป Windows 10 : Security Vulnerabilities
</h1>

urllib.request.urlopen cannot fetch the primaries page of Stack Overflow elections

I have a little script to summarize and sort the candidate scores in Stack Exchange election primaries. It works for most sites, except for Stack Overflow, where retrieving the URL using request.urlopen of urllib fails with 403 error (Forbidden). To demonstrate the problem:
from urllib import request
urls = (
'http://math.stackexchange.com/election/5?tab=primary',
'http://serverfault.com/election/5?tab=primary',
'http://stackoverflow.com/election/7?tab=primary',
)
for url in urls:
print('fetching {} ...'.format(url))
request.urlopen(url).read()
Output, the URLs of Math SE and Server Fault work fine, but Stack Overflow fails:
fetching http://math.stackexchange.com/election/5?tab=primary ...
fetching http://serverfault.com/election/5?tab=primary ...
fetching http://stackoverflow.com/election/7?tab=primary ...
Traceback (most recent call last):
File "examples/t.py", line 11, in <module>
request.urlopen(url).read()
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 469, in open
response = meth(req, response)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 579, in http_response
'http', request, response, code, msg, hdrs)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 507, in error
return self._call_chain(*args)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 441, in _call_chain
result = func(*args)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 587, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
Using curl, all URLs work. So the problem seems to be specific to request.urlopen of urllib. I tried in OSX and Linux, same result. What's going on? How to explain this?
Using requests instead of urllib
import requests
urls = (
'http://math.stackexchange.com/election/5?tab=primary',
'http://serverfault.com/election/5?tab=primary',
'http://stackoverflow.com/election/7?tab=primary',
)
for url in urls:
print('fetching {} ...'.format(url))
data = requests.get(url)
and if you want to make it slightly more efficient by using a single HTTP session
import requests
urls = (
'http://math.stackexchange.com/election/5?tab=primary',
'http://serverfault.com/election/5?tab=primary',
'http://stackoverflow.com/election/7?tab=primary',
)
with requests.Session() as session:
for url in urls:
print('fetching {} ...'.format(url))
data = session.get(url)
It appears to be the user-agent that gets sent with urllib. This code works for me:
from urllib import request
urls = (
'http://math.stackexchange.com/election/5?tab=primary',
'http://serverfault.com/election/5?tab=primary',
'http://stackoverflow.com/election/7?tab=primary',
)
for url in urls:
print('fetching {} ...'.format(url))
try:
request.urlopen(url).read()
except:
print('got an exception, changing user-agent to urllib3 default')
req = request.Request(url)
req.add_header('User-Agent', 'Python-urllib/3.4')
try:
request.urlopen(req)
except:
print('got another exception, changing user-agent to something else')
req.add_header('User-Agent', 'not-Python-urllib/3.4')
request.urlopen(req)
And here's the current output (2015-11-16) with blank lines added for readability:
fetching http://math.stackexchange.com/election/5?tab=primary ...
success with url: http://math.stackexchange.com/election/5?tab=primary
fetching http://serverfault.com/election/5?tab=primary ...
success with url: http://serverfault.com/election/5?tab=primary
fetching http://stackoverflow.com/election/7?tab=primary ...
got an exception, changing user-agent to urllib default
got another exception, changing user-agent to something else
success with url: http://stackoverflow.com/election/7?tab=primary

"HTTP Error 401: Unauthorized" when querying youtube api for playlist with python

I try to write a simple python3 script that gets some playlist informations via the youtube API. However I always get a 401 Error whereas it works perfectly when I enter the request string in a browser or making a request with w-get. I'm relatively new to python and I guess I'm missing some important point here.
This is my script. Of course I actually use a real API-Key.
from urllib.request import Request, urlopen
from urllib.parse import urlencode
api_key = "myApiKey"
playlist_id = input('Enter playlist id: ')
output_file = input('Enter name of output file (default is playlist id')
if output_file == '':
output_file = playlist_id
url = 'https://www.googleapis.com/youtube/v3/playlistItems'
params = {'part': 'snippet',
'playlistId': playlist_id,
'key': api_key,
'fields': 'items/snippet(title,description,position,resourceId/videoId),nextPageToken,pageInfo/totalResults',
'maxResults': 50,
'pageToken': '', }
data = urlencode(params)
request = Request(url, data.encode('utf-8'))
response = urlopen(request)
content = response.read()
print(content)
Unfortunately it rises a error at response = urlopen(request)
Traceback (most recent call last):
File "gpd-helper.py", line 35, in <module>
response = urlopen(request)
File "/usr/lib/python3.4/urllib/request.py", line 153, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 461, in open
response = meth(req, response)
File "/usr/lib/python3.4/urllib/request.py", line 571, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.4/urllib/request.py", line 499, in error
return self._call_chain(*args)
File "/usr/lib/python3.4/urllib/request.py", line 433, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 579, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 401: Unauthorized
I looked up the documentation but couldn't find any hint. According to the docs other authentication than the api key is not required for listing a public playlist.
After diving deeper into the docs of python and google I found the solution to my problem.
Pythons Request object automatically creates a POST request when the data parameter is given but the youtube api expects GET (with post params)
The Solution is to ether supply the GET argument for the method parameter in python 3.4
request = Request(url, data.encode('utf-8'), method='GET')
or concatenate the url with the urlencoded post data
request = Request(url + '?' + data)

python automatic re-accessing without changing cookie

I have a problem with accessing specific web site.
The Web site automatically redirect to Check Page which is displaying "check your Browser"
The Check page returns HTTP 503 errors in first time.
Then web browser(chrome, IE etc) automatically re-access again.
Finally I can get into web site.
The problem is I want to access to site in Python.
So I use urllib and urllib2 both.
u = urllib.open(url)
print u.read()
Same with urllib2, but it doesn't work raising 503 error.
urllib also get HTTP 503 code but it doesn't raise error.
So I need to re-access without changing cookie
u = urllib.open(url)
u = urllib.open(url) ## cookie is changed
print u.read()
Simply I tried to call open function twice. But cookie is changed and it doesn't work
(Check Page Again)
So I use urllib2 with cooklib
import os.path
cj = None
ClientCookie = None
cookielib = None
import cookielib
import urllib2
cj = cookielib.LWPCookieJar()
if os.path.isfile('cookie.lpw'):
cj.load('cookie.lpw')
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
theurl = url
txdata = None
txheaders = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
req = urllib2.Request(theurl, txdata, txheaders)
handle = urllib2.urlopen(req) ## error raised
Error Code Here
Traceback (most recent call last):
File "<pyshell#20>", line 1, in <module>
handle = urlopen(req)
File "C:\Python27\lib\urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 410, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 448, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 382, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 503: Service Temporarily Unavailable
Simply I want to re-access the site when got HTTP 503 error without change cookies.
But I don't know how to do it.
Somebody help me please.

Python follow redirects and then download the page?

I have the following python script and it works beautifully.
import urllib2
url = 'http://abc.com' # write the url here
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
print data
however, some of the URL's I give it may redirect it 2 or more times. How can I have python wait for redirects to complete before loading the data.
For instance when using the above code with
http://www.google.com/search?hl=en&q=KEYWORD&btnI=1
which is the equvilant of hitting the im lucky button on a google search, I get:
>>> url = 'http://www.google.com/search?hl=en&q=KEYWORD&btnI=1'
>>> usick = urllib2.urlopen(url)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
response = meth(req, response)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
return self._call_chain(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
>>>
Ive tried the (url, data, timeout) however, I am unsure what to put there.
EDIT:
I actually found out if I dont redirect and just used the header of the first link, I can grab the location of the next redirect and use that as my final link
Use requests as the other answer states, here is an example. The redirect will be in r.url. In the example below the http is redirected to https
For HEAD:
In [1]: import requests
...: r = requests.head('http://github.com', allow_redirects=True)
...: r.url
Out[1]: 'https://github.com/'
For GET:
In [1]: import requests
...: r = requests.get('http://github.com')
...: r.url
Out[1]: 'https://github.com/'
Note for HEAD you have to specify allow_redirects, if you don't you can get it in the headers but this is not advised.
In [1]: import requests
In [2]: r = requests.head('http://github.com')
In [3]: r.headers.get('location')
Out[3]: 'https://github.com/'
To download the page you will need GET, you can then access the page using r.content
You might be better off with Requests library which has better APIs for controlling redirect handling:
https://requests.readthedocs.io/en/master/user/quickstart/#redirection-and-history
Requests:
https://pypi.org/project/requests/ (urllib replacement for humans)

Categories

Resources