urllib.request.urlopen cannot fetch the primaries page of Stack Overflow elections - python

I have a little script to summarize and sort the candidate scores in Stack Exchange election primaries. It works for most sites, except for Stack Overflow, where retrieving the URL using request.urlopen of urllib fails with 403 error (Forbidden). To demonstrate the problem:
from urllib import request
urls = (
'http://math.stackexchange.com/election/5?tab=primary',
'http://serverfault.com/election/5?tab=primary',
'http://stackoverflow.com/election/7?tab=primary',
)
for url in urls:
print('fetching {} ...'.format(url))
request.urlopen(url).read()
Output, the URLs of Math SE and Server Fault work fine, but Stack Overflow fails:
fetching http://math.stackexchange.com/election/5?tab=primary ...
fetching http://serverfault.com/election/5?tab=primary ...
fetching http://stackoverflow.com/election/7?tab=primary ...
Traceback (most recent call last):
File "examples/t.py", line 11, in <module>
request.urlopen(url).read()
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 469, in open
response = meth(req, response)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 579, in http_response
'http', request, response, code, msg, hdrs)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 507, in error
return self._call_chain(*args)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 441, in _call_chain
result = func(*args)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 587, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
Using curl, all URLs work. So the problem seems to be specific to request.urlopen of urllib. I tried in OSX and Linux, same result. What's going on? How to explain this?

Using requests instead of urllib
import requests
urls = (
'http://math.stackexchange.com/election/5?tab=primary',
'http://serverfault.com/election/5?tab=primary',
'http://stackoverflow.com/election/7?tab=primary',
)
for url in urls:
print('fetching {} ...'.format(url))
data = requests.get(url)
and if you want to make it slightly more efficient by using a single HTTP session
import requests
urls = (
'http://math.stackexchange.com/election/5?tab=primary',
'http://serverfault.com/election/5?tab=primary',
'http://stackoverflow.com/election/7?tab=primary',
)
with requests.Session() as session:
for url in urls:
print('fetching {} ...'.format(url))
data = session.get(url)

It appears to be the user-agent that gets sent with urllib. This code works for me:
from urllib import request
urls = (
'http://math.stackexchange.com/election/5?tab=primary',
'http://serverfault.com/election/5?tab=primary',
'http://stackoverflow.com/election/7?tab=primary',
)
for url in urls:
print('fetching {} ...'.format(url))
try:
request.urlopen(url).read()
except:
print('got an exception, changing user-agent to urllib3 default')
req = request.Request(url)
req.add_header('User-Agent', 'Python-urllib/3.4')
try:
request.urlopen(req)
except:
print('got another exception, changing user-agent to something else')
req.add_header('User-Agent', 'not-Python-urllib/3.4')
request.urlopen(req)
And here's the current output (2015-11-16) with blank lines added for readability:
fetching http://math.stackexchange.com/election/5?tab=primary ...
success with url: http://math.stackexchange.com/election/5?tab=primary
fetching http://serverfault.com/election/5?tab=primary ...
success with url: http://serverfault.com/election/5?tab=primary
fetching http://stackoverflow.com/election/7?tab=primary ...
got an exception, changing user-agent to urllib default
got another exception, changing user-agent to something else
success with url: http://stackoverflow.com/election/7?tab=primary

Related

Python 3: Urllib giving 403 error message

I wanted to run a python 3 program that I created a while ago where it retrieves the weather from a website from someone's specific zipcode. It was working perfectly when I tried it a few months ago but now I get a urllib 403 error message.
I got some advice, and someone told me that the website no longer accepts bots.
My entire project looked like this:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
# asks about zipcode
print("What is your (valid) US zipcode?")
# turns zipcode into a string
zipcode = str(input())
# adds zipcode to the URL
my_url = 'https://weather.com/weather/today/l/' + zipcode + ':4:US'
#Opening up connection, grabbing the page.
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
# grabs the temp
weather_data = page_soup.find("div", {"class":"today_nowcard-temp"})
# prints the temp without the extra code
print(weather_data.text)
Then, I was told to insert this before I open the connection:
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:50.0) Gecko/20100101 Firefox/50.0'}
This doesn't help.
My error is a 403 error. This is the entire message:
Traceback (most recent call last):
File "c:/Users/natek/Downloads/Test.py", line 14, in <module>
uClient = uReq(my_url)
File "C:\Users\natek\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\natek\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\natek\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\natek\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "C:\Users\natek\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\natek\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
I'm kind of stumped, and could use some help. Should I choose a new website entirely?
From what you are saying, this website is not accepting requests that are missing some sort of authentication technique. In a quick request logging I could see this request being made:
https://api.weather.com/v3/location/search?apiKey=d522aa97197fd864d36b418f39ebb323&format=json&language=en-US&locationType=locale&query=[SOMETHING I TYPED]
If you break down the query string, you can see apiKey=d522aa97197fd864d36b418f39ebb323. This means that you need to provide an API key on the request and it will work as intended.
I would pursue the path of checking if the website has a way for you to register and acquire an API key, allowing you to make requests directly, probably based on a set of rules.
I provide below an example of usage with the current provided API key (should be invalidated in a few hours, but I'll give it a shot).
const weatherApi = 'https://api.weather.com/v3/location/search?apiKey=d522aa97197fd864d36b418f39ebb323&format=json&language=en-US&locationType=locale&query='
$('#build').on('click', () => {
const text = $('#text').val();
const resultEl = $('#result');
const uri = `${weatherApi}${encodeURI(text)}`;
fetch(uri)
.then(r => r.json())
.then(r => JSON.stringify(r))
.then(r => resultEl.html(r))
.catch(e => alert(e));
});
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<div>
<input id='text' type='text'><button id='build'>Search</button>
</div>
<p id='result'></p>

"HTTP Error 401: Unauthorized" when querying youtube api for playlist with python

I try to write a simple python3 script that gets some playlist informations via the youtube API. However I always get a 401 Error whereas it works perfectly when I enter the request string in a browser or making a request with w-get. I'm relatively new to python and I guess I'm missing some important point here.
This is my script. Of course I actually use a real API-Key.
from urllib.request import Request, urlopen
from urllib.parse import urlencode
api_key = "myApiKey"
playlist_id = input('Enter playlist id: ')
output_file = input('Enter name of output file (default is playlist id')
if output_file == '':
output_file = playlist_id
url = 'https://www.googleapis.com/youtube/v3/playlistItems'
params = {'part': 'snippet',
'playlistId': playlist_id,
'key': api_key,
'fields': 'items/snippet(title,description,position,resourceId/videoId),nextPageToken,pageInfo/totalResults',
'maxResults': 50,
'pageToken': '', }
data = urlencode(params)
request = Request(url, data.encode('utf-8'))
response = urlopen(request)
content = response.read()
print(content)
Unfortunately it rises a error at response = urlopen(request)
Traceback (most recent call last):
File "gpd-helper.py", line 35, in <module>
response = urlopen(request)
File "/usr/lib/python3.4/urllib/request.py", line 153, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 461, in open
response = meth(req, response)
File "/usr/lib/python3.4/urllib/request.py", line 571, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.4/urllib/request.py", line 499, in error
return self._call_chain(*args)
File "/usr/lib/python3.4/urllib/request.py", line 433, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 579, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 401: Unauthorized
I looked up the documentation but couldn't find any hint. According to the docs other authentication than the api key is not required for listing a public playlist.
After diving deeper into the docs of python and google I found the solution to my problem.
Pythons Request object automatically creates a POST request when the data parameter is given but the youtube api expects GET (with post params)
The Solution is to ether supply the GET argument for the method parameter in python 3.4
request = Request(url, data.encode('utf-8'), method='GET')
or concatenate the url with the urlencoded post data
request = Request(url + '?' + data)

python automatic re-accessing without changing cookie

I have a problem with accessing specific web site.
The Web site automatically redirect to Check Page which is displaying "check your Browser"
The Check page returns HTTP 503 errors in first time.
Then web browser(chrome, IE etc) automatically re-access again.
Finally I can get into web site.
The problem is I want to access to site in Python.
So I use urllib and urllib2 both.
u = urllib.open(url)
print u.read()
Same with urllib2, but it doesn't work raising 503 error.
urllib also get HTTP 503 code but it doesn't raise error.
So I need to re-access without changing cookie
u = urllib.open(url)
u = urllib.open(url) ## cookie is changed
print u.read()
Simply I tried to call open function twice. But cookie is changed and it doesn't work
(Check Page Again)
So I use urllib2 with cooklib
import os.path
cj = None
ClientCookie = None
cookielib = None
import cookielib
import urllib2
cj = cookielib.LWPCookieJar()
if os.path.isfile('cookie.lpw'):
cj.load('cookie.lpw')
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
theurl = url
txdata = None
txheaders = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
req = urllib2.Request(theurl, txdata, txheaders)
handle = urllib2.urlopen(req) ## error raised
Error Code Here
Traceback (most recent call last):
File "<pyshell#20>", line 1, in <module>
handle = urlopen(req)
File "C:\Python27\lib\urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 410, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 448, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 382, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 503: Service Temporarily Unavailable
Simply I want to re-access the site when got HTTP 503 error without change cookies.
But I don't know how to do it.
Somebody help me please.

Get request Activities for strava v3 api python

edit - as I couldn't tag this with Strava here are the docs if you're interested - http://strava.github.io/api/
I get through the authentication fine and gain the access_token (and my athlete info) in a response.read.
I'm having problems with the next step:
I want to return information about a specific activity.
import urllib2
import urllib
access_token = str(tp[3]) #this comes from the response not shown
print access_token
ath_url = 'https://www.strava.com/api/v3/activities/108838256'
ath_val = values={'access_token':access_token}
ath_data = urllib.urlencode (ath_val)
ath_req = urllib2.Request(ath_url, ath_data)
ath_response = urllib2.urlopen(ath_req)
the_page = ath_response.read()
print the_page
error is
Traceback (most recent call last):
File "C:\Users\JordanR\Python2.6\documents\strava\auth.py", line 30, in <module>
ath_response = urllib2.urlopen(ath_req)
File "C:\Users\JordanR\Python2.6\lib\urllib2.py", line 124, in urlopen
return _opener.open(url, data, timeout)
File "C:\Users\JordanR\Python2.6\lib\urllib2.py", line 389, in open
response = meth(req, response)
File "C:\Users\JordanR\Python2.6\lib\urllib2.py", line 502, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\JordanR\Python2.6\lib\urllib2.py", line 427, in error
return self._call_chain(*args)
File "C:\Users\JordanR\Python2.6\lib\urllib2.py", line 361, in _call_chain
result = func(*args)
File "C:\Users\JordanR\Python2.6\lib\urllib2.py", line 510, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 404: Not Found
the 404 is a puzzle as I know this activity exists?
Is 'access_token' the correct header information?
the documentation (http://strava.github.io/api/v3/activities/#get-details) uses Authorization: Bearer ? I'm not sure how liburl would encode the Bearer part of the info?
sorry if some of my terminology is a bit off, I'm new to this.
answered this bad boy.
import requests as r
access_token = tp[3]
ath_url = 'https://www.strava.com/api/v3/activities/108838256'
header = {'Authorization': 'Bearer 4b1d12006c51b685fd1a260490_example_jklfds'}
data = r.get(ath_url, headers=header).json()
it needed the "Bearer" part adding in the Dict.
thanks for the help idClark
I prefer using the third-party Requests module. You do indeed need to follow the documentation and use the Authorization: Header documented in the API
Requests has an arg for header data. We can then create a dict where the key is Authorization and its value is a single string Bearer access_token
#install requests from pip if you want
import requests as r
url = 'https://www.strava.com/api/v3/activities/108838256'
header = {'Authorization': 'Bearer access_token'}
r.get(url, headers=header).json()
If you really want to use urllib2
#using urllib2
import urllib2
req = urllib.Request(url)
req.add_header('Authorization', 'Bearer access_token')
resp = urllib2.urlopen(req)
content = resp.read()
Just remember access_token needs to be the literal string value, eg acc09cds09c097d9c097v9

Python follow redirects and then download the page?

I have the following python script and it works beautifully.
import urllib2
url = 'http://abc.com' # write the url here
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
print data
however, some of the URL's I give it may redirect it 2 or more times. How can I have python wait for redirects to complete before loading the data.
For instance when using the above code with
http://www.google.com/search?hl=en&q=KEYWORD&btnI=1
which is the equvilant of hitting the im lucky button on a google search, I get:
>>> url = 'http://www.google.com/search?hl=en&q=KEYWORD&btnI=1'
>>> usick = urllib2.urlopen(url)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
response = meth(req, response)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
return self._call_chain(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
>>>
Ive tried the (url, data, timeout) however, I am unsure what to put there.
EDIT:
I actually found out if I dont redirect and just used the header of the first link, I can grab the location of the next redirect and use that as my final link
Use requests as the other answer states, here is an example. The redirect will be in r.url. In the example below the http is redirected to https
For HEAD:
In [1]: import requests
...: r = requests.head('http://github.com', allow_redirects=True)
...: r.url
Out[1]: 'https://github.com/'
For GET:
In [1]: import requests
...: r = requests.get('http://github.com')
...: r.url
Out[1]: 'https://github.com/'
Note for HEAD you have to specify allow_redirects, if you don't you can get it in the headers but this is not advised.
In [1]: import requests
In [2]: r = requests.head('http://github.com')
In [3]: r.headers.get('location')
Out[3]: 'https://github.com/'
To download the page you will need GET, you can then access the page using r.content
You might be better off with Requests library which has better APIs for controlling redirect handling:
https://requests.readthedocs.io/en/master/user/quickstart/#redirection-and-history
Requests:
https://pypi.org/project/requests/ (urllib replacement for humans)

Categories

Resources