I'm having little trouble creating a script working with URLs. I'm using urllib.urlopen() to get content of desired URL. But some of these URLs requires authentication. And urlopen prompts me to type in my username and then password.
What I need is to ignore every URL that'll require authentication, just easily skip it and continue, is there a way to do this?
I was wondering about catching HTTPError exception, but in fact, exception is handled by urlopen() method, so it's not working.
Thanks for every reply.
You are right about the urllib2.HTTPError exception:
exception urllib2.HTTPError
Though being an exception (a subclass of URLError), an HTTPError can also function as a non-exceptional file-like return value (the same thing that urlopen() returns). This is useful when handling exotic HTTP errors, such as requests for authentication.
code
An HTTP status code as defined in RFC 2616. This numeric value corresponds to a value found in the dictionary of codes as found in BaseHTTPServer.BaseHTTPRequestHandler.responses.
The code attribute of the exception can be used to verify that authentication is required - code 401.
>>> try:
... conn = urllib2.urlopen('http://www.example.com/admin')
... # read conn and process data
... except urllib2.HTTPError, x:
... print 'Ignoring', x.code
...
Ignoring 401
>>>
Related
There is a json() method in the requests library.
I want to handle his exceptions, "if suddenly something is wrong." For example, the server did not respond with json, but with something else, well, there are all kinds of problems with encodings, etc.
I climbed into the documentation, then into the sources, and realized that in the end, in the requests itself, everything is tied to two libraries at once:
try:
import simplejson as json
except ImportError:
import json
Accordingly, there are two types of exceptions for each of them at once.
And I have a "combined arms" library, which I would like to use in independent conditions from the installed system, libraries, etc.
How to correctly register in this case
try:
answer = requests.get (url, param, headers) .json ()
except (???, ???, ???):
do_my_function (incorrect_answer)
so that it does not depend on the libraries the import took place, and with them their specific exceptions?
Since you climbed into the source, you should go to the end. And also the documentation can help.
In the [documentation] [1] on requests json () it says:
In case the JSON decoding fails, r.json () raises an exception. For example, if the response gets a 204 (No Content), or if the response contains invalid JSON, attempting r.json () raises ValueError: No JSON object could be decoded.
We must catch the ValueError exception:
try:
answer = requests.get (url, param, headers).json ()
except ValueError:
do_my_function (incorrect_answer)
If you look at the json and simplejson sources, you can see that both packages use a more accurate exception inherited from ValueError:JSONDecodeError. Thus, you can do this:
try:
from simplejson import JSONDecodeError
except ImportError:
from json import JSONDecodeError
...
try:
answer = requests.get (url, param, headers) .json ()
except JSONDecodeError:
do_my_function (incorrect_answer)
[1]: https://requests.readthedocs.io/en/master/user/quickstart/#json-response-content
I used the following code snippet to unshorten URLs using the requests library. The snippet runs correctly for URL redirects of hostnames that are valid , and running webpages. But , this code and every other variants of the snippets of unshortening urls seem to fail when the final URL is invalid website. I would still like to get what the final web page url is , regardless of being it an invalid one.
The snippet is :
def unshorten_url(url):
return requests.head(url, allow_redirects=True).url
print unshorten_url(<shortened URL>)
The shortened URL should redirect to this webpage, which has invalid host .
http://trekingear.com/product/4-get-a-real-rocky-mountain-high/?utm_source=Content&utm_medium=Postings&utm_campaign=Guffey%20X%20Mass
But it returns me this error :
requests.exceptions.ConnectionError: HTTPConnectionPool(host='trekingear.com', port=80): Max retries exceeded with url: /product/4-get-a-real-rocky-mountain-high/?utm_source=Content&utm_medium=Postings&utm_campaign=Guffey%20X%20Mass (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10556dc50>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known',))
Here is the URL I am trying to unshorten :
How can I extract the final URL, of this invalid host from this redirection chain?
You should not use requests.head like that, since by default it follows a 302 Redirect up to three times.
You could disable redirection (with retries=False) and use urlopen. Then the returned response would always hold the 302 contents as its url:
urlopen(method, url, body=None, headers=None, retries=None,
redirect=True, assert_same_host=True, timeout=<object object>,
pool_timeout=None, release_conn=None, chunked=False, body_pos=None,
**response_kw)
Get a connection from the pool and perform an HTTP request. This is the lowest level call for making a request, so you’ll need to specify all the raw details.
Parameters:
method – HTTP request method (such as GET, POST, PUT, etc.)
body – Data to send in the request body (useful for creating POST requests, see HTTPConnectionPool.post_url for more convenience).
headers – Dictionary of custom headers to send, such as User-Agent, If-None-Match, etc. If None, pool headers are used. If provided, these headers completely replace any pool-specific headers.
retries (Retry, False, or an int.) –
Configure the number of retries to allow before raising a MaxRetryError exception.
Pass None to retry until you receive a response. Pass a Retry object for fine-grained control over different types of retries. Pass an integer number to retry connection errors that many times, but no other types of errors. Pass zero to never retry.
And this is the relevant note:
If False, then retries are disabled and any exception is raised immediately. Also, instead of raising a MaxRetryError on redirects, the redirect response will be returned.
Example
(I have actually ran a different test on my local web server, but can't find a public one supplying wrong 302 requests).
from urllib3 import PoolManager
manager = PoolManager(10)
req = manager.urlopen("GET", "http://en.wikipedia.org/wiki/Claude_E._Shannon", retries=False)
print req.get_redirect_location()
The above will request a HTTP page from Wikipedia, thus generating the redirect to HTTPS:
https://en.wikipedia.org/wiki/Claude_E._Shannon
Redirects plus no retries
Your case is a bit different. You want to do redirects since the original URL will not yield the real redirection on the first try, but you want to get the failed redirect.
The problem here is that redirects are handled by the same code as error retries, so you can't disable only the latter. It's neither or both.
You then have to enable both, and do it the long way (intercepting the error). You might need to increase retries, which will slow down things when errors occur.
try:
// Did not know you can't post a URL shortener in a SO answer. Live and learn.
req = manager.urlopen("GET", "http(COLON)(SLASH)(SLASH)t(DOT)co(SLASH)eWWk8s8Hzj")
loc = req.get_redirect_location()
except MaxRetryError as fail:
// build "loc" from scheme, host and url
loc = "%s://%s%s" % (fail.pool.scheme, fail.pool.host, fail.url)
print loc
Your specific case
Since you're using a urllib3 wrapper, you can just unwrap the exception:
try:
# This is your existing code
return requests.head(url, allow_redirects = True).url
except requests.ConnectionError as fail:
return "%s://%s%s" % (fail.args[0].pool.scheme, fail.args[0].pool.host, fail.args[0].url)
You ought to provide for other possible errors, though.
Using the HTTPretty library for Python, I can create mock HTTP responses of choice and then pick them up i.e. with the requests library like so:
import httpretty
import requests
# set up a mock
httpretty.enable()
httpretty.register_uri(
method=httpretty.GET,
uri='http://www.fakeurl.com',
status=200,
body='My Response Body'
)
response = requests.get('http://www.fakeurl.com')
# clean up
httpretty.disable()
httpretty.reset()
print(response)
Out: <Response [200]>
Is there also the possibility to register an uri which cannot be reached (e.g. connection timed out, connection refused, ...) such that no response is received at all (which is not the same as an established connection which gives an HTTP error code like 404)?
I want to use this behaviour in unit testing to ensure that my error handling works as expected (which does different things in case of 'no connection established' and 'connection established, bad bad HTTP status code'). As a workaround, I could try to connect to an invalid server like http://192.0.2.0 which would time out in any case. However, I would prefer to do all my unit testing without using any real network connections.
Meanwhile I got it, using a HTTPretty callback body seems to produce the desired behaviour. See inline comments below.
This is actually not exactly the same as I was looking for (it is not a server that cannot be reached and hence the request times out but a server that throws a timeout exception once it is reached, however, the effect is the same for my usecase.
Still, if anybody knows a different solution, I'm looking forward to it.
import httpretty
import requests
# enable HTTPretty
httpretty.enable()
# create a callback body that raises an exception when opened
def exceptionCallback(request, uri, headers):
# raise your favourite exception here, e.g. requests.ConnectionError or requests.Timeout
raise requests.Timeout('Connection timed out.')
# set up a mock and use the callback function as response's body
httpretty.register_uri(
method=httpretty.GET,
uri='http://www.fakeurl.com',
status=200,
body=exceptionCallback
)
# try to get a response from the mock server and catch the exception
try:
response = requests.get('http://www.fakeurl.com')
except requests.Timeout as e:
print('requests.Timeout exception got caught...')
print(e)
# do whatever...
# clean up
httpretty.disable()
httpretty.reset()
I'm currently doing a lot of stuff with BigQuery, and am using a lot of try... except.... It looks like just about every error I get back from BigQuery is a apiclient.errors.HttpError, but with different strings attached to them, i.e.:
<HttpError 409 when requesting https://www.googleapis.com/bigquery/v2/projects/some_id/datasets/some_dataset/tables?alt=json returned "Already Exists: Table some_id:some_dataset.some_table">
<HttpError 404 when requesting https://www.googleapis.com/bigquery/v2/projects/some_id/jobs/sdfgsdfg?alt=json returned "Not Found: Job some_id:sdfgsdfg">
among many others. Right now the only way I see to handle these is to run regexs on the error messages, but this is messy and definitely not ideal. Is there a better way?
BigQuery is a REST API, the errors it uses follow standard HTTP error conventions.
In python, an HttpError has a resp.status field that returns the HTTP status code.
As you show above, 409 is 'conflict', 404 is 'not found'.
For example:
from googleapiclient.errors import HttpError
try:
...
except HttpError as err:
# If the error is a rate limit or connection error,
# wait and try again.
if err.resp.status in [403, 500, 503]:
time.sleep(5)
else: raise
The response is also a json object, an even better way is to parse the json and read the error reason field:
if err.resp.get('content-type', '').startswith('application/json'):
reason = json.loads(err.content).get('error').get('errors')[0].get('reason')
This can be:
notFound, duplicate, accessDenied, invalidQuery, backendError, resourcesExceeded, invalid, quotaExceeded, rateLimitExceeded, timeout, etc.
Google Cloud now provides exception handlers:
from google.api_core.exceptions import AlreadyExists, NotFound
try:
...
except AlreadyExists:
...
except NotFound:
...
This should prove more exact in catching the details of the error.
Please reference this source code to find other exceptions to utilize: http://google-cloud-python.readthedocs.io/en/latest/_modules/google/api_core/exceptions.html
the following url returns the expected resonse in the browser:
http://ws.audioscrobbler.com/2.0/?method=user.getinfo&user=notonfile99&api_key=8e9de6bd545880f19d2d2032c28992b4
<lfm status="failed">
<error code="6">No user with that name was found</error>
</lfm>
But I am unable to access the xml in Python via the following code as an HTTPError exception is raised:
"due to an HTTP Error 400: Bad Request"
import urllib
urlopen('http://ws.audioscrobbler.com/2.0/?method=user.getinfo&user=notonfile99&api_key=8e9de6bd545880f19d2d2032c28992b4')
I see that I can work aound this via using urlretrieve rather than urlopen, but the html response gets written to disc.
Is there a way, just using the python v2.7 standard library, where I can get hold of the xml response, without having to read it from disc, and do housekeeping?
I see that this question has been asked before in a PHP context, but I don't know how to apply the answer to Python:
DOMDocument load on a page returning 400 Bad Request status
Copying from here: http://www.voidspace.org.uk/python/articles/urllib2.shtml#httperror
The exception that is thrown contains the full body of the error page:
#!/usr/bin/python2
import urllib2
try:
resp = urllib2.urlopen('http://ws.audioscrobbler.com/2.0/?method=user.getinfo&user=notonfile99&api_key=8e9de6bd545880f19d2d2032c28992b4')
except urllib2.HTTPError, e:
print e.code
print e.read()