I want to grab the HTTP status code once it raises a URLError exception:
I tried this but didn't help:
except URLError, e:
logger.warning( 'It seems like the server is down. Code:' + str(e.code) )
You shouldn't check for a status code after catching URLError, since that exception can be raised in situations where there's no HTTP status code available, for example when you're getting connection refused errors.
Use HTTPError to check for HTTP specific errors, and then use URLError to check for other problems:
try:
urllib2.urlopen(url)
except urllib2.HTTPError, e:
print e.code
except urllib2.URLError, e:
print e.args
Of course, you'll probably want to do something more clever than just printing the error codes, but you get the idea.
Not sure why you are getting this error. If you are using urllib2 this should help:
import urllib2
from urllib2 import URLError
try:
urllib2.urlopen(url)
except URLError, e:
print e.code
Related
I'm trying to scrape data from "www . money . usnews .com" but I'm consistently getting this TimeoutError that says:
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly
respond after a period of time, or established connection failed because connected host has failed to
respond.
Here is my code: (I included a different website as a test case and it works. I am new to web scraping. What should I do?
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
try:
html = urlopen('https://money.usnews.com')
# html = urlopen('http://www.timeanddate.com/weather/usa/los-angeles')
except HTTPError as e:
print(e)
except URLError as e:
print("No server")
except ValueError as e:
print("NULL ")
else:
print("b")
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)
My script in Python 2.7 scrapes a website every minute, but sometimes it gives the error:
urlopen error [Errno 54] Connection reset by peer>
How I can handle the exception? I am trying something like this:
from socket import error as SocketError
import errno
try:
response = urllib2.urlopen(request).read()
except SocketError as e:
if e.errno != errno.ECONNRESET:
print "there is a time-out"
pass
print "There is no time-out, continue"
You can handle the exception like this:
from socket import error as SocketError
import errno
try:
urllib2.urlopen(request).read()
except SocketError, e:
errorcode = e[0]
if errorcode!=errno.ECONNREFUSED:
print "There is a time-out"
# Not the error we are looking for, re-raise
raise e
You can read Error Code
I have recently run into an issue at work where we are having intermittent problems with an internal website not loading due to an interrupted system call. We are using urllib2 to access the website. I can't share the exact code, but here is basically how we do it:
payload = {'userName': user_name,
'emailAddress': email_address,
'password': password}
headers = {'Accept': 'application/json',
'Content-Type': 'application/json',
'Authorization': token}
values = json.dumps(payload)
req = urllib2.Request(url, values, headers)
try:
response = urllib2.urlopen(req, timeout=30)
break
except IOError, e:
if e.errno != errno.EINTR:
print e.errno
raise
We log the errono and the raised exception. The exception is:
IOError: <urlopen error [Errno 4] Interrupted system call>
And the errno is None. I expected it to be 4.
Is there a better way to catch this error in Python 2.7? I am aware of PEP475, but we cannot upgrade to Python 3 right now.
The <urlopen error [Errno 4] Interrupted system call> indicates it is actually a URLError from urllib2, which subclasses IOError, but handles arguments completely differently. That is why the attributes errno and strerror are not initialized. It both passes strings as reason:
raise URLError("qop '%s' is not supported." % qop)
and wraps exceptions from other sources:
try:
h.request(req.get_method(), req.get_selector(), req.data, headers)
except socket.error, err: # XXX what error?
h.close()
raise URLError(err)
This is why you will not find errno in the usual place:
>>> try:
urlopen('http://asdf')
except URLError, e:
pass
...
>>> e
URLError(gaierror(-2, 'Name or service not known'),)
>>> e.errno
>>> e.reason
gaierror(-2, 'Name or service not known')
>>> e.reason.errno
-2
This worked in this case, but the reason attribute could be a string or a socket.error, which has (had) its own problems with errno.
The definition of URLError in urllib2.py:
class URLError(IOError):
# URLError is a sub-type of IOError, but it doesn't share any of
# the implementation. need to override __init__ and __str__.
# It sets self.args for compatibility with other EnvironmentError
# subclasses, but args doesn't have the typical format with errno in
# slot 0 and strerror in slot 1. This may be better than nothing.
def __init__(self, reason):
self.args = reason,
self.reason = reason
def __str__(self):
return '<urlopen error %s>' % self.reason
So long story short, it's a horrible mess. You have to check e.reason for
Is it just a string? If so, there'll be no errno anywhere.
Is it a socket.error? Handle quirks of that. Again the errno attribute can be unset, or None, since it could also be raised with a single string argument.
Is it a subclass of IOError or OSError (which subclass EnvironmentError)? Read errno attribute of that – and hope for the best.
This can be and probably is overly cautious for your case, but it is good to understand the edges. Tornado had similar issues and is using a utility function to get errno from exception, but unfortunately that function does not work with URLErrors.
What could cover at least some cases:
while True: # or some amount of retries
try:
response = urllib2.urlopen(req, timeout=30)
break
except URLError, e:
if getattr(e.reason, 'errno', None) == errno.EINTR:
# Retry
continue
How to differentiate timeout error and other URLErrors in Python?
EDIT
When I catch a URLError, it can be Temporary failure in name resolution or timeout, or some other error? How can I tell one from another?
I use code like Option 2, below... but for a comprehensive answer, look at Michael Foord's urllib2 page
If you use either option 1 or option 2 below, you can add as much intelligence and branching as you like in the except clauses by looking at e.code or e.reason
Option 1:
from urllib2 import Request, urlopen, URLError, HTTPError
req = Request(someurl)
try:
response = urlopen(req)
except HTTPError, e:
print 'The server couldn\'t fulfill the request.'
print 'Error code: ', e.code
except URLError, e:
print 'We failed to reach a server.'
print 'Reason: ', e.reason
else:
# everything is fine
Option 2:
from urllib import urlencode
from urllib2 import Request
# insert other code here...
error = False
error_code = ""
try:
if method.upper()=="GET":
response = urlopen(req)
elif method.upper()=="POST":
response = urlopen(req,data)
except IOError, e:
if hasattr(e, 'reason'):
#print 'We failed to reach a server.'
#print 'Reason: ', e.reason
error = True
error_code = e.reason
elif hasattr(e, 'code'):
#print 'The server couldn\'t fulfill the request.'
#print 'Error code: ', e.code
error = True
error_code = e.code
else:
# info is dictionary of server parameters, such as 'content-type', etc...
info = response.info().dict
page = response.read()
I use the following code to differentiate timeout Error and other URLError
except URLError, e:
if e.reason.message == 'timed out':
# handle timed out exception
else:
# other URLError
I have the following code to do a postback to a remote URL:
request = urllib2.Request('http://www.example.com', postBackData, { 'User-Agent' : 'My User Agent' })
try:
response = urllib2.urlopen(request)
except urllib2.HTTPError, e:
checksLogger.error('HTTPError = ' + str(e.code))
except urllib2.URLError, e:
checksLogger.error('URLError = ' + str(e.reason))
except httplib.HTTPException, e:
checksLogger.error('HTTPException')
The postBackData is created using a dictionary encoded using urllib.urlencode. checksLogger is a logger using logging.
I have had a problem where this code runs when the remote server is down and the code exits (this is on customer servers so I don't know what the exit stack dump / error is at this time). I'm assuming this is because there is an exception and/or error that is not being handled. So are there any other exceptions that might be triggered that I'm not handling above?
Add generic exception handler:
request = urllib2.Request('http://www.example.com', postBackData, { 'User-Agent' : 'My User Agent' })
try:
response = urllib2.urlopen(request)
except urllib2.HTTPError, e:
checksLogger.error('HTTPError = ' + str(e.code))
except urllib2.URLError, e:
checksLogger.error('URLError = ' + str(e.reason))
except httplib.HTTPException, e:
checksLogger.error('HTTPException')
except Exception:
import traceback
checksLogger.error('generic exception: ' + traceback.format_exc())
From the docs page urlopen entry, it looks like you just need to catch URLError. If you really want to hedge your bets against problems within the urllib code, you can also catch Exception as a fall-back. Do not just except:, since that will catch SystemExit and KeyboardInterrupt also.
Edit: What I mean to say is, you're catching the errors it's supposed to throw. If it's throwing something else, it's probably due to urllib code not catching something that it should have caught and wrapped in a URLError. Even the stdlib tends to miss simple things like AttributeError. Catching Exception as a fall-back (and logging what it caught) will help you figure out what's happening, without trapping SystemExit and KeyboardInterrupt.
$ grep "raise" /usr/lib64/python/urllib2.py
IOError); for HTTP errors, raises an HTTPError, which can also be
raise AttributeError, attr
raise ValueError, "unknown url type: %s" % self.__original
# XXX raise an exception if no one else should try to handle
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
perform the redirect. Otherwise, raise HTTPError if no-one
raise HTTPError(req.get_full_url(), code, msg, headers, fp)
raise HTTPError(req.get_full_url(), code,
raise HTTPError(req.get_full_url(), 401, "digest auth failed",
raise ValueError("AbstractDigestAuthHandler doesn't know "
raise URLError('no host given')
raise URLError('no host given')
raise URLError(err)
raise URLError('unknown url type: %s' % type)
raise URLError('file not on local host')
raise IOError, ('ftp error', 'no host given')
raise URLError(msg)
raise IOError, ('ftp error', msg), sys.exc_info()[2]
raise GopherError('no host given')
There is also the possibility of exceptions in urllib2 dependencies, or of exceptions caused by genuine bugs.
You are best off logging all uncaught exceptions in a file via a custom sys.excepthook. The key rule of thumb here is to never catch exceptions you aren't planning to correct, and logging is not a correction. So don't catch them just to log them.
You can catch all exceptions and log what's get caught:
import sys
import traceback
def formatExceptionInfo(maxTBlevel=5):
cla, exc, trbk = sys.exc_info()
excName = cla.__name__
try:
excArgs = exc.__dict__["args"]
except KeyError:
excArgs = "<no args>"
excTb = traceback.format_tb(trbk, maxTBlevel)
return (excName, excArgs, excTb)
try:
x = x + 1
except:
print formatExceptionInfo()
(Code from http://www.linuxjournal.com/article/5821)
Also read documentation on sys.exc_info.
I catch:
httplib.HTTPException
urllib2.HTTPError
urllib2.URLError
I believe this covers everything including socket errors.