404 error received for working url using python urllib2 - python

I am trying to get the following url: ow dot ly/LApK30cbLKj that is working but I am getting http 404 error:
my_url = 'ow' + '.ly/LApK30cbLKj' # SO won't accept an ow.ly url
headers = {'User-Agent' : user_agent }
request = urllib2.Request(my_url,"", headers)
response = None
try:
response = urllib2.urlopen(request)
except urllib2.HTTPError, e:
print '+++HTTPError = ' + str(e.code)
Is there something I can do to get this url with a http 200 status as I do when I visit in a browser?

Your example works for me, except you need to add http://
my_url = 'http://ow' + '.ly/LApK30cbLKj'

You need to define the url's protocol, the thing is that when you visit the url in browser, the default protocol will be HTTP. However, urllib2 doesn't do that for you, you need to add http:// in the beginning of url, otherwise, the error will be raised:
ValueError: unknown url type: ow.ly/LApK30cbLKj

As #enjoi mentioned, I used requests:
import requests
result = None
try:
result = requests.get(agen_cont.source_url)
except requests.exceptions.Timeout as e:
print '+++timeout exception: '
print e
except requests.exceptions.TooManyRedirects as e:
print '+++ too manuy redirects exception: '
print e
except requests.exceptions.RequestException as e:
print '+++ request exception: '
print e
except Exception:
import traceback
print '+++generic exception: ' + traceback.format_exc()
if result:
final_url = result.url
print final_url
response = result.content

Related

Cant catch exeptions with urllib2

I have a script printing out the response from an API, but I cant seem to catch any exceptions. I think I've gone thru every question asked on this topic without any luck.
How can I check if the script will catch any errors/exceptions?
I'm testing the script on a site i know returns 403 Forbidden, but it does'nt show.
My script:
import urllib2
url_se = 'http://www.example.com'
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'API to File')]
try:
request = opener.open(url_se)
except urllib2.HTTPError, e:
print e.code
except urllib2.URLError, e:
print e.args
except Exception:
import traceback
print 'Generic exception ' + traceback.format_exc()
response = request.read()
print response
Is this the right approach? Whats the best practice for catching exeptions concerning
urllib2
There is a bug in your program. If any exception occurs in try block then variable request becomes undefined in response = request.text() block.
Correct it as
import urllib2
url_se = 'http://www.example.com'
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'API to File')]
try:
request = opener.open(url_se)
response = request.read()
print response
except urllib2.HTTPError, e:
print e.code
except urllib2.URLError, e:
print e.args
except Exception as e:
import traceback
print 'Generic exception ' + traceback.format_exc()
Test it on your machine you will surely see the Exceptions.
Catching exceptions individually only make sense if you are doing something specify with them otherwise if you only want to log the exceptions then a universal except block will do the job.

python using requests with valid hostname

Trying to use requests to download a list of urls and catch the exception if it is a bad url. Here's my test code:
import requests
from requests.exceptions import ConnectionError
#goodurl
url = "http://www.google.com"
#badurl with good host
#url = "http://www.google.com/thereisnothing.jpg"
#url with bad host
#url = "http://somethingpotato.com"
print url
try:
r = requests.get(url, allow_redirects=True)
print "the url is good"
except ConnectionError,e:
print e
print "the url is bad"
The problem is if I pass in url = "http://www.google.com" everything works as it should and as expected since it is a good url.
http://www.google.com
the url is good
But if I pass in url = "http://www.google.com/thereisnothing.jpg"
I still get :
http://www.google.com/thereisnothing.jpg
the url is good
So its almost like its not even looking at anything after the "/"
just to see if the error checking is working at all I passed a bad hostname: #url = "http://somethingpotato.com"
Which kicked back the error message I expected:
http://somethingpotato.com
HTTPConnectionPool(host='somethingpotato.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1b6cd15b90>: Failed to establish a new connection: [Errno -2] Name or service not known',))
the url is bad
What am I missing to make request capture a bad url not just a bad hostname?
Thanks
requests do not create a throwable exception at a 404 response. Instead you need to filter them out be checking to see if the status is 'ok' (HTTP response 200)
import requests
from requests.exceptions import ConnectionError
#goodurl
url = "http://www.google.com/nothing"
#badurl with good host
#url = "http://www.google.com/thereisnothing.jpg"
#url with bad host
#url = "http://somethingpotato.com"
print url
try:
r = requests.get(url, allow_redirects=True)
if r.status_code == requests.codes.ok:
print "the url is good"
else:
print "the url is bad"
except ConnectionError,e:
print e
print "the url is bad"
EDIT:
import requests
from requests.exceptions import ConnectionError
def printFailedUrl(url, response):
if isinstance(response, ConnectionError):
print "The url " + url + " failed to connect with the exception " + str(response)
else:
print "The url " + url + " produced the failed response code " + str(response.status_code)
def testUrl(url):
try:
r = requests.get(url, allow_redirects=True)
if r.status_code == requests.codes.ok:
print "the url is good"
else:
printFailedUrl(url, r)
except ConnectionError,e:
printFailedUrl(url, e)
def main():
testUrl("http://www.google.com") #'Good' Url
testUrl("http://www.google.com/doesnotexist.jpg") #'Bad' Url with 404 response
testUrl("http://sdjgb") #'Bad' url with inaccessable url
main()
In this case one function can handle both getting an exception or a request response passed into it. This way you can have separate responses for if the url returns some non 'good' (non-200) response vs an unusable url which throws an exception. Hope this has the information you need in it.
what you want is to check r.status_code. Getting r.status_code on "http://www.google.com/thereisnothing.jpg" will give you 404. you can put a condition for only 200 code URL to be "good".

How to check HTTP errors for more than two URLs?

Question: I've 3 URLS - testurl1, testurl2 and testurl3. I'd like to try testurl1 first, if I get 404 error then try testurl2, if that gets 404 error then try testurl3. How to achieve this? So far I've tried below but that works only for two url, how to add support for third url?
from urllib2 import Request, urlopen
from urllib2 import URLError, HTTPError
def checkfiles():
req = Request('http://testurl1')
try:
response = urlopen(req)
url1=('http://testurl1')
except HTTPError, URLError:
url1 = ('http://testurl2')
print url1
finalURL='wget '+url1+'/testfile.tgz'
print finalURL
checkfiles()
Another job for plain old for loop:
for url in testurl1, testurl2, testurl3
req = Request(url)
try:
response = urlopen(req)
except HttpError as err:
if err.code == 404:
continue
raise
else:
# do what you want with successful response here (or outside the loop)
break
else:
# They ALL errored out with HTTPError code 404. Handle this?
raise err
Hmmm maybe something like this?
from urllib2 import Request, urlopen
from urllib2 import URLError, HTTPError
def checkfiles():
req = Request('http://testurl1')
try:
response = urlopen(req)
url1=('http://testurl1')
except HTTPError, URLError:
try:
url1 = ('http://testurl2')
except HTTPError, URLError:
url1 = ('http://testurl3')
print url1
finalURL='wget '+url1+'/testfile.tgz'
print finalURL
checkfiles()

In Python, how do I use urllib to see if a website is 404 or 200?

How to get the code of the headers through urllib?
The getcode() method (Added in python2.6) returns the HTTP status code that was sent with the response, or None if the URL is no HTTP URL.
>>> a=urllib.urlopen('http://www.google.com/asdfsf')
>>> a.getcode()
404
>>> a=urllib.urlopen('http://www.google.com/')
>>> a.getcode()
200
You can use urllib2 as well:
import urllib2
req = urllib2.Request('http://www.python.org/fish.html')
try:
resp = urllib2.urlopen(req)
except urllib2.HTTPError as e:
if e.code == 404:
# do something...
else:
# ...
except urllib2.URLError as e:
# Not an HTTP-specific error (e.g. connection refused)
# ...
else:
# 200
body = resp.read()
Note that HTTPError is a subclass of URLError which stores the HTTP status code.
For Python 3:
import urllib.request, urllib.error
url = 'http://www.google.com/asdfsf'
try:
conn = urllib.request.urlopen(url)
except urllib.error.HTTPError as e:
# Return code error (e.g. 404, 501, ...)
# ...
print('HTTPError: {}'.format(e.code))
except urllib.error.URLError as e:
# Not an HTTP-specific error (e.g. connection refused)
# ...
print('URLError: {}'.format(e.reason))
else:
# 200
# ...
print('good')
import urllib2
try:
fileHandle = urllib2.urlopen('http://www.python.org/fish.html')
data = fileHandle.read()
fileHandle.close()
except urllib2.URLError, e:
print 'you got an error with the code', e

Overriding urllib2.HTTPError or urllib.error.HTTPError and reading response HTML anyway

I receive a 'HTTP Error 500: Internal Server Error' response, but I still want to read the data inside the error HTML.
With Python 2.6, I normally fetch a page using:
import urllib2
url = "http://google.com"
data = urllib2.urlopen(url)
data = data.read()
When attempting to use this on the failing URL, I get the exception urllib2.HTTPError:
urllib2.HTTPError: HTTP Error 500: Internal Server Error
How can I fetch such error pages (with or without urllib2), all while they are returning Internal Server Errors?
Note that with Python 3, the corresponding exception is urllib.error.HTTPError.
The HTTPError is a file-like object. You can catch it and then read its contents.
try:
resp = urllib2.urlopen(url)
contents = resp.read()
except urllib2.HTTPError, error:
contents = error.read()
If you mean you want to read the body of the 500:
request = urllib2.Request(url, data, headers)
try:
resp = urllib2.urlopen(request)
print resp.read()
except urllib2.HTTPError, error:
print "ERROR: ", error.read()
In your case, you don't need to build up the request. Just do
try:
resp = urllib2.urlopen(url)
print resp.read()
except urllib2.HTTPError, error:
print "ERROR: ", error.read()
so, you don't override urllib2.HTTPError, you just handle the exception.
alist=['http://someurl.com']
def testUrl():
errList=[]
for URL in alist:
try:
urllib2.urlopen(URL)
except urllib2.URLError, err:
(err.reason != 200)
errList.append(URL+" "+str(err.reason))
return URL+" "+str(err.reason)
return "".join(errList)
testUrl()

Categories

Resources