Bad URLs in Python 3.4.3 - python

I am new to this so please help me. I am using urllib.request to open and reading webpages. Can someone tell me how can my code handle redirects, timeouts, badly formed URLs?
I have sort of found a way for timeouts, I am not sure if it is correct though. Is it? All opinions are welcomed! Here it is:
from socket import timeout
import urllib.request
try:
text = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
except (HTTPError, URLError) as error:
logging.error('Data of %s not retrieved because %s\nURL: %s', name, error, url)
except timeout:
logging.error('socket timed out - URL %s', url)
Please help me as I am new to this. Thanks!

Take a look at the urllib error page.
So for following behaviours:
Redirect: HTTP code 302, so that's a HTTPError with a code. You could also use the HTTPRedirectHandler instead of failing.
Timeouts: You have that correct.
Badly formed URLs: That's a URLError.
Here's the code I would use:
from socket import timeout
import urllib.request
try:
text = urllib.request.urlopen("http://www.google.com", timeout=0.1).read()
except urllib.error.HTTPError as error:
print(error)
except urllib.error.URLError as error:
print(error)
except timeout as error:
print(error)
I can't finding a redirecting URL, so I'm not exactly sure how to check the HTTPError is a redirect.
You might find the requests package is a bit easier to use (it's suggested on the urllib page).

Using requests package I was able to find a better solution. With the only exception you need to handle are:
try:
r = requests.get(url, timeout =5)
except requests.exceptions.Timeout:
# Maybe set up for a retry, or continue in a retry loop
except requests.exceptions.TooManyRedirects as error:
# Tell the user their URL was bad and try a different one
except requests.exceptions.ConnectionError:
# Connection could not be completed
except requests.exceptions.RequestException as e:
# catastrophic error. bail.
And to get the text of that page, all you need to do is:
r.text

Related

Why is 'https://revoked.badssl.com/' and 'https://pinning-test.badssl.com/' returning 200 response using Python requests?

I'm working with Python requests and testing URLs from https://badssl.com/ certificate section and all the invalid URLs are returning errors except for https://revoked.badssl.com/ and https://pinning-test.badssl.com/. They are responding with 200 status codes. I would like someone to explain why this is happening, despite the pages exhibiting errors such as NET::ERR_CERT_REVOKED and NET::ERR_SSL_PINNED_KEY_NOT_IN_CERT_CHAIN for the former and latter respectively.
import requests
def check_connection():
url='https://revoked.badssl.com/' or 'https://pinning-test.badssl.com/'
try:
r = requests.get(url)
r.raise_for_status()
print(r)
except requests.exceptions.RequestException as err:
print ("OOps: Something Else",err)
except requests.exceptions.HTTPError as errh:
print ("Http Error:",errh)
except requests.exceptions.ConnectionError as errc:
print ("Error Connecting:",errc)
except requests.exceptions.Timeout as errt:
print ("Timeout Error:",errt)
check_connection()
You're not getting an analog to "NET::ERR_CERT_REVOKED" message because requests is just an HTTP request tool; it's not a browser. If you want to query an OCSP responder to see if a server certificate has been revoked, you can use the ocsp module to do that. There's an example here.
The answer is going to be similar for "NET::ERR_SSL_PINNED_KEY_NOT_IN_CERT_CHAIN"; the requests module isn't the sort of high-level tool that implements certificate pinning. In fact, even the development builds of major browsers don't implement this; there's some interesting discussion about this issue in https://github.com/chromium/badssl.com/issues/15.

How to know for sure if requests.get has timed out?

In this line of code:
request = requests.get(url, verify=False, headers=headers, proxies=proxy, timeout=15)
How do I know that timeout=15 has been triggered so I can send a message that url did not send any data in 15 seconds?
If a response is not received from the server at all within the given time, then an exception requests.exceptions.Timeout is thrown, as per Exa's link from the other answer.
To test if this occurred we can use a try, except block to detect it and act accordingly, rather than just letting our program crash.
Expanding on the demonstration used in the docs:
import requests
try:
requests.get('https://github.com/', timeout=0.001)
except requests.exceptions.Timeout as e:
# code to run if we didn't get a reply
print("Request timed out!\nDetails:", e)
else:
# code to run if we did get a response, and only if we did.
print(r.headers)
Just substitute your url and timeout where appropriate.
An exception will be thrown. See this for more info.

Identify if a website is taking too long to respond

I need to find if a website is taking too long to respond or not.
For example, i need to identify this website as problematic: http://www.lowcostbet.com/
I am trying something like this:
print urllib.urlopen("http://www.lowcostbet.com/").getcode()
but i am getting Connection timed out
My objective is just create a routine to identify what websites are taking too long to load. (e.g. 4 seconds, and cancel the request)
urlopen from urllib2 package has timeout param.
You can use something like this:
from urllib2 import urlopen
TO = 4
website = "http://www.lowcostbet.com/"
try:
response = urlopen(website, timeout=TO)
except:
mark_as_not_responsive(website)
UPD:
Please, note that using my snippet as-is suck because you'll catch all kind of exceptions, not just timeouts here. And probably, you need to make several tries before marking website as non-responsive.
also, requests.get has a timeout kwarg you can pass in.
from the docs:
requests.get('http://github.com', timeout=0.001)
this will raise an exception, so you probably want to handle that.
http://docs.python-requests.org/en/latest/user/quickstart/
The timeout value will be applied to both the connect and the read timeouts. Specify a tupleif would like to set the values separately:
import requests
try:
r = requests.get('https://github.com', timeout=(6.05, 27))
except requests.Timeout:
...
except requests.ConnectionError:
...
except requests.HTTPError:
...
except requests.RequestException:
...
else:
print(r.text)

how can I detect whether the http and https service is ok in python?

I want to detect wheter the http or https service is ok, in python.
Now I have known is to use httplib module.
Use the httplib.HTTPConnection to get the status, and check whether it is 'OK'(code is 200), and the same to https by using HTTPSConnection
but I don't know whether this way is right?or there is another more good way?
I have a script that does this kind of check, I use urllib2 for that, whatever the protocol (http or https):
result = False
error = None
try:
# Open URL
urllib2.urlopen(url, timeout=TIMEOUT)
result = True
except urllib2.URLError as exc:
error = 'URL Error: {0}'.format(str(exc))
except urllib2.HTTPError as exc:
error = 'HTTP Error: {0}'.format(str(exc))
except Exception as exc:
error = 'Unknow error: {0}'.format(str(exc))

How do I catch a specific HTTP error in Python?

I have
import urllib2
try:
urllib2.urlopen("some url")
except urllib2.HTTPError:
<whatever>
but what I end up is catching any kind of HTTP error. I want to catch only if the specified webpage doesn't exist (404?).
Python 3
from urllib.error import HTTPError
Python 2
from urllib2 import HTTPError
Just catch HTTPError, handle it, and if it's not Error 404, simply use raise to re-raise the exception.
See the Python tutorial.
Here is a complete example for Python 2:
import urllib2
from urllib2 import HTTPError
try:
urllib2.urlopen("some url")
except HTTPError as err:
if err.code == 404:
<whatever>
else:
raise
For Python 3.x
import urllib.request
import urllib.error
try:
urllib.request.urlretrieve(url, fullpath)
except urllib.error.HTTPError as err:
print(err.code)
Tim's answer seems to me as misleading especially when urllib2 does not return the expected code. For example, this error will be fatal (believe or not - it is not uncommon one when downloading urls):
AttributeError: 'URLError' object has no attribute 'code'
Fast, but maybe not the best solution would be code using nested try/except block:
import urllib2
try:
urllib2.urlopen("some url")
except urllib2.HTTPError as err:
try:
if err.code == 404:
# Handle the error
else:
raise
except:
...
More information to the topic of nested try/except blocks Are nested try/except blocks in python a good programming practice?
If from urllib.error import HTTPError doesn't work, try using from requests.exceptions import HTTPError.
Sample:
from requests.exceptions import HTTPError
try:
<access some url>
except HTTPError:
# Handle the error like ususal

Categories

Resources