I need to find if a website is taking too long to respond or not.
For example, i need to identify this website as problematic: http://www.lowcostbet.com/
I am trying something like this:
print urllib.urlopen("http://www.lowcostbet.com/").getcode()
but i am getting Connection timed out
My objective is just create a routine to identify what websites are taking too long to load. (e.g. 4 seconds, and cancel the request)
urlopen from urllib2 package has timeout param.
You can use something like this:
from urllib2 import urlopen
TO = 4
website = "http://www.lowcostbet.com/"
try:
response = urlopen(website, timeout=TO)
except:
mark_as_not_responsive(website)
UPD:
Please, note that using my snippet as-is suck because you'll catch all kind of exceptions, not just timeouts here. And probably, you need to make several tries before marking website as non-responsive.
also, requests.get has a timeout kwarg you can pass in.
from the docs:
requests.get('http://github.com', timeout=0.001)
this will raise an exception, so you probably want to handle that.
http://docs.python-requests.org/en/latest/user/quickstart/
The timeout value will be applied to both the connect and the read timeouts. Specify a tupleif would like to set the values separately:
import requests
try:
r = requests.get('https://github.com', timeout=(6.05, 27))
except requests.Timeout:
...
except requests.ConnectionError:
...
except requests.HTTPError:
...
except requests.RequestException:
...
else:
print(r.text)
Related
I want to take some website's sources for a project. When i try to get response, program just stuck and wait for response. No matter how long i wait no timeout or response. Here is my code:
link = "https://eu.mouser.com/"
linkResponse = urllib.request.urlopen(link)
readedResponse = linkResponse.readlines()
writer = open("html.txt", "w")
for line in readedResponse:
writer.write(str(line))
writer.write("\n")
writer.close()
When i try to other websites, urlopen return their response. But when i try to get "eu.mouser.com" and "uk.farnell.com" not return their response. I ll skip their response, even urlopen not return a timeout. What is the problem there? Is there another way to take the website's sources? (Sorry for my bad english)
urllib.request.urlopen docs claims that
The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the
global default timeout setting will be used). This actually only works
for HTTP, HTTPS and FTP connections.
without explaining how to find said default, I managed to provoke timeout after directly providing 5 (seconds) as timeout
import urllib.request
url = "https://uk.farnell.com"
urllib.request.urlopen(url, timeout=5)
gives
socket.timeout: The read operation timed out
There are some sites that protect themselves from automated crawlers by implementing mechanisms that detect such bots. These can be very diverse and also change over time. If you really want to do everything you can to get the page crawled automatically, this usually means that you have to implement steps yourself to circumvent these protective barriers.
One example of this is the header information that is provided with every request. This can be changed before making the request, e.g. via request's header customization. But there are probably more things to do here and there.
If you're interested in starting developing such a thing (leaving aside the question of whether this is allowed at all), you can take this as a starting point:
from collections import namedtuple
from contextlib import suppress
import requests
from requests import ReadTimeout
Link = namedtuple("Link", ["url", "filename"])
links = {
Link("https://eu.mouser.com/", "mouser.com"),
Link("https://example.com/", "example1.com"),
Link("https://example.com/", "example2.com"),
}
for link in links:
with suppress(ReadTimeout):
response = requests.get(link.url, timeout=3)
with open(f"html-{link.filename}.txt", "w", encoding="utf-8") as file:
file.write(response.text)
where such protected sites which lead to ReadTimeOut errors are simply ignored and with the possibility to go further - e.g. by enhancing requests.get(link.url, timeout=3) with a suitable headers parameter. But as I already mentioned, this is probably not the only customization which had to be done and the legal aspects should also be clarified.
The API I'm sending requests to has a bit of an unusual format for its responses
It always returns status_code = 200
There's an additional error key inside the returned json that details the actual status of the response:
2.1. error = 0 means it successfully completes
2.2. error != 0 means something went wrong
I'm trying use the Retry class in urlib3, but so far I understand it only uses the status_code from the response, not its actual content.
Are there any other options?
If I'm hearing you right, then there are two cases in which you have 'errors' to handle:
Any non-200 response from the web server (i.e. 500, 403, etc.)
whenever the API returns a non-zero value for 'error' in the JSON response as the server always responds with an HTTP 200 even if your request is bad.
Given that we need to handle two completely different cases which trigger a retry, it'd be easier to write your own retry handler rather than trying to hack our way into this with the urllib3 library or similar, as we can specifically specify the cases where we need to do a retry.
You might try something like this approach, which also takes into account the number of requests you're making to determine if there's a repeated error case, and in cases of API response errors or HTTP errors, we use an (suggested via comments on my initial answer) 'exponential backoff' approach to retries so you don't constantly tax a server - this means that each successive retry has a different 'sleep' period before retrying, until we reach a MAX_RETRY count, as written it's a base increment of 1 second for first retry attempt, 2 seconds for second retry, 4 seconds for third retry, etc. which will permit the server to catch up if it has to rather than just constantly over-tax the server.
import requests
import time
MAX_RETRY = 5
def make_request():
'''This makes a single request to the server to get data from it.'''
# Replace 'get' with whichever method you're using, and the URL with the actual API URL
r = requests.get('http://api.example.com')
# If r.status_code is not 200, treat it as an error.
if r.status_code != 200:
raise RuntimeError(f"HTTP Response Code {r.status_code} received from server."
else:
j = r.json()
if j['error'] != 0:
raise RuntimeError(f"API Error Code {j['error']} received from server."
else:
return j
def request_with_retry(backoff_in_seconds=1):
'''This makes a request retry up to MAX_RETRY set above with exponential backoff.'''
attempts = 1
while True:
try:
data = make_request()
return data
except RuntimeError as err:
print(err)
if attempts > MAX_RETRY:
raise RuntimeError("Maximum number of attempts exceeded, aborting.")
sleep = backoff_in_seconds * 2 ** (attempts - 1)
print(f"Retrying request (attempt #{attempts}) in {sleep} seconds...")
time.sleep(sleep)
attempts += 1
Then, you couple these two functions together with the following to actually attempt to get data from the API server and then either error hard or do something with it if there's no errors encountered:
# This code actually *calls* these functions which contain the request with retry and
# exponential backoff *and* the individual request process for a single request.
try:
data = request_with_retry()
except RuntimeError as err:
print(err)
exit(1)
After that code, you can just 'do something' with data which is the JSON(?) output of your API, even if this part is included in another function. You just need the two dependent functions (done this way to reduce code duplication).
I am new to this so please help me. I am using urllib.request to open and reading webpages. Can someone tell me how can my code handle redirects, timeouts, badly formed URLs?
I have sort of found a way for timeouts, I am not sure if it is correct though. Is it? All opinions are welcomed! Here it is:
from socket import timeout
import urllib.request
try:
text = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
except (HTTPError, URLError) as error:
logging.error('Data of %s not retrieved because %s\nURL: %s', name, error, url)
except timeout:
logging.error('socket timed out - URL %s', url)
Please help me as I am new to this. Thanks!
Take a look at the urllib error page.
So for following behaviours:
Redirect: HTTP code 302, so that's a HTTPError with a code. You could also use the HTTPRedirectHandler instead of failing.
Timeouts: You have that correct.
Badly formed URLs: That's a URLError.
Here's the code I would use:
from socket import timeout
import urllib.request
try:
text = urllib.request.urlopen("http://www.google.com", timeout=0.1).read()
except urllib.error.HTTPError as error:
print(error)
except urllib.error.URLError as error:
print(error)
except timeout as error:
print(error)
I can't finding a redirecting URL, so I'm not exactly sure how to check the HTTPError is a redirect.
You might find the requests package is a bit easier to use (it's suggested on the urllib page).
Using requests package I was able to find a better solution. With the only exception you need to handle are:
try:
r = requests.get(url, timeout =5)
except requests.exceptions.Timeout:
# Maybe set up for a retry, or continue in a retry loop
except requests.exceptions.TooManyRedirects as error:
# Tell the user their URL was bad and try a different one
except requests.exceptions.ConnectionError:
# Connection could not be completed
except requests.exceptions.RequestException as e:
# catastrophic error. bail.
And to get the text of that page, all you need to do is:
r.text
I'm using an API for doing HTTP requests that return JSON. The calling of the api, however, depends on a start and an end page to be indicated, such as this:
def API_request(URL):
while(True):
try:
Response = requests.get(URL)
Data = Response.json()
return(Data['data'])
except Exception as APIError:
print(APIError)
continue
break
def build_orglist(start_page, end_page):
APILink = ("http://sc-api.com/?api_source=live&system=organizations&action="
"all_organizations&source=rsi&start_page={0}&end_page={1}&items_"
"per_page=500&sort_method=&sort_direction=ascending&expedite=1&f"
"ormat=json".format(start_page, end_page))
return(API_request(APILink))
The only way to know if you're not longer at an existing page is when the JSON will be null, like this.
If I wanted to do multiple build_orglist going over every single page asynchronously until I reach the end (Null JSON) how could I do so?
I went with a mix of #LukasGraf's answer of using sessions to unify all of my HTTP connections into a single session as well as made use of grequests for making the group of HTTP requests in parallel.
I read that when I get this error I should specify better the url. I assume that I should specify between two displayed or accessible options. How can I do that?
In urllib or its tutorial I couldn't find anything. My assumption is true? Can I read somewhere the possible url?
When I open this url in my browser I am redirected to a new url.
The url I try to access: http://www.uniprot.org/uniprot/P08198_CSG_HALHA.fasta
The new url I am redirected: http://www.uniprot.org/uniprot/?query=replaces:P08198&format=fasta
import urllib.request
try:
response = urllib.request.urlopen(url)
except urllib.error.HTTPError as e:
if int(e.code) == 300:
# what now?
The status code 300 is returned from the server to tell you, your request is somehow not complete and you shall be more specific.
Testing the url, I tried to search from http://www.uniprot.org/ and entered into search "P08198". This resulted in page http://www.uniprot.org/uniprot/P08198 telling me
Demerged into Q9HM69, B0R8E4 and P0DME1. [ List ]
To me it seems, the query for some protein is not specific enough as this protein code was split to subcategories or subcodes Q9HM69, B0R8E4 and P0DME1.
Conclusion
Status code 300 is signal from server app, that your request is somehow ambiguous. The way, you can make it specific enough is application specific and has nothing to do with Python or HTTP status codes, you have to find more details about good url in the application logic.
So I ran into this issue and wanted to get the actual content returned.
turns out that this is the solution to my problem.
import urllib.request
try:
response = urllib.request.urlopen(url)
except urllib.error.HTTPError as e:
if int(e.code) == 300:
response = r.read()