I have created a web bot that iterates over a website e.g example.com/?id=int where int is some integer. the function gets the result in raw html using requests library then hands it to parseAndWrite to extract a div and save its value in a sqlite db:
def archive(initial_index, final_index):
while True:
try:
for i in range(initial_index, final_index):
res = requests.get('https://www.example.com/?id='+str(i))
parseAndWrite(res.text)
print(i, ' archived')
except requests.exceptions.ConnectionError:
print("[-] Connection lost. ")
continue
except:
exit(1)
break
archive(1, 10000)
My problem is that, after some time, the loop doesn't continue to 10000 but repeats itself from a random value, resulting in many duplicate records in the database. What is causing this inconsistency ?
I think your two loops are nested in the wrong order. The outer while loop is supposed to retry any URLs that cause connection errors, but you've put it outside the for loop the iterates over the URL numbers. That means you always start from the initial index whenever an error happens.
Try swapping the loops, and you'll only repeat one URL until it works:
def archive(initial_index, final_index):
for i in range(initial_index, final_index):
while True:
try:
res = requests.get('https://www.example.com/?id='+str(i))
parseAndWrite(res.text)
print(i, ' archived')
except requests.exceptions.ConnectionError:
print("[-] Connection lost. ")
continue
except:
exit(1)
break
archive(1, 10000)
A general rule for a try statement is to execute as little code as possible inside one; only put the code you expect will produce the error you want to catch in it; all other code goes before or after the statement.
Don't catch errors you don't know what to do with. Exiting the program is rarely the right thing to do; that will happen anyway if no one else catches the exception, so given your caller the chance to handle it.
And finally, don't build URLs yourself; let the requests library do that for you. The base URL is http://www.example.com; the id parameter and its value can be passed via a dict to requests.get.
Your outer loop will iterate over the various parameters used to construct the URL; the inner loop will try the request until it succeeds. Once the inner loop terminates, then you can use the response to call parseAndWrite.
def archive(initial_index, final_index):
base_url = 'https://www.example.com/'
for i in range(initial_index, final_index + 1):
while True:
try:
res = requests.get(base_url, params={'id': i})
except requests.exception.ConnectionError:
print("[-] Connection lost, trying again")
continue
else:
break
parseAndWrite(res.text)
print('{} archived'.format(i))
archived(1, 10000)
You might also consider letting requests handle the retries for you. See Can I set max_retries for requests.request? for a start.
If any connection error occurs, you restart at initial_index. Instead, you could retry the current index again and again, until the connection succeeds:
def archive(initial_index, final_index):
for i in range(initial_index, final_index):
while True:
try:
response = requests.get(f'https://www.example.com/?id={i}')
parseAndWrite(response.text)
print(f'{i} archived')
except requests.exceptions.ConnectionError:
print("[-] Connection lost. ")
else:
break
archive(1, 10000)
Related
I have a function for a scraper that connects to a webpage, and checks the response code (anything within 200 fine, anything else not ok). The function retries the connection, when it has a connection error or an SSLerror, and tries it again and again until the try limit has been reached. Within my try block, I try to validate the response with if else statement. If the response is ok, the response is returned, otherwise, the else statement should print the response code, but also execute the except block. Should this be done by manually raising an exception, and calling this in the except block as such?
#Try to get the url at least ten times, in case it times out
def retries(scraper, json, headers, record_url, tries=10):
for i in range(tries):
try:
response=scraper.post("https://webpageurl/etc", json=json, headers=headers)
if response.ok:
print ('OK!')
return response
else:
print (str(response))
raise Exception("Bad Response Code")
except (Exception, ConnectionError, SSLError):
if i < tries - 1:
sleep(randint(1, 2))
continue
else:
return 'Con error'
Since this is not really an exception, you could just check for a problem condition after the try .. except block:
#Try to get the url at least ten times, in case it times out
def retries(scraper, json, headers, record_url, tries=10):
problem = False
for i in range(tries):
try:
response=scraper.post("https://webpageurl/etc", json=json, headers=headers)
if response.ok:
print ('OK!')
return response
else:
print (str(response))
problem = True
except (Exception, ConnectionError, SSLError):
problem = True
if problem:
if i < tries - 1:
sleep(randint(1, 2))
continue
else:
return 'Con error'
Since your try..except is within the for loop, it would continue regardless, I assumed that was by design.
In the example above, I removed the "Bad Response Code" text, since you made no use of it anyway (you don't access the exception, and don't reraise it), but of course instead of a flag, you could also pass a specific problem code, or even a message.
The advantage of the if after the except is that no exception has to be raised to achieve the same, which is a much more expensive operation.
However, I'm not saying the above is always preferred - the advantage of an exception is that you could reraise it as needed and you can catch it outside the function instead of inside, if the exception would require it.
I have a micro service with a job that needs to happen only if a different server is up.
for a few weeks it works great, if the server was down, the micro service sleeps a bit without doing the job (as should) and if the server was up - the job was done.
the server is never down for more then a few minutes (for sure! the server is highly monitored), so the job is skipped 2-3 times tops.
Today I entered my Docker Container and noticed in the logs that the job didn't even try to continue for a few weeks now (bad choice not to monitor I know), indicating, I assume that some kind of deadlock happened.
I also assume that the problem is with my Exception handling, could use some advice I work alone.
def is_server_healthy():
url = "url" #correct url for health check path
try:
res = requests.get(url)
except Exception as ex:
LOGGER.error(f"Can't health check!{ex}")
finally:
pass
return res
def init():
while True:
LOGGER.info(f"Sleeping for {SLEEP_TIME} Minutes")
time.sleep(SLEEP_TIME*ONE_MINUTE)
res = is_server_healthy()
if res.status_code == 200:
my_api.DoJob()
LOGGER.info(f"Server is: {res.text}")
else:
LOGGER.info(f"Server is down... {res.status_code}")
(The names of the variables were changed to simplify the question)
The health check is simple enough - return "up" if up. anything else considered to be down, so unless status 200 and "up" came back I consider the server to be down.
In case your server is down you get a non-captured error:
NameError: name 'res' is not defined
Why? See:
def is_server_healthy():
url = "don't care"
try:
raise Exception() # simulate fail
except Exception as ex:
print(f"Can't health check!{ex}")
finally:
pass
return res ## name is not known ;o)
res = is_server_healthy()
if res.status_code == 200: # here, next exception bound to happen
my_api.DoJob()
LOGGER.info(f"Server is: {res.text}")
else:
LOGGER.info(f"Server is down... {res.status_code}")
Even if you declared the name, it would try to access some attribute thats not there:
if res.status_code == 200: # here - object has no attribute 'status_code'
my_api.DoJob()
LOGGER.info(f"Server is: {res.text}")
else:
LOGGER.info(f"Server is down... {res.status_code}")
would try to access a member thats simply not there => Exception, and process gone.
You are probably better off using some system-specific way to call your script once every minute (Cron Jobs, Task Scheduler) then idling in a while True: with sleep.
I'm trying to implement a method which tries to make a few attempts to download an image from url. To do so, I'm using requests lib. An example of my code is:
while attempts < nmr_attempts:
try:
attempts += 1
response = requests.get(self.basis_url, params=query_params, timeout=response_timeout)
except Exception as e:
pass
Each attempt can't spend more than "response_timeout" making the request. However It seems that the timeout variable is not doing anything since it does not respect the times given by myself.
How can I limit the max blocking time at response.get() call.
Thanks in advance
Can you try following (get rid of try-except block) and see if it helps? except Exception is probably suppressing the exception that requests.get throws.
while attempts < nmr_attempts:
response = requests.get(self.basis_url, params=query_params, timeout=response_timeout)
Or with your original code, you can catch requests.exceptions.ReadTimeout exception. Such as:
while attempts < nmr_attempts:
try:
attempts += 1
response = requests.get(self.basis_url, params=query_params, timeout=response_timeout)
except requests.exceptions.ReadTimeout as e:
do_something()
I made a simple script for amusment that takes the latest comment from http://www.reddit.com/r/random/comments.json?limit=1 and speaks through espeak. I ran into a problem however. If Reddit fails to give me the json data, which it commonly does, the script stops and gives a traceback. This is a problem, as it stops the script. Is there any sort of way to retry to get the json if it fails to load. I am using requests if that means anything
If you need it, here is the part of the code that gets the json data
url = 'http://www.reddit.com/r/random/comments.json?limit=1'
r = requests.get(url)
quote = r.text
body = json.loads(quote)['data']['children'][0]['data']['body']
subreddit = json.loads(quote)['data']['children'][0]['data']['subreddit']
For the vocabulary, the actual error you're having is an exception that has been thrown at some point in a program because of a detected runtime error, and the traceback is the program thread that tells you where the exception has been thrown.
Basically, what you want is an exception handler:
try:
url = 'http://www.reddit.com/r/random/comments.json?limit=1'
r = requests.get(url)
quote = r.text
body = json.loads(quote)['data']['children'][0]['data']['body']
subreddit = json.loads(quote)['data']['children'][0]['data']['subreddit']
except Exception as err:
print err
so that you jump over the part that needs the thing that couldn't work. Have a look at that doc as well: HandlingExceptions - Python Wiki
As pss suggests, if you want to retry after the url failed to load:
done = False
while not done:
try:
url = 'http://www.reddit.com/r/random/comments.json?limit=1'
r = requests.get(url)
except Exception as err:
print err
done = True
quote = r.text
body = json.loads(quote)['data']['children'][0]['data']['body']
subreddit = json.loads(quote)['data']['children'][0]['data']['subreddit']
N.B.: That solution may not be optimal, since if you're offline or the URL is always failing, it'll do an infinite loop. If you retry too fast and too much, Reddit may also ban you.
N.B. 2: I'm using the newest Python 3 syntax for exception handling, which may not work with Python older than 2.7.
N.B. 3: You may also want to choose a class other than Exception for the exception handling, to be able to select what kind of error you want to handle. It mostly depends on your app design, and given what you say, you might want to handle requests.exceptions.ConnectionError, but have a look at request's doc to choose the right one.
Here's what you may want, but please think this through and adapt it to your use case:
import requests
import time
import json
def get_reddit_comments():
retries = 5
while retries != 0:
try:
url = 'http://www.reddit.com/r/random/comments.json?limit=1'
r = requests.get(url)
break # if the request succeeded we get out of the loop
except requests.exceptions.ConnectionError as err:
print("Warning: couldn't get the URL: {}".format(err))
time.delay(1) # wait 1 second between two requests
retries -= 1
if retries == 0: # if we've done 5 attempts, we fail loudly
return None
return r.text
def use_data(quote):
if not quote:
print("could not get URL, despites multiple attempts!")
return False
data = json.loads(quote)
if 'error' in data.keys():
print("could not get data from reddit: error code #{}".format(quote['error']))
return False
body = data['data']['children'][0]['data']['body']
subreddit = data['data']['children'][0]['data']['subreddit']
# … do stuff with your data here
if __name__ == "__main__":
quote = get_reddit_comments()
if not use_data(quote):
print("Fatal error: Couldn't handle data receipt from reddit.")
sys.exit(1)
I hope this snippet will help you correctly design your program. And now that you've discovered exceptions, please always remember that exceptions are for handling things that shall stay exceptional. If you throw an exception at some point in one of your programs, always ask yourself if this is something that should happen when something unexpected happens (like a webpage not loading), or if it's an expected error (like a page loading but giving you an output that is not expected).
My goal:
To go through a list of websites to check them using Requests. This is being done in apply_job.
My problem:
When job_pool.next is called, a few websites are in error and instead of giving an error, they just stand there and don't even give a TimeoutError. That's why I am using a timeout in the next function with 10s of timeout. This timeout works well but when the TimeoutError exception arises, the next function the following times keep raising the exception even though the next websites are good. It seems to me that it doesn't move to the next item and just loop over the same one.
I tried with imap and imap_unordered, no difference in that.
My code here:
def run_check(websites):
""" Run check on the given websites """
import multiprocessing
from multiprocessing.pool import ThreadPool
pool = ThreadPool(processes=JOB_POOL_SIZE)
try:
job_pool = pool.imap_unordered(apply_job, websites)
try:
while True:
try:
res = job_pool.next(10)
except multiprocessing.TimeoutError:
logging.error("Timeout Error")
res = 'No Res'
csv_callback(res)
except StopIteration:
pass
pool.terminate()
except Exception, e:
logging.error("Run_check Error: %s"%e)
raise
I use res = requests.get(url, timeout=10) to check the websites. This timeout doesn't work for this issue.
To test, here are the websites that makes the problem (not every time but very often): http://www.kddecorators.netfirms.com, http://www.railcar.netfirms.com.
I can't figure out what is different with these websites but my guess is that they keep sending a byte once in a while so it isn't considered as a real timeout even though they are unusable.
If anyone has an idea, it would be greatly appreciated, I have been stuck on that one for a few days now. I even tried future and async but they don't raise the exception which I need.
Thanks guys!
Your intuition that passing a timeout to next would abort the job is wrong. It just aborts waiting, but the particular job keeps running. The next time you wait, you do wait for the same job. To achieve a timeout on the actual jobs you should look at the requests documentation. Note that there is no reliable way to terminate another thread. So if you absolutely cannot make your jobs terminate within a reasonable time frame, you can switch to a process based pool and forcefully terminate the processes (e.g. using signal.alarm).
I found a solution for my issue, I used eventlet and its Timeout function.
def apply_job(account_info):
""" Job for the Thread """
try:
account_id = account_info['id']
account_website = account_info['website']
url = account_website
result = "ERROR: GreenPool Timeout"
with Timeout(TIMEOUT*2, False):
url, result = tpool.execute(website.try_url, account_website)
return (account_id, account_website, url, result)
except Exception, e:
logging.error("Apply_job Error: %s"%e)
def start_db(res):
update_db(res)
csv_file.csv_callback(res)
def spawn_callback(result):
res = result.wait()
tpool.execute(start_db, res)
def run_check(websites):
""" Run check on the given websites """
print str(len(websites)) + " items found\n"
pool = eventlet.GreenPool(100)
for i, account_website in enumerate(websites):
res = pool.spawn(apply_job, account_website)
res.link(spawn_callback)
pool.waitall()
This solution works well because it times-out over the whole execution of the function website.try_url in the command url, result = tpool.execute(website.try_url, account_website).