I am pulling some data from a streaming API using python v3, I need to stop pulling that data after 60 seconds. Also if anyone has suggestions on chunk_size or some alternative for streaming I'd be open to that.
So far this is what I have:
response = requests.get('link to site', stream=True)
for data in response.iter_content(chunk_size=100):
print(data)
More speculation than answer, but you could set a timer and then close the response. This may do it but I don't have a good way to test it. I don't know which exception to expect when the response is closed so I catch them all and print so the code can be changed.
import threading
response = requests.get('link to site', stream=True)
timer = threading.Timer(60, response.close)
try:
timer.start()
for data in response.iter_content(chunk_size=100):
print(data)
except Exception as e:
print("you want to catch this", e)
finally:
timer.cancel()
Related
I'm trying to implement a method which tries to make a few attempts to download an image from url. To do so, I'm using requests lib. An example of my code is:
while attempts < nmr_attempts:
try:
attempts += 1
response = requests.get(self.basis_url, params=query_params, timeout=response_timeout)
except Exception as e:
pass
Each attempt can't spend more than "response_timeout" making the request. However It seems that the timeout variable is not doing anything since it does not respect the times given by myself.
How can I limit the max blocking time at response.get() call.
Thanks in advance
Can you try following (get rid of try-except block) and see if it helps? except Exception is probably suppressing the exception that requests.get throws.
while attempts < nmr_attempts:
response = requests.get(self.basis_url, params=query_params, timeout=response_timeout)
Or with your original code, you can catch requests.exceptions.ReadTimeout exception. Such as:
while attempts < nmr_attempts:
try:
attempts += 1
response = requests.get(self.basis_url, params=query_params, timeout=response_timeout)
except requests.exceptions.ReadTimeout as e:
do_something()
I have an array, that contains URL addresses to remote files.
By default I tried to download all files using this bad approach:
for a in ARRAY:
wget.download(url=A, out=path_folder)
So, it falls by the some reason: host server return timeout, some URL are broken etc.
How to handle this process more professional? But I can not apply this to my case.
If you still want to use wget, you can wrap the download in a try..except block that just prints any exception and moves on to the next file:
for f in files:
try:
wget.download(url=f, out=path_folder)
except Exception as e:
print("Could not download file {}".format(f)
print(e)
Here you have a way to define a timeout, it reads the filename from the url and retrieves big files as stream, so your memory won't get overfilled
import requests
import urlparse, os
timeout = 30 # Seconds
for url in urls:
try:
# Make the actual request, set the timeout for no data to X seconds and enable streaming responses so we don't have to keep the large files in memory
request = requests.get(url, timeout=timeout, stream=True)
# Get the Filename from the URL
name = os.path.basename(urlparse.urlparse(url).path)
# Open the output file and make sure we write in binary mode
with open(name, 'wb') as fh:
# Walk through the request response in chunks of 1024 * 1024 bytes, so 1MiB
for chunk in request.iter_content(1024 * 1024):
# Write the chunk to the file
fh.write(chunk)
except Exception as e:
print("Something went wrong:", e)
you can use urllib
import urllib.request
urllib.request.urlretrieve('http://www.example.com/files/file.ext', 'folder/file.ext')
you can use a try: except: around the retrive url, to catch any errors
try:
urllib.request.urlretrieve('http://www.example.com/files/file.ext', 'folder/file.ext')
except Exception as e:
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)
Adding it as another answer,
If you want to solve the timeout you can use the requests library
import requests
try:
requests.get('http://url/to/file')
catch Exception as e:
print('Error code: ', e.code)
if you haven't specified any time it won't timeout
I have a simple long poll thing using python3 and the requests package. It currently looks something like:
def longpoll():
session = requests.Session()
while True:
try:
fetched = session.get(MyURL)
input = base64.b64decode(fetched.content)
output = process(data)
session.put(MyURL, data=base64.b64encode(response))
except Exception as e:
print(e)
time.sleep(10)
There is a case where instead of processing the input and puting the result, I'd like to raise an http error. Is there a simple way to do this from the high level Session interface? Or do I have to drill down to use the lower level objects?
Since You have control over the server you may want to reverse the 2nd call
Here is an example using bottle to recive the 2nd poll
def longpoll():
session = requests.Session()
while True: #I'm guessing that the server does not care that we call him a lot of times ...
try:
session.post(MyURL, {"ip_address": my_ip_address}) # request work or I'm alive
#input = base64.b64decode(fetched.content)
#output = process(data)
#session.put(MyURL, data=base64.b64encode(response))
except Exception as e:
print(e)
time.sleep(10)
#bottle.post("/process")
def process_new_work():
data = bottle.request.json()
output = process(data) #if an error is thrown an HTTP error will be returned by the framework
return output
This way the server will get the output or an bad HTTP status
I made a simple script for amusment that takes the latest comment from http://www.reddit.com/r/random/comments.json?limit=1 and speaks through espeak. I ran into a problem however. If Reddit fails to give me the json data, which it commonly does, the script stops and gives a traceback. This is a problem, as it stops the script. Is there any sort of way to retry to get the json if it fails to load. I am using requests if that means anything
If you need it, here is the part of the code that gets the json data
url = 'http://www.reddit.com/r/random/comments.json?limit=1'
r = requests.get(url)
quote = r.text
body = json.loads(quote)['data']['children'][0]['data']['body']
subreddit = json.loads(quote)['data']['children'][0]['data']['subreddit']
For the vocabulary, the actual error you're having is an exception that has been thrown at some point in a program because of a detected runtime error, and the traceback is the program thread that tells you where the exception has been thrown.
Basically, what you want is an exception handler:
try:
url = 'http://www.reddit.com/r/random/comments.json?limit=1'
r = requests.get(url)
quote = r.text
body = json.loads(quote)['data']['children'][0]['data']['body']
subreddit = json.loads(quote)['data']['children'][0]['data']['subreddit']
except Exception as err:
print err
so that you jump over the part that needs the thing that couldn't work. Have a look at that doc as well: HandlingExceptions - Python Wiki
As pss suggests, if you want to retry after the url failed to load:
done = False
while not done:
try:
url = 'http://www.reddit.com/r/random/comments.json?limit=1'
r = requests.get(url)
except Exception as err:
print err
done = True
quote = r.text
body = json.loads(quote)['data']['children'][0]['data']['body']
subreddit = json.loads(quote)['data']['children'][0]['data']['subreddit']
N.B.: That solution may not be optimal, since if you're offline or the URL is always failing, it'll do an infinite loop. If you retry too fast and too much, Reddit may also ban you.
N.B. 2: I'm using the newest Python 3 syntax for exception handling, which may not work with Python older than 2.7.
N.B. 3: You may also want to choose a class other than Exception for the exception handling, to be able to select what kind of error you want to handle. It mostly depends on your app design, and given what you say, you might want to handle requests.exceptions.ConnectionError, but have a look at request's doc to choose the right one.
Here's what you may want, but please think this through and adapt it to your use case:
import requests
import time
import json
def get_reddit_comments():
retries = 5
while retries != 0:
try:
url = 'http://www.reddit.com/r/random/comments.json?limit=1'
r = requests.get(url)
break # if the request succeeded we get out of the loop
except requests.exceptions.ConnectionError as err:
print("Warning: couldn't get the URL: {}".format(err))
time.delay(1) # wait 1 second between two requests
retries -= 1
if retries == 0: # if we've done 5 attempts, we fail loudly
return None
return r.text
def use_data(quote):
if not quote:
print("could not get URL, despites multiple attempts!")
return False
data = json.loads(quote)
if 'error' in data.keys():
print("could not get data from reddit: error code #{}".format(quote['error']))
return False
body = data['data']['children'][0]['data']['body']
subreddit = data['data']['children'][0]['data']['subreddit']
# … do stuff with your data here
if __name__ == "__main__":
quote = get_reddit_comments()
if not use_data(quote):
print("Fatal error: Couldn't handle data receipt from reddit.")
sys.exit(1)
I hope this snippet will help you correctly design your program. And now that you've discovered exceptions, please always remember that exceptions are for handling things that shall stay exceptional. If you throw an exception at some point in one of your programs, always ask yourself if this is something that should happen when something unexpected happens (like a webpage not loading), or if it's an expected error (like a page loading but giving you an output that is not expected).
My goal:
To go through a list of websites to check them using Requests. This is being done in apply_job.
My problem:
When job_pool.next is called, a few websites are in error and instead of giving an error, they just stand there and don't even give a TimeoutError. That's why I am using a timeout in the next function with 10s of timeout. This timeout works well but when the TimeoutError exception arises, the next function the following times keep raising the exception even though the next websites are good. It seems to me that it doesn't move to the next item and just loop over the same one.
I tried with imap and imap_unordered, no difference in that.
My code here:
def run_check(websites):
""" Run check on the given websites """
import multiprocessing
from multiprocessing.pool import ThreadPool
pool = ThreadPool(processes=JOB_POOL_SIZE)
try:
job_pool = pool.imap_unordered(apply_job, websites)
try:
while True:
try:
res = job_pool.next(10)
except multiprocessing.TimeoutError:
logging.error("Timeout Error")
res = 'No Res'
csv_callback(res)
except StopIteration:
pass
pool.terminate()
except Exception, e:
logging.error("Run_check Error: %s"%e)
raise
I use res = requests.get(url, timeout=10) to check the websites. This timeout doesn't work for this issue.
To test, here are the websites that makes the problem (not every time but very often): http://www.kddecorators.netfirms.com, http://www.railcar.netfirms.com.
I can't figure out what is different with these websites but my guess is that they keep sending a byte once in a while so it isn't considered as a real timeout even though they are unusable.
If anyone has an idea, it would be greatly appreciated, I have been stuck on that one for a few days now. I even tried future and async but they don't raise the exception which I need.
Thanks guys!
Your intuition that passing a timeout to next would abort the job is wrong. It just aborts waiting, but the particular job keeps running. The next time you wait, you do wait for the same job. To achieve a timeout on the actual jobs you should look at the requests documentation. Note that there is no reliable way to terminate another thread. So if you absolutely cannot make your jobs terminate within a reasonable time frame, you can switch to a process based pool and forcefully terminate the processes (e.g. using signal.alarm).
I found a solution for my issue, I used eventlet and its Timeout function.
def apply_job(account_info):
""" Job for the Thread """
try:
account_id = account_info['id']
account_website = account_info['website']
url = account_website
result = "ERROR: GreenPool Timeout"
with Timeout(TIMEOUT*2, False):
url, result = tpool.execute(website.try_url, account_website)
return (account_id, account_website, url, result)
except Exception, e:
logging.error("Apply_job Error: %s"%e)
def start_db(res):
update_db(res)
csv_file.csv_callback(res)
def spawn_callback(result):
res = result.wait()
tpool.execute(start_db, res)
def run_check(websites):
""" Run check on the given websites """
print str(len(websites)) + " items found\n"
pool = eventlet.GreenPool(100)
for i, account_website in enumerate(websites):
res = pool.spawn(apply_job, account_website)
res.link(spawn_callback)
pool.waitall()
This solution works well because it times-out over the whole execution of the function website.try_url in the command url, result = tpool.execute(website.try_url, account_website).