I am writing a small multi-threaded http file downloader and would like to be able to shrink the available threads as the code encounters errors
The errors would be specific to http errors returned where the web server is not allowing any more connections
eg. If I setup a pool of 5 threads, each thread is attempting to open it's own connection and download a chunk of the file. The server may only allow 2 connections and will I believe return 503 errors, I want to detect this and shut down a thread, eventually limiting the size of the pool to presumably only the 2 that the server will allow
Can I make a thread stop itself?
Is self.Thread_stop() sufficient?
Do I also need to join()?
Here's my worker class that does the downloading, grabs from the queue to process, once downloaded it dumps the result into resultQ to be saved to file by the main thread
It's in here where I would like to detect a http 503 and stop/kill/remove a thread from the available pools - and of course re-add the failed chunk back to the queue so the remaining threads will process it
class Downloader(threading.Thread):
def __init__(self, queue, resultQ, file_name):
threading.Thread.__init__(self)
self.workQ = queue
self.resultQ = resultQ
self.file_name = file_name
def run(self):
while True:
block_num, url, start, length = self.workQ.get()
print 'Starting Queue #: %s' % block_num
print start
print length
#Download the file
self.download_file(url, start, length)
#Tell queue that this task is done
print 'Queue #: %s finished' % block_num
self.workQ.task_done()
def download_file(self, url, start, length):
request = urllib2.Request(url, None, headers)
if length == 0:
return None
request.add_header('Range', 'bytes=%d-%d' % (start, start + length))
while 1:
try:
data = urllib2.urlopen(request)
except urllib2.URLError, u:
print "Connection did not start with", u
else:
break
chunk = ''
block_size = 1024
remaining_blocks = length
while remaining_blocks > 0:
if remaining_blocks >= block_size:
fetch_size = block_size
else:
fetch_size = int(remaining_blocks)
try:
data_block = data.read(fetch_size)
if len(data_block) == 0:
print "Connection: [TESTING]: 0 sized block" + \
" fetched."
if len(data_block) != fetch_size:
print "Connection: len(data_block) != length" + \
", but continuing anyway."
self.run()
return
except socket.timeout, s:
print "Connection timed out with", s
self.run()
return
remaining_blocks -= fetch_size
chunk += data_block
resultQ.put([start, chunk])
Below is where I init the thread pool, further down I put items to the queue
# create a thread pool and give them a queue
for i in range(num_threads):
t = Downloader(workQ, resultQ, file_name)
t.setDaemon(True)
t.start()
Can I make a thread stop itself?
Don't use self._Thread__stop(). It is enough to exit the thread's run() method (you can check a flag or read a sentinel value from a queue to know when to exit).
It's in here where I would like to detect a http 503 and stop/kill/remove a thread from the available pools - and of course re-add the failed chunk back to the queue so the remaining threads will process it
You can simplify the code by separating responsibilities:
download_file() should not try to reconnect in the infinite loop. If there is an error; let's the code that calls download_file() resubmit it if necessary
the control about the number of concurrent connections can be encapsulated in a Semaphore object. Number of threads may differ from number of concurrent connections in this case
import concurrent.futures # on Python 2.x: pip install futures
from threading import BoundedSemaphore
def download_file(args):
nconcurrent.acquire(timeout=args['timeout']) # block if too many connections
# ...
nconcurrent.release() #NOTE: don't release it on exception,
# allow the caller to handle it
# you can put it into a dictionary: server -> semaphore instead of the global
nconcurrent = BoundedSemaphore(5) # start with at most 5 concurrent connections
with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
future_to_args = dict((executor.submit(download_file, args), args)
for args in generate_initial_download_tasks())
while future_to_args:
for future in concurrent.futures.as_completed(dict(**future_to_args)):
args = future_to_args.pop(future)
try:
result = future.result()
except Exception as e:
print('%r generated an exception: %s' % (args, e))
if getattr(e, 'code') != 503:
# don't decrease number of concurrent connections
nconcurrent.release()
# resubmit
args['timeout'] *= 2
future_to_args[executor.submit(download_file, args)] = args
else: # successfully downloaded `args`
print('f%r returned %r' % (args, result))
See ThreadPoolExecutor() example.
you should be using a threadpool to control the life of your threads:
http://www.inductiveload.com/posts/easy-thread-pools-in-python-with-threadpool/
Then when a thread exists, you can send a message to the main thread (that is handling the threadpool) and then change the size of the threadpool, and postpone new requests or failed requests in a stack that you'll empty.
tedelanay is absolutely right about the daemon status you're giving to your threads. There is no need to set them as daemons.
Basically, you can simplify your code, you could do something as follows:
import threadpool
def process_tasks():
pool = threadpool.ThreadPool(4)
requests = threadpool.makeRequests(download_file, arguments)
for req in requests:
pool.putRequest(req)
#wait for them to finish (or you could go and do something else)
pool.wait()
if __name__ == '__main__':
process_tasks()
where arguments is up to your strategy. Either you give your threads a queue as argument and then empty the queue. Or you can get process the queue in process_tasks, block while the pool is full, and open a new thread when a thread is done, but the queue is not empty. It all depends on your needs and the context of your downloader.
resources:
http://chrisarndt.de/projects/threadpool/
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/203871
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/196618
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/302746
http://lethain.com/using-threadpools-in-python/
A Thread object terminates the thread simply by returning from the run method - it doesn't call stop. If you set your thread to daemon mode, there is no need to join but otherwise the main thread needs to do it. It is common for the thread to use the resultq to report that it is exiting and for the main thread to use that info to do the join. This helps with orderly termination of your process. You can get strange errors during system exit if python is still juggling multiple threads and its best to side-step that.
Related
I am using Python3 modules:
requests for HTTP GET calls to a few Particle Photons which are set up as simple HTTP Servers
As a client I am using the Raspberry Pi (which is also an Access Point) as a HTTP Client which uses multiprocessing.dummy.Pool for making HTTP GET resquests to the above mentioned photons
The polling routine is as follows:
def pollURL(url_of_photon):
"""
pollURL: Obtain the IP Address and create a URL for HTTP GET Request
#param: url_of_photon: IP address of the Photon connected to A.P.
"""
create_request = 'http://' + url_of_photon + ':80'
while True:
try:
time.sleep(0.1) # poll every 100ms
response = requests.get(create_request)
if response.status_code == 200:
# if success then dump the data into a temp dump file
with open('temp_data_dump', 'a+') as jFile:
json.dump(response.json(), jFile)
else:
# Currently just break
break
except KeyboardInterrupt as e:
print('KeyboardInterrupt detected ', e)
break
The url_of_photon values are simple IPv4 Addresses obtained from the dnsmasq.leases file available on the Pi.
the main() function:
def main():
# obtain the IP and MAC addresses from the Lease file
IP_addresses = []
MAC_addresses = []
with open('/var/lib/misc/dnsmasq.leases', 'r') as leases_file:
# split lines and words to obtain the useful stuff.
for lines in leases_file:
fields = lines.strip().split()
# use logging in future
print('Photon with MAC: %s has IP address: %s' %(fields[1],fields[2]))
IP_addresses.append(fields[2])
MAC_addresses.append(fields[1])
# Create Thread Pool
pool = ThreadPool(len(IP_addresses))
results = pool.map(pollURL, IP_addresses)
pool.close()
pool.join()
if __name__ == '__main__':
main()
Problem
The program runs well however when I press CTRL + C the program does not terminate. Upon digging I found that the way to do so is using CTRL + \
How do I use this in my pollURL function for a safe way to exit the program, i.e. perform poll.join() so no leftover processes are left?
notes:
the KeyboardInterrupt is never recognized with the function. Hence I am facing trouble trying to detect CTRL + \.
The pollURL is executed in another thread. In Python, signals are handled only in the main thread. Therefore, SIGINT will raise the KeyboardInterrupt only in the main thread.
From the signal documentation:
Signals and threads
Python signal handlers are always executed in the main Python thread, even if the signal was received in another thread. This means that signals can’t be used as a means of inter-thread communication. You can use the synchronization primitives from the threading module instead.
Besides, only the main thread is allowed to set a new signal handler.
You can implement your solution in the following way (pseudocode).
event = threading.Event()
def looping_function( ... ):
while event.is_set():
do_your_stuff()
def main():
try:
event.set()
pool = ThreadPool()
pool.map( ... )
except KeyboardInterrupt:
event.clear()
finally:
pool.close()
pool.join()
I'm working in an environment where web applications fork processes on demand and each process has its own thread pool to service web requests. The threads may need to issue HTTPS requests to outside services, and the requests library is currently used to do so. When requests usage was first added, it was used naively by creating a new requests.Session and requests.adapters.HTTPAdapter for each request, or even by simply calling requests.get or requests.post on demand. The problem that arises is that a new connection is established each time instead of potentially taking advantage of HTTP persistent connections. A potential fix would be to use a connection pool, but what is the recommended way of sharing a HTTP connection pool between threads when using the requests library? Is there one?
The first thought would be to share a single requests.Session, but that currently not safe, as described in "Is the Session object from Python's Requests library thread safe?" and "Document threading contract for Session class". Is it safe and sufficient to have a single global requests.adapters.HTTPAdapter that is shared between requests.Sessionss that are created on demand in each thread? According to "Our use of urllib3's ConnectionPools is not threadsafe.", even that may not be a valid use. Only needing to connect to a small number of distinct remote endpoints may allow it to be a viable approach regardless.
I doubt there is existing way to do this in requests. But you can modify my code to encapsulate requests session() instead of standard urllib2.
This is my code that I use when I want to get data from multiple sites at the same time:
# Following code I keep in file named do.py
# It can be use to perform any number of otherwise blocking IO operations simultaneously
# Results are available to you when all IO operations are completed.
# Completed means either IO finished successfully or an exception got raised.
# So when all tasks are completed, you pick up results.
# Usage example:
# >>> import do
# >>> results = do.simultaneously([
# ... (func1, func1_args, func1_kwargs),
# ... (func2, func2_args, func2_kwargs), ...])
# >>> for x in results:
# ... print x
# ...
from thread import start_new_thread as thread
from thread import allocate_lock
from collections import deque
from time import sleep
class Task:
"""A task's thread holder. Keeps results or exceptions raised.
This could be a bit more robustly implemented using
threading module.
"""
def __init__ (self, func, args, kwargs, pool):
self.func = func
self.args = args
self.kwargs = kwargs
self.result = None
self.done = 0
self.started = 0
self.xraised = 0
self.tasks = pool
pool.append(self)
self.allow = allocate_lock()
self.run()
def run (self):
thread(self._run,())
def _run (self):
self.allow.acquire() # Prevent same task from being started multiple times
self.started = 1
self.result = None
self.done = 0
self.xraised = 0
try:
self.result = self.func(*self.args, **self.kwargs)
except Exception, e:
e.task = self # Keep reference to the task in an exception
# This way we can access original task from caught exception
self.result = e
self.xraised = 1
self.done = 1
self.allow.release()
def wait (self):
while not self.done:
try: sleep(0.001)
except: break
def withdraw (self):
if not self.started: self.run()
if not self.done: self.wait()
self.tasks.remove(self)
return self.result
def remove (self):
self.tasks.remove(self)
def simultaneously (tasks, xraise=0):
"""Starts all functions within iterable <tasks>.
Then waits for all to be finished.
Iterable <tasks> may contain a subiterables with:
(function, [[args,] kwargs])
or just functions. These would be called without arguments.
Returns an iterator that yields result of each called function.
If an exception is raised within a task the Exception()'s instance will be returned unless
is 1 or True. Then first encountered exception within results will be raised.
Results will start to yield after all funcs() either return or raise an exception.
"""
pool = deque()
for x in tasks:
func = lambda: None
args = ()
kwargs = {}
if not isinstance(x, (tuple, list)):
Task(x, args, kwargs, pool)
continue
l = len(x)
if l: func = x[0]
if l>1:
args = x[1]
if not isinstance(args, (tuple, list)): args = (args,)
if l>2:
if isinstance(x[2], dict):
kwargs = x[2]
Task(func, args, kwargs, pool)
for t in pool: t.wait()
while pool:
t = pool.popleft()
if xraise and t.xraised:
raise t.result
yield t.result
# So, I do this using urllib2, you can do it using requests if you want.
from urllib2 import URLError, HTTPError, urlopen
import do
class AccessError(Exception):
"""Raised if server rejects us because we bombarded same server with multiple connections in too small time slots."""
pass
def retrieve (url):
try:
u = urlopen(url)
r = u.read()
u.close()
return r
except HTTPError, e:
msg = "HTTPError %i - %s" % (e.code, e.msg)
t = AccessError()
if e.code in (401, 403, 429):
msg += " (perhaps you're making too many calls)"
t.reason = "perhaps you are making too many calls"
elif e.code in (502, 504):
msg += " (service temporarily not available)"
t.reason = "service temporarily not available"
else: t.reason = e.msg
t.args = (msg,)
t.message = msg
t.msg = e.msg; t.code = e.code
t.orig = e
raise t
except URLError, e:
msg = "URLError %s - %s (%s)" % (str(e.errno), str(e.message), str(e.reason))
t = AccessError(msg)
t.reason = str(e.reason)
t.msg = str(t.message)
t.code = e.errno
t.orig = e
raise t
except: raise
urls = ["http://www.google.com", "http://www.amazon.com", "http://stackoverflow.com", "http://blah.blah.sniff-sniff"]
retrieval = []
for u in urls:
retrieval.append((retrieve, u))
x = 0
for data in do.simultaneously(retrieval):
url = urls[x]
if isinstance(data, Exception):
print url, "not retrieved successfully!\nThe error is:"
print data
else:
print url, "returned", len(data), "characters!!\nFirst 100:"
print data[:100]
x += 1
# If you need persistent HTTP, you tweak the retrieve() function to be able to hold the connection open.
# After retrieving currently requested data You save opened connections in a global dict() with domains as keys.
# When the next retrieve is called and the domain already has an opened connection, you remove the connection from dict (to prevent any other retrieve grabbing it in the middle of nowhere), then you use it
# to send a new request if possible. (If it isn't timed out or something), if connection broke, you just open a new one.
# You will probably have to introduce some limits if you will be using multiple connections to same server at once.
# Like no more than 4 connections to same server at once, some delays between requests and so on.
# No matter which approach to multithreading you will choose (something like I propose or some other mechanism) thread safety is in trouble because HTTP is serialized protocol.
# You send a request, you await the answer. After you receive whole answer, then you can make a new request if HTTP/1.1 is used and connection is being kept alive.
# If your thread tries to send a new request during the data download a general mess will occur.
# So you design your system to open as much connections as possible, but always wait for one to be free before reusing it. Thats a trick here.
# As for any other part of requests being thread unsafe for some reason, well, you should check the code to see which calls exactly should be kept atomic and then use a lock. But don't put it in a block where major IO is occurring or it will be as you aren't using threads at all.
I am trying to implement a Python (2.6.x/2.7.x) thread pool that would check for network connectivity(ping or whatever), the entire pool threads must be killed/terminated when the check is successful.
So I am thinking of creating a pool of, let's say, 10 worker threads. If any one of them is successful in pinging, the main thread should terminate all the rest.
How do I implement this?
This is not a compilable code, this is just to give you and idea how to make threads communicate..
Inter process or threads communication happens through queues or pipes and some other ways..here I'm using queues for communication.
It works like this.. I'll send ip addresses in in_queue and add response to out_queue, my main thread monitors out_queue and if it gets desired result, it marks all the threads to terminate.
Below is the pinger thread definition..
import threading
from Queue import Queue, Empty
# A thread that pings ip.
class Pinger(threading.Thread):
def __init__(self, kwargs=None):
threading.Thread.__init__(self)
self.kwargs = kwargs
self.stop_pinging = False
def run(self):
ip_queue = self.kwargs.get('in_queue')
out_queue = self.kwargs.get('out_queue')
while not self.stop_pinging:
try:
data = ip_quque.get(timeout=1)
ping_status = ping(data)
# This is pseudo code, you've to takecare of
# your own ping.
if ping_status:
out_queue.put('success')
# you can even break here if you don't want to
# continue after one success
else:
out_queue.put('failure')
if ip_queue.empty()
break
except Empty, e:
pass
Here is the main thread block..
# Create the shared queue and launch both thread pools
in_queue = Queue()
out_queue = Queue()
ip_list = ['ip1', 'ip2', '....']
# This is to add all the ips to the queue or you can
# customize to add through some producer way.
for ip in ip_list:
in_queue.put(ip)
pingerer_pool = []
for i in xrange(1, 10):
pingerer_worker = Pinger(kwargs={'in_queue': in_queue, 'out_queue': out_queue}, name=str(i))
pingerer_pool.append(pinger_worker)
pingerer_worker.start()
while 1:
if out_queue.get() == 'success':
for pinger in pinger_pool:
pinger_worker.stop_pinging = True
break
Note: This is a pseudo code, you should make this workable as you like.
I have a piece of multi threaded code - 3 threads that polls data from SQS and add it to a python queue. 5 threads that take the messages from python queue, process them and send it to a back end system.
Here is the code:
python_queue = Queue.Queue()
class GetDataFromSQS(threading.Thread):
"""Threaded Url Grab"""
def __init__(self, python_queue):
threading.Thread.__init__(self)
self.python_queue = python_queue
def run(self):
while True:
time.sleep(0.5) //sleep for a few secs before querying again
try:
msgs = sqs_queue.get_messages(10)
if msgs == None:
print "sqs is empty now"!
for msg in msgs:
#place each message block from sqs into python queue for processing
self.python_queue.put(msg)
print "Adding a new message to Queue. Queue size is now %d" % self.python_queue.qsize()
#delete from sqs
sqs_queue.delete_message(msg)
except Exception as e:
print "Exception in GetDataFromSQS :: " + e
class ProcessSQSMsgs(threading.Thread):
def __init__(self, python_queue):
threading.Thread.__init__(self)
self.python_queue = python_queue
self.pool_manager = PoolManager(num_pools=6)
def run(self):
while True:
#grabs the message to be parsed from sqs queue
python_queue_msg = self.python_queue.get()
try:
processMsgAndSendToBackend(python_queue_msg, self.pool_manager)
except Exception as e:
print "Error parsing:: " + e
finally:
self.python_queue.task_done()
def processMsgAndSendToBackend(msg, pool_manager):
if msg != "":
###### All the code related to processing the msg
for individualValue in processedMsg:
try:
response = pool_manager.urlopen('POST', backend_endpoint, body=individualValue)
if response == None:
print "Error"
else:
response.release_conn()
except Exception as e:
print "Exception! Post data to backend: " + e
def startMyPython():
#spawn a pool of threads, and pass them queue instance
for i in range(3):
sqsThread = GetDataFromSQS(python_queue)
sqsThread.start()
for j in range(5):
parseThread = ProcessSQSMsgs(python_queue)
#parseThread.setDaemon(True)
parseThread.start()
#wait on the queue until everything has been processed
python_queue.join()
# python_queue.close() -- should i do this?
startMyPython()
The problem:
3 python workers die randomly (monitored using top -p -H) once every few days and everything is alright if i kill the process and start the script again. I suspect the workers that vanish are the 3 GetDataFromSQS threads.. And because the GetDataFromSQS dies, the other 5 workers although running always sleep as there is no data in the python queue. I am not sure what I am doing wrong here as I am pretty new to python and followed this tutorial for creating this queuing logic and threads - http://www.ibm.com/developerworks/aix/library/au-threadingpython/
Thanks in advance for your help. Hope I have explained my problem clear.
The problem for the thread hanging was related to getting a handle of the sqs queue. I used IAM for managing credentials and the boto sdk for connecting to sqs.
The root cause for this issue was that the boto package was reading the metadata for auth from AWS and it was failing once in a while.
The fix is to edit the boto config, increasing the attempts that are made to perform the auth call to AWS.
[Boto]
metadata_service_num_attempts = 5
( https://groups.google.com/forum/#!topic/boto-users/1yX24WG3g1E )
I was reading a article on Python multi threading using Queues and have a basic question.
Based on the print stmt, 5 threads are started as expected. So, how does the queue works?
1.The thread is started initially and when the queue is populated with a item does it gets restarted and starts processing that item?
2.If we use the queue system and threads process each item by item in the queue, how there is a improvement in performance..Is it not similar to serial processing ie; 1 by 1.
import Queue
import threading
import urllib2
import datetime
import time
hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com",
"http://ibm.com", "http://apple.com"]
queue = Queue.Queue()
class ThreadUrl(threading.Thread):
def __init__(self, queue):
threading.Thread.__init__(self)
print 'threads are created'
self.queue = queue
def run(self):
while True:
#grabs host from queue
print 'thread startting to run'
now = datetime.datetime.now()
host = self.queue.get()
#grabs urls of hosts and prints first 1024 bytes of page
url = urllib2.urlopen(host)
print 'host=%s ,threadname=%s' % (host,self.getName())
print url.read(20)
#signals to queue job is done
self.queue.task_done()
start = time.time()
if __name__ == '__main__':
#spawn a pool of threads, and pass them queue instance
print 'program start'
for i in range(5):
t = ThreadUrl(queue)
t.setDaemon(True)
t.start()
#populate queue with data
for host in hosts:
queue.put(host)
#wait on the queue until everything has been processed
queue.join()
print "Elapsed Time: %s" % (time.time() - start)
A queue is similar to a list container, but with internal locking to make it a thread-safe way to communicate data.
What happens when you start all of your threads is that they all block on the self.queue.get() call, waiting to pull an item from the queue. When an item is put into the queue from your main thread, one of the threads will become unblocked and receive the item. It can then continue to process it until it finishes and returns to a blocking state.
All of your threads can run concurrently because they all are able to receive items from the queue. This is where you would see your improvement in performance. If the urlopen and read take time in one thread and it is waiting on IO, that means another thread can do work. The queue objects job is simply to manage the locking access, and popping off items to the callers.