Simulating Cancellation Tokens in Python Threading - python

I just wrote a task queue in Python whose job is to limit the number of tasks that are run at one time. This is a little different than Queue.Queue because instead of limiting how many items can be in the queue, it limits how many can be taken out at one time. It still uses an unbounded Queue.Queue to do its job, but it relies on a Semaphore to limit the number of threads:
from Queue import Queue
from threading import BoundedSemaphore, Lock, Thread
class TaskQueue(object):
"""
Queues tasks to be run in separate threads and limits the number
concurrently running tasks.
"""
def __init__(self, limit):
"""Initializes a new instance of a TaskQueue."""
self.__semaphore = BoundedSemaphore(limit)
self.__queue = Queue()
self.__cancelled = False
self.__lock = Lock()
def enqueue(self, callback):
"""Indicates that the given callback should be ran."""
self.__queue.put(callback)
def start(self):
"""Tells the task queue to start running the queued tasks."""
thread = Thread(target=self.__process_items)
thread.start()
def stop(self):
self.__cancel()
# prevent blocking on a semaphore.acquire
self.__semaphore.release()
# prevent blocking on a Queue.get
self.__queue.put(lambda: None)
def __cancel(self):
print 'canceling'
with self.__lock:
self.__cancelled = True
def __process_items(self):
while True:
# see if the queue has been stopped before blocking on acquire
if self.__is_canceled():
break
self.__semaphore.acquire()
# see if the queue has been stopped before blocking on get
if self.__is_canceled():
break
callback = self.__queue.get()
# see if the queue has been stopped before running the task
if self.__is_canceled():
break
def runTask():
try:
callback()
finally:
self.__semaphore.release()
thread = Thread(target=runTask)
thread.start()
self.__queue.task_done()
def __is_canceled(self):
with self.__lock:
return self.__cancelled
The Python interpreter runs forever unless I explicitly stop the task queue. This is a lot more tricky than I thought it would be. If you look at the stop method, you'll see that I set a canceled flag, release the semaphore and put a no-op callback on the queue. The last two parts are necessary because the code could be blocking on the Semaphore or on the Queue. I basically have to force these to go through so that the loop has a chance to break out.
This code works. This class is useful when running a service that is trying to run thousands of tasks in parallel. In order to keep the machine running smoothly and to prevent the OS from screaming about too many active threads, this code will limit the number of threads living at any one time.
I have written a similar chunk of code in C# before. What made that code particular cut 'n' dry was that .NET has something called a CancellationToken that just about every threading class uses. Any time there is a blocking operation, that operation takes an optional token. If the parent task is ever canceled, any child tasks blocking with that token will be immediately canceled, as well. This seems like a much cleaner way to exit than to "fake it" by releasing semaphores or putting values in a queue.
I was wondering if there was an equivalent way of doing this in Python? I definitely want to be using threads instead of something like asynchronous events. I am wondering if there is a way to achieve the same thing using two Queue.Queues where one is has a max size and the other doesn't - but I'm still not sure how to handle cancellation.

I think your code can be simplified by using poisoning and Thread.join():
from Queue import Queue
from threading import Thread
poison = object()
class TaskQueue(object):
def __init__(self, limit):
def process_items():
while True:
callback = self._queue.get()
if callback is poison:
break
try:
callback()
except:
pass
finally:
self._queue.task_done()
self._workers = [Thread(target=process_items) for _ in range(limit)]
self._queue = Queue()
def enqueue(self, callback):
self._queue.put(callback)
def start(self):
for worker in self._workers:
worker.start()
def stop(self):
for worker in self._workers:
self._queue.put(poison)
while self._workers:
self._workers.pop().join()
Untested.
I removed the comments, for brevity.
Also, in this version process_items() is truly private.
BTW: The whole point of the Queue module is to free you from the dreaded locking and event stuff.

You seem to be creating a new thread for each task from the queue. This is wasteful in itself, and also leads you to the problem of how to limit the number of threads.
Instead, a common approach is to create a fixed number of worker threads and let them freely pull tasks from the queue. To cancel the queue, you can clear it and let the workers stay alive in anticipation of future work.

I took Janne Karila's advice and created a thread pool. This eliminated the need for a semaphore. The problem is if you ever expect the queue to go away, you have to stop the worker threads from running (just a variation of what I did before). The new code is fairly similar:
class TaskQueue(object):
"""
Queues tasks to be run in separate threads and limits the number
concurrently running tasks.
"""
def __init__(self, limit):
"""Initializes a new instance of a TaskQueue."""
self.__workers = []
for _ in range(limit):
worker = Thread(target=self.__process_items)
self.__workers.append(worker)
self.__queue = Queue()
self.__cancelled = False
self.__lock = Lock()
self.__event = Event()
def enqueue(self, callback):
"""Indicates that the given callback should be ran."""
self.__queue.put(callback)
def start(self):
"""Tells the task queue to start running the queued tasks."""
for worker in self.__workers:
worker.start()
def stop(self):
"""
Stops the queue from processing anymore tasks. Any actively running
tasks will run to completion.
"""
self.__cancel()
# prevent blocking on a Queue.get
for _ in range(len(self.__workers)):
self.__queue.put(lambda: None)
self.__event.wait()
def __cancel(self):
with self.__lock:
self.__queue.queue.clear()
self.__cancelled = True
def __process_items(self):
while True:
callback = self.__queue.get()
# see if the queue has been stopped before running the task
if self.__is_canceled():
break
try:
callback()
except:
pass
finally:
self.__queue.task_done()
self.__event.set()
def __is_canceled(self):
with self.__lock:
return self.__cancelled
If you look carefully, I had to do some accounting to kill off the workers. I basically wait on an Event for as many times as there are workers. I clear the underlying queue to prevent workers from being cancelled any other way. I also wait after pumping each bogus value into the queue, so only one worker can cancel out at a time.
I've ran some tests on this and it appears to be working. It would still be nice to eliminate the need for bogus values.

Related

Python multiprocessing: add tasks to queue but prevent them being picked for a given time

I am using multiprocessing with multiple workers (subclasses of multiprocessing.Process) and queues (multiprocessing.JoinableQueue), to implement a complex workflow of data manipulation.
One of the workers (JobSender) is submitting jobs to a remote system (a web service), which returns an identifier immediately. Those jobs can take a very long time to be performed.
I therefore have another worker (StatusPoller) in charge of polling that remote system for status of the job. To do so, the JobSender adds the identifier in a queue that the StatusPoller uses as input. If the job is not completed, the StatusPoller puts the identifier back on the same queue. If the job is completed, the StatusPoller retrieves the result information and then adds it to a list (multiprocessing.Manager.list()).
My question: I don't want to hammer the remote system with continuous requests for status, which would happen in my setup. I want to introduce a delay somewhere to ensure that status polling for any given identifier only happens every 20 seconds.
Currently I'm doing this by having a time.sleep(20) just before the StatusPoller puts the identifier back on the queue. But that means that the StatusPoller is now idle for 20 seconds and cannot pick up another polling task from the queue. I will have multiple StatusPollers but I can't have one for each of the jobs (there might be hundreds of those).
class StatusPoller(multiprocessing.Process):
def __init__(self, polling_queue, results_queue, errors_queue):
multiprocessing.Process.__init__(self)
self.polling_queue = polling_queue
self.results_queue = results_queue
def run(self):
while True:
# Pick a task from the queue
next_id = self.polling_queue.get()
# Poison pill => shutdown
if next_id == 'END':
self.polling_queue.task_done()
break
# Process the task
response = remote_system.get_status(next_id)
if response == "IN_PROGRESS":
time.sleep(20)
self.polling_queue.put(next_id)
else:
self.results_queue.put(response)
self.polling_queue.task_done()
Any idea how to implement such a workflow?
When you consider that the multiprocessing.Process and multithreading.Threading classes can be instantiated with the target keyword, I consider it to be an antipattern to actually subclass these classes since you then lose some flexibility and reuse. In fact, in your case I would think that given that StatusPoller is just waiting on a queue and a reply from a network, that multithreading would be more than adequate, especially if, as you say, you have "hundreds of those." I also cannot see in your current code the need for a joinable queue.
So I would suggest using multithreading with regular queue.Queue instances and the sched.scheduler class instance from the sched module, which can be shared among all StatusPoller instances as the code appears to the thread safe. Here is the general idea:
from threading import Thread
from queue import Queue
import time
# Start of modified sched.scheduler code:
#########################################################
# Heavily modified from sched.scheduler
import time
import heapq
from collections import namedtuple
import threading
from time import monotonic as _time
class Event(namedtuple('Event', 'time, priority, action, argument, kwargs')):
__slots__ = []
def __eq__(s, o): return (s.time, s.priority) == (o.time, o.priority)
def __lt__(s, o): return (s.time, s.priority) < (o.time, o.priority)
def __le__(s, o): return (s.time, s.priority) <= (o.time, o.priority)
def __gt__(s, o): return (s.time, s.priority) > (o.time, o.priority)
_sentinel = object()
class Scheduler():
"""
Code modified from sched.scheduler
"""
delayfunc = time.sleep
def __init__(self, timefunc=_time):
"""Initialize a new instance, passing the time functions"""
self._queue = []
self.timefunc = timefunc
self.got_event = threading.Condition(threading.RLock())
self.thread_started = False
def enterabs(self, time, priority, action, argument=(), kwargs=_sentinel):
"""Enter a new event in the queue at an absolute time.
Returns an ID for the event which can be used to remove it,
if necessary.
"""
if kwargs is _sentinel:
kwargs = {}
event = Event(time, priority, action, argument, kwargs)
with self.got_event:
if not self.thread_started:
self.thread_started = True
threading.Thread(target=self.run, daemon=True).start()
heapq.heappush(self._queue, event)
# Show new Event has been entered:
self.got_event.notify()
return event # The ID
def cancel(self, event):
"""Remove an event from the queue.
This must be presented the ID as returned by enter().
If the event is not in the queue, this raises ValueError.
"""
with self.got_event:
self._queue.remove(event)
heapq.heapify(self._queue)
def enter(self, delay, priority, action, argument=(), kwargs=_sentinel):
"""A variant that specifies the time as a relative time.
This is actually the more commonly used interface.
"""
time = self.timefunc() + delay
return self.enterabs(time, priority, action, argument, kwargs)
def empty(self):
"""Check whether the queue is empty."""
with self.got_event:
return not self._queue
def run(self):
"""Execute events until the queue is empty."""
# localize variable access to minimize overhead
# and to improve thread safety
got_event = self.got_event
q = self._queue
timefunc = self.timefunc
delayfunc = self.delayfunc
pop = heapq.heappop
while True:
try:
while True:
with got_event:
got_event.wait_for(lambda: len(q) != 0)
time, priority, action, argument, kwargs = q[0]
now = timefunc()
if time > now:
# Wait for either the time to elapse or a new
# event to be added:
got_event.wait(timeout=(time - now))
continue
pop(q)
action(*argument, **kwargs)
delayfunc(0) # Let other threads run
except:
pass
#property
def queue(self):
"""An ordered list of upcoming events.
Events are named tuples with fields for:
time, priority, action, arguments, kwargs
"""
# Use heapq to sort the queue rather than using 'sorted(self._queue)'.
# With heapq, two events scheduled at the same time will show in
# the actual order they would be retrieved.
with self.got_event:
events = self._queue[:]
return list(map(heapq.heappop, [events]*len(events)))
###########################################################
def re_queue(polling_queue, id):
polling_queue.put(id)
class StatusPoller:
scheduler = Scheduler()
def __init__(self, polling_queue, results_queue, errors_queue):
self.polling_queue = polling_queue
self.results_queue = results_queue
def run(self):
while True:
# Pick a task from the queue
next_id = self.polling_queue.get()
# Poison pill => shutdown
if next_id == 'END':
break
# Process the task
response = remote_system.get_status(next_id)
if response == "IN_PROGRESS":
self.scheduler.enter(20, 1, re_queue, argument=(self.polling_queue, next_id))
else:
self.results_queue.put(response)
Explanation
First, why did I say that I saw no reason for a JoinableQueue? The run method is programmed to return if it finds an input message that is 'END'. But because of the way this method when finding "IN_PROGRES" responses from the remote system requeues messages back onto the pollinq_queue, the possibility exists that when END is received and run terminates that there is one or more of these requeued messages remaining on the queue. So how can another process or thread depend on calling polling_queue.join() without possibly hanging? It cannot.
Instead, if you have N processes or threads (we haven't decided yet which) doing get requests against a single queue instance, it should suffice to just put N 'END' shutdown messages on the queue. This will result in the N processes terminating. The main process now instead of joining the queue just joins the N processes or threads if it wishes to block on the actual termination of these processes/threads.
The way I would use a JoinableQueue, which I don't think fits your use case, would be if the processes/threads were in an infinite loop never terminating, that is, not quitting "prematurely" and therefore never leaving items left on the queue. You would make these processes/threads daemon processes so that they would eventually end when the main process eventually terminates. So you could not force a termination with an 'END' message. So I just don't see how a JoinableQueue works here, but you can point out to me if I have misunderstood something.
Yes, StatusPoller could be the target of a Process instance (or even a subclass of Process as you originally had it, although except for the fact that is how you currently have it coded, I see no advantage to doing that). But it seems to me that it will be spending most of its time waiting on either getting from a queue or getting a network response. In both cases it will release the Global Interpreter Lock and multithreading should be very performant. Threads will also take up far fewer resources if we are indeed talking about creating hundreds of instances of these tasks, especially if you are running under Windows. You will also not be able to share the scheduler, which runs in its own thread, across all StatusPoller instances. There will be one scheduler now running in each process since each StatusPoller is running in its own process.

The workers in ThreadPoolExecutor is not really daemon

The thing I cannot figure out is that although ThreadPoolExecutor uses daemon workers, they will still run even if main thread exit.
I can provide a minimal example in python3.6.4:
import concurrent.futures
import time
def fn():
while True:
time.sleep(5)
print("Hello")
thread_pool = concurrent.futures.ThreadPoolExecutor()
thread_pool.submit(fn)
while True:
time.sleep(1)
print("Wow")
Both main thread and the worker thread are infinite loops. So if I use KeyboardInterrupt to terminate main thread, I expect that the whole program will terminate too. But actually the worker thread is still running even though it is a daemon thread.
The source code of ThreadPoolExecutor confirms that worker threads are daemon thread:
t = threading.Thread(target=_worker,
args=(weakref.ref(self, weakref_cb),
self._work_queue))
t.daemon = True
t.start()
self._threads.add(t)
Further, if I manually create a daemon thread, it works like a charm:
from threading import Thread
import time
def fn():
while True:
time.sleep(5)
print("Hello")
thread = Thread(target=fn)
thread.daemon = True
thread.start()
while True:
time.sleep(1)
print("Wow")
So I really cannot figure out this strange behavior.
Suddenly... I found why. According to much more source code of ThreadPoolExecutor:
# Workers are created as daemon threads. This is done to allow the interpreter
# to exit when there are still idle threads in a ThreadPoolExecutor's thread
# pool (i.e. shutdown() was not called). However, allowing workers to die with
# the interpreter has two undesirable properties:
# - The workers would still be running during interpreter shutdown,
# meaning that they would fail in unpredictable ways.
# - The workers could be killed while evaluating a work item, which could
# be bad if the callable being evaluated has external side-effects e.g.
# writing to a file.
#
# To work around this problem, an exit handler is installed which tells the
# workers to exit when their work queues are empty and then waits until the
# threads finish.
_threads_queues = weakref.WeakKeyDictionary()
_shutdown = False
def _python_exit():
global _shutdown
_shutdown = True
items = list(_threads_queues.items())
for t, q in items:
q.put(None)
for t, q in items:
t.join()
atexit.register(_python_exit)
There is an exit handler which will join all unfinished worker...
Here's the way to avoid this problem. Bad design can be beaten by another bad design. People write daemon=True only if they really know that the worker won't damage any objects or files.
In my case, I created TreadPoolExecutor with a single worker and after a single submit I just deleted the newly created thread from the queue so the interpreter won't wait till this thread stops on its own. Notice that worker threads are created after submit, not after the initialization of TreadPoolExecutor.
import concurrent.futures.thread
from concurrent.futures import ThreadPoolExecutor
...
executor = ThreadPoolExecutor(max_workers=1)
future = executor.submit(lambda: self._exec_file(args))
del concurrent.futures.thread._threads_queues[list(executor._threads)[0]]
It works in Python 3.8 but may not work in 3.9+ since this code is accessing private variables.
See the working piece of code on github

how to to terminate process using python's multiprocessing

I have some code that needs to run against several other systems that may hang or have problems not under my control. I would like to use python's multiprocessing to spawn child processes to run independent of the main program and then when they hang or have problems terminate them, but I am not sure of the best way to go about this.
When terminate is called it does kill the child process, but then it becomes a defunct zombie that is not released until the process object is gone. The example code below where the loop never ends works to kill it and allow a respawn when called again, but does not seem like a good way of going about this (ie multiprocessing.Process() would be better in the __init__()).
Anyone have a suggestion?
class Process(object):
def __init__(self):
self.thing = Thing()
self.running_flag = multiprocessing.Value("i", 1)
def run(self):
self.process = multiprocessing.Process(target=self.thing.worker, args=(self.running_flag,))
self.process.start()
print self.process.pid
def pause_resume(self):
self.running_flag.value = not self.running_flag.value
def terminate(self):
self.process.terminate()
class Thing(object):
def __init__(self):
self.count = 1
def worker(self,running_flag):
while True:
if running_flag.value:
self.do_work()
def do_work(self):
print "working {0} ...".format(self.count)
self.count += 1
time.sleep(1)
You might run the child processes as daemons in the background.
process.daemon = True
Any errors and hangs (or an infinite loop) in a daemon process will not affect the main process, and it will only be terminated once the main process exits.
This will work for simple problems until you run into a lot of child daemon processes which will keep reaping memories from the parent process without any explicit control.
Best way is to set up a Queue to have all the child processes communicate to the parent process so that we can join them and clean up nicely. Here is some simple code that will check if a child processing is hanging (aka time.sleep(1000)), and send a message to the queue for the main process to take action on it:
import multiprocessing as mp
import time
import queue
running_flag = mp.Value("i", 1)
def worker(running_flag, q):
count = 1
while True:
if running_flag.value:
print(f"working {count} ...")
count += 1
q.put(count)
time.sleep(1)
if count > 3:
# Simulate hanging with sleep
print("hanging...")
time.sleep(1000)
def watchdog(q):
"""
This check the queue for updates and send a signal to it
when the child process isn't sending anything for too long
"""
while True:
try:
msg = q.get(timeout=10.0)
except queue.Empty as e:
print("[WATCHDOG]: Maybe WORKER is slacking")
q.put("KILL WORKER")
def main():
"""The main process"""
q = mp.Queue()
workr = mp.Process(target=worker, args=(running_flag, q))
wdog = mp.Process(target=watchdog, args=(q,))
# run the watchdog as daemon so it terminates with the main process
wdog.daemon = True
workr.start()
print("[MAIN]: starting process P1")
wdog.start()
# Poll the queue
while True:
msg = q.get()
if msg == "KILL WORKER":
print("[MAIN]: Terminating slacking WORKER")
workr.terminate()
time.sleep(0.1)
if not workr.is_alive():
print("[MAIN]: WORKER is a goner")
workr.join(timeout=1.0)
print("[MAIN]: Joined WORKER successfully!")
q.close()
break # watchdog process daemon gets terminated
if __name__ == '__main__':
main()
Without terminating worker, attempt to join() it to the main process would have blocked forever since worker has never finished.
The way Python multiprocessing handles processes is a bit confusing.
From the multiprocessing guidelines:
Joining zombie processes
On Unix when a process finishes but has not been joined it becomes a zombie. There should never be very many because each time a new process starts (or active_children() is called) all completed processes which have not yet been joined will be joined. Also calling a finished process’s Process.is_alive will join the process. Even so it is probably good practice to explicitly join all the processes that you start.
In order to avoid a process to become a zombie, you need to call it's join() method once you kill it.
If you want a simpler way to deal with the hanging calls in your system you can take a look at pebble.

How do I feed an infinite generator to eventlet (or gevent)?

The docs of both eventlet and gevent have several examples on how to asyncronously spawn IO tasks and get the results latter.
But so far, all the examples where a value should be returned from the async call,I allways find a blocking call after all the calls to spawn(). Either join(), joinall(), wait(), waitall().
This assumes that calling the functions that use IO is immediate and we can jump right into the point where we are waiting for the results.
But in my case I want to get the jobs from a generator that can be slow and or arbitrarily large or even infinite.
I obviously can't do this
pile = eventlet.GreenPile(pool)
for url in mybiggenerator():
pile.spawn(fetch_title, url)
titles = '\n'.join(pile)
because mybiggenerator() can take a long time before it is exhausted. So I have to start consuming the results while I am still spawning async calls.
This is probably usually done with resource to queues, but I'm not really sure how. Say I create a queue to hold jobs, push a bunch of jobs from a greenlet called P and pop them from another greenlet C.
When in C, if I find that the queue is empty, how do I know if P has pushed every job it had to push or if it is just in the middle of an iteration?
Alternativey,Eventlet allows me to loop through a pile to get the return values, but can I start doing this without having spawn all the jobs I have to spawn? How? This would be a simpler alternative.
You don't need any pool or pile by default. They're just convenient wrappers to implement a particular strategy. First you should get idea how exactly your code must work under all circumstances, that is: when and why you start another greenthread, when and why wait for something.
When you have some answers to these questions and doubt in others, ask away. In the meanwhile, here's a prototype that processes infinite "generator" (actually a queue).
queue = eventlet.queue.Queue(10000)
wait = eventlet.semaphore.CappedSemaphore(1000)
def fetch(url):
# httplib2.Http().request
# or requests.get
# or urllib.urlopen
# or whatever API you like
return response
def crawl(url):
with wait:
response = fetch(url)
links = parse(response)
for url in link:
queue.put(url)
def spawn_crawl_next():
try:
url = queue.get(block=False)
except eventlet.queue.Empty:
return False
# use another CappedSemaphore here to limit number of outstanding connections
eventlet.spawn(crawl, url)
return True
def crawler():
while True:
if spawn_crawl_next():
continue
while wait.balance != 0:
eventlet.sleep(1)
# if last spawned `crawl` enqueued more links -- process them
if not spawn_crawl_next():
break
def main():
queue.put('http://initial-url')
crawler()
Re: "concurrent.futures from Python3 does not really apply to "eventlet or gevent" part."
In fact, eventlet can be combined to deploy the concurrent.futures ThreadPoolExecutor as a GreenThread executor.
See: https://github.com/zopefiend/green-concurrent.futures-with-eventlet/commit/aed3b9f17ac27eeaf8c56210e0c8e4aff2ecbdb5
I had the same problem and it has been super difficult to find any answers.
I think I managed to get something working by having a consumer running on a separate thread and using Event for synchronization. Seems to work fine.
Only caveat is that you have to be careful with monkey-patching. If you monkey-patch threading facilities this will probably not work.
import gevent
import gevent.queue
import threading
import time
q = gevent.queue.JoinableQueue()
queue_not_empty = threading.Event()
def run_task(task):
print(f"Started task {task} # {time.time()}")
# Use whatever has been monkey-patched with gevent here
gevent.sleep(1)
print(f"Finished task {task} # {time.time()}")
def consumer():
while True:
print("Waiting for item in queue")
queue_not_empty.wait()
try:
task = q.get()
print(f"Dequed task {task} for consumption # {time.time()}")
except gevent.exceptions.LoopExit:
queue_not_empty.clear()
continue
try:
gevent.spawn(run_task, task)
finally:
q.task_done()
gevent.sleep(0) # Kickstart task
def enqueue(item):
q.put(item)
queue_not_empty.set()
# Run consumer on separate thread
consumer_thread = threading.Thread(target=consumer, daemon=True)
consumer_thread.start()
# Add some tasks
for i in range(5):
enqueue(i)
time.sleep(2)
Output:
Waiting for item in queue
Dequed task 0 for consumption # 1643232632.0220542
Started task 0 # 1643232632.0222237
Waiting for item in queue
Dequed task 1 for consumption # 1643232632.0222733
Started task 1 # 1643232632.0222948
Waiting for item in queue
Dequed task 2 for consumption # 1643232632.022315
Started task 2 # 1643232632.02233
Waiting for item in queue
Dequed task 3 for consumption # 1643232632.0223525
Started task 3 # 1643232632.0223687
Waiting for item in queue
Dequed task 4 for consumption # 1643232632.022386
Started task 4 # 1643232632.0224123
Waiting for item in queue
Finished task 0 # 1643232633.0235817
Finished task 1 # 1643232633.0236874
Finished task 2 # 1643232633.0237293
Finished task 3 # 1643232633.0237558
Finished task 4 # 1643232633.0237799
Waiting for item in queue
With the new concurrent.futures module in Py3k, I would say (assuming that the processing you want to do is actually something more complex than join):
with concurrent.futures.ThreadPoolExecutor(max_workers=foo) as wp:
res = [wp.submit(fetchtitle, url) for url in mybiggenerator()]
ans = '\n'.join([a for a in concurrent.futures.as_completed(res)]
This will allow you to start processing results before all of your fetchtitle calls complete. However, it will require you to exhaust mybiggenerator before you continue -- it's not clear how you want to get around this, unless you want to set some max_urls parameter or similar. That would still be something you could do with your original implementation, though.

Index out of Range in consumer Thread

I have a Thread that should wait for tasks to arrive from different multible Threads and execute them until no task is left. If no task is left it should wait again.
I tried it with this class (only relevant code):
from threading import Event, Thread
class TaskExecutor(object):
def __init__(self):
self.event = Event()
self.taskInfos = []
self._running = True
task_thread = Thread(target=self._run_worker_thread)
self._running = True
task_thread.daemon = True
task_thread.start()
def _run_worker_thread(self):
while self.is_running():
if len(self.taskInfos) == 0:
self.event.clear()
self.event.wait()
try:
msg, task = self.taskInfos[0]
del self.taskInfos[0]
if task:
task.execute(msg)
except Exception, e:
logger.error("Error " + str(e))
def schedule_task(self, msg, task):
self.taskInfos.append((msg, task))
self.event.set()
Multiple Threads are calling schedule_task everytime they like to add a task.
The problem is that I get an error sometimes saying: list index out of range from the msg, task = self.taskInfos[0] line. The del self.taskInfos[0] below is the only one where I delete a task.
How can that happen? I feel like I have to synchronize everything, but there is no such keyword in python, and reading the docs brought up this pattern.
This code is pretty hopeless - give up on it and do something sane ;-) What's sane? Use a Queue.Queue. That's designed to do what you want.
Replace:
self.event = Event()
self.taskInfos = []
with:
self.taskInfos = Queue.Queue()
(of course you have to import Queue too).
To add a task:
self.taskInfos.put((msg, task))
To get a task:
msg, task = self.taskInfos.get()
That will block until a task is available. There are also options to do a non-blocking .get() attempt, and to do a .get() attempt with a timeout (read the docs).
Trying to fix the code you have would be a nightmare. At heart, Events are not powerful enough to do what you need for thread safety in this context. In fact, any time you see code doing Event.clear(), it's probably buggy (subject to races).
Edit: what will go wrong next
If you continue trying to fix this code, this is likely to happen next:
the queue is empty
thread 1 does len(self.taskInfo) == 0, and loses its timeslice
thread 2 does self.taskInfos.append((msg, task))
and does self.event.set()
and loses its timeslice
thread 1 resumes and does self.event.clear()
and does self.event.wait()
Oops! Now thread 1 waits forever, despite that a task is on the queue.
That's why Python supplies Queue.Queue. You're exceedingly unlikely to get a correct solution using a feeble Event.
The following sequence is possible (assume Thread #0 is a consumer and runs your _run_worker_thread method, and threads Thread #1 and Thread #2 are producers and call schedule_task method):
Thread #0 waits on the event, queue is empty
Thread #1 calls schedule_task and is preempted before set
Thread #2 calls schedule_task and reaches set
Thread #0 awakens and does two iterations, clearing the task queue
Thread #1 awakens and reaches the set call
Thread #0 awakens with incorrect state - the queue is empty
The parts in bold are key to understand the race that is possible. Basically, the worker thread can consume all the tasks, spinning with if len(self.taskInfos) == 0 condition being False before all producers will manage to set the event after appending to the queue.
Possible solutions include checking the condition again after wait as suggested in the comments by xndrme, or using the Lock class, best one probably being the Queue.Queue class mentioned by Tim Peters in his answer.

Categories

Resources