Related
I am using multiprocessing with multiple workers (subclasses of multiprocessing.Process) and queues (multiprocessing.JoinableQueue), to implement a complex workflow of data manipulation.
One of the workers (JobSender) is submitting jobs to a remote system (a web service), which returns an identifier immediately. Those jobs can take a very long time to be performed.
I therefore have another worker (StatusPoller) in charge of polling that remote system for status of the job. To do so, the JobSender adds the identifier in a queue that the StatusPoller uses as input. If the job is not completed, the StatusPoller puts the identifier back on the same queue. If the job is completed, the StatusPoller retrieves the result information and then adds it to a list (multiprocessing.Manager.list()).
My question: I don't want to hammer the remote system with continuous requests for status, which would happen in my setup. I want to introduce a delay somewhere to ensure that status polling for any given identifier only happens every 20 seconds.
Currently I'm doing this by having a time.sleep(20) just before the StatusPoller puts the identifier back on the queue. But that means that the StatusPoller is now idle for 20 seconds and cannot pick up another polling task from the queue. I will have multiple StatusPollers but I can't have one for each of the jobs (there might be hundreds of those).
class StatusPoller(multiprocessing.Process):
def __init__(self, polling_queue, results_queue, errors_queue):
multiprocessing.Process.__init__(self)
self.polling_queue = polling_queue
self.results_queue = results_queue
def run(self):
while True:
# Pick a task from the queue
next_id = self.polling_queue.get()
# Poison pill => shutdown
if next_id == 'END':
self.polling_queue.task_done()
break
# Process the task
response = remote_system.get_status(next_id)
if response == "IN_PROGRESS":
time.sleep(20)
self.polling_queue.put(next_id)
else:
self.results_queue.put(response)
self.polling_queue.task_done()
Any idea how to implement such a workflow?
When you consider that the multiprocessing.Process and multithreading.Threading classes can be instantiated with the target keyword, I consider it to be an antipattern to actually subclass these classes since you then lose some flexibility and reuse. In fact, in your case I would think that given that StatusPoller is just waiting on a queue and a reply from a network, that multithreading would be more than adequate, especially if, as you say, you have "hundreds of those." I also cannot see in your current code the need for a joinable queue.
So I would suggest using multithreading with regular queue.Queue instances and the sched.scheduler class instance from the sched module, which can be shared among all StatusPoller instances as the code appears to the thread safe. Here is the general idea:
from threading import Thread
from queue import Queue
import time
# Start of modified sched.scheduler code:
#########################################################
# Heavily modified from sched.scheduler
import time
import heapq
from collections import namedtuple
import threading
from time import monotonic as _time
class Event(namedtuple('Event', 'time, priority, action, argument, kwargs')):
__slots__ = []
def __eq__(s, o): return (s.time, s.priority) == (o.time, o.priority)
def __lt__(s, o): return (s.time, s.priority) < (o.time, o.priority)
def __le__(s, o): return (s.time, s.priority) <= (o.time, o.priority)
def __gt__(s, o): return (s.time, s.priority) > (o.time, o.priority)
_sentinel = object()
class Scheduler():
"""
Code modified from sched.scheduler
"""
delayfunc = time.sleep
def __init__(self, timefunc=_time):
"""Initialize a new instance, passing the time functions"""
self._queue = []
self.timefunc = timefunc
self.got_event = threading.Condition(threading.RLock())
self.thread_started = False
def enterabs(self, time, priority, action, argument=(), kwargs=_sentinel):
"""Enter a new event in the queue at an absolute time.
Returns an ID for the event which can be used to remove it,
if necessary.
"""
if kwargs is _sentinel:
kwargs = {}
event = Event(time, priority, action, argument, kwargs)
with self.got_event:
if not self.thread_started:
self.thread_started = True
threading.Thread(target=self.run, daemon=True).start()
heapq.heappush(self._queue, event)
# Show new Event has been entered:
self.got_event.notify()
return event # The ID
def cancel(self, event):
"""Remove an event from the queue.
This must be presented the ID as returned by enter().
If the event is not in the queue, this raises ValueError.
"""
with self.got_event:
self._queue.remove(event)
heapq.heapify(self._queue)
def enter(self, delay, priority, action, argument=(), kwargs=_sentinel):
"""A variant that specifies the time as a relative time.
This is actually the more commonly used interface.
"""
time = self.timefunc() + delay
return self.enterabs(time, priority, action, argument, kwargs)
def empty(self):
"""Check whether the queue is empty."""
with self.got_event:
return not self._queue
def run(self):
"""Execute events until the queue is empty."""
# localize variable access to minimize overhead
# and to improve thread safety
got_event = self.got_event
q = self._queue
timefunc = self.timefunc
delayfunc = self.delayfunc
pop = heapq.heappop
while True:
try:
while True:
with got_event:
got_event.wait_for(lambda: len(q) != 0)
time, priority, action, argument, kwargs = q[0]
now = timefunc()
if time > now:
# Wait for either the time to elapse or a new
# event to be added:
got_event.wait(timeout=(time - now))
continue
pop(q)
action(*argument, **kwargs)
delayfunc(0) # Let other threads run
except:
pass
#property
def queue(self):
"""An ordered list of upcoming events.
Events are named tuples with fields for:
time, priority, action, arguments, kwargs
"""
# Use heapq to sort the queue rather than using 'sorted(self._queue)'.
# With heapq, two events scheduled at the same time will show in
# the actual order they would be retrieved.
with self.got_event:
events = self._queue[:]
return list(map(heapq.heappop, [events]*len(events)))
###########################################################
def re_queue(polling_queue, id):
polling_queue.put(id)
class StatusPoller:
scheduler = Scheduler()
def __init__(self, polling_queue, results_queue, errors_queue):
self.polling_queue = polling_queue
self.results_queue = results_queue
def run(self):
while True:
# Pick a task from the queue
next_id = self.polling_queue.get()
# Poison pill => shutdown
if next_id == 'END':
break
# Process the task
response = remote_system.get_status(next_id)
if response == "IN_PROGRESS":
self.scheduler.enter(20, 1, re_queue, argument=(self.polling_queue, next_id))
else:
self.results_queue.put(response)
Explanation
First, why did I say that I saw no reason for a JoinableQueue? The run method is programmed to return if it finds an input message that is 'END'. But because of the way this method when finding "IN_PROGRES" responses from the remote system requeues messages back onto the pollinq_queue, the possibility exists that when END is received and run terminates that there is one or more of these requeued messages remaining on the queue. So how can another process or thread depend on calling polling_queue.join() without possibly hanging? It cannot.
Instead, if you have N processes or threads (we haven't decided yet which) doing get requests against a single queue instance, it should suffice to just put N 'END' shutdown messages on the queue. This will result in the N processes terminating. The main process now instead of joining the queue just joins the N processes or threads if it wishes to block on the actual termination of these processes/threads.
The way I would use a JoinableQueue, which I don't think fits your use case, would be if the processes/threads were in an infinite loop never terminating, that is, not quitting "prematurely" and therefore never leaving items left on the queue. You would make these processes/threads daemon processes so that they would eventually end when the main process eventually terminates. So you could not force a termination with an 'END' message. So I just don't see how a JoinableQueue works here, but you can point out to me if I have misunderstood something.
Yes, StatusPoller could be the target of a Process instance (or even a subclass of Process as you originally had it, although except for the fact that is how you currently have it coded, I see no advantage to doing that). But it seems to me that it will be spending most of its time waiting on either getting from a queue or getting a network response. In both cases it will release the Global Interpreter Lock and multithreading should be very performant. Threads will also take up far fewer resources if we are indeed talking about creating hundreds of instances of these tasks, especially if you are running under Windows. You will also not be able to share the scheduler, which runs in its own thread, across all StatusPoller instances. There will be one scheduler now running in each process since each StatusPoller is running in its own process.
I am sharping up my Python skills and have started learning about websockets as an educational tool.
Therefore, I'm working with real-time data received every millisecond via a websocket. I would like to seperate its acquisition/processing/plotting in a clean and comprehensive way. Acquisition and processing are critical, whereas plotting can be updated every ~100ms.
A) I am assuming that the raw data arrives at a constant rate, every ms.
B) If processing isn't quick enough (>1ms), skip the data that arrived while busy and stay synced with A)
C) Every ~100ms or so, get the last processed data and plot it.
I guess that a Minimal Working Example would start like this:
import threading
class ReceiveData(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def receive(self):
pass
class ProcessData(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def process(self):
pass
class PlotData(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def plot(self):
pass
Starting with this (is it even the right way to go ?), how can I pass the raw data from ReceiveData to ProcessData, and periodically to PlotData ? How can I keep the executions synced, and repeat calls every ms or 100ms ?
Thank you.
I think your general approach with threads for receiving and processing the data is fine. For communication between the threads, I would suggest a producer-consumer approach. Here is a complete example using a Queue as a data structure.
In your case you want to skip unprocessed data and use only the most recent element. For achieving this, collections.deque (see documentation) might be a better choice for you - see also this discussion.
d = collections.deque(maxlen=1)
The producer side would then append data to the deque like this:
d.append(item)
And the main loop on the consumer side might look like this:
while True:
try:
item = d.pop()
print('Getting item' + str(item))
except IndexError:
print('Deque is empty')
# time.sleep(s) if you want to poll the latest data every s seconds
Possibly, you can merge the ReceiveData and ProcessData functionalities into just one class / thread and use only one deque between this class and PlotData.
(python2.7)
I'm trying to do a kind of scanner, that has to walk through CFG nodes, and split in different processes on branching for parallelism purpose.
The scanner is represented by an object of class Scanner. This class has one method traverse that walks through the said graph and splits if necessary.
Here how it looks:
class Scanner(object):
def __init__(self, atrb1, ...):
self.attribute1 = atrb1
self.process_pool = Pool(processes=4)
def traverse(self, ...):
[...]
if branch:
self.process_pool.map(my_func, todo_list).
My problem is the following:
How do I create a instance of multiprocessing.Pool, that is shared between all of my processes ? I want it to be shared, because since a path can be splitted again, I do not want to end with a kind of fork bomb, and having the same Pool will help me to limit the number of processes running at the same time.
The above code does not work, since Pool can not be pickled. In consequence, I have tried that:
class Scanner(object):
def __getstate__(self):
self_dict = self.__dict__.copy()
def self_dict['process_pool']
return self_dict
[...]
But obviously, it results in having self.process_pool not defined in the created processes.
Then, I tried to create a Pool as a module attribute:
process_pool = Pool(processes=4)
def my_func(x):
[...]
class Scanner(object):
def __init__(self, atrb1, ...):
self.attribute1 = atrb1
def traverse(self, ...):
[...]
if branch:
process_pool.map(my_func, todo_list)
It does not work, and this answer explains why.
But here comes the thing, wherever I create my Pool, something is missing. If I create this Pool at the end of my file, it does not see self.attribute1, the same way it did not see answer and fails with an AttributeError.
I'm not even trying to share it yet, and I'm already stuck with Multiprocessing way of doing thing.
I don't know if I have not been thinking correctly the whole thing, but I can not believe it's so complicated to handle something as simple as "having a worker pool and giving them tasks".
Thank you,
EDIT:
I resolved my first problem (AttributeError), my class had a callback as its attribute, and this callback was defined in the main script file, after the import of the scanner module... But the concurrency and "do not fork bomb" thing is still a problem.
What you want to do can't be done safely. Think about if you somehow had a single shared Pool shared across parent and worker processes, with, say, two worker processes. The parent runs a map that tries to perform two tasks, and each task needs to map two more tasks. The two parent dispatched tasks go to each worker, and the parent blocks. Each worker sends two more tasks to the shared pool and blocks for them to complete. But now all workers are now occupied, waiting for a worker to become free; you've deadlocked.
A safer approach would be to have the workers return enough information to dispatch additional tasks in the parent. Then you could do something like:
class MoreWork(object):
def __init__(self, func, *args):
self.func = func
self.args = args
pool = multiprocessing.Pool()
try:
base_task = somefunc, someargs
outstanding = collections.deque([pool.apply_async(*base_task)])
while outstanding:
result = outstanding.popleft().get()
if isinstance(result, MoreWork):
outstanding.append(pool.apply_async(result.func, result.args))
else:
... do something with a "final" result, maybe breaking the loop ...
finally:
pool.terminate()
What the functions are is up to you, they'd just return information in a MoreWork when there was more to do, not launch a task directly. The point is to ensure that by having the parent be solely responsible for task dispatch, and the workers solely responsible for task completion, you can't deadlock due to all workers being blocked waiting for tasks that are in the queue, but not being processed.
This is also not at all optimized; ideally, you wouldn't block waiting on the first item in the queue if other items in the queue were complete; it's a lot easier to do this with the concurrent.futures module, specifically with concurrent.futures.wait to wait on the first available result from an arbitrary number of outstanding tasks, but you'd need a third party PyPI package to get concurrent.futures on Python 2.7.
Context
I recently posted a timer class for review on Code Review. I'd had a gut feeling there were concurrency bugs as I'd once seen 1 unit test fail, but was unable to reproduce the failure. Hence my post to code review.
I got some great feedback highlighting various race conditions in the code. (I thought) I understood the problem and the solution, but before making any fixes, I wanted to expose the bugs with a unit test. When I tried, I realised it was difficult. Various stack exchange answers suggested I'd have to control the execution of threads to expose the bug(s) and any contrived timing would not necessarily be portable to a different machine. This seemed like a lot of accidental complexity beyond the problem I was trying to solve.
Instead I tried using the best static analysis (SA) tool for python, PyLint, to see if it'd pick out any of the bugs, but it couldn't. Why could a human find the bugs through code review (essentially SA), but a SA tool could not?
Afraid of trying to get Valgrind working with python (which sounded like yak-shaving), I decided to have a bash at fixing the bugs without reproducing them first. Now I'm in a pickle.
Here's the code now.
from threading import Timer, Lock
from time import time
class NotRunningError(Exception): pass
class AlreadyRunningError(Exception): pass
class KitchenTimer(object):
'''
Loosely models a clockwork kitchen timer with the following differences:
You can start the timer with arbitrary duration (e.g. 1.2 seconds).
The timer calls back a given function when time's up.
Querying the time remaining has 0.1 second accuracy.
'''
PRECISION_NUM_DECIMAL_PLACES = 1
RUNNING = "RUNNING"
STOPPED = "STOPPED"
TIMEUP = "TIMEUP"
def __init__(self):
self._stateLock = Lock()
with self._stateLock:
self._state = self.STOPPED
self._timeRemaining = 0
def start(self, duration=1, whenTimeup=None):
'''
Starts the timer to count down from the given duration and call whenTimeup when time's up.
'''
with self._stateLock:
if self.isRunning():
raise AlreadyRunningError
else:
self._state = self.RUNNING
self.duration = duration
self._userWhenTimeup = whenTimeup
self._startTime = time()
self._timer = Timer(duration, self._whenTimeup)
self._timer.start()
def stop(self):
'''
Stops the timer, preventing whenTimeup callback.
'''
with self._stateLock:
if self.isRunning():
self._timer.cancel()
self._state = self.STOPPED
self._timeRemaining = self.duration - self._elapsedTime()
else:
raise NotRunningError()
def isRunning(self):
return self._state == self.RUNNING
def isStopped(self):
return self._state == self.STOPPED
def isTimeup(self):
return self._state == self.TIMEUP
#property
def timeRemaining(self):
if self.isRunning():
self._timeRemaining = self.duration - self._elapsedTime()
return round(self._timeRemaining, self.PRECISION_NUM_DECIMAL_PLACES)
def _whenTimeup(self):
with self._stateLock:
self._state = self.TIMEUP
self._timeRemaining = 0
if callable(self._userWhenTimeup):
self._userWhenTimeup()
def _elapsedTime(self):
return time() - self._startTime
Question
In the context of this code example, how can I expose the race conditions, fix them, and prove they're fixed?
Extra points
extra points for a testing framework suitable for other implementations and problems rather than specifically to this code.
Takeaway
My takeaway is that the technical solution to reproduce the identified race conditions is to control the synchronism of two threads to ensure they execute in the order that will expose a bug. The important point here is that they are already identified race conditions. The best way I've found to identify race conditions is to put your code up for code review and encourage more expert people analyse it.
Traditionally, forcing race conditions in multithreaded code is done with semaphores, so you can force a thread to wait until another thread has achieved some edge condition before continuing.
For example, your object has some code to check that start is not called if the object is already running. You could force this condition to make sure it behaves as expected by doing something like this:
starting a KitchenTimer
having the timer block on a semaphore while in the running state
starting the same timer in another thread
catching AlreadyRunningError
To do some of this you may need to extend the KitchenTimer class. Formal unit tests will often use mock objects which are defined to block at critical times. Mock objects are a bigger topic than I can address here, but googling "python mock object" will turn up a lot of documentation and many implementations to choose from.
Here's a way that you could force your code to throw AlreadyRunningError:
import threading
class TestKitchenTimer(KitchenTimer):
_runningLock = threading.Condition()
def start(self, duration=1, whenTimeUp=None):
KitchenTimer.start(self, duration, whenTimeUp)
with self._runningLock:
print "waiting on _runningLock"
self._runningLock.wait()
def resume(self):
with self._runningLock:
self._runningLock.notify()
timer = TestKitchenTimer()
# Start the timer in a subthread. This thread will block as soon as
# it is started.
thread_1 = threading.Thread(target = timer.start, args = (10, None))
thread_1.start()
# Attempt to start the timer in a second thread, causing it to throw
# an AlreadyRunningError.
try:
thread_2 = threading.Thread(target = timer.start, args = (10, None))
thread_2.start()
except AlreadyRunningError:
print "AlreadyRunningError"
timer.resume()
timer.stop()
Reading through the code, identify some of the boundary conditions you want to test, then think about where you would need to pause the timer to force that condition to arise, and add Conditions, Semaphores, Events, etc. to make it happen. e.g. what happens if, just as the timer runs the whenTimeUp callback, another thread tries to stop it? You can force that condition by making the timer wait as soon as it's entered _whenTimeUp:
import threading
class TestKitchenTimer(KitchenTimer):
_runningLock = threading.Condition()
def _whenTimeup(self):
with self._runningLock:
self._runningLock.wait()
KitchenTimer._whenTimeup(self)
def resume(self):
with self._runningLock:
self._runningLock.notify()
def TimeupCallback():
print "TimeupCallback was called"
timer = TestKitchenTimer()
# The timer thread will block when the timer expires, but before the callback
# is invoked.
thread_1 = threading.Thread(target = timer.start, args = (1, TimeupCallback))
thread_1.start()
sleep(2)
# The timer is now blocked. In the parent thread, we stop it.
timer.stop()
print "timer is stopped: %r" % timer.isStopped()
# Now allow the countdown thread to resume.
timer.resume()
Subclassing the class you want to test isn't an awesome way to instrument it for testing: you'll have to override basically all of the methods in order to test race conditions in each one, and at that point there's a good argument to be made that you're not really testing the original code. Instead, you may find it cleaner to put the semaphores right in the KitchenTimer object but initialized to None by default, and have your methods check if testRunningLock is not None: before acquiring or waiting on the lock. Then you can force races on the actual code that you're submitting.
Some reading on Python mock frameworks that may be helpful. In fact, I'm not sure that mocks would be helpful in testing this code: it's almost entirely self-contained and doesn't rely on many external objects. But mock tutorials sometimes touch on issues like these. I haven't used any of these, but the documentation on these like a good place to get started:
Getting Started with Mock
Using Fudge
Python Mock Testing Techniques and Tools
The most common solution to testing thread (un)safe code is to start a lot of threads and hope for the best. The problem I, and I can imagine others, have with this is that it relies on chance and it makes tests 'heavy'.
As I ran into this a while ago I wanted to go for precision instead of brute force. The result is a piece of test code to cause race-conditions by letting the threads race neck to neck.
Sample racey code
spam = []
def set_spam():
spam[:] = foo()
use(spam)
If set_spam is called from several threads, a race condition exists between modification and use of spam. Let's try to reproduce it consistently.
How to cause race-conditions
class TriggeredThread(threading.Thread):
def __init__(self, sequence=None, *args, **kwargs):
self.sequence = sequence
self.lock = threading.Condition()
self.event = threading.Event()
threading.Thread.__init__(self, *args, **kwargs)
def __enter__(self):
self.lock.acquire()
while not self.event.is_set():
self.lock.wait()
self.event.clear()
def __exit__(self, *args):
self.lock.release()
if self.sequence:
next(self.sequence).trigger()
def trigger(self):
with self.lock:
self.event.set()
self.lock.notify()
Then to demonstrate the use of this thread:
spam = [] # Use a list to share values across threads.
results = [] # Register the results.
def set_spam():
thread = threading.current_thread()
with thread: # Acquires the lock.
# Set 'spam' to thread name
spam[:] = [thread.name]
# Thread 'releases' the lock upon exiting the context.
# The next thread is triggered and this thread waits for a trigger.
with thread:
# Since each thread overwrites the content of the 'spam'
# list, this should only result in True for the last thread.
results.append(spam == [thread.name])
threads = [
TriggeredThread(name='a', target=set_spam),
TriggeredThread(name='b', target=set_spam),
TriggeredThread(name='c', target=set_spam)]
# Create a shifted sequence of threads and share it among the threads.
thread_sequence = itertools.cycle(threads[1:] + threads[:1])
for thread in threads:
thread.sequence = thread_sequence
# Start each thread
[thread.start() for thread in threads]
# Trigger first thread.
# That thread will trigger the next thread, and so on.
threads[0].trigger()
# Wait for each thread to finish.
[thread.join() for thread in threads]
# The last thread 'has won the race' overwriting the value
# for 'spam', thus [False, False, True].
# If set_spam were thread-safe, all results would be true.
assert results == [False, False, True], "race condition triggered"
assert results == [True, True, True], "code is thread-safe"
I think I explained enough about this construction so you can implement it for your own situation. I think this fits the 'extra points' section quite nicely:
extra points for a testing framework suitable for other implementations and problems rather than specifically to this code.
Solving race-conditions
Shared variables
Each threading issue is solved in it's own specific way. In the example above I caused a race-condition by sharing a value across threads. Similar problems can occur when using global variables, such as a module attribute. The key to solving such issues may be to use a thread-local storage:
# The thread local storage is a global.
# This may seem weird at first, but it isn't actually shared among threads.
data = threading.local()
data.spam = [] # This list only exists in this thread.
results = [] # Results *are* shared though.
def set_spam():
thread = threading.current_thread()
# 'get' or set the 'spam' list. This actually creates a new list.
# If the list was shared among threads this would cause a race-condition.
data.spam = getattr(data, 'spam', [])
with thread:
data.spam[:] = [thread.name]
with thread:
results.append(data.spam == [thread.name])
# Start the threads as in the example above.
assert all(results) # All results should be True.
Concurrent reads/writes
A common threading issue is the problem of multiple threads reading and/or writing to a data holder concurrently. This problem is solved by implementing a read-write lock. The actual implementation of a read-write lock may differ. You may choose a read-first lock, a write-first lock or just at random.
I'm sure there are examples out there describing such locking techniques. I may write an example later as this is quite a long answer already. ;-)
Notes
Have a look at the threading module documentation and experiment with it a bit. As each threading issue is different, different solutions apply.
While on the subject of threading, have a look at the Python GIL (Global Interpreter Lock). It is important to note that threading may not actually be the best approach in optimizing performance (but this is not your goal). I found this presentation pretty good: https://www.youtube.com/watch?v=zEaosS1U5qY
You can test it by using a lot of threads:
import sys, random, thread
def timeup():
sys.stdout.write("Timer:: Up %f" % time())
def trdfunc(kt, tid):
while True :
sleep(1)
if not kt.isRunning():
if kt.start(1, timeup):
sys.stdout.write("[%d]: started\n" % tid)
else:
if random.random() < 0.1:
kt.stop()
sys.stdout.write("[%d]: stopped\n" % tid)
sys.stdout.write("[%d] remains %f\n" % ( tid, kt.timeRemaining))
kt = KitchenTimer()
kt.start(1, timeup)
for i in range(1, 100):
thread.start_new_thread ( trdfunc, (kt, i) )
trdfunc(kt, 0)
A couple of problem problems I see:
When a thread sees the timer as not running and try to start it, the
code generally raises an exception due to context switch in between
test and start. I think raising an exception is too much. Or you can
have an atomic testAndStart function
A similar problem occurs with stop. You can implement a testAndStop
function.
Even this code from the timeRemaining function:
if self.isRunning():
self._timeRemaining = self.duration - self._elapsedTime()
Needs some sort of atomicity, perhaps you need to grab a lock before
testing isRunning.
If you plan to share this class between threads, you need to address these issues.
In general - this is not viable solution. You can reproduce this race condition by using debugger (set breakpoints in some locations in the code, than, when it hits one of the breakpoints - freeze the thread and run the code until it hits another breakpoint, then freeze this thread and unfreeze the first thread, you can interleave threads execution in any way using this technique).
The problem is - the more threads and code you have, the more ways to interleave side effects they will have. Actually - it will grow exponentially. There is no viable solution to test it in general. It is possible only in some simple cases.
The solution to this problem are well known. Write code that is aware of it's side effects, control side effects with synchronisation primitives like locks, semaphores or queues or use immutable data if its possible.
Maybe more practical way is to use runtime checks to force correct call order. For example (pseudocode):
class RacyObject:
def __init__(self):
self.__cnt = 0
...
def isReadyAndLocked(self):
acquire_object_lock
if self.__cnt % 2 != 0:
# another thread is ready to start the Job
return False
if self.__is_ready:
self.__cnt += 1
return True
# Job is in progress or doesn't ready yet
return False
release_object_lock
def doJobAndRelease(self):
acquire_object_lock
if self.__cnt % 2 != 1:
raise RaceConditionDetected("Incorrect order")
self.__cnt += 1
do_job()
release_object_lock
This code will throw exception if you doesn't check isReadyAndLock before calling doJobAndRelease. This can be tested easily using only one thread.
obj = RacyObject()
...
# correct usage
if obj.isReadyAndLocked()
obj.doJobAndRelease()
Problem
I've segmented a long-running task into logical subtasks, so I can report the results of each subtask as it completes. However, I'm trying to report the results of a task that will effectively never complete (instead yielding values as it goes), and am struggling to do so with my existing solution.
Background
I'm building a web interface to some Python programs I've written. Users can submit jobs through web forms, then check back to see the job's progress.
Let's say I have two functions, each accessed via separate forms:
med_func: Takes ~1 minute to execute, results are passed off to render(), which produces additional data.
long_func: Returns a generator. Each yield takes on the order of 30 minutes, and should be reported to the user. There are so many yields, we can consider this iterator as infinite (terminating only when revoked).
Code, current implementation
With med_func, I report results as follows:
On form submission, I save an AsyncResult to a Django session:
task_result = med_func.apply_async([form], link=render.s())
request.session["task_result"] = task_result
The Django view for the results page accesses this AsyncResult. When a task has completed, results are saved into an object that is passed as context to a Django template.
def results(request):
""" Serve (possibly incomplete) results of a session's latest run. """
session = request.session
try: # Load most recent task
task_result = session["task_result"]
except KeyError: # Already cleared, or doesn't exist
if "results" not in session:
session["status"] = "No job submitted"
else: # Extract data from Asynchronous Tasks
session["status"] = task_result.status
if task_result.ready():
session["results"] = task_result.get()
render_task = task_result.children[0]
# Decorate with rendering results
session["render_status"] = render_task.status
if render_task.ready():
session["results"].render_output = render_task.get()
del(request.session["task_result"]) # Don't need any more
return render_to_response('results.html', request.session)
This solution only works when the function actually terminates. I can't chain together logical subtasks of long_func, because there are an unknown number of yields (each iteration of long_func's loop may not produce a result).
Question
Is there any sensible way to access yielded objects from an extremely long-running Celery task, so that they can be displayed before the generator is exhausted?
In order for Celery to know what the current state of the task is, it sets some metadata in whatever result backend you have. You can piggy-back on that to store other kinds of metadata.
def yielder():
for i in range(2**100):
yield i
#task
def report_progress():
for progress in yielder():
# set current progress on the task
report_progress.backend.mark_as_started(
report_progress.request.id,
progress=progress)
def view_function(request):
task_id = request.session['task_id']
task = AsyncResult(task_id)
progress = task.info['progress']
# do something with your current progress
I wouldn't throw a ton of data in there, but it works well for tracking the progress of a long-running task.
Paul's answer is great. As an alternative to using mark_as_started you can use Task's update_state method. They ultimately do the same thing, but the name "update_state" is a little more appropriate for what you're trying to do. You can optionally define a custom state that indicates your task is in progress (I've named my custom state 'PROGRESS'):
def yielder():
for i in range(2**100):
yield i
#task
def report_progress():
for progress in yielder():
# set current progress on the task
report_progress.update_state(state='PROGRESS', meta={'progress': progress})
def view_function(request):
task_id = request.session['task_id']
task = AsyncResult(task_id)
progress = task.info['progress']
# do something with your current progress
Celery part:
def long_func(*args, **kwargs):
i = 0
while True:
yield i
do_something_here(*args, **kwargs)
i += 1
#task()
def test_yield_task(task_id=None, **kwargs):
the_progress = 0
for the_progress in long_func(**kwargs):
cache.set('celery-task-%s' % task_id, the_progress)
Webclient side, starting task:
r = test_yield_task.apply_async()
request.session['task_id'] = r.task_id
Testing last yielded value:
v = cache.get('celery-task-%s' % session.get('task_id'))
if v:
do_someting()
If you do not like to use cache, or it's impossible, you can use db, file or any other place which celery worker and server side will have both accesss. With cache it's a simplest solution, but workers and server have to use the same cache.
A couple options to consider:
1 -- task groups. If you can enumerate all the sub tasks from the time of invocation, you can apply the group as a whole -- that returns a TaskSetResult object you can use to monitor the results of the group as a whole, or of individual tasks in the group -- query this as-needed when you need to check status.
2 -- callbacks. If you can't enumerate all sub tasks (or even if you can!) you can define a web hook / callback that's the last step in the task -- called when the rest of the task completes. The hook would be against a URI in your app that ingests the result and makes it available via DB or app-internal API.
Some combination of these could solve your challenge.
See also this great PyCon preso from one of the Instagram engineers.
http://blogs.vmware.com/vfabric/2013/04/how-instagram-feeds-work-celery-and-rabbitmq.html
At video mark 16:00, he discusses how they structure long lists of sub-tasks.