I want to send a "pause" signal to a long running task in Celery and I'm trying to figure out the best way to do it. In the view I can pull an instance of the object from the database and tell that to save, but it's not the same as the instance of the object in Celery. The object doesn't check back to see if it's paused.
Polling the database from within the long-running class and task feels weird and impractical so I'm looking at sending my instance a message. I looked at using pubsub but I would prefer to use Django signals as it's already a Django project. I might be approaching this the wrong way.
Here's an example that does not work:
Models.py
class LongRunningClass(models.Model):
is_paused = models.BooleanField(default=False)
processed_files = models.IntegerField(default=0)
total_files = models.IntegerField(default=100)
def long_task(self):
remaining_files = self.total_files - self.processed_files
for i in xrange(remaining_files):
if not self.is_paused:
self.processed_files += 1
time.sleep(1)
# Task complete, let's save.
self.save()
Views.py
def pause_task(self, pk):
lrc = LongRunningClass.objects.get(pk=pk)
lrc.is_paused = True
lrc.save()
return HttpResponse(json.dumps({'is_paused': lrc.is_paused}))
def resume_task(self, pk):
lrc = LongRunningClass.objects.get(pk=pk)
lrc.is_paused = False
lrc.save()
# Pretend this is a Celery task
lrc.long_task()
So if I modify models.py to use signals, I can add these lines but it still does not quite work:
pause_signal = django.dispatch.Signal(providing_args=['is_paused'])
#django.dispatch.receiver(pause_signal)
def pause_callback(sender, **kwargs):
if 'is_paused' in kwargs:
sender.is_paused = kwargs['is_paused']
sender.save()
That doesn't affect the instantiated class that's already running either. How can I tell the instance of my model running within the task to pause?
Celery task is a separate process. Django signals is standard "observer" pattern, which works within one thread, so there is no way to orginize communication betwean threads using signals. You need to load object from database to know if its properties has changed.
class LongRunningClass(models.Model):
is_paused = models.BooleanField(default=False)
processed_files = models.IntegerField(default=0)
total_files = models.IntegerField(default=100)
def get_is_paused(self):
db_obj = LongRunningClass.objects.get(pk=self.pk)
return db_obj.is_paused
def long_task(self):
remaining_files = self.total_files - self.processed_files
for i in xrange(remaining_files):
if not self.get_is_paused:
self.processed_files += 1
time.sleep(1)
# Task complete, let's save.
self.save()
Not very good by design - you better to move long_task to other place, and operate with newly loaded LongRunningClass instance, but it will do the job. You could add some memcache here - if you don't want to disturb your database so often.
BTW: I'm not 100% sure but you may have another design issue here. This is rather rare case when you have really long running tasks with this kind of cycle. Think about removing cycle from your program (you have queues!). Take a look:
#celery.task(run_every=2minutes) # adding XX files for processing every XX minutes
def scheduled_task(lr_pk):
lr = LongRunningClass.objects.get(pk=lr_pk)
if not lr.is paused:
remaining_files = self.total_files - self.processed_files
for i in xrange(lr.files_per_iteration):
process_file.delay(lr.pk,i)
#celery.task(rate=1/m,queue='process_file') # processing each file
def process_file(lr_pk,i):
# do somthing with i
lr = LongRunningClass.objects.get(pk=lr_pk)
lr.processed_files += 1
lr.save()
You have to set up celerybeat, and create separate queue for this types of tasks, to implement this solution. But as a result you will have a lot of control over your program - speed rates, parallel execution and your code would not hang for sleep(1). If you create another model for each file you could control what files are processed and what are not, handle errors etc,etc.
Take a look at celery.contrib.abortable -- this is an alternate base class for Celery tasks that implements a signal between caller and task to handle terminations, that could also be used to implement a "pause".
When caller calls abort(), a status is marked in the backend. Task calls self.is_aborted() to see if that special status has been set; and then implements whatever action is appropriate (terminate, pause, ignore etc.). The action is under the task's control; this is not automated task termination.
This could be used as-is if it is sensible for the specific task to interpret the ABORT signal as a request for a pause. Or you could extend the class to provide more signals, not just the existing ABORT.
Related
I am learning coroutine
class Scheduler:
def __init__(self):
self.ready = Queue() # a queue of tasks that are ready to run.
self.taskmap = {} #dictionary that keeps track of all active tasks (each task has a unique integer task ID)
def new(self, target): #introduce a new task to the scheduler
newtask = Task(target)
self.taskmap[newtask.tid] = newtask
self.schedule(newtask)
return newtask.tid
def schedule(self, task):
self.ready.put(task)
def mainloop(self):
while self.taskmap: #does not remove element from taskmap
task = self.ready.get() self.ready
result = task.run()
self.schedule(task)
When reading the task = self.ready.get() in schedule, I suddenly realize that the nature of data structure is about control, to control the next step, while the nature of algorithm is also about control, to control all the steps.
Does the understanding make sense?
The Queue object defines control of what step is next, yes. It's FIFO, as described here.
Here, it looks like you're just trying to keep track of tasks, whether there are any remaining, which are executing, and so on. This is "controlling all the steps." Yes.
What's unclear is the purpose. The data structure and algorithm should be suited to your purpose. asyncio can help you implement parallelism and event-driven designs, for example. Sometimes the goal is to quickly and efficiently render data from a source into a data structure. What you're getting at is more meaningful (to me, at least) in the context of an end goal.
I'm fairly new to Python and Django, so please let me know if there is a better way to do this. What I am trying to do is have each Device (which inherits from models.Model) kick off a long running background thread which constantly checks the health of that Device. However when I run my code, it does not seem to be executing like a daemon, as the server is sluggish and continually times out. This background thread will (in most cases) run the life of the program.
Below is a simplified version of my code:
class Device(models.Model):
active = models.BooleanField(default=True)
is_healthy = models.BooleanField(default=True)
last_heartbeat = models.DateTimeField(null=True, blank=True)
def __init__(self, *args, **kwargs):
super(Device, self).__init__(*args, **kwargs)
# start daemon thread that polls device's health
thread = Thread(name='device_health_checker', target=self.health_checker())
thread.daemon = True
thread.start()
def health_checker(self):
while self.active:
if self.last_heartbeat is not None:
time_since_last_heartbeat = timezone.now() - self.last_heartbeat
self.is_healthy = False if time_since_last_heartbeat.total_seconds() >= 60 else True
self.save()
time.sleep(10)
This seems like a very simple use of threading, but every time I search for solutions, the suggested approach is to use celery which seems like overkill to me. Is there a way to get this to work without the need for something like celery?
As #knbk mentioned in a comment, "Every time you query for devices, a new thread will be created for each device that is returned". This is something I originally overlooked.
However I was able to solve my issue using a single background thread that is kicked off as a Django application. This is a much simpler approach then adding a 3rd party library (like Celery).
class DeviceApp(AppConfig):
name = 'device_app'
def ready(self):
# start daemon thread that polls device's health
thread = Thread(name='device_health_checker', target=self.device_health_check)
thread.daemon = True
thread.start()
def device_health_check(self):
while (true):
for device in Device.objects.get_queryset():
if device.last_heartbeat is not None:
time_since_last_heartbeat = timezone.now() - device.last_heartbeat
device.is_healthy = False if time_since_last_heartbeat.total_seconds() >= 60 else True
device.save()
time.sleep(10)
When you start out in your development environment the number of devices are likely quite low. So the number of threads perhaps are in the double digits as you test things out.
But this thread issue will rapidly become untenable as you increase the number of devices even if you got the code to work. So using celery with a celery beat is the better way to do it.
Also consider that you are new to Django and Python, trying to master threads on top of that would add even more complexity. Using celery for this would be a lot simpler and neater in the end.
I'm quite new to python and Django and I'm struggling to understand what's the best way to tackle my problem, which is:
I have a view which has to activate a process/task/thread and return a success.
The process/task/thread operates a device and it will update its status based on the device inputs.
I then have another view which I will poll with ajax and this view shoule be able to query that background process/task/thread to know its status and return it to the caller.
I've read quite a few different options like multiprocessing, gevent, celery, session but I'm still can't get my head around it.
Tried with the session but obviously I don't have access to the request object from within the background task.
Didn't try gevent or celery just because I thought there would have been a easier solution without using any additional frameworks (don't really wanna install RabbitMQ etc...).
Tried the multiprocessing and that's the code:
def test_process(request):
manager = Manager()
d = manager.dict()
p = Process(target=test_function, args=(d, ))
p.daemon = True
p.start()
return HttpResponse(json.dumps('Ok'), content_type="application/json")
def test_function(d):
d['test'] = 'alex'
def test_manager(request):
manager = Manager()
data = manager.dict().get('test')
return HttpResponse(json.dumps(data), content_type="application/json")
After I wrote I realized that probably the dictionary is only shared by the background process and the process of the request that executed test_process and so test_manager gets and empty dictionary.
Dunno where to go from here
Any help ?
Cheers
To share data between a child and a parent process using the multiprocessing interface you may use one of the methods proposed in https://docs.python.org/2/library/multiprocessing.html, for instance a Queue or a Pipe.
Here's what you should do to use a queue to talk to a child process from within a Django web application (I suppose that the background/child process is controlling a single device for all users of the web application, so everybody will get the same results -- this could also be per session):
Create a global queue object inside your views.py like this global_q = Queue().
Create a view for initializing the process, and pass the global Queue to the process function:
def init_process(request):
p = Process(target=the_process, args=(global_q, ))
p.daemon = True
p.start()
return HttpResponse(json.dumps('Ok'), content_type="application/json")
Create a different view that will read from the global Queue:
def read_process_status(request):
data = global_q.get()
return HttpResponse(json.dumps(data), content_type="application/json")
Your process function handles the device and write events in the queue parameter when needed:
def the_process(local_q):
# do some things
local_q.put([6])
# do some other things
local_q.put([34])
For the above to work without problems you must check if th queue is empty or make it non-block etc, but you'll get the idea.
Problem
I've segmented a long-running task into logical subtasks, so I can report the results of each subtask as it completes. However, I'm trying to report the results of a task that will effectively never complete (instead yielding values as it goes), and am struggling to do so with my existing solution.
Background
I'm building a web interface to some Python programs I've written. Users can submit jobs through web forms, then check back to see the job's progress.
Let's say I have two functions, each accessed via separate forms:
med_func: Takes ~1 minute to execute, results are passed off to render(), which produces additional data.
long_func: Returns a generator. Each yield takes on the order of 30 minutes, and should be reported to the user. There are so many yields, we can consider this iterator as infinite (terminating only when revoked).
Code, current implementation
With med_func, I report results as follows:
On form submission, I save an AsyncResult to a Django session:
task_result = med_func.apply_async([form], link=render.s())
request.session["task_result"] = task_result
The Django view for the results page accesses this AsyncResult. When a task has completed, results are saved into an object that is passed as context to a Django template.
def results(request):
""" Serve (possibly incomplete) results of a session's latest run. """
session = request.session
try: # Load most recent task
task_result = session["task_result"]
except KeyError: # Already cleared, or doesn't exist
if "results" not in session:
session["status"] = "No job submitted"
else: # Extract data from Asynchronous Tasks
session["status"] = task_result.status
if task_result.ready():
session["results"] = task_result.get()
render_task = task_result.children[0]
# Decorate with rendering results
session["render_status"] = render_task.status
if render_task.ready():
session["results"].render_output = render_task.get()
del(request.session["task_result"]) # Don't need any more
return render_to_response('results.html', request.session)
This solution only works when the function actually terminates. I can't chain together logical subtasks of long_func, because there are an unknown number of yields (each iteration of long_func's loop may not produce a result).
Question
Is there any sensible way to access yielded objects from an extremely long-running Celery task, so that they can be displayed before the generator is exhausted?
In order for Celery to know what the current state of the task is, it sets some metadata in whatever result backend you have. You can piggy-back on that to store other kinds of metadata.
def yielder():
for i in range(2**100):
yield i
#task
def report_progress():
for progress in yielder():
# set current progress on the task
report_progress.backend.mark_as_started(
report_progress.request.id,
progress=progress)
def view_function(request):
task_id = request.session['task_id']
task = AsyncResult(task_id)
progress = task.info['progress']
# do something with your current progress
I wouldn't throw a ton of data in there, but it works well for tracking the progress of a long-running task.
Paul's answer is great. As an alternative to using mark_as_started you can use Task's update_state method. They ultimately do the same thing, but the name "update_state" is a little more appropriate for what you're trying to do. You can optionally define a custom state that indicates your task is in progress (I've named my custom state 'PROGRESS'):
def yielder():
for i in range(2**100):
yield i
#task
def report_progress():
for progress in yielder():
# set current progress on the task
report_progress.update_state(state='PROGRESS', meta={'progress': progress})
def view_function(request):
task_id = request.session['task_id']
task = AsyncResult(task_id)
progress = task.info['progress']
# do something with your current progress
Celery part:
def long_func(*args, **kwargs):
i = 0
while True:
yield i
do_something_here(*args, **kwargs)
i += 1
#task()
def test_yield_task(task_id=None, **kwargs):
the_progress = 0
for the_progress in long_func(**kwargs):
cache.set('celery-task-%s' % task_id, the_progress)
Webclient side, starting task:
r = test_yield_task.apply_async()
request.session['task_id'] = r.task_id
Testing last yielded value:
v = cache.get('celery-task-%s' % session.get('task_id'))
if v:
do_someting()
If you do not like to use cache, or it's impossible, you can use db, file or any other place which celery worker and server side will have both accesss. With cache it's a simplest solution, but workers and server have to use the same cache.
A couple options to consider:
1 -- task groups. If you can enumerate all the sub tasks from the time of invocation, you can apply the group as a whole -- that returns a TaskSetResult object you can use to monitor the results of the group as a whole, or of individual tasks in the group -- query this as-needed when you need to check status.
2 -- callbacks. If you can't enumerate all sub tasks (or even if you can!) you can define a web hook / callback that's the last step in the task -- called when the rest of the task completes. The hook would be against a URI in your app that ingests the result and makes it available via DB or app-internal API.
Some combination of these could solve your challenge.
See also this great PyCon preso from one of the Instagram engineers.
http://blogs.vmware.com/vfabric/2013/04/how-instagram-feeds-work-celery-and-rabbitmq.html
At video mark 16:00, he discusses how they structure long lists of sub-tasks.
I have an application that has a GUI thread and many different worker threads. In this application, I have a functions.py module, which contains a lot of different "utility" functions that are used all over the application.
Yesterday the application has been released and some users (a minority, but still) has reported problems with the application crashing. I looked over my code and noticed a possible design flaw, and would like to check with the lovely people of SO and see if I am right and if this is indeed a flaw.
Suppose I have this defined in my functions.py module:
class Functions:
solveComputationSignal = Signal(str)
updateStatusSignal = Signal(int, str)
text = None
#classmethod
def setResultText(self, text):
self.text = text
#classmethod
def solveComputation(cls, platform, computation, param=None):
#Not the entirety of the method is listed here
result = urllib.urlopen(COMPUTATION_URL).read()
if param is None:
cls.solveComputationSignal.emit(result)
else:
cls.solveAlternateComputation(platform, computation)
while not self.text:
time.sleep(3)
return self.text if self.text else False
#classmethod
def updateCurrentStatus(cls, platform, statusText):
cls.updateStatusSignal.emit(platform, statusText)
I think these methods in themselves are fine. The two signals defined here are connected to in the GUI thread. The first signal pops-up a dialog in which the computation is presented. The GUI thread calls the setResultText() method and sets the resulting string as entered by the user (if anyone knows of a better way to wait until the user has inputted the text other than sleeping and waiting for self.text to become True, please let me know). The solveAlternateComputation is another method in the same class that solves the computation automatically, however, it too calls the setResultText() method that sets the resulting text.
The second signal updates the statusBar text of the main GUI as well.
What's worse is that I think the above design, while perhaps flawed, is not the problem.
The problem lies, I believe, in the way I call these methods, whihch is from the worker threads (note that I have multiple similar workers, all of which are different "platforms")
Assume I have this (and I do):
class WorkerPlatform1(QThread):
#Init and other methods are here
def run(self):
#Thread does its job here, but then when it needs to present the
#computation, instead of emitting a signal, this is what I do
self.f = functions.Functions
result = self.f.solveComputation(platform, computation)
if result:
#Go on with the task
else:
self.f.updateCurrentStatus(platform, "Error grabbing computation!")
In this case I think that my flaw is that the thread itself is not emitting any signals, but rather calling callables residing outside of that thread directly. Am I right in thinking that this could cause my application to crash? Although the faulty module is reported as QtGui4.dll
One more thing: both of these methods in the Functions class are accessed by many threads almost simultaneously. Is this even advisable - have methods residing outside of a thread be accessed by many threads all at the same time? Can it so happen that I "confuse" my program? The reason I am asking is because people who say that the application is not crashing report that, very often, the solveComputation() returns the incorrect text - not all the time, but very often. Since that COMPUTATION_URL's server can take some time to respond (even 10+ seconds), is it possible that, once a thread calls that method, while the urllib library is still waiting for server response, in that time another thread can call it, causing it to use a different COMPUTATION_URL, which will result in it returning an incorrect value on some cases?
Finally, I am thinking of solutions: for my first (crashing) problem, do you think the proper solution would be to directly emit a Signal from the thread itself, and then connect it in the GUI thread? Is that the right way to go about it?
Secondly, for the solveComputation returning incorrect values, would I solve it by moving that method (and accompanying methods) to every Worker class? then I could call them directly and hopefully have the correct response - or, dozens of different responses (since I have that many threads) - for every thread?
Thank you all and I apologize for the wall of text.
EDIT: I would like to add that when running in console with some users, this error appears QObject: Cannot create children for a parent that is in a different thread.
(Parent is QLabel(0x4795500), parent's thread is QThread(0x2d3fd90), current thread is WordpressCreator(0x49f0548)
Your design is flawed if you really are using your Functions class like this with classmethods storing results on class attributes, being shared amongst multiple workers. It should be using all instance methods, and each thread should be using an instance of this class:
class Functions(QObject):
solveComputationSignal = pyqtSignal(str)
updateStatusSignal = pyqtSignal(int, str)
def __init__(self, parent=None):
super(Functions, self).__init__(parent)
self.text = ""
def setResultText(self, text):
self.text = text
def solveComputation(self, platform, computation, param=None):
result = urllib.urlopen(COMPUTATION_URL).read()
if param is None:
self.solveComputationSignal.emit(result)
else:
self.solveAlternateComputation(platform, computation)
while not self.text:
time.sleep(3)
return self.text if self.text else False
def updateCurrentStatus(self, platform, statusText):
self.updateStatusSignal.emit(platform, statusText)
# worker_A
def run(self):
...
f = Functions()
# worker_B
def run(self):
...
f = Functions()
Also, for doing your urlopen, instead of doing sleeps to check for when it is ready, you can make use of the QNetworkAccessManager to make your requests and use signals to be notified when results are ready.