Is there any Celery functionality or preferred way of executing periodic background tasks locally when using a single worker? Sort of like a background thread, but scheduled and handled by Celery?
celery.beat doesn't seem suitable as it appears to be simply tied to a consumer (so could run on any server) - that's the type of scheduling I was after, but just a task that is always run locally on each server running this worker (the task does some cleanup and stats relating to the main task the worker handles).
I may be going about this the wrong way, but I'm confined to implementing this within a celery worker daemon.
You could use a custom remote control command and use the broadcast function on a cron to run cleanup or whatever else might be required.
One possible method I thought of, though not ideal, is to patch the celery.worker.heartbeat Heart() class.
Since we already use heartbeats, the class allows for a simple modification to its start() method (add another self.timer.call_repeatedly() entry), or an additional self.eventer.on_enabled.add() __init__ entry which references a new method that also uses self.timer.call_repeatedly() to perform a periodic task.
Related
I have to set up a worker which handles some data after a certain event happens. I know I can start the worker with python manage.py runworker my_worker, but what I would need is to start the worker in the same process as the main Django app on a separate thread.
Why do I need it in a separate thread and not in a separate process? Because the worker would perform a pretty light-weight job which would not overload the server's resources, and, moreover, the effort of making the set up for a new process in the production is not worth the gain in performance. In other words, I would prefer to keep it in the Django's process if possible.
Why not perform the job synchronously? Because it is a separate logic that needs the possibility to be extended, and it is out of the main HTTP request-reply scope. It is a post-processing task which doesn't interfere with the main logic. I need to decouple this task from an infrastructural point-of-view, not only logical (e.g. with plain signals).
Is there any possibility provided by Django Channels to run a worker in such a way?
Would there be any downsides to start the worker manually on a separate thread?
Right now I have the setup for a message broker consumer thread (without using Channels), so I have the entry point for starting a new worker thread. But as I've seen from the Channel's runworker command, it loads the whole application, so it doesn't seem like a naïve worker.run() call is the proper way to do it (I might be wrong with this one).
I found an answer to my question.
The answer is no, you can't just start a worker within the same process. This is because the consumer needs to run inside an event loop thread and it is not good at all to have more than one event loop thread in the same process (Django WSGI application already runs the main thread with an event loop).
The best you can do is to start the worker in a separate process. As I mentioned in my question, I started a message broker consumer on a separate thread, which was not a good approach either, so I changed my configuration to start the consumers as separate processes.
I have apscheduler running in django and it appears to work ... okay. In my project init.py, I initialize the scheduler:
scheduler = Scheduler(daemon=True)
print("\n\n\n\n\n\n\n\nstarting scheduler")
scheduler.configure({'apscheduler.jobstores.file.class': settings.APSCHEDULER['jobstores.file.class']})
scheduler.start()
atexit.register(lambda: scheduler.shutdown(wait=False))
The first problem with this is that the print shows this code is executed twice. Secondly, in other applications, I'd like to reference the scheduler, but haven't a clue how to do that. If I get another instance of a scheduler, I believe it is a separate threadpool and not the one created here.
how do I get one and only one instance of apscheduler running?
how do I reference that instance in other apps?
That depends on how you ended up with two scheduler instances in the first place. Are you starting apscheduler in a worker thread/process? If you have more than one such worker, you're going to get multiple instances of the scheduler. So, you have to find a way to prevent the scheduler from being started more than once by either running it in a different process if possible, or adding some condition to the scheduler startup.
You don't. Variables are local to each process. The best you can do is to build some kind of remote execution system, either using some kind of a ReST service or some remote control system like execnet or rpyc.
In celery, is there a simple way to create a (series of) task(s) that I could use to automagically restart a worker?
The goal is to have my deployment automagically restart all the child celery workers every time it gets a new source from github. So I could then send out a restartWorkers() task to my management celery instance on that machine that would kill (actually stopwait) all the celery worker processes on that machine, and restart them with the new modules.
The plan is for each machine to have:
Management node [Queues: Management, machine-specific] - Responsible for managing the rest of the workers on the machine, bringing up new nodes and killing old ones as necessary
Worker nodes [Queues: git revision specific, worker specific, machine specific] - Actually responsible for doing the work.
It looks like the code I need is somewhere in dist_packages/celery/bin/celeryd_multi.py, but the source is rather opaque for starting workers, and I can't tell how it's supposed to work or where it's actually starting the nodes. (It looks like shutdown_nodes is the correct code to be calling for killing the processes, and I'm slowly debugging my way through it to figure out what my arguments should be)
Is there a function/functions restart_nodes(self, nodes) somewhere that I could call or am I going to be running shell scripts from within python?
/Also, is there a simpler way to reload the source into Python than killing and restarting the processes? If I knew that reloading the module actually worked(Experiments say that it doesn't. Changes to functions do not percolate until I restart the process), I'd just do that instead of the indirection with management nodes.
EDIT:
I can now shutdown, thanks to broadcast(Thank you mihael. If I had more rep, I'd upvote). Any way to broadcast a restart? There's pool_restart, but that doesn't kill the node, which means that it won't update the source.
I've been looking into some of the behind the scenes source in celery.bin.celeryd:WorkerCommand().run(), but there's some weird stuff going on before and after the run call, so I can't just call that function and be done because it crashes. It just makes 0 sense to call a shell command from a python script to run another python script, and I can't believe that I'm the first one to want to do this.
You can try to use broadcast functionality of Celery.
Here you can see some good examples: https://github.com/mher/flower/blob/master/flower/api/control.py
I need to to handle a large (time and memory-consuming) process asynchronously in a web2py application called inside a controller method.
My specific use case is to call a process via stdlib.subprocess and wait for it to exit without blocking the web server, but I am open to alternative methods.
Hands-on examples would be a plus.
3rd party library recommendations
are welcome.
CRON scheduling is not required/wanted.
Assuming you'll need to start multiple, possibly simultaneous, instances of the background task, the solution is a task queue. I've heard good things about Celery and RabbitMQ, if you're looking for 3rd-party options, and web2py includes it's own task queue system that might be sufficient for your needs.
With either tool, you'll define a function that encapsulates the operation you want the background process to perform. Then bring the task queue workers online. The web2py manual and forums indicate this can be done with an #reboot statement in the web2py cron system, which is triggered whenever the web server starts. There are probably other ways to start the workers if this is unsatisfactory.
In your controller you'll insert a task into the task queue, passing any necessary parameters as inputs to the function (the background function will not run in the same environment as the controller, so it won't have access to the session, DB, etc. unless you explicitly pass the appropriate values into the task function).
Now, to get the output of the background operation to the user. When you insert a task into the task queue, you should get back a unique ID for the task. You would then implement controller logic (either something that expects an AJAX call, or a page that keeps refreshing until the task completes) that calls the task queue's API to check the status of the specified task. If the task's status is "finished", return the data to the user. If not, keep waiting.
Maybe review the book section on running tasks in the background. You can use the new scheduler or create a homemade queue (email example). There's also a web2py-celery plugin, though I'm not sure what state that is in.
This is more difficult than one might expect. Note the deadlock warnings in the stdlib.subprocess documentation. It's easy if you don't mind blocking---use Popen.communicate. To work around the blocking, you can manage the process using stdlib.subprocess from a thread.
My favorite way to deal with subprocesses is to use Twisted's spawnProcess. But, it is not easy to get Twisted to play nicely with other frameworks.
I would like to hold running threads in my Django application. Since I cannot do so in the model or in the session, I thought of holding them in a singleton. I've been checking this out for a while and haven't really found a good how-to for this.
Does anyone know how to create a thread-safe singleton in python?
EDIT:
More specifically what I wand to do is I want to implement some kind of "anytime algorithm", i.e. when a user presses a button, a response returned and a new computation begins (a new thread). I want this thread to run until the user presses the button again, and then my app will return the best solution it managed to find. to do that, i need to save somewhere the thread object - i thought of storing them in the session, what apparently i cannot do.
The bottom line is - i have a FAT computation i want to do on the server side, in different threads, while the user is using my site.
Unless you have a very good reason - you should execute the long running threads in a different process altogether, and use Celery to execute them:
Celery is an open source asynchronous
task queue/job queue based on
distributed message passing. It is
focused on real-time operation, but
supports scheduling as well.
The execution units, called tasks, are
executed concurrently on one or more
worker nodes using multiprocessing,
Eventlet or gevent. Tasks can execute
asynchronously (in the background) or
synchronously (wait until ready).
Celery guide for djangonauts: http://django-celery.readthedocs.org/en/latest/getting-started/first-steps-with-django.html
For singletons and sharing data between tasks/threads, again, unless you have a good reason, you should use the db layer (aka, models) with some caution regarding db locks and refreshing stale data.
Update: regarding your use case, define a Computation model, with a status field. When a user starts a computation, an instance is created, and a task will start to run. The task will monitor the status field (check db once in a while). When a user clicks the button again, a view will change the status to user requested to stop, causing the task to terminate.
If you want asynchronous code in a web application then you're taking the wrong approach. You should run background tasks with a specialist task queue like Celery: http://celeryproject.org/
The biggest problem you have is web server architecture. Unless you go against the recommended Django web server configuration and use a worker thread MPM, you will have no way to track your thread handles between requests as each request typically occupies its own process. This is how Apache normally works: http://httpd.apache.org/docs/2.0/mod/prefork.html
EDIT:
In light of your edit I think you might learn more by creating a custom solution that does this:
Maintains start/stop state in the database
Create a new program that runs as a daemon
Periodically check the start/stop state and begin or end work from here
There's no need for multithreading here unless you need to create a new process for each user. If so, things get more complicated and using Celery will make your life much easier.