Python Celery task to restart celery worker - python

In celery, is there a simple way to create a (series of) task(s) that I could use to automagically restart a worker?
The goal is to have my deployment automagically restart all the child celery workers every time it gets a new source from github. So I could then send out a restartWorkers() task to my management celery instance on that machine that would kill (actually stopwait) all the celery worker processes on that machine, and restart them with the new modules.
The plan is for each machine to have:
Management node [Queues: Management, machine-specific] - Responsible for managing the rest of the workers on the machine, bringing up new nodes and killing old ones as necessary
Worker nodes [Queues: git revision specific, worker specific, machine specific] - Actually responsible for doing the work.
It looks like the code I need is somewhere in dist_packages/celery/bin/celeryd_multi.py, but the source is rather opaque for starting workers, and I can't tell how it's supposed to work or where it's actually starting the nodes. (It looks like shutdown_nodes is the correct code to be calling for killing the processes, and I'm slowly debugging my way through it to figure out what my arguments should be)
Is there a function/functions restart_nodes(self, nodes) somewhere that I could call or am I going to be running shell scripts from within python?
/Also, is there a simpler way to reload the source into Python than killing and restarting the processes? If I knew that reloading the module actually worked(Experiments say that it doesn't. Changes to functions do not percolate until I restart the process), I'd just do that instead of the indirection with management nodes.
EDIT:
I can now shutdown, thanks to broadcast(Thank you mihael. If I had more rep, I'd upvote). Any way to broadcast a restart? There's pool_restart, but that doesn't kill the node, which means that it won't update the source.
I've been looking into some of the behind the scenes source in celery.bin.celeryd:WorkerCommand().run(), but there's some weird stuff going on before and after the run call, so I can't just call that function and be done because it crashes. It just makes 0 sense to call a shell command from a python script to run another python script, and I can't believe that I'm the first one to want to do this.

You can try to use broadcast functionality of Celery.
Here you can see some good examples: https://github.com/mher/flower/blob/master/flower/api/control.py

Related

How can one maintain communication with a child process if the parent process restarts?

Im writing a simple job scheduler and I need to maintain some basic level of communication from the scheduler ( parent ) to spawned processes ( children ).
Basically I want to be able to send signals from scheduler -> child and also inform the the scheduler of a child's exit status.
Generally this is all fine and dandy using regular subprocess Popen with signaling / waitpid. My issues is that I want to be able to restart the scheduler without impacting running jobs, and re-establish comm once the scheduler restarts.
Ive gone through a few ideas, nothing feels "right".
Right now I have a simple wrapper script that maintains communication with the scheduler using named pipes, and the wrapper will actually run the scheduled command. This seems to work well in most cases but I have seen it subject to some races, mainly due to how namedpipes need the connection to be alive in order to read/write ( no buffering ).
Im considering trying linux message queues instead, but not sure if that's the right path. Another option is setup a server socket accept on the scheduler and have sub-processes connect to the scheduler when they start, and re-establish the connection if it breaks. This might be overkill, but maybe the most robust.
Its a shame I have to go through all this trouble since I cant just "re-attached" the child process on restart ( I realize there are reasons for this ).
Any other thoughts? This seems like a common problem any scheduler would hit, am I missing some obvious solution here?

Is celery the appropriate tech for running long-running processes I simply need to start/stop?

My python program isn't the sort of thing you'd create an init script for. It's simply a long-running process which needs to run until I tell it to shut down.
I run multiple instances of the program, each with different cmd-line args. Just FYI, the program acts like a Physics Tutor who chats with my users, and each instance represents a different Physics problem.
My Django app communicates with these processes using Redis pub/sub
I'd like to improve how I start/stop and manage these processes from Django views. What I don't know is if Celery is the proper technology to do this for me. A lot of the celery docs make it sound like it's for running short-lived asynchronous tasks, such as their 'add()' example task.
Currently my views are doing some awful 'spawn' stuff to start the processes, and I'm keeping track of which processes are running in a completely ad-hoc way utilizing a Redis hash.
My program actually only daemonizes if it pass it a -d argument, which I suppose I wouldn't pass it if using celery, although it does output to stdout/stderr if I don't pass that option.
All I really need is:
A way to start/stop my processes
information on whether start/stop operation succeeded
information on which of my processes are running
What I don't want is:
multiple instances of a process with the same configuration running
need to replace the way I communicate with Django (Redis pub/sub)
Does celery sound like the proper tech for me to use for my process management?
Maybe you can utilize supervisor for this. It is good at running and monitoring long running processes and has an XML-RPC interface.
You can view an example of what I did here (example output here).

Celery worker: periodic local background task

Is there any Celery functionality or preferred way of executing periodic background tasks locally when using a single worker? Sort of like a background thread, but scheduled and handled by Celery?
celery.beat doesn't seem suitable as it appears to be simply tied to a consumer (so could run on any server) - that's the type of scheduling I was after, but just a task that is always run locally on each server running this worker (the task does some cleanup and stats relating to the main task the worker handles).
I may be going about this the wrong way, but I'm confined to implementing this within a celery worker daemon.
You could use a custom remote control command and use the broadcast function on a cron to run cleanup or whatever else might be required.
One possible method I thought of, though not ideal, is to patch the celery.worker.heartbeat Heart() class.
Since we already use heartbeats, the class allows for a simple modification to its start() method (add another self.timer.call_repeatedly() entry), or an additional self.eventer.on_enabled.add() __init__ entry which references a new method that also uses self.timer.call_repeatedly() to perform a periodic task.

Making a zmq server run forever in Django?

I'm trying to figure that best way to keep a zeroMQ listener running forever in my django app.
I'm setting up a zmq server app in my Django project that acts as internal API to other applications in our network (no need to go through http/requests stuff since these apps are internal). I want the zmq listener inside of my django project to always be alive.
I want the zmq listener in my Django project so I have access to all of the projects models (for querying) and other django context things.
I'm currently thinking:
Set up a Django management command that will run the listener and keep it alive forever (aka infinite loop inside the zmq listener code) or
use a celery worker to always keep the zmq listener alive? But I'm not exactly sure on how to get a celery worker to restart a task only if it's not running. All the celery docs are about frequency/delayed running. Or maybe I should let celery purge the task # a given interval & restart it anyways..
Any tips, advice on performance implications or alternate approaches?
Setting up a management command is a fine way to do this, especially if you're running on your own hardware.
If you're running in a cloud, where a machine may disappear along with your process, then the latter is a better option. This is how I've done it:
Setup a periodic task that runs every N seconds (you need celerybeat running somewhere)
When the task spawns, it first checks a shared network resource (redis, zookeeper, or a db), to see if another process has an active/valid lease. If one exists, abort.
If there's no valid lease, obtain your lease (beware of concurrency here!), and start your infinite loop, making sure you periodically renew the lease.
Add instrumentation so that you know who, where the process is running.
Start celery workers on multiple boxes, consuming from the same queue your periodic task is designated for.
The second solution is more complex and harder to get right; so if you can, a singleton is great and consider using something like supervisord to ensure the process gets restarted if it faults for some reason.

Preventing management commands from running more than one at a time

I'm designing a long running process, triggered by a Django management command, that needs to run on a fairly frequent basis. This process is supposed to run every 5 min via a cron job, but I want to prevent it from running a second instance of the process in the rare case that the first takes longer than 5 min.
I've thought about creating a touch file that gets created when the management process starts and is removed when the process ends. A second management command process would then check to make sure the touch file didn't exist before running. But that seems like a problem if a process dies abruptly without properly removing the touch file. It seems like there's got to be a better way to do that check.
Does anyone know any good tools or patterns to help solve this type of issue?
For this reason I prefer to have a long-running process that gets its work off of a shared queue. By long-running I mean that its lifetime is longer than a single unit of work. The process is then controlled by some daemon service such as supervisord which can take over control of restarting the process when it crashes. This delegates the work appropriately to something that knows how to manage process lifecycles and frees you from having to worry about the nitty gritty of posix processes in the scope of your script.
If you have a queue, you also have the luxury of being able to spin up multiple processes that can each take jobs off of the queue and process them, but that sounds like it's out of scope of your problem.

Categories

Resources