Inspecting Celery queues for tasks from Python

Inspecting Celery queues for tasks from Python - python

I have a Django+Celery application, where Celery is used to push (and pull) Django model instances to a third party SOAP service.
My Django models have dependencies between them and also a simple hash like this:
class MyModel(Models):
def get_dependencies(self):
# ...
return [...]
def __hash__(self):
return hash(self.__class__.__name__+str(self.pk))
This hash came handy in my own implementation which I had to drop due to stability issues. Celery is a much sounder ground.
When I push an instance over to the SOAP service, I must make sure that its dependencies have been pushed. This is done by checking all related instances for a pushed_ok timestamp fields.
The difficult part is when an instance a which depends on list of instances deps (all are instances of MyModel subclasses) is being pushed. I cannot push a unless all instances in deps have been processed by Celery. In other words I need to serialize tasks so that the dependecies order is respected.
Celery is run like this:
celery -A server worker -P eventlet -c 100
How can I prevent one of the the eventlets (/process/thread) from running a before its dependencies, if any, have finished being run by other eventlets?
Thank you for your help.

I went for a pragmatic solution of moving all dependency checking of a resource (which includes pushing out-of-sync dependencies to the soap server) inside the celery task. Instead of trying to serialize tasks depending on the resources' dependencies.
The up side is that it keeps it simple and I could implement it rapidly.
The down side is that I'm locking a worker for a moment and potentially many synchronous SOAP operations instead of dispatching these operations accross workers.

Related

How can I load my Django project into celery workers AFTER they fork?

This question might seem odd to folks, but it's actually a creative question.
I'm using Django (v3.2.3) and celery (v5.2.3) for a project. I've noticed that the workers and master process all share the same code (probably b/c celery loads my app modules before it forks the child processes for configuration reasons). While this would normally be fine, I want to do something more unreasonable :smile: — I want the celery workers to each load my project code after they fork (similar to how uwsgi does with lazy-apps configuration).
Some responses here will ask why, but let's not focus on that (remember, I'm being unreasonable). Let's just assume I don't want to write thread-safe code. The risks are understood, namely that each child worker would load more memory and be slow at restart.
It's not clear to me from reading the celery code how this would be possible. I've tried this to no avail:
listen to the signal worker_process_init (source here)
then use my project's instantiated app ref and talk to the DjangoFixup interface app._fixups[0] here
and try to manually call all the registered signal callbacks for the DjangoFixupWorker here
Any ideas on the steps to get this to work would be much appreciated?

Does django-celery-beat deal with multiple instances of django processes created by web-servers

I have a fairly complex periodic-tasks that needs to be offloaded from django context. django-celery-beat looks promising. While I was going through celery-beat docs, I found this:
You have to ensure only a single scheduler is running for a schedule at a time, otherwise you’d end up with duplicate tasks. Using a centralized approach means the schedule doesn’t have to be synchronized, and the service can operate without using locks.
A typical production deployment will spawn a pool of worker-processes each running a django instance. Will that result in creation of multiple scheduler processes as well? Do I need to have some synchronisation logic?
Thanks for your time!

It does not.
You can dig into the issues page on their github repo for confirmation. I think it's weird that the documentation doesn't call this out, but I suppose you have to assume that's how all celery beats work unless they specify otherwise.
In theory, you could build your own synchronization, but it will probably be a better experience to use a different scheduler that has that functionality built in, like Heroku's redbeat: https://blog.heroku.com/redbeat-celery-beat-scheduler.

Multiple instances of celerybeat for autoscaled django app on elasticbeanstalk

I am trying to figure out the best way to structure a Django app that uses Celery to handle async and scheduled tasks in an autoscaling AWS ElasticBeanstalk environment.
So far I have used only a single instance Elastic Beanstalk environment with Celery + Celerybeat and this worked perfectly fine. However, I want to have multiple instances running in my environment, because every now and then an instance crashes and it takes a lot of time until the instance is back up, but I can't scale my current architecture to more than one instance because Celerybeat is supposed to be running only once across all instances as otherwise every task scheduled by Celerybeat will be submitted multiple times (once for every EC2 instance in the environment).
I have read about multiple solutions, but all of them seem to have issues that don't make it work for me:
Using django cache + locking: This approach is more like a quick fix than a real solution. This can't be the solution if you have a lot of scheduled tasks and you need to add code to check the cache for every task. Also tasks are still submitted multiple times, this approach only makes sure that execution of the duplicates stops.
Using leader_only option with ebextensions: Works fine initially, but if an EC2 instance in the enviroment crashes or is replaced, this would lead to a situation where no Celerybeat is running at all, because the leader is only defined once at the creation of the environment.
Creating a new Django app just for async tasks in the Elastic Beanstalk worker tier: Nice, because web servers and workers can be scaled independently and the web server performance is not affected by huge async work loads performed by the workers. However, this approach does not work with Celery because the worker tier SQS daemon removes messages and posts the message bodies to a predefined urls. Additionally, I don't like the idea of having a complete additional Django app that needs to import the models from the main app and needs to be separately updated and deployed if the tasks are modified in the main app.
How to I use Celery with scheduled tasks in a distributed Elastic Beanstalk environment without task duplication? E.g. how can I make sure that exactly one instance is running across all instances all the time in the Elastic Beanstalk environment (even if the current instance with Celerybeat crashes)?
Are there any other ways to achieve this? What's the best way to use Elastic Beanstalk's Worker Tier Environment with Django?

I guess you could single out celery beat to different group.
Your auto scaling group runs multiple django instances, but celery is not included in the ec2 config of the scaling group.
You should have different set (or just one) of instance for celery beat

In case someone experience similar issues: I ended up switching to a different Queue / Task framework for django. It is called django-q and was set up and working in less than an hour. It has all the features that I needed and also better Django integration than Celery (since djcelery is no longer active).
Django-q is super easy to use and also lighter than the huge Celery framework. I can only recommend it!

How do I configure Celery to use separate BROKER_URLs for producing and consuming from the same broker?

We have an application that uses a Celery instance in two ways: The instance's .task attribute is used as our task decorator, and when we invoke celery workers, we pass the instance as the -A (--app) argument. This workflow uses the same Celery instance for both producing and consuming, and it has worked, but we are using the same Celery instance for both producers (the tasks) and consumers (the celery workers).
Now, we are considering using Bigwig RabbitMQ, which is an AMQP service provider, and they publish two different URLs, one optimized for message producers, the other optimized for message consumers.
What's the best way for us to modify our setup in order to take advantage of the separate broker endpoints? I'm assuming a single Celery instance can only use a single broker URL (via the BROKER_URL setting). Should we use two distinct Celery instances configured identically except for the BROKER_URL setting?

This feature will be available in Celery 4.0: http://docs.celeryproject.org/en/master/whatsnew-4.0.html#configure-broker-url-for-read-write-separately

Yes you are right one celery instance can use only one broker URL. As you said the only way is to use 2 workers with just different BROKER_URL one for consuming and one for producing.
Technically is trivial, you can take advantage of this (http://celery.readthedocs.org/en/latest/reference/celery.html#celery.Celery.config_from_object) but off course you will have two workers running but I don't think that this introduces any problem.
There is also another option explained here , but I would avoid it.

How can I ensure a Celery task runs with the right settings?

I have two sites running essentially the same codebase, with only slight differences in settings. Each site is built in Django, with a WordPress blog integrated.
Each site needs to import blog posts from WordPress and store them in the Django database. When a user publishes a post, WordPress hits a webhook URL on the Django side, which kicks off a Celery task that grabs the JSON version of the post and imports it.
My initial thought was that each site could run its own instance of manage.py celeryd, each is in its own virtualenv, and the two sites would stay out of each other's way. Each is daemonized with a separate upstart script.
But it looks like they're colliding somehow. I can run one at a time successfully, but if both are running, one instance won't receive tasks, or tasks will run with the wrong settings (in this case, each has a WORDPRESS_BLOG_URL setting).
I'm using a Redis queue, if that makes a difference. What am I doing wrong here?

Have you specified the name of the default queue that celery should use? If you haven't set CELERY_DEFAULT_QUEUE the both sites will be using the same queue and getting each other's messages. You need to set this setting to a different value for each site to keep the message separate.
Edit
You're right, CELERY_DEFAULT_QUEUE is only for backends like RabbitMQ. I think you need to set a different database number for each site, using a different number at the end of your broker url.

If you are using django-celery then make sure you don't have an instance of celery running outside of your virtualenvs. Then start the celery instance within your virtualenvs using manage.py celeryd like you have done. I recommend setting up supervisord to keep track of your instances.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.