Multiple instances of celerybeat for autoscaled django app on elasticbeanstalk

Multiple instances of celerybeat for autoscaled django app on elasticbeanstalk - python

I am trying to figure out the best way to structure a Django app that uses Celery to handle async and scheduled tasks in an autoscaling AWS ElasticBeanstalk environment.
So far I have used only a single instance Elastic Beanstalk environment with Celery + Celerybeat and this worked perfectly fine. However, I want to have multiple instances running in my environment, because every now and then an instance crashes and it takes a lot of time until the instance is back up, but I can't scale my current architecture to more than one instance because Celerybeat is supposed to be running only once across all instances as otherwise every task scheduled by Celerybeat will be submitted multiple times (once for every EC2 instance in the environment).
I have read about multiple solutions, but all of them seem to have issues that don't make it work for me:
Using django cache + locking: This approach is more like a quick fix than a real solution. This can't be the solution if you have a lot of scheduled tasks and you need to add code to check the cache for every task. Also tasks are still submitted multiple times, this approach only makes sure that execution of the duplicates stops.
Using leader_only option with ebextensions: Works fine initially, but if an EC2 instance in the enviroment crashes or is replaced, this would lead to a situation where no Celerybeat is running at all, because the leader is only defined once at the creation of the environment.
Creating a new Django app just for async tasks in the Elastic Beanstalk worker tier: Nice, because web servers and workers can be scaled independently and the web server performance is not affected by huge async work loads performed by the workers. However, this approach does not work with Celery because the worker tier SQS daemon removes messages and posts the message bodies to a predefined urls. Additionally, I don't like the idea of having a complete additional Django app that needs to import the models from the main app and needs to be separately updated and deployed if the tasks are modified in the main app.
How to I use Celery with scheduled tasks in a distributed Elastic Beanstalk environment without task duplication? E.g. how can I make sure that exactly one instance is running across all instances all the time in the Elastic Beanstalk environment (even if the current instance with Celerybeat crashes)?
Are there any other ways to achieve this? What's the best way to use Elastic Beanstalk's Worker Tier Environment with Django?

I guess you could single out celery beat to different group.
Your auto scaling group runs multiple django instances, but celery is not included in the ec2 config of the scaling group.
You should have different set (or just one) of instance for celery beat

In case someone experience similar issues: I ended up switching to a different Queue / Task framework for django. It is called django-q and was set up and working in less than an hour. It has all the features that I needed and also better Django integration than Celery (since djcelery is no longer active).
Django-q is super easy to use and also lighter than the huge Celery framework. I can only recommend it!

Related

In Django, what are the best ways to handle scheduled jobs where only one job is spun up instead of multiple?

[TLDR] What are the best ways to handle scheduled jobs in development where only one job is spun up instead of two? Do you use the 'noreload' option? Do you test to see if a file is locked and then stop the second instance of the job if it is? Are there better alternatives?
[Context]
[edit] We are still in a development environment and am looking to next steps for a production environment.
My team and I are currently developing a Django project, Django 1.9 with Python 3.5. We just discovered that Django spins up two instances of itself to allow for real time code changes.
APScheduler 3.1.0 is being used to schedule a DB ping every few minutes to see if there is new data for us to process. However, when Django spins up, we noticed that we were pinging twice and that there were two instances of our functions running. We tried to shut down the second job through APS but as they are in two different processes, APS is unable to see the other job.
After researching this, we discovered the 'noreload' option and another suggestion to test if a file has been locked.
The noreload prevents Django from spinning up the second instance. This solution works but feels weird. We haven't come across any documentation or guides that suggest that this is something that you want to/not do in production.
Filelock 2.0.6 is another option that we have tested. In this solution, the two scheduled tasks ping a local file to see if it is locked. If it isn't locked, then that task will lock it and run while the other one will stop running. If the task crashes, then the locked file will remain locked until a server restart. This feels like a hack.
In a production environment, are these good solutions? Are there other alternatives that we should look at for handling scheduled tasks that are better for this? Are there cons to either of these solutions that we haven't thought of?
'noreload' - is this something that is done normally in a production environment?

Django app with long running calculations

I'm creating a Django web app which features potentially very long running calculations of up to an hour. The calculations are simulation models built in Python. The web app sends inputs to the simulation model and after some time receives the answer. Also, the user should be able to close his browser after starting the simulation and if he logs in the next day the results should be there.
From my research it seems like I can use Celery together with Redis/RabbitMQ as broker to run the calculation in the background. Ideally I would want to display progress updates using ajax, so that the page updates without a user refresh when the calculation is complete.
I want to host the app on Heroku, so the calculation will also be running on the Heroku server. How hard will it be if I want to move the calculation engine to another server? It might be useful if the calculation engine is on a different server.
So my question is, is my this a good approach above or what other options can I look at?

I think Celery is a good approach. Not sure if you need Redis/RabbitMQ as a broker or you could just use MySQL - it depends on your tasks. Celery workers could be runned on the different servers, so Celery supports distributed queues.
Another approach - implement some queue engine with python, database as a broker and a cron for job executions. But it could be a dirty way with a lots of pain and bugs.
So I think that Celery is a more nice way to do it.

If you are running on Heroku, you want django-rq, not Celery. See https://devcenter.heroku.com/articles/python-rq.

Autoscale Python Celery with Amazon EC2

I have a Celery Task-Manager to crunch some numbers for company analytics.
The Task-Manager and workers are hosted on an Amazon EC2 Linux Server.
I need to set up the system such if we send too many tasks to celery Amazon automatically sets up a new EC2 instance to run more workers and balances the load across these workers.
The services that I'm aware exist are the Amazon Autoscale and Amazon Load balancing services which seem like exactly what I want to use however, I'm not sure what the best way to configure the Celery is.
I think that I ought to have a celery "master" which is collecting all the tasks and a number of celery workers which execute them. As the number of tasks increases I want to add more workers. The way the autoscale works (by taking an AMI of the celery server) I think that I'm currently cloning the Master as well as the workers which seems like not what I want to do.
How do I organise this to achieve my end goal which is flexible autoscaling task management using Celery to manage the tasks and Amazon Web Service to host the computing.
As much detail as possible in any answers (or links to tutorials!) would be greatly appreciated as most tutorials or advice seems to assume large quantities of knowledge which I don't currently have!

You do not need a master-worker architecture to get this to work. If I understand your question correctly, you want to be able to scale based on queue size. I would say it will be easier if you have the following steps
Setup elasticache/sqs for the broker (since you're in aws)
For custom scaling - A periodic task which checks queue sizes using something like this OR add amazon autoscaling to just add/remove machines when CPU usage is high (assuming that that is a good enough indication of load). Also, start workers with --autoscale so that the CPU usage gets reflected correctly.

Status of Python Celery tasks

I'm wondering what kind of options there are for monitoring celery tasks from a browser, after they have been deployed to a worker?
My current application stack is a flask app running inside twisted, using celery to run dozens to thousands of small background tasks (updating metadata in a repository, creating image derivatives, etc.) I'm envisioning using ajax long-polling to monitor the status of the celery tasks initiated by the user. I'm using redis for the backend broker and results.
I see celery has some command line ways to monitor tasks, or flower for a web dashboard. But if I wanted to see more detailed status from a particular task sent to celery, would it make more sense for that task to print / write to a log file, then long-poll that file for changes from the flask front-end?
At this point a user can say, "update these 10,000 items", the tasks are sent to celery, and the front-end very quickly says, "job sent!". And the tasks do complete. But I'd like to have the user navigate to "/status" and see the status of those 10,000 small jobs - even a scrolling log file would probably work.
Any suggestions would be greatly appreciated. Took a lot of head scratching to make it this far sketching things out, but I'm spinning my wheels figuring out exactly WHAT to long-poll from the user front-end.

Try Jobstatic, which is extending Celery.
From project description:
Jobtastic gives you goodies like:
Easy progress estimation/reporting
Job status feedback
Helper methods for gracefully handling a dead task broker (delay_or_eager and delay_or_fail)
Super-easy result caching
Thundering herd avoidance
Integration with a celery jQuery plugin for easy client-side progress display
Memory leak detection in a task run

Jobtastic was a great idea, but not quite what worked for us. In the end, decided to create an incrementing job number (stored in Redis alongside results and broker), push all celery task id's associated with that job number into a python object, then pickle and store that in redis. We can then use that later to see if the entire "job" is complete, or the status thereof. For our purposes, works just lovely.

How can I ensure a Celery task runs with the right settings?

I have two sites running essentially the same codebase, with only slight differences in settings. Each site is built in Django, with a WordPress blog integrated.
Each site needs to import blog posts from WordPress and store them in the Django database. When a user publishes a post, WordPress hits a webhook URL on the Django side, which kicks off a Celery task that grabs the JSON version of the post and imports it.
My initial thought was that each site could run its own instance of manage.py celeryd, each is in its own virtualenv, and the two sites would stay out of each other's way. Each is daemonized with a separate upstart script.
But it looks like they're colliding somehow. I can run one at a time successfully, but if both are running, one instance won't receive tasks, or tasks will run with the wrong settings (in this case, each has a WORDPRESS_BLOG_URL setting).
I'm using a Redis queue, if that makes a difference. What am I doing wrong here?

Have you specified the name of the default queue that celery should use? If you haven't set CELERY_DEFAULT_QUEUE the both sites will be using the same queue and getting each other's messages. You need to set this setting to a different value for each site to keep the message separate.
Edit
You're right, CELERY_DEFAULT_QUEUE is only for backends like RabbitMQ. I think you need to set a different database number for each site, using a different number at the end of your broker url.

If you are using django-celery then make sure you don't have an instance of celery running outside of your virtualenvs. Then start the celery instance within your virtualenvs using manage.py celeryd like you have done. I recommend setting up supervisord to keep track of your instances.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.