setup celery and redis with existing flask-sqlachemy-postgres

setup celery and redis with existing flask-sqlachemy-postgres - python

I currently have setup postgres with flask-sqlalchemy extension on AWS using ElasticBeanstalk (EB). Postgres is running using RDS. Now i want to setup some background tasks. I read about Celery, seems to fit the use case reasonably well.
I want to understand how I can set that up on AWS so that it talks to the same Database. For the actual queues I want to use Redis. The business logic for background process and what I have in flask-webserver is very intertwined. How would the deployment process look like (with or without EB). I am okay with setup a new instance if needed for celery and redis as long as I don't have to separate the business logic a lot.
Another hacky solution i have been thinking is to setup crons on a node that hit certain URLs in Flask application to execute background tasks. But I would rather have a more scalable solution.

I'm using Flask with a similar setup and I followed this answer:
How do you run a worker with AWS Elastic Beanstalk?
I also setup redis using this .config file:
https://gist.github.com/yustam/9086610
However, for my setup, I changed the command to be this:
command=/opt/python/run/venv/bin/python2.7 manage.py celery
My manage.py has:
#manager.command
def celery():
"""
Start the celery worker.
"""
with app.app_context():
return celery_main(['celery', 'worker'])

Related

Django celery redis remove a specific periodic task from queue

There is a specific periodic task that needs to be removed from message queue. I am using the configuration of Redis and celery here.
tasks.py
#periodic_task(run_every=crontab(minute='*/6'))
def task_abcd():
"""
some operations here
"""
There are other periodic tasks also in the project but I need to stop this specific task to stop from now on.
As explained in this answer, the following code will work?
#periodic_task(run_every=crontab(minute='*/6'))
def task_abcd():
pass

In this example periodic task schedule is defined directly in code, meaning it is hard-coded and cannot be altered dynamically without code change and app re-deploy.
The provided code with task logic deleted or with simple return at the beginning - will work, but will not be the answer to the question - task will still run, there just is no code that will run with it.
Also, it is recommended NOT to use #periodic_task:
"""Deprecated decorator, please use :setting:beat_schedule."""
so it is not recommended to use it.
First, change method from being #periodic_task to just regular celery #task, and because you are using Django - it is better to go straightforward for #shared_task:
from celery import shared_task
#shared_task
def task_abcd():
...
Now this is just one of celery tasks, which needs to be called explicitly. Or it can be run periodically if added to celery beat schedule.
For production and if using multiple workers it is not recommended to run celery worker with embedded beat (-B) - run separate instance of celery beat scheduler.
Schedule can specified in celery.py or in django project settings (settings.py).
It is still not very dynamic, as to re-read settings app needs to be reloaded.
Then, use Database Scheduler which will allow dynamically creating schedules - which tasks need to be run and when and with what arguments. It even provides nice django admin web views for administration!

That code will work but I'd go for something that doesn't force you to update your code every time you need to disable/enable the task.
What you could do is to use a configurable variable whose value could come from an admin panel, a configuration file, or whatever you want, and use that to return before your code runs if the task is in disabled mode.
For instance:
#periodic_task(run_every=crontab(minute='*/6'))
def task_abcd():
config = load_config_for_task_abcd()
if not config.is_enabled:
return
# some operations here
In this way, even if your task is scheduled, its operations won't be executed.

If you simply want to remove the periodic task, have you tried to remove the function and then restart your celery service. You can restart your Redis service as well as your Django server for safe measure.
Make sure that the function you removed is not referenced anywhere else.

How to test code that creates Celery tasks?

I've read Testing with Celery but I'm still a bit confused. I want to test code that generates a Celery task by running the task manually and explicitly, something like:
def test_something(self):
do_something_that_generates_a_celery_task()
assert_state_before_task_runs()
run_task()
assert_state_after_task_runs()
I don't want to entirely mock up the creation of the task but at the same time I don't care about testing the task being picked up by a Celery worker. I'm assuming Celery works.
The actual context in which I'm trying to do this is a Django application where there's some code that takes too long to run in a request, so, it's delegated to background jobs.

In test mode use CELERY_TASK_ALWAYS_EAGER = True. You can set this setting in your settings.py in django if you have followed the default guide for django-celery configuration.

Multiple instances of celerybeat for autoscaled django app on elasticbeanstalk

I am trying to figure out the best way to structure a Django app that uses Celery to handle async and scheduled tasks in an autoscaling AWS ElasticBeanstalk environment.
So far I have used only a single instance Elastic Beanstalk environment with Celery + Celerybeat and this worked perfectly fine. However, I want to have multiple instances running in my environment, because every now and then an instance crashes and it takes a lot of time until the instance is back up, but I can't scale my current architecture to more than one instance because Celerybeat is supposed to be running only once across all instances as otherwise every task scheduled by Celerybeat will be submitted multiple times (once for every EC2 instance in the environment).
I have read about multiple solutions, but all of them seem to have issues that don't make it work for me:
Using django cache + locking: This approach is more like a quick fix than a real solution. This can't be the solution if you have a lot of scheduled tasks and you need to add code to check the cache for every task. Also tasks are still submitted multiple times, this approach only makes sure that execution of the duplicates stops.
Using leader_only option with ebextensions: Works fine initially, but if an EC2 instance in the enviroment crashes or is replaced, this would lead to a situation where no Celerybeat is running at all, because the leader is only defined once at the creation of the environment.
Creating a new Django app just for async tasks in the Elastic Beanstalk worker tier: Nice, because web servers and workers can be scaled independently and the web server performance is not affected by huge async work loads performed by the workers. However, this approach does not work with Celery because the worker tier SQS daemon removes messages and posts the message bodies to a predefined urls. Additionally, I don't like the idea of having a complete additional Django app that needs to import the models from the main app and needs to be separately updated and deployed if the tasks are modified in the main app.
How to I use Celery with scheduled tasks in a distributed Elastic Beanstalk environment without task duplication? E.g. how can I make sure that exactly one instance is running across all instances all the time in the Elastic Beanstalk environment (even if the current instance with Celerybeat crashes)?
Are there any other ways to achieve this? What's the best way to use Elastic Beanstalk's Worker Tier Environment with Django?

I guess you could single out celery beat to different group.
Your auto scaling group runs multiple django instances, but celery is not included in the ec2 config of the scaling group.
You should have different set (or just one) of instance for celery beat

In case someone experience similar issues: I ended up switching to a different Queue / Task framework for django. It is called django-q and was set up and working in less than an hour. It has all the features that I needed and also better Django integration than Celery (since djcelery is no longer active).
Django-q is super easy to use and also lighter than the huge Celery framework. I can only recommend it!

Django app with long running calculations

I'm creating a Django web app which features potentially very long running calculations of up to an hour. The calculations are simulation models built in Python. The web app sends inputs to the simulation model and after some time receives the answer. Also, the user should be able to close his browser after starting the simulation and if he logs in the next day the results should be there.
From my research it seems like I can use Celery together with Redis/RabbitMQ as broker to run the calculation in the background. Ideally I would want to display progress updates using ajax, so that the page updates without a user refresh when the calculation is complete.
I want to host the app on Heroku, so the calculation will also be running on the Heroku server. How hard will it be if I want to move the calculation engine to another server? It might be useful if the calculation engine is on a different server.
So my question is, is my this a good approach above or what other options can I look at?

I think Celery is a good approach. Not sure if you need Redis/RabbitMQ as a broker or you could just use MySQL - it depends on your tasks. Celery workers could be runned on the different servers, so Celery supports distributed queues.
Another approach - implement some queue engine with python, database as a broker and a cron for job executions. But it could be a dirty way with a lots of pain and bugs.
So I think that Celery is a more nice way to do it.

If you are running on Heroku, you want django-rq, not Celery. See https://devcenter.heroku.com/articles/python-rq.

uWSGI, cherrypy and threading

preface: I would like to separate these problems into smaller questions, but apparently, I am missing some pieces of the puzzle and it seems impossible to me.
I developed my cherrypy application using cherrypy's built in WSGI server. I naively assumed that when the time comes, I will be able to use created WSGI Application class and deploy it using any WSGI compliant server.
I used this blog post to create my own (but very similar) cherrypy Plugin and Tool to connect to database using SQLAlchemy during http requests.
I expected that any server will somehow work like cherrypy's built in server:
main process will spawn X threads to satisfy X concurrent requests
my engine Plugin will create SQLalchemy engine with connection pool = X (so any request will have its connection)
on request arrival, my Tool will supply sql alchemy connection from pool
This flow does not match with uWSGI (as long as I understand it).
I assign my application.py in uWSGI configuration. This file looks something like this:
cherrypy.tools.db = DbConnectorTool()
cherrypy.engine.dbengine = DbEnginePlugin(cherrypy.engine, settings.database)
cherrypy.config.update({
'engine.dbengine.on': True
})
from myapp.application import Application
root = Application(settings)
application = cherrypy.Application(root, script_name='', config=settings)
I was using this application.py to mount my application into cherrypy's built in server when I was developing and testing it.
The problems are that uWSGI does not create any threads itself and my SQLAlchemy plugin is not working with it, because no cherrypy.engine is created.
Does uWSGI support threading in the meaning of using threads to serve multiple concurrent requests? Can I start these threads in my application.py? Will uWSGI understand it and use these threads for concurrent requests? And how can this be done? I think cherrypy can be used somehow, or not?
And what about my SQLAlchemy Plugin, how can I start cherrypy.engine when using only WSGI cherrypy.Application?
Any help or information that could help me will be appreciated.
Edit:
My uWSGI configuration:
<uwsgi>
<socket>127.0.0.1:9001</socket>
<master/>
<daemonize>/var/log/uwsgi/app.log</daemonize>
<logdate/>
<threads/>
<pidfile>/home/web/uwsgi.pid</pidfile>
<uid>uwsgi</uid>
<gid>uwsgi</gid>
<workers>2</workers>
<harakiri>90</harakiri>
<harakiri-verbose/>
<home>/home/web/</home>
<pythonpath>/home/web/instance</pythonpath>
<module>core.application</module>
<no-orphans/>
<touch-reload>/home/web/uwsgi-reload-web</touch-reload>
</uwsgi>

uWSGI uses worker processes, not threads. It's worth noting that it means that the globals are not shared between all requests any more. You can use SharedArea for global data.
The processes are forked by default, so make sure you're ok with that or adjust settings (see Things to know).
Get Cherrypy's WSGI application with cherrypy.tree.mount(root, config=settings) call.
If your DB plugin does not have threading / shared data issues, chances are it will work. Like you say, you may need cherrypy.engine.start(), but definitely not cherrypy.engine.block(), since your main thread is now uWSGI worker.

You should post your uWSGI config, otherwise it will be hard to understand what is going on.
By the way to spawn additional threads (per worker) you simply need to add --threads N

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.