I am trying to do some tasks in django that consume alot of time. For that, I will be running background tasks.
After some R&D, i have found two solutions:
Celery with RabbitMQ.
Django Background tasks.
Both options seem to fulfill the criteria but setting up Celery will require some work. Now as far as the second option is concerned, setup is fairly simple and in fairly quick amount of time, i can go on writing background tasks. Now my questions if i adopt the 2nd option is this:
How well does Django Background tasks perform ? (Scalability wise in Production environment).
Can i poll the tasks (after some time) in DB to check the task's status?
Architecture of Django-Background-tasks? Couldn't find any clear explanation about it's architecture (Or have I missed some resource?)
Again coming to the first point, how well does Django Background tasks perform in production. (Talking about prior experience of using this in prod.)
Setting up celery takes work (although less when using Redis). It's also serious tool with almost a decade of investment and widespread industry adoption.
As for performance, scaling behaviors of task systems which are backed by queues vs those backed by RDBMs are well understood – but may not be relevant to you as "scalability" is a very subjective term. This thread provides some good framing on the subject and questions.
Comparing stars on GitHub (bg tasks' 3XX vs Celery's 13XXX), you should realize Django-Background-tasks has a smaller user base, and you're probably going to need to get into the internals to understand the architecture and precise mechanics. That shouldn't stop you – just be prepared to DIY when answers aren't forthcoming.
How well does Django Background tasks perform ? - This will depend upon how and what you implement. One thing to note is, Django-background-tasks is based upon database where celery can have redis/rabbitmq as backend, so most probably we'll see considerable performance difference here.
Can I poll the tasks (after some time) in DB to check the task's status? - It's possible in celery and maybe you can find a solution by inspecting django-background-tasks internal code. But one thing is, we can abort celery task, which maybe not possible in Django-Background-tasks.
Architecture of Django-Background-tasks? Couldn't find any clear explanation about it's architecture (Or have I missed some resource?) - It's simple Django based project. You can have a look at code. It's seems to be pretty straightforward.
Again coming to the first point, how well does Django Background tasks perform in production. - Haven't used in production. But since Django-Background-tasks is database based and celery can be configured to use redis/rabbitmq - I think celery have a plus point here.
To me this comparison, seems to be link comparing pistol with a high-end automatic machine guns. Both do same job. But one simple straightforward - other little complicated but with lots of options and scope.
Choose based on your use case.
I have decided to use Django-Background-Tasks. Let me clarify my motivations.
The tasks that will be processed by Django-Background-Tasks doesn't need to be processed in a fast manner. As it is stated by the name, they are background tasks. I accept delays.
The architecture of Django-Background-Tasks is very simple. When you call a method to be process in the background in your code a task record is inserted to the Django-Background-Tasks tables in your database. And the method you called is not executed actually. It is proxied. Then you should trigger another process to execute the jobs. Your method is then executed in this process.
The process that execute jobs can be executed by a cron entry in your server.
Since this setup is so easy and work for I decided to use Django-Background-Tasks. But If I needed something more responsive and fast I would use Celery since it is using memory and there is an active process that processes the jobs. Which isn't the case in Django-Background-Tasks.
Related
I have a fairly complex periodic-tasks that needs to be offloaded from django context. django-celery-beat looks promising. While I was going through celery-beat docs, I found this:
You have to ensure only a single scheduler is running for a schedule at a time, otherwise you’d end up with duplicate tasks. Using a centralized approach means the schedule doesn’t have to be synchronized, and the service can operate without using locks.
A typical production deployment will spawn a pool of worker-processes each running a django instance. Will that result in creation of multiple scheduler processes as well? Do I need to have some synchronisation logic?
Thanks for your time!
It does not.
You can dig into the issues page on their github repo for confirmation. I think it's weird that the documentation doesn't call this out, but I suppose you have to assume that's how all celery beats work unless they specify otherwise.
In theory, you could build your own synchronization, but it will probably be a better experience to use a different scheduler that has that functionality built in, like Heroku's redbeat: https://blog.heroku.com/redbeat-celery-beat-scheduler.
Edit for clarify my question:
I want to attach a python service on uwsgi using this feature (I can't understand the examples) and I also want to be able to communicate results between them. Below I present some context and also present my first thought on the communication matter, expecting maybe some advice or another approach to take.
I have an already developed python application that uses multiprocessing.Pool to run on demand tasks. The main reason for using the pool of workers is that I need to share several objects between them.
On top of that, I want to have a flask application that triggers tasks from its endpoints.
I've read several questions here on SO looking for possible drawbacks of using flask with python's multiprocessing module. I'm still a bit confused but this answer summarizes well both the downsides of starting a multiprocessing.Pool directly from flask and what my options are.
This answer shows an uWSGI feature to manage daemon/services. I want to follow this approach so I can use my already developed python application as a service of the flask app.
One of my main problems is that I look at the examples and do not know what I need to do next. In other words, how would I start the python app from there?
Another problem is about the communication between the flask app and the daemon process/service. My first thought is to use flask-socketIO to communicate, but then, if my server stops I need to deal with the connection... Is this a good way to communicate between server and service? What are other possible solutions?
Note:
I'm well aware of Celery, and I pretend to use it in a near future. In fact, I have an already developed node.js app, on which users perform actions that should trigger specific tasks from the (also) already developed python application. The thing is, I need a production-ready version as soon as possible, and instead of modifying the python application, that uses multiprocessing, I thought it would be faster to create a simple flask server to communicate with node.js through HTTP. This way I would only need to implement a flask app that instantiates the python app.
Edit:
Why do I need to share objects?
Simply because the creation of the objects in questions takes too long. Actually, the creation takes an acceptable amount of time if done once, but, since I'm expecting (maybe) hundreds to thousands of requests simultaneously having to load every object again would be something I want to avoid.
One of the objects is a scikit classifier model, persisted on a pickle file, which takes 3 seconds to load. Each user can create several "job spots" each one will take over 2k documents to be classified, each document will be uploaded on an unknown point in time, so I need to have this model loaded in memory (loading it again for every task is not acceptable).
This is one example of a single task.
Edit 2:
I've asked some questions related to this project before:
Bidirectional python-node communication
Python multiprocessing within node.js - Prints on sub process not working
Adding a shared object to a manager.Namespace
As stated, but to clarify: I think the best solution would be to use Celery, but in order to quickly have a production ready solution, I trying to use this uWSGI attach daemon solution
I can see the temptation to hang on to multiprocessing.Pool. I'm using it in production as part of a pipeline. But Celery (which I'm also using in production) is much better suited to what you're trying to do, which is distribute work across cores to a resource that's expensive to set up. Have N cores? Start N celery workers, which of which can load (or maybe lazy-load) the expensive model as a global. A request comes in to the app, launch a task (e.g., task = predict.delay(args), wait for it to complete (e.g., result = task.get()) and return a response. You're trading a little bit of time learning celery for saving having to write a bunch of coordination code.
I am building REST API with Flask-restplus. One of my endpoints takes a file uploaded from client and run some analysis. The job uses up to 30 seconds. I don't want the job to block the main process. So the endpoint will return a response with 200 or 201 right away, the job can still be running. Results will be saved to database which will be retrieved later.
It seems I have two options for long-running jobs.
Threading
Task-queue
Threading is relatively simpler. But problem is, there is a limit of thread numbers for Flask app. In a standalone Python app, I could use a queue for the threads. But this is REST api, each request call is independent. I don't know if there is a way to maintain a global queue for that. So if the requests exceed the thread limit, it won't be able to take more requests.
Task-queue with Celery and Redis is probably better option. But this is just a proof of concept thing, and time line is kind of tight. Setting up Celery, Redis with Flask is not easy, I am having lots of trouble on my dev machine which is a Windows. It will be deployed on AWS which is kind of complex.
I wonder if there is a third option for this case?
I would HIGHLY recommend using Celery as you have already mentioned in your post. It is built exactly for this use case. Their docs are really informative and there are no shortage of examples online that can get you up and running quickly.
Additionally, I would say THIS would be an excellent first resource for you to start with.
Celery is a fantastic solution to this problem I have used quite successfully in the past to manage millions of jobs per day.
The only real downside is the initial learning curve and complexity of debugging when things go sour (it can happen, especially with millions of jobs).
I've got my Django project running well, and a separate background process which will collect data from various sources and store that data in an index.
I've got a model in a Django app called Sources which contains, essentially, a list of sources that data can come from! I've successfully managed to create a signal that is activated/called when a new entry is put in the Sources model.
My question is, is there a simple way that anybody knows of whereby I can send some form of signal/message to my background process indicating that the Sources model has been changed? Or should I just resort to polling for changes every x seconds, because it's so much simpler?
Many thanks for any help received.
It's unclear how are you running the background process you're talking about.
Anyway, I'd suggest that in your background task you use the Sources model directly. There are convenient ways to run the task without leaving realm of Django (so as to have an access to your models. You can use Celery [1], for example, or RQ [2].
Then you won't need to pass any messages, any changes to Sources model will take effect the next time your task is run.
[1] Celery is an open source asynchronous task queue/job queue, it isn't hard to set up and integrates with Django well.
Celery: general introduction
Django with celery introduction
[2] RQ means "Redis Queue", it is ‘a simple Python library for queueing jobs and processing them in the background with workers’.
Introductory post
GitHub repository
Polling is probably the easiest if you don't need split-second latency.
If you do, however, then you'll probably want to look into either, say,
sending an UNIX signal (or other methods of IPC, depending on platform) to the process
having the background process have a simple listening socket that you just send, say, a byte to (which is, admittedly, a form of IPC), and that triggers the action you want to trigger
or some sort of task/message queue. Celery or ZeroMQ come to mind.
I'm writing a Python application that needs both concurrency and asynchronicity. I've had a few recommendations each for Twisted and Celery, but I'm having trouble determining which is the better choice for this application (I have no experience with either).
The application (which is not a web app) primarily centers around making SOAP calls out to various third party APIs. To process a given piece of data, I'll need to call several APIs sequentially. And I'd like to be able to have a pool of "workers" for each of these APIs so I can make more than 1 call at a time to each API. Nothing about this should be very cpu-intensive.
More specifically, an external process will add a new "Message" to this application's database. I will need a job that watches for new messages, and then pushes them through the Process. The process will contain 4-5 steps that need to happen in order, but can happen completely asynchronously. Each step will take the message and act upon it in some way, typically adding details to the message. Each subsequent step will require the output from the step that precedes it. For most of these Steps, the work involved centers around calling out to a third-party API typically with a SOAP client, parsing the response, and updating the message. A few cases will involve the creation of a binary file (harder to pickle, if that's a factor). Ultimately, once the last step has completed, I'll need to update a flag in the database to indicate the entire process is done for this message.
Also, since each step will involve waiting for a network response, I'd like to increase overall throughput by making multiple simultaneous requests at each step.
Is either Celery or Twisted a more generally appropriate framework here? If they'll both solve the problem adequately, are there pros/cons to using one vs the other? Is there something else I should consider instead?
Is either Celery or Twisted a more generally appropriate framework here?
Depends on what you mean by "generally appropriate".
If they'll both solve the problem adequately, are there pros/cons to using one vs the other?
Not an exhaustive list.
Celery Pros:
Ready-made distributed task queue, with rate-limiting, re-tries, remote workers
Rapid development
Comparatively shallow learning curve
Celery Cons:
Heavyweight: multiple processes, external dependencies
Have to run a message passing service
Application "processes" will need to fit Celery's design
Twisted Pros:
Lightweight: single process and not dependent on a message passing service
Rapid development (for those familiar with it)
Flexible
Probably faster, no "internal" message passing required.
Twisted Cons:
Steep learning curve
Not necessarily as easy to add processing capacity later.
I'm familiar with both, and from what you've said, if it were me I'd pick Twisted.
I'd say you'll get it done quicker using Celery, but you'd learn more while doing it by using Twisted. If you have the time and inclination to follow the steep learning curve, I'd recommend you do this in Twisted.
Celery allows you to use asynchronous behavior of various async library like gevent and eventlet. So you can have best of both world.
Example using eventlet
https://github.com/celery/celery/tree/master/examples/eventlet
Example using gevent
https://github.com/celery/celery/tree/master/examples/gevent