Django app with long running calculations

Django app with long running calculations - python

I'm creating a Django web app which features potentially very long running calculations of up to an hour. The calculations are simulation models built in Python. The web app sends inputs to the simulation model and after some time receives the answer. Also, the user should be able to close his browser after starting the simulation and if he logs in the next day the results should be there.
From my research it seems like I can use Celery together with Redis/RabbitMQ as broker to run the calculation in the background. Ideally I would want to display progress updates using ajax, so that the page updates without a user refresh when the calculation is complete.
I want to host the app on Heroku, so the calculation will also be running on the Heroku server. How hard will it be if I want to move the calculation engine to another server? It might be useful if the calculation engine is on a different server.
So my question is, is my this a good approach above or what other options can I look at?

I think Celery is a good approach. Not sure if you need Redis/RabbitMQ as a broker or you could just use MySQL - it depends on your tasks. Celery workers could be runned on the different servers, so Celery supports distributed queues.
Another approach - implement some queue engine with python, database as a broker and a cron for job executions. But it could be a dirty way with a lots of pain and bugs.
So I think that Celery is a more nice way to do it.

If you are running on Heroku, you want django-rq, not Celery. See https://devcenter.heroku.com/articles/python-rq.

Related

Using celery to send tasks from component A to component B

The technology I would like to use in this example is Celery for queueing and python for component implementation.
Imagine a simple project hat exists of 2 components. One is a web app that connects to an API and gathers data. Component 2 is a processor that can then process the data. When the web app has gotten a piece of data from the API it is supposed to send a task into a task queue including the just crawled data which is then consumed by the processor to process the Data.
Whether or not this is a sensible way to go about a task like this is debatable and not the point of my question.
My question is, the tasks to process things are defined within the processor since they state what processing function shall be executed and the definition of that function is obviously within the processor. Now that the web app doesn't have access to the task definition how does he communicate the task to the processor?
Do you have to hold a copy of the source code of the processor within the web app?
Do you make the processor a dependency of the web app?
What is the best practice approach to handle such a scenario?

What you are describing is probably one of the most common use-cases for Celery. Just look how many people are asking Django/Flask + Celery questions here on StackOverflow... If you are a Django user, there is an entire section in the Celery documentation describing how to do exactly what you want. Things should be similar with other frameworks.
Do you have to hold a copy of the source code of the processor within the web app?
As far as I know you do not have to (I do not use any web framework) but it could be that you do need to because of some deeper integration with Celery. If your web application knows the Celery task name, and its parameters, it can schedule it to run without actually having access to the Python code. This is accomplished using send_task(task_name, ...).
Do you make the processor a dependency of the web app?
As I wrote above there are several ways to use it. If you want tighter integration then yes. If you just want to run task and get result using the send_task() than your web application should only depend on Celery.
What is the best practice approach to handle such a scenario?
Follow the Django guide. I advise you to run Celery independently, run some tasks, just so you learn about basic principles how it distributes the work, etc.

gunicorn and/or celery: What is the way get the best out of both?

I've a machine learning application which uses flask to expose api(for production this is not a good idea, but even if I'll use django in future the idea of the question shouldn't change).
The main problem is how to serve multiple requests to my app. Few months back celery has been added to get around this problem. The number of workers in celery that was spawned is equal to the number of cores present in the machine. For very few users this was looking fine and was in production for some time.
When the number of concurrent users got increased, it was evident that we should do a performance testing on it. It turns out: it is able to handle 20 users for 30 GB and 8 core machine without authentication and without any front-end. Which is not looking like a good number.
I didn't know there are things like: application server, web server, model server. When googling for this problem: gunicorn was a good application server python application.
Should I use gunicorn or any other application server along with celery and why
If I remove celery and only use gunicorn with the application can I achieve concurrency. I have read somewhere celery is not good for machine learning applications.
What are the purposes of gunicorn and celery. How can we achieve the best out of both.
Note: Main goal is to maximize concurrency. While serving in production authentication will be added. One front-end application might come into action in between in production.

There is no shame in flask. If in fact you just need a web API wrapper, flask is probably a much better choice than django (simply because django is huge and you'd be using only a fraction of its capability).
However, your concurrency problems are apparently stemming from the fact that you are doing some heavy-duty processing for each request. There is simply no way around that; if you require a certain amount of computational resources per request, you can't magic those up. From here on, it's a juggling act.
If you want a guaranteed response immediately, you need to have as many workers as potential simultaneous requests. This may involve load balancing over multiple servers, if you can't scrounge up enough resources on one server. (cue gunicorn, a web application server, responsible for accepting connections and then distributing them to multiple application processes.)
If you are okay with not getting an immediate response, you can let stuff queue up. (cue celery, a task queue, which worker processes can use to retrieve the next thing to be done, and deposit results). This works best if you don't need a response in the same request-response cycle; e.g. you submit a job from client, and they only get an acknowledgement that the job has been received; you would need a second request to ask about the status of the job, and possibly the results of the job if it is finished.
Alternately, instead of Flask you could use websockets or Tornado, to push out the response to the client when it is available (as opposed to user polling for results, or waiting on a live HTTP connection and taking up a server process).

How do I run background job in Flask without threading or task-queue

I am building REST API with Flask-restplus. One of my endpoints takes a file uploaded from client and run some analysis. The job uses up to 30 seconds. I don't want the job to block the main process. So the endpoint will return a response with 200 or 201 right away, the job can still be running. Results will be saved to database which will be retrieved later.
It seems I have two options for long-running jobs.
Threading
Task-queue
Threading is relatively simpler. But problem is, there is a limit of thread numbers for Flask app. In a standalone Python app, I could use a queue for the threads. But this is REST api, each request call is independent. I don't know if there is a way to maintain a global queue for that. So if the requests exceed the thread limit, it won't be able to take more requests.
Task-queue with Celery and Redis is probably better option. But this is just a proof of concept thing, and time line is kind of tight. Setting up Celery, Redis with Flask is not easy, I am having lots of trouble on my dev machine which is a Windows. It will be deployed on AWS which is kind of complex.
I wonder if there is a third option for this case?

I would HIGHLY recommend using Celery as you have already mentioned in your post. It is built exactly for this use case. Their docs are really informative and there are no shortage of examples online that can get you up and running quickly.
Additionally, I would say THIS would be an excellent first resource for you to start with.

Celery is a fantastic solution to this problem I have used quite successfully in the past to manage millions of jobs per day.
The only real downside is the initial learning curve and complexity of debugging when things go sour (it can happen, especially with millions of jobs).

Status of Python Celery tasks

I'm wondering what kind of options there are for monitoring celery tasks from a browser, after they have been deployed to a worker?
My current application stack is a flask app running inside twisted, using celery to run dozens to thousands of small background tasks (updating metadata in a repository, creating image derivatives, etc.) I'm envisioning using ajax long-polling to monitor the status of the celery tasks initiated by the user. I'm using redis for the backend broker and results.
I see celery has some command line ways to monitor tasks, or flower for a web dashboard. But if I wanted to see more detailed status from a particular task sent to celery, would it make more sense for that task to print / write to a log file, then long-poll that file for changes from the flask front-end?
At this point a user can say, "update these 10,000 items", the tasks are sent to celery, and the front-end very quickly says, "job sent!". And the tasks do complete. But I'd like to have the user navigate to "/status" and see the status of those 10,000 small jobs - even a scrolling log file would probably work.
Any suggestions would be greatly appreciated. Took a lot of head scratching to make it this far sketching things out, but I'm spinning my wheels figuring out exactly WHAT to long-poll from the user front-end.

Try Jobstatic, which is extending Celery.
From project description:
Jobtastic gives you goodies like:
Easy progress estimation/reporting
Job status feedback
Helper methods for gracefully handling a dead task broker (delay_or_eager and delay_or_fail)
Super-easy result caching
Thundering herd avoidance
Integration with a celery jQuery plugin for easy client-side progress display
Memory leak detection in a task run

Jobtastic was a great idea, but not quite what worked for us. In the end, decided to create an incrementing job number (stored in Redis alongside results and broker), push all celery task id's associated with that job number into a python object, then pickle and store that in redis. We can then use that later to see if the entire "job" is complete, or the status thereof. For our purposes, works just lovely.

Do I need celery when I am using gevent?

I am working on a django web app that has functions (say for e.g. sync_files()) that take a long time to return. When I use gevent, my app does not block when sync_file() runs and other clients can connect and interact with the webapp just fine.
My goal is to have the webapp responsive to other clients and not block. I do not expect a zillion users to connect to my webapp (perhaps max 20 connections), and I do not want to set this up to become the next twitter. My app is running on a vps, so I need something light weight.
So in my case listed above, is it redundant to use celery when I am using gevent? Is there a specific advantage to using celery? I prefer not to use celery since it is yet another service that will be running on my machine.
edit: found out that celery can run the worker pool on gevent. I think I am a litle more unsure about the relationship between gevent & celery.

In short you do need a celery.
Even if you use gevent and have concurrency, the problem becomes request timeout. Lets say your task takes 10 minutes to run however the typical request timeout is about up to a minute. So what will happen if you trigger the task directly within a view is that the server will start processing it however after a minute a client (browser) will probably disconnect the connection since it will think the server is offline. As a result, your data can become corrupt since you cannot be guaranteed what will happen when connection will close. Celery solves this because it will trigger a background process which will process the task independent of the view. So the user will get the view response right away and at the same time the server will start processing the task. That is a correct pattern to handle any scenarios which require lots of processing.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.