What is the use of Celery in python?

What is the use of Celery in python? - python

I am confused in celery.Example i want to load a data file and it takes 10 seconds to load without celery.With celery how will the user be benefited? Will it take same time to load data?

Celery, and similar systems like Huey are made to help us distribute (offload) the amount of processes that normally can't execute concurrently on a single machine, or it would lead to significant performance degradation if you do so. The key word here is DISTRIBUTED.
You mentioned downloading of a file. If it is a single file you need to download, and that is all, then you do not need Celery. How about more complex scenario - you need to download 100000 files? How about even more complex - these 100000 files need to be parsed and the parsing process is CPU intensive?
Moreover, Celery will help you with retrying of failed tasks, logging, monitoring, etc.

Normally, the user has to wait to load the data file to be done on the server. But with the help of celery, the operation will be performed on the server and the user will not be involved. Even if the app crashes, that task will be queued.
Celery will keep track of the work you send to it in a database
back-end such as Redis or RabbitMQ. This keeps the state out of your
app server's process which means even if your app server crashes your
job queue will still remain. Celery also allows you to track tasks
that fail.

Related

Number and type of Gunicorn workers and file handling in Flask app

I am new to python and have just started using flask and building a web application that might need to handle 5 or 6 users at the same time. My flask app will trigger some shell scripts (based on user input) which might take 1 or 2 minutes to process and also it will mail the user about the stauts of their request. It will also write logs in a particular .log file for each day. Now I was reading about it and found out that I dont need to worry about handling multiple request as the GIL will not allow to run more than one instance of my application, even if I have 10 instances running at the same time. Now as mentioned in the Gunicorn doc sync workers will be able to handle only one request, I need to use Async workers but will it affect my application? Like writing to the log file and executing the shell scripts. Do I need to implement locking on the logfile and will it affect the shell script execution?

GIL. It seems you don't need to care about GIL. GIL is about multithreading limitations, in your case, workers are independently running instances of your application. And even in the case of multithreading, GIL will be released when you will call your shell script, so that wouldn't be a problem either.
AsyncWorker. As I understood your design, will not help you, because you need actions (shell and writing log) sequentially, one by one.
Locking. If by each user request you'll run a separate shell script process and write to a separate log file then you don't need locking.
The core thing is that your application will be run by Gunicorn multiple times (workers count), simultaneously and independently. The count of workers depends on the user count you want to process and the type of your load, but values like 2 * cpu_cores_count is almost always fit good. I would recommend sync worker type.
Also, I wrote a blog post about production-ready gunicorn configuration. You may be interested in.

gunicorn and/or celery: What is the way get the best out of both?

I've a machine learning application which uses flask to expose api(for production this is not a good idea, but even if I'll use django in future the idea of the question shouldn't change).
The main problem is how to serve multiple requests to my app. Few months back celery has been added to get around this problem. The number of workers in celery that was spawned is equal to the number of cores present in the machine. For very few users this was looking fine and was in production for some time.
When the number of concurrent users got increased, it was evident that we should do a performance testing on it. It turns out: it is able to handle 20 users for 30 GB and 8 core machine without authentication and without any front-end. Which is not looking like a good number.
I didn't know there are things like: application server, web server, model server. When googling for this problem: gunicorn was a good application server python application.
Should I use gunicorn or any other application server along with celery and why
If I remove celery and only use gunicorn with the application can I achieve concurrency. I have read somewhere celery is not good for machine learning applications.
What are the purposes of gunicorn and celery. How can we achieve the best out of both.
Note: Main goal is to maximize concurrency. While serving in production authentication will be added. One front-end application might come into action in between in production.

There is no shame in flask. If in fact you just need a web API wrapper, flask is probably a much better choice than django (simply because django is huge and you'd be using only a fraction of its capability).
However, your concurrency problems are apparently stemming from the fact that you are doing some heavy-duty processing for each request. There is simply no way around that; if you require a certain amount of computational resources per request, you can't magic those up. From here on, it's a juggling act.
If you want a guaranteed response immediately, you need to have as many workers as potential simultaneous requests. This may involve load balancing over multiple servers, if you can't scrounge up enough resources on one server. (cue gunicorn, a web application server, responsible for accepting connections and then distributing them to multiple application processes.)
If you are okay with not getting an immediate response, you can let stuff queue up. (cue celery, a task queue, which worker processes can use to retrieve the next thing to be done, and deposit results). This works best if you don't need a response in the same request-response cycle; e.g. you submit a job from client, and they only get an acknowledgement that the job has been received; you would need a second request to ask about the status of the job, and possibly the results of the job if it is finished.
Alternately, instead of Flask you could use websockets or Tornado, to push out the response to the client when it is available (as opposed to user polling for results, or waiting on a live HTTP connection and taking up a server process).

Django Parallel Processing

I have a simple Django project.
Each time a user hits the homepage,some operations are performed based on which,view is generated. Now the problem is that when a user hits the homepage ,sometimes the operations take a long time based on network connectivity. If in the meantime, a new user hits the homepage,he has to wait for the request from the previous user to get serviced before the page gets rendered.
I found Celery is used for task scheduling and queuing . But I wonder if Celery is what i need.I need each user to have his request be processed independently and not queued.
My project is a single app project and will receive a maximum of 100 users a time.
Thanks.

If the long process needs to be done in order to serve the request and generate the proper response then you cannot use Celery.
The debug web-server that is shipped with Django is a multi-threaded-single-process server, but is really very limited and should not be used in production.
If you use gunicorn or other wsgi servers you can run your application in multiple processes but you will hit the limit quickly if you're doing heavy processing.
The solution would be in my opinion is to either change the way you're processing stuff, either prepare ahead or serve the request and do the processing in the background, you can show the user a Please wait... message, here you can use Celery to do the processing.
The other solution would be to use event-based web-server like Twisted or cyclone or others

Status of Python Celery tasks

I'm wondering what kind of options there are for monitoring celery tasks from a browser, after they have been deployed to a worker?
My current application stack is a flask app running inside twisted, using celery to run dozens to thousands of small background tasks (updating metadata in a repository, creating image derivatives, etc.) I'm envisioning using ajax long-polling to monitor the status of the celery tasks initiated by the user. I'm using redis for the backend broker and results.
I see celery has some command line ways to monitor tasks, or flower for a web dashboard. But if I wanted to see more detailed status from a particular task sent to celery, would it make more sense for that task to print / write to a log file, then long-poll that file for changes from the flask front-end?
At this point a user can say, "update these 10,000 items", the tasks are sent to celery, and the front-end very quickly says, "job sent!". And the tasks do complete. But I'd like to have the user navigate to "/status" and see the status of those 10,000 small jobs - even a scrolling log file would probably work.
Any suggestions would be greatly appreciated. Took a lot of head scratching to make it this far sketching things out, but I'm spinning my wheels figuring out exactly WHAT to long-poll from the user front-end.

Try Jobstatic, which is extending Celery.
From project description:
Jobtastic gives you goodies like:
Easy progress estimation/reporting
Job status feedback
Helper methods for gracefully handling a dead task broker (delay_or_eager and delay_or_fail)
Super-easy result caching
Thundering herd avoidance
Integration with a celery jQuery plugin for easy client-side progress display
Memory leak detection in a task run

Jobtastic was a great idea, but not quite what worked for us. In the end, decided to create an incrementing job number (stored in Redis alongside results and broker), push all celery task id's associated with that job number into a python object, then pickle and store that in redis. We can then use that later to see if the entire "job" is complete, or the status thereof. For our purposes, works just lovely.

Do I need celery when I am using gevent?

I am working on a django web app that has functions (say for e.g. sync_files()) that take a long time to return. When I use gevent, my app does not block when sync_file() runs and other clients can connect and interact with the webapp just fine.
My goal is to have the webapp responsive to other clients and not block. I do not expect a zillion users to connect to my webapp (perhaps max 20 connections), and I do not want to set this up to become the next twitter. My app is running on a vps, so I need something light weight.
So in my case listed above, is it redundant to use celery when I am using gevent? Is there a specific advantage to using celery? I prefer not to use celery since it is yet another service that will be running on my machine.
edit: found out that celery can run the worker pool on gevent. I think I am a litle more unsure about the relationship between gevent & celery.

In short you do need a celery.
Even if you use gevent and have concurrency, the problem becomes request timeout. Lets say your task takes 10 minutes to run however the typical request timeout is about up to a minute. So what will happen if you trigger the task directly within a view is that the server will start processing it however after a minute a client (browser) will probably disconnect the connection since it will think the server is offline. As a result, your data can become corrupt since you cannot be guaranteed what will happen when connection will close. Celery solves this because it will trigger a background process which will process the task independent of the view. So the user will get the view response right away and at the same time the server will start processing the task. That is a correct pattern to handle any scenarios which require lots of processing.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.