I'm wondering what kind of options there are for monitoring celery tasks from a browser, after they have been deployed to a worker?
My current application stack is a flask app running inside twisted, using celery to run dozens to thousands of small background tasks (updating metadata in a repository, creating image derivatives, etc.) I'm envisioning using ajax long-polling to monitor the status of the celery tasks initiated by the user. I'm using redis for the backend broker and results.
I see celery has some command line ways to monitor tasks, or flower for a web dashboard. But if I wanted to see more detailed status from a particular task sent to celery, would it make more sense for that task to print / write to a log file, then long-poll that file for changes from the flask front-end?
At this point a user can say, "update these 10,000 items", the tasks are sent to celery, and the front-end very quickly says, "job sent!". And the tasks do complete. But I'd like to have the user navigate to "/status" and see the status of those 10,000 small jobs - even a scrolling log file would probably work.
Any suggestions would be greatly appreciated. Took a lot of head scratching to make it this far sketching things out, but I'm spinning my wheels figuring out exactly WHAT to long-poll from the user front-end.
Try Jobstatic, which is extending Celery.
From project description:
Jobtastic gives you goodies like:
Easy progress estimation/reporting
Job status feedback
Helper methods for gracefully handling a dead task broker (delay_or_eager and delay_or_fail)
Super-easy result caching
Thundering herd avoidance
Integration with a celery jQuery plugin for easy client-side progress display
Memory leak detection in a task run
Jobtastic was a great idea, but not quite what worked for us. In the end, decided to create an incrementing job number (stored in Redis alongside results and broker), push all celery task id's associated with that job number into a python object, then pickle and store that in redis. We can then use that later to see if the entire "job" is complete, or the status thereof. For our purposes, works just lovely.
Related
I have a django website, where I can register some event listeners and monitoring tasks on certain websites, see an info about these tasks, edit, delete, etc. These tasks are long running, so I launch them as tasks in a asyncio event loop. I want them to be independent on the django website, so I run these tasks in event loop alongside Sanic webserver, and control it with api calls from the django server. I dont know why, but I still feel that this solution is pretty scuffed, so is there a better way to do it? I was thinking about using kubernetes, but these tasks arent resource heavy and are simple, so I dont think it's worth launching new pod for each.
Thanks for help.
Ideally, it is always a good idea to launch a new pod for a new event or job.
You can use cronjob in Kubernetes so they auto-deleted when work is done.
It's always better keep to separate and small microservices rather than running the whole monolith application inside the container.
On the management side using starting the new pod will be easy to manage also, also cost-efficient if you scale up & down your cluster as per resource requirement.
You can also use the message broker and listener which will listen to the channel in the message broker and perform the async task or event if any. Listen consider as separate pod.
I am confused in celery.Example i want to load a data file and it takes 10 seconds to load without celery.With celery how will the user be benefited? Will it take same time to load data?
Celery, and similar systems like Huey are made to help us distribute (offload) the amount of processes that normally can't execute concurrently on a single machine, or it would lead to significant performance degradation if you do so. The key word here is DISTRIBUTED.
You mentioned downloading of a file. If it is a single file you need to download, and that is all, then you do not need Celery. How about more complex scenario - you need to download 100000 files? How about even more complex - these 100000 files need to be parsed and the parsing process is CPU intensive?
Moreover, Celery will help you with retrying of failed tasks, logging, monitoring, etc.
Normally, the user has to wait to load the data file to be done on the server. But with the help of celery, the operation will be performed on the server and the user will not be involved. Even if the app crashes, that task will be queued.
Celery will keep track of the work you send to it in a database
back-end such as Redis or RabbitMQ. This keeps the state out of your
app server's process which means even if your app server crashes your
job queue will still remain. Celery also allows you to track tasks
that fail.
I need your opinion on a challenge that I'm facing. I'm building a website that uses Django as a backend, PostgreSQL as my DB, GraphQL as my API layer and React as my frontend framework. Website is hosted on Heroku. I wrote a python script that logs me in to my gmail account and parse few emails, based on pre-defined conditions, and store the parsed data into Google Sheet. Now, I want the script to be part of my website in which user will specify what exactly need to be parsed (i.e. filters) and then display the parsed data in a table to review accuracy of the parsing task.
The part that I need some help with is how to architect such workflow. Below are few ideas that I managed to come up with after some googling:
generate a graphQL mutation that stores a 'task' into a task model. Once a new task entry is stored, a Django Signal will trigger the script. Not sure yet if Signal can run custom python functions, but from what i read so far, it seems doable.
Use Celery to run this task asynchronously. But i'm not sure if asynchronous tasks is what i'm after here as I need this task to run immediately after the user trigger the feature from the frontend. But i'm might be wrong here. I'm also not sure if I need Redis to store the task details or I can do that on PostgreSQL.
What is the best practice in implementing this feature? The task can be anything, not necessarily parsing emails; it can also be importing data from excel. Any task that is user generated rather than scheduled or repeated task.
I'm sorry in advance if this question seems trivial to some of you. I'm not a professional developer and the above project is a way for me to sharpen my technical skills and learn new techniques.
Looking forward to learn from your experiences.
You can dissect your problem into the following steps:
User specifies task parameters
System executes task
System displays result to the User
You can either do all of these:
Sequentially and synchronously in one swoop; or
Step by step asynchronously.
Synchronously
You can run your script when generating a response, but it will come with the following downsides:
The process in the server processing your request will block until the script is finished. This may or may not affect the processing of other requests by that same server (this will depend on the number of simultaneous requests being processed, workload of the script, etc.)
The client (e.g. your browser) and even the server might time out if the script takes too long. You can fix this to some extent by configuring your server appropriately.
The beauty of this approach however is it's simplicity. For you to do this, you can just pass the parameters through the request, server parses and does the script, then returns you the result.
No setting up of a message queue, task scheduler, or whatever needed.
Asynchronously
Ideally though, for long-running tasks, it is best to have this executed outside of the usual request-response loop for the following advantages:
The server responding to the requests can actually serve other requests.
Some scripts can take a while, some you don't even know if it's going to finish
Script is no longer dependent on the reliability of the network (imagine running an expensive task, then your internet connection skips or is just plain intermittent; you won't be able to do anything)
The downside of this is now you have to set more things up, which increases the project's complexity and points of failure.
Producer-Consumer
Whatever you choose, it's usually best to follow the producer-consumer pattern:
Producer creates tasks and puts them in a queue
Consumer takes a task from the queue and executes it
The producer is basically you, the user. You specify the task and the parameters involved in that task.
This queue could be any datastore: in-memory datastore like Redis; a messaging queue like RabbitMQ; or an relational database management system like PostgreSQL.
The consumer is your script executing these tasks. There are multiple ways of running the consumer/script: via Celery like you mentioned which runs multiple workers to execute the tasks passed through the queue; via a simple time-based job scheduler like crontab; or even you manually triggering the script
The question is actually not trivial, as the solution depends on what task you are actually trying to do. It is best to evaluate the constraints, parameters, and actual tasks to decide which approach you will choose.
But just to give you a more relevant guideline:
Just keep it simple, unless you have a compelling reason to do so (e.g. server is being bogged down, or internet connection is not reliable in practice), there's really no reason to be fancy.
The more blocking the task is, or the longer the task takes or the more dependent it is to third party APIs via the network, the more it makes sense to push this to a background process add reliability and resiliency.
In your email import script, I'll most likely push that to the background:
Have a page where you can add a task to the database
In the task details page, display the task details, and the result below if it exists or "Processing..." otherwise
Have a script that executes tasks (import emails from gmail given the task parameters) and save the results to the database
Schedule this script to run every few minutes via crontab
Yes the above has side effects, like crontab running the script in multiple times at the same time and such, but I won't go into detail without knowing more about the specifics of the task.
I have a simple Django project.
Each time a user hits the homepage,some operations are performed based on which,view is generated. Now the problem is that when a user hits the homepage ,sometimes the operations take a long time based on network connectivity. If in the meantime, a new user hits the homepage,he has to wait for the request from the previous user to get serviced before the page gets rendered.
I found Celery is used for task scheduling and queuing . But I wonder if Celery is what i need.I need each user to have his request be processed independently and not queued.
My project is a single app project and will receive a maximum of 100 users a time.
Thanks.
If the long process needs to be done in order to serve the request and generate the proper response then you cannot use Celery.
The debug web-server that is shipped with Django is a multi-threaded-single-process server, but is really very limited and should not be used in production.
If you use gunicorn or other wsgi servers you can run your application in multiple processes but you will hit the limit quickly if you're doing heavy processing.
The solution would be in my opinion is to either change the way you're processing stuff, either prepare ahead or serve the request and do the processing in the background, you can show the user a Please wait... message, here you can use Celery to do the processing.
The other solution would be to use event-based web-server like Twisted or cyclone or others
I'm creating a Django web app which features potentially very long running calculations of up to an hour. The calculations are simulation models built in Python. The web app sends inputs to the simulation model and after some time receives the answer. Also, the user should be able to close his browser after starting the simulation and if he logs in the next day the results should be there.
From my research it seems like I can use Celery together with Redis/RabbitMQ as broker to run the calculation in the background. Ideally I would want to display progress updates using ajax, so that the page updates without a user refresh when the calculation is complete.
I want to host the app on Heroku, so the calculation will also be running on the Heroku server. How hard will it be if I want to move the calculation engine to another server? It might be useful if the calculation engine is on a different server.
So my question is, is my this a good approach above or what other options can I look at?
I think Celery is a good approach. Not sure if you need Redis/RabbitMQ as a broker or you could just use MySQL - it depends on your tasks. Celery workers could be runned on the different servers, so Celery supports distributed queues.
Another approach - implement some queue engine with python, database as a broker and a cron for job executions. But it could be a dirty way with a lots of pain and bugs.
So I think that Celery is a more nice way to do it.
If you are running on Heroku, you want django-rq, not Celery. See https://devcenter.heroku.com/articles/python-rq.