I have a python backend using celery and Redis to serve long-running tasks. The app has a front end build in Vuejs. For these long-running tasks, I need to provide a real-time update to the user about the status of that task. The approach I think of is to poll the status endpoint to using setTimeOut function in my js code. Is there any better approach to handle this kind of use case?
Related
I need your opinion on a challenge that I'm facing. I'm building a website that uses Django as a backend, PostgreSQL as my DB, GraphQL as my API layer and React as my frontend framework. Website is hosted on Heroku. I wrote a python script that logs me in to my gmail account and parse few emails, based on pre-defined conditions, and store the parsed data into Google Sheet. Now, I want the script to be part of my website in which user will specify what exactly need to be parsed (i.e. filters) and then display the parsed data in a table to review accuracy of the parsing task.
The part that I need some help with is how to architect such workflow. Below are few ideas that I managed to come up with after some googling:
generate a graphQL mutation that stores a 'task' into a task model. Once a new task entry is stored, a Django Signal will trigger the script. Not sure yet if Signal can run custom python functions, but from what i read so far, it seems doable.
Use Celery to run this task asynchronously. But i'm not sure if asynchronous tasks is what i'm after here as I need this task to run immediately after the user trigger the feature from the frontend. But i'm might be wrong here. I'm also not sure if I need Redis to store the task details or I can do that on PostgreSQL.
What is the best practice in implementing this feature? The task can be anything, not necessarily parsing emails; it can also be importing data from excel. Any task that is user generated rather than scheduled or repeated task.
I'm sorry in advance if this question seems trivial to some of you. I'm not a professional developer and the above project is a way for me to sharpen my technical skills and learn new techniques.
Looking forward to learn from your experiences.
You can dissect your problem into the following steps:
User specifies task parameters
System executes task
System displays result to the User
You can either do all of these:
Sequentially and synchronously in one swoop; or
Step by step asynchronously.
Synchronously
You can run your script when generating a response, but it will come with the following downsides:
The process in the server processing your request will block until the script is finished. This may or may not affect the processing of other requests by that same server (this will depend on the number of simultaneous requests being processed, workload of the script, etc.)
The client (e.g. your browser) and even the server might time out if the script takes too long. You can fix this to some extent by configuring your server appropriately.
The beauty of this approach however is it's simplicity. For you to do this, you can just pass the parameters through the request, server parses and does the script, then returns you the result.
No setting up of a message queue, task scheduler, or whatever needed.
Asynchronously
Ideally though, for long-running tasks, it is best to have this executed outside of the usual request-response loop for the following advantages:
The server responding to the requests can actually serve other requests.
Some scripts can take a while, some you don't even know if it's going to finish
Script is no longer dependent on the reliability of the network (imagine running an expensive task, then your internet connection skips or is just plain intermittent; you won't be able to do anything)
The downside of this is now you have to set more things up, which increases the project's complexity and points of failure.
Producer-Consumer
Whatever you choose, it's usually best to follow the producer-consumer pattern:
Producer creates tasks and puts them in a queue
Consumer takes a task from the queue and executes it
The producer is basically you, the user. You specify the task and the parameters involved in that task.
This queue could be any datastore: in-memory datastore like Redis; a messaging queue like RabbitMQ; or an relational database management system like PostgreSQL.
The consumer is your script executing these tasks. There are multiple ways of running the consumer/script: via Celery like you mentioned which runs multiple workers to execute the tasks passed through the queue; via a simple time-based job scheduler like crontab; or even you manually triggering the script
The question is actually not trivial, as the solution depends on what task you are actually trying to do. It is best to evaluate the constraints, parameters, and actual tasks to decide which approach you will choose.
But just to give you a more relevant guideline:
Just keep it simple, unless you have a compelling reason to do so (e.g. server is being bogged down, or internet connection is not reliable in practice), there's really no reason to be fancy.
The more blocking the task is, or the longer the task takes or the more dependent it is to third party APIs via the network, the more it makes sense to push this to a background process add reliability and resiliency.
In your email import script, I'll most likely push that to the background:
Have a page where you can add a task to the database
In the task details page, display the task details, and the result below if it exists or "Processing..." otherwise
Have a script that executes tasks (import emails from gmail given the task parameters) and save the results to the database
Schedule this script to run every few minutes via crontab
Yes the above has side effects, like crontab running the script in multiple times at the same time and such, but I won't go into detail without knowing more about the specifics of the task.
The technology I would like to use in this example is Celery for queueing and python for component implementation.
Imagine a simple project hat exists of 2 components. One is a web app that connects to an API and gathers data. Component 2 is a processor that can then process the data. When the web app has gotten a piece of data from the API it is supposed to send a task into a task queue including the just crawled data which is then consumed by the processor to process the Data.
Whether or not this is a sensible way to go about a task like this is debatable and not the point of my question.
My question is, the tasks to process things are defined within the processor since they state what processing function shall be executed and the definition of that function is obviously within the processor. Now that the web app doesn't have access to the task definition how does he communicate the task to the processor?
Do you have to hold a copy of the source code of the processor within the web app?
Do you make the processor a dependency of the web app?
What is the best practice approach to handle such a scenario?
What you are describing is probably one of the most common use-cases for Celery. Just look how many people are asking Django/Flask + Celery questions here on StackOverflow... If you are a Django user, there is an entire section in the Celery documentation describing how to do exactly what you want. Things should be similar with other frameworks.
Do you have to hold a copy of the source code of the processor within the web app?
As far as I know you do not have to (I do not use any web framework) but it could be that you do need to because of some deeper integration with Celery. If your web application knows the Celery task name, and its parameters, it can schedule it to run without actually having access to the Python code. This is accomplished using send_task(task_name, ...).
Do you make the processor a dependency of the web app?
As I wrote above there are several ways to use it. If you want tighter integration then yes. If you just want to run task and get result using the send_task() than your web application should only depend on Celery.
What is the best practice approach to handle such a scenario?
Follow the Django guide. I advise you to run Celery independently, run some tasks, just so you learn about basic principles how it distributes the work, etc.
Celery. An app sends a task to be executed.
r = task.delay()
Is it possible to execute code in the app address space upon task end other than by polling r?
There is no built-in way in celery. Generally speaking, you will have to come up with your own way for the task to send a notification back to your app if you want to perform the callback processing in the calling application's address space.
Here are some patterns you can use:
The general pattern
Using django channels
Using firebase
I have a project in which the user will send an audio file from android/web to the server.
I need to perform speech to text processing on the server and return some files to the user back on android/web. However the server side is to be done using Python.
Please guide me as to how it could be done?
Alongside your web application, you can have a queue of tasks that need to be run and worker process(es) to run and track those tasks. This is a popular pattern when web requests need to either start tasks in the background, check in on tasks, or get the result of a task. An introduction to this pattern can be found in the Task Queues section of the Full Stack Python open book. Celery and RQ are two popular projects that supply task queue management and can plug into an existing Python web application, such as one built with Django or Flask.
Once you have task management, you'll have to decide how to keep the user up to date on the status of a task. If you're stuck with having to use RPC-style web service calls only, then you can have clients (e.g. Android or browser) poll for the status by making a call to a web service you've created that checks on the task via your task queue manager's API.
If you want the user to be informed faster or want to reduce wasteful overhead from constant polling, consider supplying a websocket instead. Through a websocket connection, clients could subscribe to notifications of events such as the completion of a speech-to-text job. The Autobahn|Python library provides server code for implementing websockets as well as support for a protocol on top called WAMP that can be used to communicate subscriptions and messages or call upon services. If you need to stick with Django, consider something like django-websocket-redis instead.
I'm creating a Django web app which features potentially very long running calculations of up to an hour. The calculations are simulation models built in Python. The web app sends inputs to the simulation model and after some time receives the answer. Also, the user should be able to close his browser after starting the simulation and if he logs in the next day the results should be there.
From my research it seems like I can use Celery together with Redis/RabbitMQ as broker to run the calculation in the background. Ideally I would want to display progress updates using ajax, so that the page updates without a user refresh when the calculation is complete.
I want to host the app on Heroku, so the calculation will also be running on the Heroku server. How hard will it be if I want to move the calculation engine to another server? It might be useful if the calculation engine is on a different server.
So my question is, is my this a good approach above or what other options can I look at?
I think Celery is a good approach. Not sure if you need Redis/RabbitMQ as a broker or you could just use MySQL - it depends on your tasks. Celery workers could be runned on the different servers, so Celery supports distributed queues.
Another approach - implement some queue engine with python, database as a broker and a cron for job executions. But it could be a dirty way with a lots of pain and bugs.
So I think that Celery is a more nice way to do it.
If you are running on Heroku, you want django-rq, not Celery. See https://devcenter.heroku.com/articles/python-rq.