Appropriate method to define a pool of workers in Python

Appropriate method to define a pool of workers in Python - python

I've started a new Python 3 project in which my goal is to download tweets and analyze them. As I'll be downloading tweets from different subjects, I want to have a pool of workers that must download from Twitter status with the given keywords and store them in a database. I name this workers fetchers.
Other kind of worker is the analyzers whose function is to analyze tweets contents and extract information from them, storing the result in a database also. As I'll be analyzing a lot of tweets, would be a good idea to have a pool of this kind of workers too.
I've been thinking in using RabbitMQ and Celery for this but I have some questions:
General question: Is really a good approach to solve this problem?
I need at least one fetcher worker per downloading task and this could be running for a whole year (actually is a 15 minutes cycle that repeats and last for a year). Is it appropriate to define an "infinite" task?
I've been trying Celery and I used delay to launch some example tasks. The think is that I don't want to call ready() method constantly to check if the task is completed. Is it possible to define a callback? I'm not talking about a celery task callback, just a function defined by myself. I've been searching for this and I don't find anything.
I want to have a single RabbitMQ + Celery server with workers in different networks. Is it possible to define remote workers?

Yeah, it looks like a good approach to me.
There is no such thing as infinite task. You might reschedule a task it to run once in a while. Celery has periodic tasks, so you can schedule a task so that it runs at particular times. You don't necessarily need celery for this. You can also use a cron job if you want.
You can call a function once a task is successfully completed.
from celery.signals import task_success
#task_success(sender='task_i_am_waiting_to_complete')
def call_me_when_my_task_is_done():
pass
Yes, you can have remote workes on different networks.

Related

How to Inspect the Queue Processing a Celery Task

I'm currently leveraging celery for periodic tasks. I am new to celery. I have two workers running two different queues. One for slow background jobs and one for jobs user's queue up in the application.
I am monitoring my tasks on datadog because it's an easy way to confirm my workers a running appropriately.
What I want to do is after each task completes, record which queue the task was completed on.
#after_task_publish.connect()
def on_task_publish(sender=None, headers=None, body=None, **kwargs):
statsd.increment("celery.on_task_publish.start.increment")
task = celery.tasks.get(sender)
queue_name = task.queue
statsd.increment("celery.on_task_publish.increment", tags=[f"{queue_name}:{task}"])
The following function is something that I implemented after researching the celery docs and some StackOverflow posts, but it's not working as intended. I get the first statsd increment but the remaining code does not execute.
I am wondering if there is a simpler way to inspect inside/after each task completes, what queue processed the task.

Since your question says is there a way to inspect inside/after each task completes - I'm assuming you haven't tried this celery-result-backend stuff. So you could check out this feature which is provided by Celery itself : Celery-Result-Backend / Task-result-Backend .
It is very useful for storing results of your celery tasks.
Read through this => https://docs.celeryproject.org/en/stable/userguide/configuration.html#task-result-backend-settings
Once you get an idea of how to setup this result-backend, Search for result_extended key (in the same link) to be able to add queue-names in your task return values.
Number of options are available - Like you can setup these results to go to any of these :
Sql-DB / NoSql-DB / S3 / Azure / Elasticsearch / etc
I have made use of this Result-Backend feature with Elasticsearch and this how my task results are stored :
It is just a matter of adding few configurations in settings.py file as per your requirements. Worked really well for my application. And I have a weekly cron that clears only successful results of tasks - since we don't need the results anymore - and I can see only failed results (like the one in image).
These were main keys for my requirement : task_track_started and task_acks_late along with result_backend

Best Way to Handle user triggered task (like import data) in Django

I need your opinion on a challenge that I'm facing. I'm building a website that uses Django as a backend, PostgreSQL as my DB, GraphQL as my API layer and React as my frontend framework. Website is hosted on Heroku. I wrote a python script that logs me in to my gmail account and parse few emails, based on pre-defined conditions, and store the parsed data into Google Sheet. Now, I want the script to be part of my website in which user will specify what exactly need to be parsed (i.e. filters) and then display the parsed data in a table to review accuracy of the parsing task.
The part that I need some help with is how to architect such workflow. Below are few ideas that I managed to come up with after some googling:
generate a graphQL mutation that stores a 'task' into a task model. Once a new task entry is stored, a Django Signal will trigger the script. Not sure yet if Signal can run custom python functions, but from what i read so far, it seems doable.
Use Celery to run this task asynchronously. But i'm not sure if asynchronous tasks is what i'm after here as I need this task to run immediately after the user trigger the feature from the frontend. But i'm might be wrong here. I'm also not sure if I need Redis to store the task details or I can do that on PostgreSQL.
What is the best practice in implementing this feature? The task can be anything, not necessarily parsing emails; it can also be importing data from excel. Any task that is user generated rather than scheduled or repeated task.
I'm sorry in advance if this question seems trivial to some of you. I'm not a professional developer and the above project is a way for me to sharpen my technical skills and learn new techniques.
Looking forward to learn from your experiences.

You can dissect your problem into the following steps:
User specifies task parameters
System executes task
System displays result to the User
You can either do all of these:
Sequentially and synchronously in one swoop; or
Step by step asynchronously.
Synchronously
You can run your script when generating a response, but it will come with the following downsides:
The process in the server processing your request will block until the script is finished. This may or may not affect the processing of other requests by that same server (this will depend on the number of simultaneous requests being processed, workload of the script, etc.)
The client (e.g. your browser) and even the server might time out if the script takes too long. You can fix this to some extent by configuring your server appropriately.
The beauty of this approach however is it's simplicity. For you to do this, you can just pass the parameters through the request, server parses and does the script, then returns you the result.
No setting up of a message queue, task scheduler, or whatever needed.
Asynchronously
Ideally though, for long-running tasks, it is best to have this executed outside of the usual request-response loop for the following advantages:
The server responding to the requests can actually serve other requests.
Some scripts can take a while, some you don't even know if it's going to finish
Script is no longer dependent on the reliability of the network (imagine running an expensive task, then your internet connection skips or is just plain intermittent; you won't be able to do anything)
The downside of this is now you have to set more things up, which increases the project's complexity and points of failure.
Producer-Consumer
Whatever you choose, it's usually best to follow the producer-consumer pattern:
Producer creates tasks and puts them in a queue
Consumer takes a task from the queue and executes it
The producer is basically you, the user. You specify the task and the parameters involved in that task.
This queue could be any datastore: in-memory datastore like Redis; a messaging queue like RabbitMQ; or an relational database management system like PostgreSQL.
The consumer is your script executing these tasks. There are multiple ways of running the consumer/script: via Celery like you mentioned which runs multiple workers to execute the tasks passed through the queue; via a simple time-based job scheduler like crontab; or even you manually triggering the script
The question is actually not trivial, as the solution depends on what task you are actually trying to do. It is best to evaluate the constraints, parameters, and actual tasks to decide which approach you will choose.
But just to give you a more relevant guideline:
Just keep it simple, unless you have a compelling reason to do so (e.g. server is being bogged down, or internet connection is not reliable in practice), there's really no reason to be fancy.
The more blocking the task is, or the longer the task takes or the more dependent it is to third party APIs via the network, the more it makes sense to push this to a background process add reliability and resiliency.
In your email import script, I'll most likely push that to the background:
Have a page where you can add a task to the database
In the task details page, display the task details, and the result below if it exists or "Processing..." otherwise
Have a script that executes tasks (import emails from gmail given the task parameters) and save the results to the database
Schedule this script to run every few minutes via crontab
Yes the above has side effects, like crontab running the script in multiple times at the same time and such, but I won't go into detail without knowing more about the specifics of the task.

Perpetual tasks in Celery?

I'm building a django app where I use a camera to capture images, analyze them, store metadata and results of the analysis in a database, and finally present the data to users.
I'm considering using Celery to handle to background process of capturing images and then processing them:
app = Celery('myapp')
#app.task
def capture_and_process_images(camera):
while True:
image = camera.get_image()
process_image(image)
sleep(5000)
#app.task
def process_image(image):
# do some calculations
# django orm calls
# etc...
The first task will run perpetually, while the second should take ~20 seconds, so there will be multiple images being processed at once.
I haven't found any examples online of using Celery in this way, so I'm not sure if this is bad practice or not.
Can/should Celery be used to handle perpetually running tasks?
Thank you.

Running perpetual tasks in Celery is a done in practise. Take a look at daemonization, which essentially runs a permanent task without user interaction, so I wouldn't say there is anything wrong with running it permanently in your case.

Having celery task running infinitely is not seems like a good idea to me.
If you are going to capture images at some intervals I would suggest you to use some cron-like script getting an image every 5 seconds and launching celery task to process it.
Note also that it is a best practice to avoid synchronous subtasks in celery, see docs for more details.

Retrieving result from celery worker constantly

I have an web app in which I am trying to use celery to load background tasks from a database. I am currently loading the database upon request, but would like to load the tasks on an hourly interval and have them work in the background. I am using flask and am coding in python.I have redis running as well.
So far using celery I have gotten the worker to process the task and the beat to send the tasks to the worker on an interval. But I want to retrieve the results[a dataframe or query] from the worker and if the result is not ready then it should load the previous result of the worker.
Any ideas on how to do this?
Edit
I am retrieving the results from a database using sqlalchemy and I am rendering the results in a webpage. I have my homepage which has all the various links which all lead to different graphs which I want to be loaded in the background so the user does not have to wait long loading times.

The Celery Task is being executed by a Worker, and it's Result is being stored in the Celery Backend.
If I get you correctly, then I think you got few options:
Ignore the result of the graph-loading-task, store what ever you need, as a side effect of the task, in your database. When needed, query for the most recent result in that database. If the DB is Redis, you may find ZADD and ZRANGE suitable. This way you'll get the new if available, or the previous if not.
You can look for the result of a task if you provide it's id. You can do this when you want to find out the status, something like (where celery is the Celery app): result = celery.AsyncResult(<the task id>)
Use callback to update farther when new result is ready.
Let a background thread wait for the AsyncResult, or native_join, which is supported with Redis, and update accordingly (not recommended)
I personally used option #1 in similar cases (using MongoDB) and found it to be very maintainable and flexible. But possibly, due the nature of your UI, option #3 will more suitable for you needs.

Checking status of Task Queue in Google App Engine

I'm putting several tasks into a task queue and would like to know when the specific tasks are done. I haven't found anything in the API about call backs, or checking the status of a task, so I thought I'd see what other people do, or if there's a work around (or official) way to check. I don't care about individual tasks, if it helps, I'm putting 6 different tasks in, and want to know when all 6 are complete.
Thanks!

The new REST/JSON task queue API will let you do this.
http://code.google.com/appengine/docs/python/taskqueue/rest.html
This does not scale well to thousands of tasks...
I do like the pipeline API suggestion though!

You might be able to accomplish this with the pipeline api. You make something dependent on all 6 tasks and let it rip.
http://code.google.com/p/appengine-pipeline/
Good Luck.

You can use memcache. Use a unique key specific to this task group. Set a count when you kick off your tasks, and have each task atomically decrement it. When the value is 0, your tasks are complete. The task that finds this value to be 0 can call your callback.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.