How to Inspect the Queue Processing a Celery Task - python

I'm currently leveraging celery for periodic tasks. I am new to celery. I have two workers running two different queues. One for slow background jobs and one for jobs user's queue up in the application.
I am monitoring my tasks on datadog because it's an easy way to confirm my workers a running appropriately.
What I want to do is after each task completes, record which queue the task was completed on.
#after_task_publish.connect()
def on_task_publish(sender=None, headers=None, body=None, **kwargs):
statsd.increment("celery.on_task_publish.start.increment")
task = celery.tasks.get(sender)
queue_name = task.queue
statsd.increment("celery.on_task_publish.increment", tags=[f"{queue_name}:{task}"])
The following function is something that I implemented after researching the celery docs and some StackOverflow posts, but it's not working as intended. I get the first statsd increment but the remaining code does not execute.
I am wondering if there is a simpler way to inspect inside/after each task completes, what queue processed the task.

Since your question says is there a way to inspect inside/after each task completes - I'm assuming you haven't tried this celery-result-backend stuff. So you could check out this feature which is provided by Celery itself : Celery-Result-Backend / Task-result-Backend .
It is very useful for storing results of your celery tasks.
Read through this => https://docs.celeryproject.org/en/stable/userguide/configuration.html#task-result-backend-settings
Once you get an idea of how to setup this result-backend, Search for result_extended key (in the same link) to be able to add queue-names in your task return values.
Number of options are available - Like you can setup these results to go to any of these :
Sql-DB / NoSql-DB / S3 / Azure / Elasticsearch / etc
I have made use of this Result-Backend feature with Elasticsearch and this how my task results are stored :
It is just a matter of adding few configurations in settings.py file as per your requirements. Worked really well for my application. And I have a weekly cron that clears only successful results of tasks - since we don't need the results anymore - and I can see only failed results (like the one in image).
These were main keys for my requirement : task_track_started and task_acks_late along with result_backend

Related

Retrieving result from celery worker constantly

I have an web app in which I am trying to use celery to load background tasks from a database. I am currently loading the database upon request, but would like to load the tasks on an hourly interval and have them work in the background. I am using flask and am coding in python.I have redis running as well.
So far using celery I have gotten the worker to process the task and the beat to send the tasks to the worker on an interval. But I want to retrieve the results[a dataframe or query] from the worker and if the result is not ready then it should load the previous result of the worker.
Any ideas on how to do this?
Edit
I am retrieving the results from a database using sqlalchemy and I am rendering the results in a webpage. I have my homepage which has all the various links which all lead to different graphs which I want to be loaded in the background so the user does not have to wait long loading times.
The Celery Task is being executed by a Worker, and it's Result is being stored in the Celery Backend.
If I get you correctly, then I think you got few options:
Ignore the result of the graph-loading-task, store what ever you need, as a side effect of the task, in your database. When needed, query for the most recent result in that database. If the DB is Redis, you may find ZADD and ZRANGE suitable. This way you'll get the new if available, or the previous if not.
You can look for the result of a task if you provide it's id. You can do this when you want to find out the status, something like (where celery is the Celery app): result = celery.AsyncResult(<the task id>)
Use callback to update farther when new result is ready.
Let a background thread wait for the AsyncResult, or native_join, which is supported with Redis, and update accordingly (not recommended)
I personally used option #1 in similar cases (using MongoDB) and found it to be very maintainable and flexible. But possibly, due the nature of your UI, option #3 will more suitable for you needs.

reuse results for celery tasks

Is there any common solution to store and reuse celery task results without executing tasks again? I have many http fetch tasks in my metasearch project and wish to reduce number of useless http requests (they can take long time and return same results) by store results of first one and fire it back without real fetching. Also it will be very useful to does not start new fetch task when the same one is already in progress. Instead of running new job app has to return AsyncResult by id (id is unique and generated by task call args) of already pending task.
Looks like I need to define new apply_async(Celery.send_task) behavior for tasks with same task_id:
if task with given task_id doesn't started yet then start it
if task with given task_id already started return AsyncResult(task_id) without actually run task
#task decorator should accept new ttl
kwarg to determine cache time (only for redis backend?)
Looks like the simplest answer is to store your results in a cache (like a database) and first ask for the result from your cache else fire the http request.
I don't think there's something specific to celery that can perform this.
Edit:
To comply with the fact that you the tasks are sent at the same time an additional thing would be to build a lock for celery task (see Celery Task Lock receipt).
In your case you want to give the lock a name containing the task name and the url name. And you can use whatever system you want for cache if visible by all your workers (Redis in your case?)

Appropriate method to define a pool of workers in Python

I've started a new Python 3 project in which my goal is to download tweets and analyze them. As I'll be downloading tweets from different subjects, I want to have a pool of workers that must download from Twitter status with the given keywords and store them in a database. I name this workers fetchers.
Other kind of worker is the analyzers whose function is to analyze tweets contents and extract information from them, storing the result in a database also. As I'll be analyzing a lot of tweets, would be a good idea to have a pool of this kind of workers too.
I've been thinking in using RabbitMQ and Celery for this but I have some questions:
General question: Is really a good approach to solve this problem?
I need at least one fetcher worker per downloading task and this could be running for a whole year (actually is a 15 minutes cycle that repeats and last for a year). Is it appropriate to define an "infinite" task?
I've been trying Celery and I used delay to launch some example tasks. The think is that I don't want to call ready() method constantly to check if the task is completed. Is it possible to define a callback? I'm not talking about a celery task callback, just a function defined by myself. I've been searching for this and I don't find anything.
I want to have a single RabbitMQ + Celery server with workers in different networks. Is it possible to define remote workers?
Yeah, it looks like a good approach to me.
There is no such thing as infinite task. You might reschedule a task it to run once in a while. Celery has periodic tasks, so you can schedule a task so that it runs at particular times. You don't necessarily need celery for this. You can also use a cron job if you want.
You can call a function once a task is successfully completed.
from celery.signals import task_success
#task_success(sender='task_i_am_waiting_to_complete')
def call_me_when_my_task_is_done():
pass
Yes, you can have remote workes on different networks.

How can I run a task after other (already started) tasks are finished

I'm new to Celery and I'm trying to understand if it can solve my problem.
I need to start a number of tasks (An) and then run another task (B) after these are done. The problem is that tasks An are added sequentially and I don't want to wait for the last one to be added before I start the first one. Can I configure task B to execute after tasks An are done?
Now to the real scenario:
Task An - Process a file uploaded by user (Added after each file is
uploaded)
Task B - do something with the results of processing all
uploaded files
Alternative solutions are welcome as well
For sure you can do this, celery canvas supports many options, inluding the behaviour you require, running a task after a group of tasks ... it is called "Chords", e.g.:
from celery import chord
from tasks import task_upload1, task_upload2, task_upload3, final_execution
result = chord(task_upload1.s(), task_upload2.s(), task_upload3.s())(final_execution.s())
get_required_result = result.get()
you can refer to this link for more details
With RabbitMQ you can get exact behavior using message acknowledgment and aggregator pattern.
You start worker, that consumes messages (A) and do some work(process a file uploaded by user in your case), but doesn't sent ack when finished. Instead it takes next message form queue and if it's A task again, he is doing the same thing. At some point he will receive task B and could process all previous A's results, all atones and send ack to all of them.
Unfortunately, this scenario can't be done with Celery, because you have to specify all A tasks and final B task(chains, chords, callbacks, etc.) on creating time.
Alternatively, you can save Task.id for each successful A task in separate queue (not Celery queue) and process this messages, when executing B task. Celery can fit for this algorithm.

How to inspect and cancel Celery tasks by task name

I'm using Celery (3.0.15) with Redis as a broker.
Is there a straightforward way to query the number of tasks with a given name that exist in a Celery queue?
And, as a followup, is there a way to cancel all tasks with a given name that exist in a Celery queue?
I've been through the Monitoring and Management Guide and don't see a solution there.
# Retrieve tasks
# Reference: http://docs.celeryproject.org/en/latest/reference/celery.events.state.html
query = celery.events.state.tasks_by_type(your_task_name)
# Kill tasks
# Reference: http://docs.celeryproject.org/en/latest/userguide/workers.html#revoking-tasks
for uuid, task in query:
celery.control.revoke(uuid, terminate=True)
There is one issue that earlier answers have not addressed and may throw off people if they are not aware of it.
Among those solutions already posted, I'd use Danielle's with one minor modification: I'd import the task into my file and use its .name attribute to get the task name to pass to .tasks_by_type().
app.control.revoke(
[uuid for uuid, _ in
celery.events.state.State().tasks_by_type(task.name)])
However, this solution will ignore those tasks that have been scheduled for future execution. Like some people who commented on other answers, when I checked what .tasks_by_type() return I had an empty list. And indeed my queues were empty. But I knew that there were tasks scheduled to be executed in the future and these were my primary target. I could see them by executing celery -A [app] inspect scheduled but they were unaffected by the code above.
I managed to revoke the scheduled tasks by doing this:
app.control.revoke(
[scheduled["request"]["id"] for scheduled in
chain.from_iterable(app.control.inspect().scheduled()
.itervalues())])
app.control.inspect().scheduled() returns a dictionary whose keys are worker names and values are lists of scheduling information (hence, the need for chain.from_iterable which is imported from itertools). The task information is in the "request" field of the scheduling information and "id" contains the task id. Note that even after revocation, the scheduled task will still show among the scheduled tasks. Scheduled tasks that are revoked won't get removed from the list of scheduled tasks until their timers expire or until Celery performs some cleanup operation. (Restarting workers triggers such cleanup.)
You can do this in one request:
app.control.revoke([
uuid
for uuid, _ in
celery.events.state.State().tasks_by_type(task_name)
])
As usual with Celery, none of the answers here worked for me at all, so I did my usual thing and hacked together a solution that just inspects redis directly. Here we go...
# First, get a list of tasks from redis:
import redis, json
r = redis.Redis(
host=settings.REDIS_HOST,
port=settings.REDIS_PORT,
db=settings.REDIS_DATABASES['CELERY'],
)
l = r.lrange('celery', 0, -1)
# Now import the task you want so you can get its name
from my_django.tasks import my_task
# Now, import your celery app and iterate over all tasks
# from redis and nuke the ones that have a matching name.
from my_django.celery_init import app
for task in l:
task_headers = json.loads(task)['headers']
task_name = task_headers["task"]
if task_name == my_task.name:
task_id = task_headers['id']
print("Terminating: %s" % task_id)
app.control.revoke(task_id, terminate=True)
Note that revoking in this way might not revoke prefetched tasks, so you might not see results immediately.
Also, this answer doesn't support prioritized tasks. If you want to modify it to do that, you'll want some of the tips in my other answer that hacks redis.
It looks like flower provides monitoring:
https://github.com/mher/flower
Real-time monitoring using Celery Events
Task progress and history Ability to show task details (arguments,
start time, runtime, and more) Graphs and statistics Remote Control
View worker status and statistics Shutdown and restart worker
instances Control worker pool size and autoscale settings View and
modify the queues a worker instance consumes from View currently
running tasks View scheduled tasks (ETA/countdown) View reserved and
revoked tasks Apply time and rate limits Configuration viewer Revoke
or terminate tasks HTTP API
OpenID authentication

Categories

Resources