How to create a shared counter in Celery?

How to create a shared counter in Celery? - python

Is there a way to have a shared counter (shared between workers) in Celery? I am also open to other ideas on how to solve my problem, but would like to stick to Celery. Here is my problem:
I have a task that is dependent on an index passed to it. These tasks could pass or fail, but I need to target a number of passed tasks. If a job fails it should kick off a new job with the next available index.
I can of course do this through a function that tracks the active jobs and initiates the new jobs, but if there was something built in that'd be great.

You can use task_failure celery signal.
from celery.signals import task_failure
#task_failure.connect
def fail_task_handler(sender=None, body=None, **kwargs):
print('a task has failed')
# start new task or do something else
More at http://celery.readthedocs.org/en/latest/userguide/signals.html#task-failure

Related

Python Celery: Updating the state of an AsyncResult

After a parent task succeeds, depending on some of his child tasks results, I wish to update the task state.
However:
1/ I cannot find a way to retrieve the actual task instance based on its id, only its AsyncResult
def level5_success(task_id):
result = app.AsyncResult(task_id)
# Set the parent task state (do not work)
app.AsyncResult(task_id).update_state(state='HOWAREYOUDOING')
2/ I cannot find a way to update the state of an AsyncResult, only with the task itself using update_state:
def on_level4_success(sender, *args, **kwargs):
sender.update_state(state='HOWAREYOUDOING')
Any idea?

It feels like you are operating outside the bounds of what celery is designed to do. Coordination of work state and process is supposed to be done with the worker canvas, not by monkeying around with the celery internals. Even if you manage to get it to work, I doubt that state hacking is in the contract celery intends to keep with its API; it is entirely possible that your work will be broken by future changes to celery.
What are you trying to do that you cannot do with groups, chords and chains?

reuse results for celery tasks

Is there any common solution to store and reuse celery task results without executing tasks again? I have many http fetch tasks in my metasearch project and wish to reduce number of useless http requests (they can take long time and return same results) by store results of first one and fire it back without real fetching. Also it will be very useful to does not start new fetch task when the same one is already in progress. Instead of running new job app has to return AsyncResult by id (id is unique and generated by task call args) of already pending task.
Looks like I need to define new apply_async(Celery.send_task) behavior for tasks with same task_id:
if task with given task_id doesn't started yet then start it
if task with given task_id already started return AsyncResult(task_id) without actually run task
#task decorator should accept new ttl
kwarg to determine cache time (only for redis backend?)

Looks like the simplest answer is to store your results in a cache (like a database) and first ask for the result from your cache else fire the http request.
I don't think there's something specific to celery that can perform this.
Edit:
To comply with the fact that you the tasks are sent at the same time an additional thing would be to build a lock for celery task (see Celery Task Lock receipt).
In your case you want to give the lock a name containing the task name and the url name. And you can use whatever system you want for cache if visible by all your workers (Redis in your case?)

Having error queues in celery

Is there any way in celery by which if a task execution fails I can automatically put it into another queue.
For example it the task is running in a queue x, on exception enqueue it to another queue named error_x
Edit:
Currently I am using celery==3.0.13 along with django 1.4, Rabbitmq as broker.
Some times the task fails. Is there a way in celery to add messages to an error queue and process it later.
The problem when celery task fails is that I don't have access to the message queue name. So I can't use self.retry retry to put it to a different error queue.

Well, you cannot use the retry mechanism if you want to route the task to another queue. From the docs:
retry() can be used to re-execute the task, for example in the event
of recoverable errors.
When you call retry it will send a new message, using the same
task-id, and it will take care to make sure the message is delivered
to the same queue as the originating task.
You'll have to relaunch yourself and route it manually to your wanted queue in the event of any exception raised. It seems a good job for error callbacks.
The main issue is that we need to get the task name in the error callback to be able to launch it. Also we may not want to add the callback each time we launch a task. Thus a decorator would be a good way to automatically add the right callback.
from functools import partial, wraps
import celery
#celery.shared_task
def error_callback(task_id, task_name, retry_queue, retry_routing_key):
# We must retrieve the task object itself.
# `tasks` is a dict of 'task_name': celery_task_object
task = celery.current_app.tasks[task_name]
# Re launch the task in specified queue.
task.apply_async(queue=retry_queue, routing_key=retry_routing_key)
def retrying_task(retry_queue, retry_routing_key):
"""Decorates function to automatically add error callbacks."""
def retrying_decorator(func):
#celery.shared_task
#wraps(func) # just to keep the original task name
def wrapper(*args, **kwargs):
return func(*args, **kwargs)
# Monkey patch the apply_async method to add the callback.
wrapper.apply_async = partial(
wrapper.apply_async,
link_error=error_callback.s(wrapper.name, retry_queue, retry_routing_key)
)
return wrapper
return retrying_decorator
# Usage:
#retrying_task(retry_queue='another_queue', retry_routing_key='another_routing_key')
def failing_task():
print 'Hi, I will fail!'
raise Exception("I'm failing!")
failing_task.apply_async()
You can adjust the decorator to pass whatever parameters you need.

I had a similar problem and i solved it may be not in a most efficient way but however my solution is as follows:
I have created a django model to keep all my celery task-ids and that is capable of checking the task state.
Then i have created another celery task that is running in an infinite cycle and checks all tasks that are 'RUNNING' on their actual state and if the state is 'FAILED' it just reruns it. Im not actually changing the queue for the task which i rerun but i think you can implement some custom logic to decide where to put every task you rerun this way.

Python / rq - monitoring worker status

If this is an idiotic question, I apologize and will go hide my head in shame, but:
I'm using rq to queue jobs in Python. I want it to work like this:
Job A starts. Job A grabs data via web API and stores it.
Job A runs.
Job A completes.
Upon completion of A, job B starts. Job B checks each record stored by job A and adds some additional response data.
Upon completion of job B, user gets a happy e-mail saying their report's ready.
My code so far:
redis_conn = Redis()
use_connection(redis_conn)
q = Queue('normal', connection=redis_conn) # this is terrible, I know - fixing later
w = Worker(q)
job = q.enqueue(getlinksmod.lsGet, theURL,total,domainid)
w.work()
I assumed my best solution was to have 2 workers, one for job A and one for B. The job B worker could monitor job A and, when job A was done, get started on job B.
What I can't figure out to save my life is how I get one worker to monitor the status of another. I can grab the job ID from job A with job.id. I can grab the worker name with w.name. But haven't the foggiest as to how I pass any of that information to the other worker.
Or, is there a much simpler way to do this that I'm totally missing?

Update januari 2015, this pull request is now merged, and the parameter is renamed to depends_on, ie:
second_job = q.enqueue(email_customer, depends_on=first_job)
The original post left intact for people running older versions and such:
I have submitted a pull request (https://github.com/nvie/rq/pull/207) to handle job dependencies in RQ. When this pull request gets merged in, you'll be able to do:
def generate_report():
pass
def email_customer():
pass
first_job = q.enqueue(generate_report)
second_job = q.enqueue(email_customer, after=first_job)
# In the second enqueue call, job is created,
# but only moved into queue after first_job finishes
For now, I suggest writing a wrapper function to sequentially run your jobs. For example:
def generate_report():
pass
def email_customer():
pass
def generate_report_and_email():
generate_report()
email_customer() # You can also enqueue this function, if you really want to
# Somewhere else
q.enqueue(generate_report_and_email)

From this page on the rq docs, it looks like each job object has a result attribute, callable by job.result, which you can check. If the job hasn't finished, it'll be None, but if you ensure that your job returns some value (even just "Done"), then you can have your other worker check the result of the first job and then begin working only when job.result has a value, meaning the first worker was completed.

You are probably too deep into your project to switch, but if not, take a look at Twisted. http://twistedmatrix.com/trac/ I am using it right now for a project that hits APIs, scrapes web content, etc. It runs multiple jobs in parallel, as well as organizing certain jobs in order, so Job B doesn't execute until Job A is done.
This is the best tutorial for learning Twisted if you want to attempt. http://krondo.com/?page_id=1327

Combine the things that job A and job B do in one function, and then use e.g. multiprocessing.Pool (it's map_async method) to farm that out over different processes.
I'm not familiar with rq, but multiprocessing is a part of the standard library. By default it uses as many processes as your CPU has cores, which in my experience is usually enough to saturate the machine.

Celery: standard method for querying pending tasks?

Is there any standard/backend-independent method for querying pending tasks based on certain fields?
For example, I have a task which needs to run once after the “last user interaction”, and I'd like to implement it something like:
def user_changed_content():
task = find_task(name="handle_content_change")
if task is None:
task = queue_task("handle_content_change")
task.set_eta(datetime.now() + timedelta(minutes=5))
task.save()
Or is it simpler to hook directly into the storage backend?

No, this is not possible.
Even if some transports may support accessing the "queue" out of order (e.g. Redis)
it is not a good idea.
The task may not be on the queue anymore, and instead reserved by a worker.
See this part in the documentation: http://docs.celeryproject.org/en/latest/userguide/tasks.html#state
Given that, a better approach would be for the task to check if it should reschedule itself
when it starts:
#task
def reschedules():
new_eta = redis.get(".".join([reschedules.request.task_id, "new_eta"])
if new_eta:
return reschedules.retry(eta=new_eta)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create a shared counter in Celery? - python

Related

Python Celery: Updating the state of an AsyncResult

reuse results for celery tasks

Having error queues in celery

Python / rq - monitoring worker status

Celery: standard method for querying pending tasks?

Categories

Resources