Interact with celery ongoing task - python

We have a distributed architecture based on rabbitMQ and Celery.
We can launch in parallel multiple tasks without any issue. The scalability is good.
Now we need to control the task remotely: PAUSE, RESUME, CANCEL.
The only solution we found is to make in the Celery task a RPC call to another task that replies the command after a DB request. The Celery task and RPC task are not on the same machine and only the RPC task has access to the DB.
Do you have any advice how to improve it and easily communicate with an ongoing task?
Thank you
EDIT:
In fact we would like to do something like in the picture below. It's easy to do the Blue configuration or the Orange, but we don't know how to do both simultaneously.
Workers are subscribing to a common Jobs queue and each worker has its own Admin queue declared on an exchange.
EDIT:
IF this is not possible with Celery, I'am open to a solution with other frameworks like python-rq.

It look like the Control Bus pattern.
For a better scalability and in order to reduce the RPC call, I recommend to reverse the logic. The PAUSE, RESUME, CANCEL command are push to the Celery tasks through a control bus when the state change occurs. The Celery app will store the current state of the Celery app in a store (could be in memory, on the filesystem..). If task states must be kept even after a stop/start of the app, It will involve more work in order to keep both app synchronized (eg. synchronization at startup).

I'd like to demonstrate a general approach to implementing pause-able (and resume-able) ongoing celery tasks through the workflow pattern. Note: Original answer written here. Re-writing here due to this post being very relevant.
Concept
With celery workflows - you can design your entire operation to be divided into a chain of tasks. It doesn't necessarily have to be purely a chain, but it should follow the general concept of one task after another task (or task group) finishes.
Once you have a workflow like that, you can finally define points to pause at throughout your workflow. At each of these points, you can check whether or not the frontend user has requested the operation to pause and continue accordingly. The concept is this:-
A complex and time consuming operation O is split into 5 celery tasks - T1, T2, T3, T4, and T5 - each of these tasks (except the first one) depend on the return value of the previous task.
Let's assume we define points to pause after every single task, so the workflow looks like-
T1 executes
T1 completes, check if user has requested pause
If user has not requested pause - continue
If user has requested pause, serialize the remaining workflow chain and store it somewhere to continue later
... and so on. Since there's a pause point after each task, that check is performed after every one of them (except the last one of course).
But this is only theory, I struggled to find an implementation of this anywhere online so here's what I came up with-
Implementation
from typing import Any, Optional
from celery import shared_task
from celery.canvas import Signature, chain, signature
#shared_task(bind=True)
def pause_or_continue(
self, retval: Optional[Any] = None, clause: dict = None, callback: dict = None
):
# Task to use for deciding whether to pause the operation chain
if signature(clause)(retval):
# Pause requested, call given callback with retval and remaining chain
# chain should be reversed as the order of execution follows from end to start
signature(callback)(retval, self.request.chain[::-1])
self.request.chain = None
else:
# Continue to the next task in chain
return retval
def tappable(ch: chain, clause: Signature, callback: Signature, nth: Optional[int] = 1):
'''
Make a operation workflow chain pause-able/resume-able by inserting
the pause_or_continue task for every nth task in given chain
ch: chain
The workflow chain
clause: Signature
Signature of a task that takes one argument - return value of
last executed task in workflow (if any - othewise `None` is passsed)
- and returns a boolean, indicating whether or not the operation should continue
Should return True if operation should continue normally, or be paused
callback: Signature
Signature of a task that takes 2 arguments - return value of
last executed task in workflow (if any - othewise `None` is passsed) and
remaining chain of the operation workflow as a json dict object
No return value is expected
This task will be called when `clause` returns `True` (i.e task is pausing)
The return value and the remaining chain can be handled accordingly by this task
nth: Int
Check `clause` after every nth task in the chain
Default value is 1, i.e check `clause` after every task
Hence, by default, user given `clause` is called and checked
after every task
NOTE: The passed in chain is mutated in place
Returns the mutated chain
'''
newch = []
for n, sig in enumerate(ch.tasks):
if n != 0 and n % nth == nth - 1:
newch.append(pause_or_continue.s(clause=clause, callback=callback))
newch.append(sig)
ch.tasks = tuple(newch)
return ch
Explanation - pause_or_continue
Here pause_or_continue is the aforementioned pause point. It's a task that will be called at specific intervals (intervals as in task intervals, not as in time intervals). This task then calls a user provided function (actually a task) - clause - to check whether or not the task should continue.
If the clause function (actually a task) returns True, the user provided callback function is called, the latest return value (if any - None otherwise) is passed onto this callback, as well as the remaining chain of tasks. The callback does what it needs to do and pause_or_continue sets self.request.chain to None, which tells celery "The task chain is now empty - everything is finished".
If the clause function (actually a task) returns False, the return value from the previous task (if any - None otherwise) is returned back for the next task to receive - and the chain goes on. Hence the workflow continues.
Why are clause and callback task signatures and not regular functions?
Both clause and callback are being called directly - without delay or apply_async. It is executed in the current process, in the current context. So it behaves exactly like a normal function, then why use signatures?
The answer is serialization. You can't conveniently pass a regular function object to a celery task. But you can pass a task signature. That's exactly what I'm doing here. Both clause and callback should be a regular signature object of a celery task.
What is self.request.chain?
self.request.chain stores a list of dicts (representing jsons as the celery task serializer is json by default) - each of them representing a task signature. Each task from this list is executed in reverse order. Which is why, the list is reversed before passing to the user provided callback function (actually a task) - the user probably expects the order of tasks to be left to right.
Quick note: Irrelevant to this discussion, but if you're using the link parameter from apply_async to construct a chain instead of the chain primitive itself. self.request.callback is the property to be modified (i.e set to None to remove callback and stop chain) instead of self.request.chain
Explanation - tappable
tappable is just a basic function that takes a chain (which is the only workflow primitive covered here, for brevity) and inserts pause_or_continue after every nth task. You can insert them wherever you want really, it is upto you to define pause points in your operation. This is just an example!
For each chain object, the actual signatures of tasks (in order, going from left to right) is stored in the .tasks property. It's a tuple of task signatures. So all we have to do, is take this tuple, convert into a list, insert the pause points and convert back to a tuple to assign to the chain. Then return the modified chain object.
The clause and callback is also attached to the pause_or_continue signature. Normal celery stuff.
That covers the primary concept, but to showcase a real project using this pattern (and also to showcase the resuming part of a paused task), here's a small demo of all the necessary resources
Usage
This example usage assumes the concept of a basic web server with a database. Whenever an operation (i.e workflow chain) is started, it's assigned an id and stored into the database. The schema of that table looks like-
-- Create operations table
-- Keeps track of operations and the users that started them
CREATE TABLE operations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
requester_id INTEGER NOT NULL,
completion TEXT NOT NULL,
workflow_store TEXT,
result TEXT,
FOREIGN KEY (requester_id) REFERENCES user (id)
);
The only field that needs to be known about right now, is completion. It just stores the status of the operation-
When the operation starts and a db entry is created, this is set to IN PROGRESS
When a user requests pause, the route controller (i.e view) modifies this to REQUESTING PAUSE
When the operation actually gets paused and callback (from tappable, inside pause_or_continue) is called, the callback should modify this to PAUSED
When the task is completed, this should be modified to COMPLETED
An example of clause
#celery.task()
def should_pause(_, operation_id: int):
# This is the `clause` to be used for `tappable`
# i.e it lets celery know whether to pause or continue
db = get_db()
# Check the database to see if user has requested pause on the operation
operation = db.execute(
"SELECT * FROM operations WHERE id = ?", (operation_id,)
).fetchone()
return operation["completion"] == "REQUESTING PAUSE"
This is the task to call at the pause points, to determine whether or not to pause. It's a function that takes 2 parameters.....well sort of. The first one is mandatory, tappable requires the clause to have one (and exactly one) argument - so it can pass the previous task's return value to it (even if that return value is None). In this example, the return value isn't required to be used - so we can just ignore it.
The second parameter is an operation id. See, all this clause does - is check a database for the operation (the workflow) entry and see if it has the status REQUESTING PAUSE. To do that, it needs to know the operation id. But clause should be a task with one argument, what gives?
Well, good thing signatures can be partial. When the task is first started and a tappable chain is created. The operation id is known and hence we can do should_pause.s(operation_id) to get the signature of a task that takes one parameter, that being the return value of the previous task. That qualifies as a clause!
An example of callback
import os
import json
from typing import Any, List
#celery.task()
def save_state(retval: Any, chains: dict, operation_id: int):
# This is the `callback` to be used for `tappable`
# i.e this is called when an operation is pausing
db = get_db()
# Prepare directories to store the workflow
operation_dir = os.path.join(app.config["OPERATIONS"], f"{operation_id}")
workflow_file = os.path.join(operation_dir, "workflow.json")
if not os.path.isdir(operation_dir):
os.makedirs(operation_dir, exist_ok=True)
# Store the remaining workflow chain, serialized into json
with open(workflow_file, "w") as f:
json.dump(chains, f)
# Store the result from the last task and the workflow json path
db.execute(
"""
UPDATE operations
SET completion = ?,
workflow_store = ?,
result = ?
WHERE id = ?
""",
("PAUSED", workflow_file, f"{retval}", operation_id),
)
db.commit()
And here's the task to be called when the task is being paused. Remember, this should take the last executed task's return value and the remaining list of signatures (in order, from left to right). There's an extra param - operation_id - once again. The explanation for this is the same as the one for clause.
This function stores the remaining chain in a json file (since it's a list of dicts). Remember, you can use a different serializer - I'm using json since it's the default task serializer used by celery.
After storing the remaining chain, it updates the completion status to PAUSED and also logs the path to the json file into the db.
Now, let's see these in action-
An example of starting the workflow
def start_operation(user_id, *operation_args, **operation_kwargs):
db = get_db()
operation_id: int = db.execute(
"INSERT INTO operations (requester_id, completion) VALUES (?, ?)",
(user_id, "IN PROGRESS"),
).lastrowid
# Convert a regular workflow chain to a tappable one
tappable_workflow = tappable(
(T1.s() | T2.s() | T3.s() | T4.s() | T5.s(operation_id)),
should_pause.s(operation_id),
save_state.s(operation_id),
)
# Start the chain (i.e send task to celery to run asynchronously)
tappable_workflow(*operation_args, **operation_kwargs)
db.commit()
return operation_id
A function that takes in a user id and starts an operation workflow. This is more or less an impractical dummy function modeled around a view/route controller. But I think it gets the general idea through.
Assume T[1-4] are all unit tasks of the operation, each taking the previous task's return as an argument. Just an example of a regular celery chain, feel free to go wild with your chains.
T5 is a task that saves the final result (result from T4) to the database. So along with the return value from T4 it needs the operation_id. Which is passed into the signature.
An example of pausing the workflow
def pause(operation_id):
db = get_db()
operation = db.execute(
"SELECT * FROM operations WHERE id = ?", (operation_id,)
).fetchone()
if operation and operation["completion"] == "IN PROGRESS":
# Pause only if the operation is in progress
db.execute(
"""
UPDATE operations
SET completion = ?
WHERE id = ?
""",
("REQUESTING PAUSE", operation_id),
)
db.commit()
return 'success'
return 'invalid id'
This employs the previously mentioned concept of modifying the db entry to change completion to REQUESTING PAUSE. Once this is committed, the next time pause_or_continue calls should_pause, it'll know that the user has requested the operation to pause and it'll do so accordingly.
An example of resuming the workflow
def resume(operation_id):
db = get_db()
operation = db.execute(
"SELECT * FROM operations WHERE id = ?", (operation_id,)
).fetchone()
if operation and operation["completion"] == "PAUSED":
# Resume only if the operation is paused
with open(operation["workflow_store"]) as f:
# Load the remaining workflow from the json
workflow_json = json.load(f)
# Load the chain from the json (i.e deserialize)
workflow_chain = chain(signature(x) for x in serialized_ch)
# Start the chain and feed in the last executed task result
workflow_chain(operation["result"])
db.execute(
"""
UPDATE operations
SET completion = ?
WHERE id = ?
""",
("IN PROGRESS", operation_id),
)
db.commit()
return 'success'
return 'invalid id'
Recall that, when the operation is paused - the remaining workflow is stored in a json. Since we are currently restricting the workflow to a chain object. We know this json is a list of signatures that should be turned into a chain. So, we deserialize it accordingly and send it to the celery worker.
Note that, this remaining workflow still has the pause_or_continue tasks as they were originally - so this workflow itself, is once again pause-able/resume-able. When it pauses, the workflow.json will simply be updated.

Related

Why are celery tasks flushing each others' sqlalchemy sessions?

(github minimal verifiable complete example at bottom)
I have a flask endpoint that takes a data submission (just a dict), stores it, spins up a celery task to process the data later, and returns.
The celery tasks' job is to update the fields of a contact using that data, and log the attributes it changes (name, oldvalue, newvalue)
In production, multiple tasks may come into being for the same update-job, and I'd like to prevent them from all submitting updates for the same contact
I tried to do this using a row-lock on the update-job row. After calculating all of the fields that need to be changed, and changing the python object itself, it will then:
Wait for lock on the update-job row
After acquiring the lock re-check if the row is still unprocessed
If yes, add the updated contact object, new change log objects, and the job object (now marked completed) to the session and commit
3.5) If no, abandon, do not try and commit and end the task.
Unfortunately I see multiple tasks committing their staged changes though.
I had assumed that db.session is a scoped session, and is therefore safe for use by different celery tasks forked from the same worker.
Are they instead performing operations on each other's sessions?
Relevant task function:
#celery.task(name="process_update")
def process_update(update_id):
print("Starting")
time.sleep(.5)
update = UpdateJob.query.get(update_id)
# Wait for previous updates to finish and abort after timeout
for attempt in range(40):
unfinished_changes_to_contact = db.session.query(UpdateJob).filter(
UpdateJob.contact_id == update.contact_id,
UpdateJob.id < update.id,
UpdateJob.contact_was_updated == False # noqa
).order_by(UpdateJob.id.desc()).first()
if unfinished_changes_to_contact is None:
break
time.sleep(.5)
else:
print(f"{update.id} waiting for {unfinished_changes_to_contact.id} to finish")
raise Exception("Already existing update to contact that is not finishing")
print("We are clear")
time.sleep(.5)
contact = db.session.query(Contact).get(update.contact_id)
for key, value in json.loads(update.data).items():
setattr(contact, key, value)
changes = field_changes_from_contact(contact)
for change in changes:
change.update_job_id = update.id
# Did another task already cover this update?
print("Rechecking. Locking")
processed_update = UpdateJob.query.with_for_update().get(update_id)
if processed_update.contact_was_updated is True:
print(processed_update)
print("Already done. Abandoning.")
return
print("Looks good. Committing.")
processed_update.contact_was_updated = True
db.session.add(processed_update)
db.session.add(contact)
db.session.bulk_save_objects(changes)
db.session.commit()
print("Commited")
In the logfile (https://ideone.com/wXg3SS) I can see that worker-1 was the first to issue an actual SELECT FOR UPDATE statement to the db, and worker-1 was indeed the first to commit. This is correct behavior.
But afterward, worker 2 should acquire lock, recheck that row to find it's "updated" field was now true, and abandon the whole thing. Instead it commits.
Also there are many statements being issued to the DB in between these events and I don't understand why.
MVCE on github
https://github.com/kfieldm/flask-race

Getting all task IDs from nested chains and chords

I'm using Celery 3.1.9 with a Redis backend. The job that I'm running is made of several subtasks which run in chords and chains. The structure looks like this:
prepare
download data (a chord of 2 workers)
parse and store downloaded data
long-running chord of 4 workers
finalize
generate report
Each item in the list is a subtask, they are all chained together. Steps 2 and 4 are chords. The whole thing is wired up by creating a chord for step 4 whose callback is a chain of 4 -> 6, then a chord is created for step 2, whose callback is 3 -> first chord. Then, finally a chain is created 1 -> second chord. This chain is then started with delay() and its ID is stored in the database.
The problem is two-fold. First I want to be able to revoke the whole thing and second I want to have a custom on_failure on my Task class that does some cleanup, and reports the failure to the user.
Currently I store the chain's task ID. I thought I could use this to revoke the chain. Also, in case of an error I wanted to walk the chain to its root (in the on_failure handler) to retrieve the relevant record from the database. This doesn't work, because when you re-create an instance of AsyncResult with just the ID of the task, its parent attribute is None.
The second thing I tried was to store the result of serializable() called on the outer chain's result. This however, does not return the entire tree of AsyncResult objects, it just returns the IDs of the first level in the chain (so not the IDs of the children in the chords.)
The third thing I tried was to implement serializable() myself, but as it turns out, the reason why the original method doesn't go further than 2 levels is because the chain's children are celery.canvas.chord objects, instead of AsyncResult instances.
An illustration of the problem:
chord([
foo.si(),
foo.si(),
foo.si(),
], bar.si() | bar.si())
res = chord.apply_async()
pprint(res.serializable())
Prints the following:
(('50c9eb94-7a63-49dc-b491-6fce5fed3713',
('d95a82b7-c107-4a2c-81eb-296dc3fb88c3',
[(('7c72310b-afc7-4010-9de4-e64cd9d30281', None), None),
(('2cb80041-ff29-45fe-b40c-2781b17e59dd', None), None),
(('e85ab83d-dd44-44b5-b79a-2bbf83c4332f', None), None)])),
None)
The first ID is the ID of the callback chain, the second ID is from the chord task itself, and the last three are the actual tasks inside the chord. But I can't get at the result from the task inside the callback chain (i.e. the ID of the two bar.si() calls).
Is there any way to get at the actual task IDs?
One hacky way is calling the tasks with apply_async, save the task ids and wait for them manually. In this way you will have complete control of happens but you should only wait for async tasks as last resort. Now you can access task id, return value, etc. For example something like this:
task1 = a_task.apply_async()
task2 = b_task.apply_async()
task3 = c_task.apply_async()
tasks = [task1, task2, task3]
for task in tasks:
task.wait()
I have a nested dag which is a mixture of groups and chains. The following recursive method works well to get the task_ids and their result:
import celery
def get_task_id_result_tuple_list(run_dag, with_result=True):
task_id_result_list = []
# for groups, parents are first task, then iterate over the children
if isinstance(run_dag, celery.result.GroupResult):
entry = (run_dag.parent, run_dag.parent.result) if with_result else run_dag.parent
task_id_result_list.append(entry)
children = run_dag.children
for child in children:
task_id_result_list.extend(get_task_id_result_tuple_list(child, with_result))
# for AsyncResults, append parents in reverse
elif isinstance(run_dag, celery.result.AsyncResult):
ch = run_dag
ch_list = [(ch, ch.result)] if with_result else [ch]
while ch.parent is not None:
ch = ch.parent
entry = (ch, ch.result) if with_result else ch
ch_list.append(entry)
# remember to reverse the list to get the calling order
task_id_result_list.extend(reversed(ch_list))
return task_id_result_list
# dag is the nested celery structure of chains and groups
run_dag = dag.apply_async()
task_id_result_tuples = get_task_id_result_tuple_list(run_dag)
task_id_only = get_task_id_result_tuple_list(run_dag, False)
NOTE: I have not tested this with chords yet but I imagine it would either work as is or would maybe need another conditional branch to handle that.

Django celery chain

I want to pass the result of one task to another task. I am using chain
som = chain (task_async_get_me_friends.s((userId), parse_friends.s()))()
q = som.get()
print q
My intention is to create 2 task. First task get the friends of the user and then pass this friends in the JSON object to the parse_friends task. I am getting the result from task_async_get_me_friends but then cannot get parse_friends to be called
#celery.task
def task_async_get_me_friends(userId, *args):
logger.info('First do something')
users_friends = fb_get_friends(userId)
# Till here everything is all good, I did see the celery logger. Getting result from fb
return {'result':'success', 'data':users_friends}
#celery.task
def parse_friends(users_friends,*args,**kwargs):
# This log line i cannot see in the celery
logger.info('Second do something'+str(users_friends))
# Do something with users_friends
EDIT: realized I had misunderstood which function did which
I'm still getting up to speed on celery, but I don't think your chain is doing what you want. Specifically chain takes a list of tasks; you're only providing 1 task (which happens to be using a 2nd task). I think what you want is:
som = chain (task_async_get_me_friends.s(userId),parse_friends.s())
This should call parse_friends and when that result is returned, pass that to task_async_get_me_friends (which has the first argument passed in as userId but is "waiting" for chain to provide the second argument (the json result).

Reporting yielded results of long-running Celery task

Problem
I've segmented a long-running task into logical subtasks, so I can report the results of each subtask as it completes. However, I'm trying to report the results of a task that will effectively never complete (instead yielding values as it goes), and am struggling to do so with my existing solution.
Background
I'm building a web interface to some Python programs I've written. Users can submit jobs through web forms, then check back to see the job's progress.
Let's say I have two functions, each accessed via separate forms:
med_func: Takes ~1 minute to execute, results are passed off to render(), which produces additional data.
long_func: Returns a generator. Each yield takes on the order of 30 minutes, and should be reported to the user. There are so many yields, we can consider this iterator as infinite (terminating only when revoked).
Code, current implementation
With med_func, I report results as follows:
On form submission, I save an AsyncResult to a Django session:
task_result = med_func.apply_async([form], link=render.s())
request.session["task_result"] = task_result
The Django view for the results page accesses this AsyncResult. When a task has completed, results are saved into an object that is passed as context to a Django template.
def results(request):
""" Serve (possibly incomplete) results of a session's latest run. """
session = request.session
try: # Load most recent task
task_result = session["task_result"]
except KeyError: # Already cleared, or doesn't exist
if "results" not in session:
session["status"] = "No job submitted"
else: # Extract data from Asynchronous Tasks
session["status"] = task_result.status
if task_result.ready():
session["results"] = task_result.get()
render_task = task_result.children[0]
# Decorate with rendering results
session["render_status"] = render_task.status
if render_task.ready():
session["results"].render_output = render_task.get()
del(request.session["task_result"]) # Don't need any more
return render_to_response('results.html', request.session)
This solution only works when the function actually terminates. I can't chain together logical subtasks of long_func, because there are an unknown number of yields (each iteration of long_func's loop may not produce a result).
Question
Is there any sensible way to access yielded objects from an extremely long-running Celery task, so that they can be displayed before the generator is exhausted?
In order for Celery to know what the current state of the task is, it sets some metadata in whatever result backend you have. You can piggy-back on that to store other kinds of metadata.
def yielder():
for i in range(2**100):
yield i
#task
def report_progress():
for progress in yielder():
# set current progress on the task
report_progress.backend.mark_as_started(
report_progress.request.id,
progress=progress)
def view_function(request):
task_id = request.session['task_id']
task = AsyncResult(task_id)
progress = task.info['progress']
# do something with your current progress
I wouldn't throw a ton of data in there, but it works well for tracking the progress of a long-running task.
Paul's answer is great. As an alternative to using mark_as_started you can use Task's update_state method. They ultimately do the same thing, but the name "update_state" is a little more appropriate for what you're trying to do. You can optionally define a custom state that indicates your task is in progress (I've named my custom state 'PROGRESS'):
def yielder():
for i in range(2**100):
yield i
#task
def report_progress():
for progress in yielder():
# set current progress on the task
report_progress.update_state(state='PROGRESS', meta={'progress': progress})
def view_function(request):
task_id = request.session['task_id']
task = AsyncResult(task_id)
progress = task.info['progress']
# do something with your current progress
Celery part:
def long_func(*args, **kwargs):
i = 0
while True:
yield i
do_something_here(*args, **kwargs)
i += 1
#task()
def test_yield_task(task_id=None, **kwargs):
the_progress = 0
for the_progress in long_func(**kwargs):
cache.set('celery-task-%s' % task_id, the_progress)
Webclient side, starting task:
r = test_yield_task.apply_async()
request.session['task_id'] = r.task_id
Testing last yielded value:
v = cache.get('celery-task-%s' % session.get('task_id'))
if v:
do_someting()
If you do not like to use cache, or it's impossible, you can use db, file or any other place which celery worker and server side will have both accesss. With cache it's a simplest solution, but workers and server have to use the same cache.
A couple options to consider:
1 -- task groups. If you can enumerate all the sub tasks from the time of invocation, you can apply the group as a whole -- that returns a TaskSetResult object you can use to monitor the results of the group as a whole, or of individual tasks in the group -- query this as-needed when you need to check status.
2 -- callbacks. If you can't enumerate all sub tasks (or even if you can!) you can define a web hook / callback that's the last step in the task -- called when the rest of the task completes. The hook would be against a URI in your app that ingests the result and makes it available via DB or app-internal API.
Some combination of these could solve your challenge.
See also this great PyCon preso from one of the Instagram engineers.
http://blogs.vmware.com/vfabric/2013/04/how-instagram-feeds-work-celery-and-rabbitmq.html
At video mark 16:00, he discusses how they structure long lists of sub-tasks.

Task management daemon

i have to do some long-time (2-3 days i think) tasks with my django ORM data. I look around and didnt find any good solutions.
django-tasks - http://code.google.com/p/django-tasks/ is not well documented, and i dont have any ideas how to use it.
celery - http://ask.github.com/celery/ is excessive for my tasks. Is it good for longtime tasks?
So, what i need to do, i just get all data or parts of data from my database, like:
Entry.objects.all()
And then i need execute same function for each one of QuerySet.
I think it should work around 2-3 days.
So, maybe someone explain for me how to build it.
P.S:at the moment i have only one idea, use cron and database to store process execution timeline.
Use Celery Sub-Tasks. This will allow you to start a long-running task (with many short-running subtasks underneath it), and keep good data on it's execution status within Celery's task result store. As an added bonus, subtasks will be spread across worker proccesses allowing you to take full advantage of multi-core servers or even multiple servers in order to reduce task runtime.
http://ask.github.com/celery/userguide/tasksets.html#task-sets
http://docs.celeryproject.org/en/latest/reference/celery.task.sets.html
EDIT: example:
import time, logging as log
from celery.task import task
from celery.task.sets import TaskSet
from app import Entry
#task(send_error_emails=True)
def long_running_analysis():
entries = list(Entry.objects.all().values('id'))
num_entries = len(entries)
taskset = TaskSet(analyse_entry.subtask(entry.id) for entry in entries)
results = taskset.apply_async()
while not results.ready()
time.sleep(10000)
print log.info("long_running_analysis is %d% complete",
completed_count()*100/num_entries)
if results.failed():
log.error("Analysis Failed!")
result_set = results.join() # brings back results in
# the order of entries
#perform collating or count or percentage calculations here
log.error("Analysis Complete!")
#task
def analyse_entry(id): # inputs must be serialisable
logger = analyse_entry.get_logger()
entry = Entry.objects.get(id=id)
try:
analysis = entry.analyse()
logger.info("'%s' found to be %s.", entry, analysis['status'])
return analysis # must be a dict or serialisable.
except Exception as e:
logger.error("Could not process '%s': %s", entry, e)
return None
If your calculations cannot be seggregated to per-entry tasks, you can always set it up so that one subtask performs tallys, one subtask performs another analysis type. and this will still work, and will still allow you to benifit from parelelleism.

Categories

Resources