Why are celery tasks flushing each others' sqlalchemy sessions?

Why are celery tasks flushing each others' sqlalchemy sessions? - python

(github minimal verifiable complete example at bottom)
I have a flask endpoint that takes a data submission (just a dict), stores it, spins up a celery task to process the data later, and returns.
The celery tasks' job is to update the fields of a contact using that data, and log the attributes it changes (name, oldvalue, newvalue)
In production, multiple tasks may come into being for the same update-job, and I'd like to prevent them from all submitting updates for the same contact
I tried to do this using a row-lock on the update-job row. After calculating all of the fields that need to be changed, and changing the python object itself, it will then:
Wait for lock on the update-job row
After acquiring the lock re-check if the row is still unprocessed
If yes, add the updated contact object, new change log objects, and the job object (now marked completed) to the session and commit
3.5) If no, abandon, do not try and commit and end the task.
Unfortunately I see multiple tasks committing their staged changes though.
I had assumed that db.session is a scoped session, and is therefore safe for use by different celery tasks forked from the same worker.
Are they instead performing operations on each other's sessions?
Relevant task function:
#celery.task(name="process_update")
def process_update(update_id):
print("Starting")
time.sleep(.5)
update = UpdateJob.query.get(update_id)
# Wait for previous updates to finish and abort after timeout
for attempt in range(40):
unfinished_changes_to_contact = db.session.query(UpdateJob).filter(
UpdateJob.contact_id == update.contact_id,
UpdateJob.id < update.id,
UpdateJob.contact_was_updated == False # noqa
).order_by(UpdateJob.id.desc()).first()
if unfinished_changes_to_contact is None:
break
time.sleep(.5)
else:
print(f"{update.id} waiting for {unfinished_changes_to_contact.id} to finish")
raise Exception("Already existing update to contact that is not finishing")
print("We are clear")
time.sleep(.5)
contact = db.session.query(Contact).get(update.contact_id)
for key, value in json.loads(update.data).items():
setattr(contact, key, value)
changes = field_changes_from_contact(contact)
for change in changes:
change.update_job_id = update.id
# Did another task already cover this update?
print("Rechecking. Locking")
processed_update = UpdateJob.query.with_for_update().get(update_id)
if processed_update.contact_was_updated is True:
print(processed_update)
print("Already done. Abandoning.")
return
print("Looks good. Committing.")
processed_update.contact_was_updated = True
db.session.add(processed_update)
db.session.add(contact)
db.session.bulk_save_objects(changes)
db.session.commit()
print("Commited")
In the logfile (https://ideone.com/wXg3SS) I can see that worker-1 was the first to issue an actual SELECT FOR UPDATE statement to the db, and worker-1 was indeed the first to commit. This is correct behavior.
But afterward, worker 2 should acquire lock, recheck that row to find it's "updated" field was now true, and abandon the whole thing. Instead it commits.
Also there are many statements being issued to the DB in between these events and I don't understand why.
MVCE on github
https://github.com/kfieldm/flask-race

Related

Concurrency on PostgreSQL Database with python subprocesses

I use python multiprocessing processes to establish multiple connections to a postgreSQL database via psycopg.
Every process establishes a connection, creates a cursor, fetches an object from a mp.Queue and does some work on the database. If everything works fine, the changes are commited and the connection is closed.
If one of the processes however creates an error (e.g. an ADD COLUMN request fails, because the COLUMN is already present), all the processes seem to stop working.
import psycopg2
import multiprocessing as mp
import Queue
def connect():
C = psycopg2.connect(host = "myhost", user = "myuser", password = "supersafe", port = 62013, database = "db")
cur = C.cursor()
return C,cur
def commit_and_close(C,cur):
C.commit()
cur.close()
C.close()
def commit(C):
C.commit()
def sub(queue):
C,cur = connect()
while not queue.empty():
work_element = queue.get(timeout=1)
#do something with the work element, that might produce an SQL error
commit_and_close(C,cur)
return 0
if __name__ == '__main__':
job_queue = mp.Queue()
#Fill Job_queue
print 'Run'
for i in range(20):
p=mp.Process(target=sub, args=(job_queue))
p.start()
I can see, that processes are still alive (because the job_queue is still full), but no Network traffic / SQL actions are happening. Is it possible, that an SQL error blocks communication from other subprocesses? How can I prevent that happening?

As chance would have it, I was doing something similar today.
It shouldn't be that the state of one connection can affect a different one, so I don't think we should start there.
There is clearly a race condition in your queue handling. You check if the queue is empty and then try to get a statement to execute. With multiple readers one of the others could empty the queue leaving the others all blocking on their queue.get. If the queue is empty when they all lock up then I would suspect this.
You also never join your processes back when they complete. I'm not sure what effect that would have in the larger picture, but it's probably good practice to clean up.
The other thing that might be happening is that your error-ing process is not rolling back properly. That might leave other transactions waiting to see if it completes or rolls back. They can wait for quite a long time by default but you can configure it.
To see what is happening, fire up psql and check out two useful system views pg_stat_activity and pg_locks. That should show where the cause lies.

Interact with celery ongoing task

We have a distributed architecture based on rabbitMQ and Celery.
We can launch in parallel multiple tasks without any issue. The scalability is good.
Now we need to control the task remotely: PAUSE, RESUME, CANCEL.
The only solution we found is to make in the Celery task a RPC call to another task that replies the command after a DB request. The Celery task and RPC task are not on the same machine and only the RPC task has access to the DB.
Do you have any advice how to improve it and easily communicate with an ongoing task?
Thank you
EDIT:
In fact we would like to do something like in the picture below. It's easy to do the Blue configuration or the Orange, but we don't know how to do both simultaneously.
Workers are subscribing to a common Jobs queue and each worker has its own Admin queue declared on an exchange.
EDIT:
IF this is not possible with Celery, I'am open to a solution with other frameworks like python-rq.

It look like the Control Bus pattern.
For a better scalability and in order to reduce the RPC call, I recommend to reverse the logic. The PAUSE, RESUME, CANCEL command are push to the Celery tasks through a control bus when the state change occurs. The Celery app will store the current state of the Celery app in a store (could be in memory, on the filesystem..). If task states must be kept even after a stop/start of the app, It will involve more work in order to keep both app synchronized (eg. synchronization at startup).

I'd like to demonstrate a general approach to implementing pause-able (and resume-able) ongoing celery tasks through the workflow pattern. Note: Original answer written here. Re-writing here due to this post being very relevant.
Concept
With celery workflows - you can design your entire operation to be divided into a chain of tasks. It doesn't necessarily have to be purely a chain, but it should follow the general concept of one task after another task (or task group) finishes.
Once you have a workflow like that, you can finally define points to pause at throughout your workflow. At each of these points, you can check whether or not the frontend user has requested the operation to pause and continue accordingly. The concept is this:-
A complex and time consuming operation O is split into 5 celery tasks - T1, T2, T3, T4, and T5 - each of these tasks (except the first one) depend on the return value of the previous task.
Let's assume we define points to pause after every single task, so the workflow looks like-
T1 executes
T1 completes, check if user has requested pause
If user has not requested pause - continue
If user has requested pause, serialize the remaining workflow chain and store it somewhere to continue later
... and so on. Since there's a pause point after each task, that check is performed after every one of them (except the last one of course).
But this is only theory, I struggled to find an implementation of this anywhere online so here's what I came up with-
Implementation
from typing import Any, Optional
from celery import shared_task
from celery.canvas import Signature, chain, signature
#shared_task(bind=True)
def pause_or_continue(
self, retval: Optional[Any] = None, clause: dict = None, callback: dict = None
):
# Task to use for deciding whether to pause the operation chain
if signature(clause)(retval):
# Pause requested, call given callback with retval and remaining chain
# chain should be reversed as the order of execution follows from end to start
signature(callback)(retval, self.request.chain[::-1])
self.request.chain = None
else:
# Continue to the next task in chain
return retval
def tappable(ch: chain, clause: Signature, callback: Signature, nth: Optional[int] = 1):
'''
Make a operation workflow chain pause-able/resume-able by inserting
the pause_or_continue task for every nth task in given chain
ch: chain
The workflow chain
clause: Signature
Signature of a task that takes one argument - return value of
last executed task in workflow (if any - othewise `None` is passsed)
- and returns a boolean, indicating whether or not the operation should continue
Should return True if operation should continue normally, or be paused
callback: Signature
Signature of a task that takes 2 arguments - return value of
last executed task in workflow (if any - othewise `None` is passsed) and
remaining chain of the operation workflow as a json dict object
No return value is expected
This task will be called when `clause` returns `True` (i.e task is pausing)
The return value and the remaining chain can be handled accordingly by this task
nth: Int
Check `clause` after every nth task in the chain
Default value is 1, i.e check `clause` after every task
Hence, by default, user given `clause` is called and checked
after every task
NOTE: The passed in chain is mutated in place
Returns the mutated chain
'''
newch = []
for n, sig in enumerate(ch.tasks):
if n != 0 and n % nth == nth - 1:
newch.append(pause_or_continue.s(clause=clause, callback=callback))
newch.append(sig)
ch.tasks = tuple(newch)
return ch
Explanation - pause_or_continue
Here pause_or_continue is the aforementioned pause point. It's a task that will be called at specific intervals (intervals as in task intervals, not as in time intervals). This task then calls a user provided function (actually a task) - clause - to check whether or not the task should continue.
If the clause function (actually a task) returns True, the user provided callback function is called, the latest return value (if any - None otherwise) is passed onto this callback, as well as the remaining chain of tasks. The callback does what it needs to do and pause_or_continue sets self.request.chain to None, which tells celery "The task chain is now empty - everything is finished".
If the clause function (actually a task) returns False, the return value from the previous task (if any - None otherwise) is returned back for the next task to receive - and the chain goes on. Hence the workflow continues.
Why are clause and callback task signatures and not regular functions?
Both clause and callback are being called directly - without delay or apply_async. It is executed in the current process, in the current context. So it behaves exactly like a normal function, then why use signatures?
The answer is serialization. You can't conveniently pass a regular function object to a celery task. But you can pass a task signature. That's exactly what I'm doing here. Both clause and callback should be a regular signature object of a celery task.
What is self.request.chain?
self.request.chain stores a list of dicts (representing jsons as the celery task serializer is json by default) - each of them representing a task signature. Each task from this list is executed in reverse order. Which is why, the list is reversed before passing to the user provided callback function (actually a task) - the user probably expects the order of tasks to be left to right.
Quick note: Irrelevant to this discussion, but if you're using the link parameter from apply_async to construct a chain instead of the chain primitive itself. self.request.callback is the property to be modified (i.e set to None to remove callback and stop chain) instead of self.request.chain
Explanation - tappable
tappable is just a basic function that takes a chain (which is the only workflow primitive covered here, for brevity) and inserts pause_or_continue after every nth task. You can insert them wherever you want really, it is upto you to define pause points in your operation. This is just an example!
For each chain object, the actual signatures of tasks (in order, going from left to right) is stored in the .tasks property. It's a tuple of task signatures. So all we have to do, is take this tuple, convert into a list, insert the pause points and convert back to a tuple to assign to the chain. Then return the modified chain object.
The clause and callback is also attached to the pause_or_continue signature. Normal celery stuff.
That covers the primary concept, but to showcase a real project using this pattern (and also to showcase the resuming part of a paused task), here's a small demo of all the necessary resources
Usage
This example usage assumes the concept of a basic web server with a database. Whenever an operation (i.e workflow chain) is started, it's assigned an id and stored into the database. The schema of that table looks like-
-- Create operations table
-- Keeps track of operations and the users that started them
CREATE TABLE operations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
requester_id INTEGER NOT NULL,
completion TEXT NOT NULL,
workflow_store TEXT,
result TEXT,
FOREIGN KEY (requester_id) REFERENCES user (id)
);
The only field that needs to be known about right now, is completion. It just stores the status of the operation-
When the operation starts and a db entry is created, this is set to IN PROGRESS
When a user requests pause, the route controller (i.e view) modifies this to REQUESTING PAUSE
When the operation actually gets paused and callback (from tappable, inside pause_or_continue) is called, the callback should modify this to PAUSED
When the task is completed, this should be modified to COMPLETED
An example of clause
#celery.task()
def should_pause(_, operation_id: int):
# This is the `clause` to be used for `tappable`
# i.e it lets celery know whether to pause or continue
db = get_db()
# Check the database to see if user has requested pause on the operation
operation = db.execute(
"SELECT * FROM operations WHERE id = ?", (operation_id,)
).fetchone()
return operation["completion"] == "REQUESTING PAUSE"
This is the task to call at the pause points, to determine whether or not to pause. It's a function that takes 2 parameters.....well sort of. The first one is mandatory, tappable requires the clause to have one (and exactly one) argument - so it can pass the previous task's return value to it (even if that return value is None). In this example, the return value isn't required to be used - so we can just ignore it.
The second parameter is an operation id. See, all this clause does - is check a database for the operation (the workflow) entry and see if it has the status REQUESTING PAUSE. To do that, it needs to know the operation id. But clause should be a task with one argument, what gives?
Well, good thing signatures can be partial. When the task is first started and a tappable chain is created. The operation id is known and hence we can do should_pause.s(operation_id) to get the signature of a task that takes one parameter, that being the return value of the previous task. That qualifies as a clause!
An example of callback
import os
import json
from typing import Any, List
#celery.task()
def save_state(retval: Any, chains: dict, operation_id: int):
# This is the `callback` to be used for `tappable`
# i.e this is called when an operation is pausing
db = get_db()
# Prepare directories to store the workflow
operation_dir = os.path.join(app.config["OPERATIONS"], f"{operation_id}")
workflow_file = os.path.join(operation_dir, "workflow.json")
if not os.path.isdir(operation_dir):
os.makedirs(operation_dir, exist_ok=True)
# Store the remaining workflow chain, serialized into json
with open(workflow_file, "w") as f:
json.dump(chains, f)
# Store the result from the last task and the workflow json path
db.execute(
"""
UPDATE operations
SET completion = ?,
workflow_store = ?,
result = ?
WHERE id = ?
""",
("PAUSED", workflow_file, f"{retval}", operation_id),
)
db.commit()
And here's the task to be called when the task is being paused. Remember, this should take the last executed task's return value and the remaining list of signatures (in order, from left to right). There's an extra param - operation_id - once again. The explanation for this is the same as the one for clause.
This function stores the remaining chain in a json file (since it's a list of dicts). Remember, you can use a different serializer - I'm using json since it's the default task serializer used by celery.
After storing the remaining chain, it updates the completion status to PAUSED and also logs the path to the json file into the db.
Now, let's see these in action-
An example of starting the workflow
def start_operation(user_id, *operation_args, **operation_kwargs):
db = get_db()
operation_id: int = db.execute(
"INSERT INTO operations (requester_id, completion) VALUES (?, ?)",
(user_id, "IN PROGRESS"),
).lastrowid
# Convert a regular workflow chain to a tappable one
tappable_workflow = tappable(
(T1.s() | T2.s() | T3.s() | T4.s() | T5.s(operation_id)),
should_pause.s(operation_id),
save_state.s(operation_id),
)
# Start the chain (i.e send task to celery to run asynchronously)
tappable_workflow(*operation_args, **operation_kwargs)
db.commit()
return operation_id
A function that takes in a user id and starts an operation workflow. This is more or less an impractical dummy function modeled around a view/route controller. But I think it gets the general idea through.
Assume T[1-4] are all unit tasks of the operation, each taking the previous task's return as an argument. Just an example of a regular celery chain, feel free to go wild with your chains.
T5 is a task that saves the final result (result from T4) to the database. So along with the return value from T4 it needs the operation_id. Which is passed into the signature.
An example of pausing the workflow
def pause(operation_id):
db = get_db()
operation = db.execute(
"SELECT * FROM operations WHERE id = ?", (operation_id,)
).fetchone()
if operation and operation["completion"] == "IN PROGRESS":
# Pause only if the operation is in progress
db.execute(
"""
UPDATE operations
SET completion = ?
WHERE id = ?
""",
("REQUESTING PAUSE", operation_id),
)
db.commit()
return 'success'
return 'invalid id'
This employs the previously mentioned concept of modifying the db entry to change completion to REQUESTING PAUSE. Once this is committed, the next time pause_or_continue calls should_pause, it'll know that the user has requested the operation to pause and it'll do so accordingly.
An example of resuming the workflow
def resume(operation_id):
db = get_db()
operation = db.execute(
"SELECT * FROM operations WHERE id = ?", (operation_id,)
).fetchone()
if operation and operation["completion"] == "PAUSED":
# Resume only if the operation is paused
with open(operation["workflow_store"]) as f:
# Load the remaining workflow from the json
workflow_json = json.load(f)
# Load the chain from the json (i.e deserialize)
workflow_chain = chain(signature(x) for x in serialized_ch)
# Start the chain and feed in the last executed task result
workflow_chain(operation["result"])
db.execute(
"""
UPDATE operations
SET completion = ?
WHERE id = ?
""",
("IN PROGRESS", operation_id),
)
db.commit()
return 'success'
return 'invalid id'
Recall that, when the operation is paused - the remaining workflow is stored in a json. Since we are currently restricting the workflow to a chain object. We know this json is a list of signatures that should be turned into a chain. So, we deserialize it accordingly and send it to the celery worker.
Note that, this remaining workflow still has the pause_or_continue tasks as they were originally - so this workflow itself, is once again pause-able/resume-able. When it pauses, the workflow.json will simply be updated.

Pyramid: Multi-threaded Database Operation

My application receives one or more URLs (typically 3-4 URLs) from the user, scrapes certain data from those URLs and writes those data to the database. However, because scraping those data take a little while, I was thinking of running each of those scraping in a separate thread so that the scraping + writing to the database can keep on going in the background so that the user does not have to keep on waiting.
To implement that, I have (relevant parts only):
#view_config(route_name="add_movie", renderer="templates/add_movie.jinja2")
def add_movie(request):
post_data = request.POST
if "movies" in post_data:
movies = post_data["movies"].split(os.linesep)
for movie_id in movies:
movie_thread = Thread(target=store_movie_details, args=(movie_id,))
movie_thread.start()
return {}
def store_movie_details(movie_id):
movie_details = scrape_data(movie_id)
new_movie = Movie(**movie_details) # Movie is my model.
print new_movie # Works fine.
print DBSession.add(movies(**movie_details)) # Returns None.
While the line new_movie does print correct scrapped data, DBSession.add() doesn't work. In fact, it just returns None.
If I remove the threads and just call the method store_movie_details(), it works fine.
What's going on?

Firstly, the SA docs on Session.add() do not mention anything about the method's return value, so I would assume it is expected to return None.
Secondly, I think you meant to add new_movie to the session, not movies(**movie_details), whatever that is :)
Thirdly, the standard Pyramid session (the one configured with ZopeTransactionExtension) is tied to Pyramid's request-response cycle, which may produce unexpected behavior in your situation. You need to configure a separate session which you will need to commit manually in store_movie_details. This session needs to use scoped_session so the session object is thread-local and is not shared across threads.
from sqlalchemy.orm import scoped_session
from sqlalchemy.orm import sessionmaker
session_factory = sessionmaker(bind=some_engine)
AsyncSession = scoped_session(session_factory)
def store_movie_details(movie_id):
session = AsyncSession()
movie_details = scrape_data(movie_id)
new_movie = Movie(**movie_details) # Movie is my model.
session.add(new_movie)
session.commit()
And, of course, this approach is only suitable for very light-weight tasks, and if you don't mind occasionally losing a task (when the webserver restarts, for example). For anything more serious have a look at Celery etc. as Antoine Leclair suggests.

The transaction manager closes the transaction when the response is returned. DBSession has no transaction in the other threads when the response has been returned. Also, it's probably not a good idea to share a transaction across threads.
This is a typical use case for a worker. Check out Celery or RQ.

Reporting yielded results of long-running Celery task

Problem
I've segmented a long-running task into logical subtasks, so I can report the results of each subtask as it completes. However, I'm trying to report the results of a task that will effectively never complete (instead yielding values as it goes), and am struggling to do so with my existing solution.
Background
I'm building a web interface to some Python programs I've written. Users can submit jobs through web forms, then check back to see the job's progress.
Let's say I have two functions, each accessed via separate forms:
med_func: Takes ~1 minute to execute, results are passed off to render(), which produces additional data.
long_func: Returns a generator. Each yield takes on the order of 30 minutes, and should be reported to the user. There are so many yields, we can consider this iterator as infinite (terminating only when revoked).
Code, current implementation
With med_func, I report results as follows:
On form submission, I save an AsyncResult to a Django session:
task_result = med_func.apply_async([form], link=render.s())
request.session["task_result"] = task_result
The Django view for the results page accesses this AsyncResult. When a task has completed, results are saved into an object that is passed as context to a Django template.
def results(request):
""" Serve (possibly incomplete) results of a session's latest run. """
session = request.session
try: # Load most recent task
task_result = session["task_result"]
except KeyError: # Already cleared, or doesn't exist
if "results" not in session:
session["status"] = "No job submitted"
else: # Extract data from Asynchronous Tasks
session["status"] = task_result.status
if task_result.ready():
session["results"] = task_result.get()
render_task = task_result.children[0]
# Decorate with rendering results
session["render_status"] = render_task.status
if render_task.ready():
session["results"].render_output = render_task.get()
del(request.session["task_result"]) # Don't need any more
return render_to_response('results.html', request.session)
This solution only works when the function actually terminates. I can't chain together logical subtasks of long_func, because there are an unknown number of yields (each iteration of long_func's loop may not produce a result).
Question
Is there any sensible way to access yielded objects from an extremely long-running Celery task, so that they can be displayed before the generator is exhausted?

In order for Celery to know what the current state of the task is, it sets some metadata in whatever result backend you have. You can piggy-back on that to store other kinds of metadata.
def yielder():
for i in range(2**100):
yield i
#task
def report_progress():
for progress in yielder():
# set current progress on the task
report_progress.backend.mark_as_started(
report_progress.request.id,
progress=progress)
def view_function(request):
task_id = request.session['task_id']
task = AsyncResult(task_id)
progress = task.info['progress']
# do something with your current progress
I wouldn't throw a ton of data in there, but it works well for tracking the progress of a long-running task.

Paul's answer is great. As an alternative to using mark_as_started you can use Task's update_state method. They ultimately do the same thing, but the name "update_state" is a little more appropriate for what you're trying to do. You can optionally define a custom state that indicates your task is in progress (I've named my custom state 'PROGRESS'):
def yielder():
for i in range(2**100):
yield i
#task
def report_progress():
for progress in yielder():
# set current progress on the task
report_progress.update_state(state='PROGRESS', meta={'progress': progress})
def view_function(request):
task_id = request.session['task_id']
task = AsyncResult(task_id)
progress = task.info['progress']
# do something with your current progress

Celery part:
def long_func(*args, **kwargs):
i = 0
while True:
yield i
do_something_here(*args, **kwargs)
i += 1
#task()
def test_yield_task(task_id=None, **kwargs):
the_progress = 0
for the_progress in long_func(**kwargs):
cache.set('celery-task-%s' % task_id, the_progress)
Webclient side, starting task:
r = test_yield_task.apply_async()
request.session['task_id'] = r.task_id
Testing last yielded value:
v = cache.get('celery-task-%s' % session.get('task_id'))
if v:
do_someting()
If you do not like to use cache, or it's impossible, you can use db, file or any other place which celery worker and server side will have both accesss. With cache it's a simplest solution, but workers and server have to use the same cache.

A couple options to consider:
1 -- task groups. If you can enumerate all the sub tasks from the time of invocation, you can apply the group as a whole -- that returns a TaskSetResult object you can use to monitor the results of the group as a whole, or of individual tasks in the group -- query this as-needed when you need to check status.
2 -- callbacks. If you can't enumerate all sub tasks (or even if you can!) you can define a web hook / callback that's the last step in the task -- called when the rest of the task completes. The hook would be against a URI in your app that ingests the result and makes it available via DB or app-internal API.
Some combination of these could solve your challenge.

See also this great PyCon preso from one of the Instagram engineers.
http://blogs.vmware.com/vfabric/2013/04/how-instagram-feeds-work-celery-and-rabbitmq.html
At video mark 16:00, he discusses how they structure long lists of sub-tasks.

Task management daemon

i have to do some long-time (2-3 days i think) tasks with my django ORM data. I look around and didnt find any good solutions.
django-tasks - http://code.google.com/p/django-tasks/ is not well documented, and i dont have any ideas how to use it.
celery - http://ask.github.com/celery/ is excessive for my tasks. Is it good for longtime tasks?
So, what i need to do, i just get all data or parts of data from my database, like:
Entry.objects.all()
And then i need execute same function for each one of QuerySet.
I think it should work around 2-3 days.
So, maybe someone explain for me how to build it.
P.S:at the moment i have only one idea, use cron and database to store process execution timeline.

Use Celery Sub-Tasks. This will allow you to start a long-running task (with many short-running subtasks underneath it), and keep good data on it's execution status within Celery's task result store. As an added bonus, subtasks will be spread across worker proccesses allowing you to take full advantage of multi-core servers or even multiple servers in order to reduce task runtime.
http://ask.github.com/celery/userguide/tasksets.html#task-sets
http://docs.celeryproject.org/en/latest/reference/celery.task.sets.html
EDIT: example:
import time, logging as log
from celery.task import task
from celery.task.sets import TaskSet
from app import Entry
#task(send_error_emails=True)
def long_running_analysis():
entries = list(Entry.objects.all().values('id'))
num_entries = len(entries)
taskset = TaskSet(analyse_entry.subtask(entry.id) for entry in entries)
results = taskset.apply_async()
while not results.ready()
time.sleep(10000)
print log.info("long_running_analysis is %d% complete",
completed_count()*100/num_entries)
if results.failed():
log.error("Analysis Failed!")
result_set = results.join() # brings back results in
# the order of entries
#perform collating or count or percentage calculations here
log.error("Analysis Complete!")
#task
def analyse_entry(id): # inputs must be serialisable
logger = analyse_entry.get_logger()
entry = Entry.objects.get(id=id)
try:
analysis = entry.analyse()
logger.info("'%s' found to be %s.", entry, analysis['status'])
return analysis # must be a dict or serialisable.
except Exception as e:
logger.error("Could not process '%s': %s", entry, e)
return None
If your calculations cannot be seggregated to per-entry tasks, you can always set it up so that one subtask performs tallys, one subtask performs another analysis type. and this will still work, and will still allow you to benifit from parelelleism.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why are celery tasks flushing each others' sqlalchemy sessions? - python

Related

Concurrency on PostgreSQL Database with python subprocesses

Interact with celery ongoing task

Pyramid: Multi-threaded Database Operation

Reporting yielded results of long-running Celery task

Task management daemon

Categories

Resources