I have a simple project where I create a bunch of chunks of work that's not related to each other, create tasks, pass them to Redis, and have a number of workers spread out over a Docker Swarm chew through the queue of long-running tasks. When the workers finish they dump their completed work in an NFS share and send back a text value to the Celery client.
I'm using celery.result.ResultSet's .join() function on the resultset array of asyncresult objects. The join() includes a callback that (for now) simply prints the result.
My problem is join() blocks until it receives each asyncresult value in the order it was given. My swarm is made up of a number of hosts that are vastly different machines, and it's important to me to have results come back as they finish, not in order or once they are all complete.
Is there a way via Celery to properly trigger a callback function as tasks are finished? I've looked at a lot of examples online and seems like my only option is to try my luck with asyncio, but Python is not exactly my strong suite.
Func for creating tasks and ResultSet obj:
def populateQueue(encodeTasks):
r = ResultSet([])
taskHandles = {}
for task in encodeTasks:
try:
ret = encode.delay(task)
r.add(ret)
logging.debug("Task ID: " + str(ret.task_id))
taskHandles[ret.task_id] = ret
except:
logging.info("populateQueue fail: " + str(task.traceback))
logging.info("Tasks queued: " + str(len(taskHandles)))
return taskHandles, r
Part of main() which waits for results:
frameCountTotal = getFrameCount(targetFile)
encodeTasks = buildCmdString(targetFile, frameCountTotal, clientCount)
taskHandles, retSet = populateQueue(encodeTasks)
logging.info("Waiting on tasks...")
retSet.join(callback=testCallback)
Thanks in advance
Found an answer to my own question:
ResultSet has another method called join_native(), which I think uses more specific API calls to the broker as long as that broker is one of several known products (RabbitMQ, Redis, etc). Celery's documentation just says that it gives better performance if you meet the broker requirement. What the docs don't say is that it allows for out-of-order returns (at least on Redis, haven't tried RMQ).
Related
Hi I don't feel like I have quite understood multiprocessing in python correctly.
I want to run a function called 'run_worker' (which is simply code that runs and manages a subprocess) 20 times in parallel and wait for all the functions to complete. Each run_worker should run on a separate core/thread. I don' mind what order the processes complete hence i used async and i dont have a return value so i used map
I thought that I should use:
if __name__ == "__main__":
num_workers = 20
param_map = []
for i in range(num_workers):
param_map += [experiment_id]
pool = mp.Pool(processes= num_workers)
pool.map_async(run_worker, param_map)
pool.close()
pool.join()
However this code exits straight away and doesn't appear to execute run_worker properly. Also do I really have to create a param_map of the same experiment_id to pass to the worker because this seems like a hack to get the number of run_workers created. Ideally i would like to run a function with no parameters and no return value over multiple cores.
Note I am using windows 2019 server in AWS.
edit added run_worker which calls a subprocess which write to file:
def run_worker(experiment_id):
hostname = socket.gethostname()
experiment = conn.experiments(experiment_id).fetch()
while experiment.progress.observation_count < experiment.observation_budget:
suggestion = conn.experiments(experiment.id).suggestions().create()
value = evaluate_model(suggestion.assignments)
conn.experiments(experiment_id).observations().create(suggestion=suggestion.id,value=value,metadata=dict(hostname=hostname),)
# Update the experiment object
experiment = conn.experiments(experiment_id).fetch()
It seems that for this simple purpose you can better be using pool.map instead of pool.map_async. They both run in parallel, however pool.map is blocking until all operations are finished (see also this question). pool.map_async is especially meant for situations like this:
result = map_async(func, iterable)
while not result.ready():
// do some work while map_async is running
pass
// blocking call to get the result
out = result.get()
Regarding your question about the parameters, the fundamental idea of a map operation is to map the values of one list/array/iterable to a new list of values of the same size. As far as I can see in the docs, multiprocessing does not provide any method to run multiple functions without parameters.
If you would also share your run_worker function, that might help to get better answers to your question. That might also clear up why you would run a function without any arguments and return values using a map operation in the first place.
I am using Celery to asynchronously perform a group of operations. There are a lot of these operations and each may take a long time, so rather than send the results back in the return value of the Celery worker function, I'd like to send them back one at a time as custom state updates. That way the caller can implement a progress bar with a change state callback, and the return value of the worker function can be of constant size rather than linear in the number of operations.
Here is a simple example in which I use the Celery worker function add_pairs_of_numbers to add a list of pairs of numbers, sending back a custom status update for every added pair.
#!/usr/bin/env python
"""
Run worker with:
celery -A tasks worker --loglevel=info
"""
from celery import Celery
app = Celery("tasks", broker="pyamqp://guest#localhost//", backend="rpc://")
#app.task(bind=True)
def add_pairs_of_numbers(self, pairs):
for x, y in pairs:
self.update_state(state="SUM", meta={"x":x, "y":y, "x+y":x+y})
return len(pairs)
def handle_message(message):
if message["status"] == "SUM":
x = message["result"]["x"]
y = message["result"]["y"]
print(f"Message: {x} + {y} = {x+y}")
def non_looping(*pairs):
task = add_pairs_of_numbers.delay(pairs)
result = task.get(on_message=handle_message)
print(result)
def looping(*pairs):
task = add_pairs_of_numbers.delay(pairs)
print(task)
while True:
pass
if __name__ == "__main__":
import sys
if sys.argv[1:] and sys.argv[1] == "looping":
looping((3,4), (2,7), (5,5))
else:
non_looping((3,4), (2,7), (5,5))
If you run just ./tasks it executes the non_looping function. This does the standard Celery thing: makes a delayed call to the worker function and then uses get to wait for the result. A handle_message callback function prints each message, and the number of pairs added is returned as the result. This is what I want.
$ ./task.py
Message: 3 + 4 = 7
Message: 2 + 7 = 9
Message: 5 + 5 = 10
3
Though the non-looping scenario is sufficient for this simple example, the real world task I'm trying to accomplish is processing a batch of files instead of adding pairs of numbers. Furthermore the client is a Flask REST API and therefore cannot contain any blocking get calls. In the script above I simulate this constraint with the looping function. This function starts the asynchronous Celery task, but does not wait for a response. (The infinite while loop that follows simulates the web server continuing to run and handle other requests.)
If you run the script with the argument "looping" it runs this code path. Here it immediately prints the Celery task ID then drops into the infinite loop.
$ ./tasks.py looping
a39c54d3-2946-4f4e-a465-4cc3adc6cbe5
The Celery worker logs show that the add operations are performed, but the caller doesn't define a callback function, so it never gets the results.
(I realize that this particular example is embarrassingly parallel, so I could use chunks to divide this up into multiple tasks. However, in my non-simplified real-world case I have tasks that cannot be parallelized.)
What I want is to be able to specify a callback in the looping scenario. Something like this.
def looping(*pairs):
task = add_pairs_of_numbers.delay(pairs, callback=handle_message) # There is no such callback.
print(task)
while True:
pass
In the Celery documentation and all the examples I can find online (for example this), there is no way to define a callback function as part of the delay call or its apply_async equivalent. You can only specify one as part of a get callback. That's making me think this is an intentional design decision.
In my REST API scenario I can work around this by having the Celery worker process send a "status update" back to the Flask server in the form of an HTTP post, but this seems weird because I'm starting to replicate messaging logic in HTTP that already exists in Celery.
Is there any way to write my looping scenario so that the caller receives callbacks without making a blocking call, or is that explicitly forbidden in Celery?
It's a pattern that is not supported by celery although you can (somewhat) trick it out by posting custom state updates to your task as described here.
Use update_state() to update a task’s state:.
def upload_files(self, filenames):
for i, file in enumerate(filenames):
if not self.request.called_directly:
self.update_state(state='PROGRESS',
meta={'current': i, 'total': len(filenames)})```
The reason that celery does not support such a pattern is that task producers (callers) are strongly decoupled from the task consumers (workers) with the only communications between the two being the broker to support communication from producers to consumers and the result backend supporting communications from consumers to producers. The closest you can get currently is with polling a task state or writing a custom result backend that will allow you to post events either via AMP RPC or redis subscriptions.
key problem :asyncio.wait(aws,timeout=1,return_when=FIRST_COMPLETED) Is there a simple way to check if the returned task has timed out?
This is an extended question.
The scene is like this:
Total number of coroutines is unknown
server only allows 10 links
The server will return a seemingly correct result (eg returning an incorrect page)
The server sometimes does not return any data.
Maximum possible access to all data
So in order to get data faster, I need to limit the number of coroutines. Check the returned page. And timeout.
There are two simple methods at present.
1. similar to the thread, use queue to build a coroutine pool + 10 infinite loop coro. I don't really like it. In fact, this method works very fast.
2. I tried to use the high-level API of async python3.7, try to simplify the structure of the program, using while tasks & asyncio.wait & return_when.
Here I came across a problem with how to find timeouts for coroutines.
I built a simple demo:
import asyncio
async def test(delaytime):
print(f"begin {delaytime}")
await asyncio.sleep(delaytime )
print(f"finish {delaytime} ")
async def main():
# the number of tasks is unknow,range(10) is just a demo
allts = list(range(10))
ts = []
while len(ts)<5:
arg = allts.pop()
t = asyncio.create_task(test(arg))
t.arg = arg
ts.append(t)
while ts:
dones,pendings = await asyncio.wait(ts,timeout=2,return_when=asyncio.FIRST_COMPLETED)
for t in dones:
# if check t.result() is error , i can append ts again
print(t.arg,"is done")
ts.remove(t)
while len(ts)<5:
if len(allts):
arg = allts.pop()
t = asyncio.create_task(test(arg))
t.arg = arg
ts.append(t)
else:
break
# for t in pendings:
# # if can check t is timeout , i can append ts again
# pass
if __name__=="__main__":
asyncio.run(main())
After debugging, I know that return_when=asyncio.FIRST_COMPLETED, the tasks returned by asyncio.wait are in the pendings, except for the completed tasks.
However, I can't tell which task is timeout.
I thought about using wait_for, but wait_for has no return_when argument.
Is there a simple way to determine the timeout task in order to re-join ts?
The issue is that the approach of using wait(return_when=FIRST_COMPLETED) is fundamentally incompatible with the use of timeout. Since different tasks have started at different times, a single timeout argument obviously can't apply to all tasks. If you want to use return_when=FIRST_COMPLETED, wrap each task in asyncio.wait_for:
t = asyncio.create_task(asyncio.wait_for(test(arg), 2))
Then, when the task is done, you can use t.exception() to test if it has timed out, in which case it will return asyncio.TimeoutError. This check should only be performed among the done tasks.
I'm trying to parse the websites that contain car's properties(154 kinds of properties). I have a huge list(name is liste_test) that consist of 280.000 used car announcement URL.
def araba_cekici(liste_test,headers,engine):
for link in liste_test:
try:
page = requests.get(link, headers=headers)
.....
.....
When I start my code like that:
araba_cekici(liste_test,headers,engine)
It works and getting results. But approximately in 1 hour, I could only obtain 1500 URL's properties. It is very slow, and I must use multiprocessing.
I found a result on here with multiprocessing. Then I applied to my code, but unfortunately, it is not working.
import numpy as np
import multiprocessing as multi
def chunks(n, page_list):
"""Splits the list into n chunks"""
return np.array_split(page_list,n)
cpus = multi.cpu_count()
workers = []
page_bins = chunks(cpus, liste_test)
for cpu in range(cpus):
sys.stdout.write("CPU " + str(cpu) + "\n")
# Process that will send corresponding list of pages
# to the function perform_extraction
worker = multi.Process(name=str(cpu),
target=araba_cekici,
args=(page_bins[cpu],headers,engine))
worker.start()
workers.append(worker)
for worker in workers:
worker.join()
And it gives:
TypeError: can't pickle _thread.RLock objects
I found some kind of responses with respects to this error. But none of them works(at least I can't apply to my code).
Also, I tried python multiprocess Pool but unfortunately it stucks on jupyter notebook and seems this code works infinitely.
Late answer, but since this question turns up when searching on Google: multiprocessing sends the data to the worker processes via a multiprocessing.Queue, which requires all data/objects sent to be picklable.
In your code, you try to pass header and engine, whose implementations you don't show. (Since header holds the HTTP request header, I suspect that engine is the issue here.) To solve your issue, you either have to make engine picklable, or only instantiate engine within the worker process.
i have to do some long-time (2-3 days i think) tasks with my django ORM data. I look around and didnt find any good solutions.
django-tasks - http://code.google.com/p/django-tasks/ is not well documented, and i dont have any ideas how to use it.
celery - http://ask.github.com/celery/ is excessive for my tasks. Is it good for longtime tasks?
So, what i need to do, i just get all data or parts of data from my database, like:
Entry.objects.all()
And then i need execute same function for each one of QuerySet.
I think it should work around 2-3 days.
So, maybe someone explain for me how to build it.
P.S:at the moment i have only one idea, use cron and database to store process execution timeline.
Use Celery Sub-Tasks. This will allow you to start a long-running task (with many short-running subtasks underneath it), and keep good data on it's execution status within Celery's task result store. As an added bonus, subtasks will be spread across worker proccesses allowing you to take full advantage of multi-core servers or even multiple servers in order to reduce task runtime.
http://ask.github.com/celery/userguide/tasksets.html#task-sets
http://docs.celeryproject.org/en/latest/reference/celery.task.sets.html
EDIT: example:
import time, logging as log
from celery.task import task
from celery.task.sets import TaskSet
from app import Entry
#task(send_error_emails=True)
def long_running_analysis():
entries = list(Entry.objects.all().values('id'))
num_entries = len(entries)
taskset = TaskSet(analyse_entry.subtask(entry.id) for entry in entries)
results = taskset.apply_async()
while not results.ready()
time.sleep(10000)
print log.info("long_running_analysis is %d% complete",
completed_count()*100/num_entries)
if results.failed():
log.error("Analysis Failed!")
result_set = results.join() # brings back results in
# the order of entries
#perform collating or count or percentage calculations here
log.error("Analysis Complete!")
#task
def analyse_entry(id): # inputs must be serialisable
logger = analyse_entry.get_logger()
entry = Entry.objects.get(id=id)
try:
analysis = entry.analyse()
logger.info("'%s' found to be %s.", entry, analysis['status'])
return analysis # must be a dict or serialisable.
except Exception as e:
logger.error("Could not process '%s': %s", entry, e)
return None
If your calculations cannot be seggregated to per-entry tasks, you can always set it up so that one subtask performs tallys, one subtask performs another analysis type. and this will still work, and will still allow you to benifit from parelelleism.