Aliases for celery tasks - python

I am switching task naming scheme. There are parts of the code which still use old names, and some which use new names. So, my question is: what is the proper way of aliasing Celery tasks?
#task
def new_task_name():
pass
old_task_name = new_task_name # doesn't work
app.tasks['old_task_name'] = new_task_name # still doesn't work
I get error similar to this:
Received unregistered task of type 'app.tasks.old_task_name'
UPDATE:
My current solution is forwarding tasks. But I still hope there's a cleaner approach:
#task
def old_task_name():
new_task_name.delay()

#app.task(name='this-is-the-name')
def new_task_name():
pass

This question is ancient but a more direct way to do this is:
#task(name='old-name')
def old_task_name(*args, **kwargs):
return new_task_name(*args, **kwargs)
Celery tasks can still be called as normal methods too.

Related

Call method on many objects in parallel

I wanted to use concurrency in Python for the first time. So I started reading a lot about Python concurreny (GIL, threads vs processes, multiprocessing vs concurrent.futures vs ...) and seen a lot of convoluted examples. Even in examples using the high level concurrent.futures library.
So I decided to just start trying stuff and was surprised with the very, very simple code I ended up with:
from concurrent.futures import ThreadPoolExecutor
class WebHostChecker(object):
def __init__(self, websites):
self.webhosts = []
for website in websites:
self.webhosts.append(WebHost(website))
def __iter__(self):
return iter(self.webhosts)
def check_all(self):
# sequential:
#for webhost in self:
# webhost.check()
# threaded:
with ThreadPoolExecutor(max_workers=10) as executor:
executor.map(lambda webhost: webhost.check(), self.webhosts)
class WebHost(object):
def __init__(self, hostname):
self.hostname = hostname
def check(self):
print("Checking {}".format(self.hostname))
self.check_dns() # only modifies internal state, i.e.: sets self.dns
self.check_http() # only modifies internal status, i.e.: sets self.http
Using the classes looks like this:
webhostchecker = WebHostChecker(["urla.com", "urlb.com"])
webhostchecker.check_all() # -> this calls .check() on all WebHost instances in parallel
The relevant multiprocessing/threading code is only 3 lines. I barely had to modify my existing code (which I hoped to be able to do when first starting to write the code for sequential execution, but started to doubt after reading the many examples online).
And... it works! :)
It perfectly distributes the IO-waiting among multiple threads and runs in less than 1/3 of the time of the original program.
So, now, my question(s):
What am I missing here?
Could I implement this differently? (Should I?)
Why are other examples so convoluted? (Although I must say I couldn't find an exact example doing a method call on multiple objects)
Will this code get me in trouble when I expand my program with features/code I cannot predict right now?
I think I already know of one potential problem and it would be nice if someone can confirm my reasoning: if WebHost.check() also becomes CPU bound I won't be able to swap ThreadPoolExecutor for ProcessPoolExecutor. Because every process will get cloned versions of the WebHost instances? And I would have to code something to sync those cloned instances back to the original?
Any insights/comments/remarks/improvements/... that can bring me to greater understanding will be much appreciated! :)
Ok, so I'll add my own first gotcha:
If webhost.check() raises an Exception, then the thread just ends and self.dns and/or self.http might NOT have been set. However, with the current code, you won't see the Exception, UNLESS you also access the executor.map() results! Leaving me wondering why some objects raised AttributeErrors after running check_all() :)
This can easily be fixed by just evaluating every result (which is always None, cause I'm not letting .check() return anything). You can do it after all threads have run or during. I choose to let Exceptions be raised during (ie: within the with statement), so the program stops at the first unexpected error:
def check_all(self):
with ThreadPoolExecutor(max_workers=10) as executor:
# this alone works, but does not raise any exceptions from the threads:
#executor.map(lambda webhost: webhost.check(), self.webhosts)
for i in executor.map(lambda webhost: webhost.check(), self.webhosts):
pass
I guess I could also use list(executor.map(lambda webhost: webhost.check(), self.webhosts)) but that would unnecessarily use up memory.

Handling errors in python with multiple tasks

I have multiple tasks in my python workflow and would like to know what would the best way to handle errors.
class Task1():
is_ready = False
def run(self):
try:
a = 0/0
# some more operations
self._is_ready = True
except:
print 'logging errors'
class Task2():
_is_ready = False
def run(self):
try:
a = 1
# some more operations
self._is_ready = True
except:
print 'logging errors'
class Workflow():
def run(self, ):
self.task1 = Task1()
self.task2 = Task2()
self.task1.run()
if self.task1.is_ready:
self.task2.run()
w = Workflow()
w.run()
I basically want to run each tasks sequentially based on the errors of each tasks.
ie; if task1 runs fine then process task2..
Can you please let me know will be the above approach will be the right way?
I have totally 10 tasks and thinking adding multiple if loops does not sound like a great way..
There are really two questions here. One is about how to arrange the tasks in a sequence, the other is about how to break the sequence if one task fails.
If you want any kind of scalability, you will need an iterable of tasks, so that you can run a for loop over it. Using nested ifs is totally impractical as you yourself noticed. The basic structure will be conceptually something like this:
tasks = [Task1(), Task2(), ...]
for task in tasks:
task.run()
if task.failed():
break
None of the portions of the loop need to appear as written. The loop itself can be replaced with any, all or next. The status check can be an attribute check, a method call or even an implied exception.
You have a number of options for how to decide if a task failed:
Use an internal flag as you are currently doing. Make sure that the flag has a consistent name in all the task classes (notice the typo _is_ready in Task2). This is a bit of overkill unless you have a use-case that really requires it, since it provides redundant information, and not very elegantly at that.
Use a return value in run. This is much nicer because you can write
for task in tasks:
if not task.run():
break
Or alternatively (as #MichaelButscher cleverly suggested)
all(task.run() for task in tasks)
In either case, your task should look like this:
class Task1:
def run(self):
try:
# Some stuff
except SomeException:
# Log error
return False
return True
Just let the error propagate from the task implementation:
class Task1:
def run(self):
try:
# Some stuff
except SomeException:
# Log error
raise
I prefer this method to all the others because that's what exceptions are basically for in the first place. In this case, your loop will be even more minimalistic:
for task in tasks:
task.run()
Or alternatively, but more obscurely
any(task.run() for task in tasks)
Or even
from collections import deque
deque(task.run() for task in tasks, maxlen=0)
The second two options are really there only for reference purposes. If you go with exceptions, just write the basic for loop: it's plenty elegant enough and by far the least arcane.
Finally, I would recommend another fundamental change. If your tasks are truly arbitrary in nature, then you should consider allowing any callable taking no arguments to be a task. There is no particular need to restrict yourself to classes having a run method. If you need to have a task class, you can rename the method you call run to __call__, and all your instances will be callable with the () operator. The code would look conceptually like this then:
class CallableClass:
def __call__(self):
try:
# Do something
except:
# Log error
raise
def callable_function():
try:
# Do something
except:
# Log error
raise
for task in tasks:
task()
If the run() methods could return a boolean success value and each task should only be run if previous succeeded, then it could be done like:
class Workflow():
def run(self, ):
task_list = (Task1(), Task2(), Task3(), ...)
success = all(t.run() for t in task_list)

How to make make run the subtasks for periodic task with Celery?

I would like to create periodic task which makes query to database, get's data from data provider, makes some API requests, formats documents and sends them using another API.
Result of the previous task should be chained to the next task. I've got from the documentation that I have to use chain, group and chord in order to organise chaining.
But, what else I've got from the documentation: don't run subtask from the task, because it might be the reason of deadlocks.
So, the question is: how to run subtasks inside the periodic task?
#app.task(name='envoy_emit_subscription', ignore_result=False)
def emit_subscriptions(frequency):
# resulting queryset is the source for other tasks
return Subscription.objects.filter(definition__frequency=1).values_list('pk', flat=True)
#app.task(name='envoy_query_data_provider', ignore_result=False)
def query_data_provider(pk):
# gets the key from the chain and returns received data
return "data"
#app.task(name='envoy_format_subscription', ignore_result=False)
def format_subscription(data):
# formats documents
return "formatted text"
#app.task(name='envoy_send_subscription', ignore_result=False)
def send_subscription(text):
return send_text_somehow(text)
Sorry for the noob question, but I'm a noob in Celery, indeed.
Something like this maybe?
import time
while True:
my_celery_chord()
time.sleep(...)

Celery transfer command line arguments to Task

I am struggling with transfering additional command line arguments to celery task. I can set the desired attribute in bootstep however the same attribute is emtpy when accessed directly from task (I guess it gets overriden)
class Arguments(bootsteps.Step):
def __init__(self, worker, environment, **options):
ArgumentTask.args = {'environment': environment}
# this works
print ArgumentTask.args
Here is the custom task
class ArgumentTask(Task):
abstract = True
_args = {}
#property
def args(self):
return self._args
#args.setter
def args(self, value):
self._args.update(value)
And actual task
#celery.task(base = ArgumentTask, bind = True, name = 'jobs.send')
def send(self):
# this prints empty dictionary
print self.args
Do I need to use some additional persistence layer, eg. persistent objects or am I missing something really obvious?
Similar question
It does not seem to be possible. The reason for that is that your task could be consumed anywhere by any consumer of the queue and each consumer having different command line parameters and therefore it's processing should not depend on workers configuration.
If your problem is to manage environment dev/prod this is the way we managed it in our project:
Each environment is jailed in it's venv having a configuration so that the project is self aware of it's environment(in our case it's just db links in configuration that changes). And each environment has its queues and celery workers launched with this command:
/path/venv/bin/celery worker -A async.myapp --workdir /path -E -n celery-name#server -Ofair
Hope it helped.
If you really want to dig hard on that, each task can access a .control which allows to launch control operations on celery (like some monitoring). But I didn't find anything helpful there.

Locking with Tornado and multiple instances

I'm fairly new to Python and Tornado, so please forgive if I overcomplicated a long-solved problem, but I didn't find much else out there.
I'm running multiple Tornado instances (multiple instances per server, multiple servers) for an application and have some tasks that only one instance should perform, such as scheduling certain events in the application. Instead of running a dedicated instance that performs this task, I'd like to have an opportunistic approach where the first instance that tries gets to do the job.
Part of my solution is a database based locking mechanism (MongoDB findAndUpdate). The code below seems to work just fine but I'd like to get some advice if this is a good solution or if there are ready-made locking and task distribution solutions out there for Tornado.
This is the decorator that acquires the lock when entering the function and releases it afterwards:
def locking(fn):
#tornado.gen.engine
def wrapped(wself, *args, **kwargs):
#tornado.gen.engine
def wrapped_callback(*cargs, **ckwargs):
logging.info("release lock")
yield tornado.gen.Task(lock.release_lock)
logging.info("release lock done")
original_callback(*cargs, **ckwargs)
logging.info("acquire lock")
yield tornado.gen.Task(model.SchedulerLock.initialize_lock, area_id=wself.area_id)
lock = yield tornado.gen.Task(model.SchedulerLock.acquire_lock, area_id=wself.area_id)
if lock:
logging.info("acquire lock done")
original_callback = kwargs['callback']
kwargs['callback'] = wrapped_callback
fn(wself, *args, **kwargs)
else:
logging.info("acquire lock not possible, postponed")
ioloop = tornado.ioloop.IOLoop.instance()
ioloop.add_timeout(datetime.timedelta(seconds=2), functools.partial(wrapped, wself, *args, **kwargs))
return wrapped
The acquire_lock method returns the lock object or False
Any thoughts on this? I know that the lock is only half of the solution, as I also need a mechanism that ensures that a one-off task only gets done once. However, this can be achieved very similarly. Is there anything that solves the problem more elegantly?

Categories

Resources