Keep initialized variables after executing Celery task - python

I have some code which is like this :
model = None
#app.task()
def compute_something():
global model
if model is None:
# Initialize model
# Use model to perform computation
So I want the set-up code (lengthy model initialization) to be executed only once when necessary, and further call to the same task could be reusing these initialized variable.
I know it partly breaks the concept of tasks being strictly atomics. But by default this does not seem to work because (I assume) multiprocessing forks separate processes for each tasks, losing the initialization.
Is there a way to achieve something like this?
RELATED QUESTION:
Another way to look at this, is there a way for a worker to look into the task queue and group tasks to perform them together more efficiently?
Let's say a worker will be much more efficient processing a group of tasks at the same time, than doing them one after the other (GPU job for instance, or here loading a big parameter file into memory).
I was wondering if there was a way for the worker to gather several instances of the same task in the task queue and process them in a batch way instead of one by one.

you may be able to initialise the model at module level:
model = initialise_model()
#app.task()
def compute_something():
# Use model to perform computation
however this assumes that the model initialisation does not depend on anything passed to the task and that the model is not modified in any way by the task.
The second question doesn't make much sense. Can you give some context? What do you mean by "group tasks" exactly? Do you want certain tasks to run on a particular host to optimise access to a shared resource? If so then read the celery documentation about routing.

Related

How can I change the key names in celery

After the finishing a task, celery writes it in to redis. However, the key of the entry is determined by celery itself(I think it is the task id). I want to give specific names because that's how other services know their place.
In my code there is one producer that creates the tasks and many workers.
Right now I use tasks_name.get() after all tasks get completed and create new redis entries with my naming convention.
But that is that seems so unnecessary and slow. Celery should just use my convention.
I am thinking about renaming them but if this is feasible. How can I receive ids from producer.
Maybe some custom RedisBackend class but that is too deep.
You can give the task a custom ID using task_id parameter when submitting the task. It's supported by apply and apply_async methods of Celery task and send_task method of Celery class.

Python parallel programming model

I'm writing a machine learning program with the following components:
A shared "Experience Pool" with a binary-tree-like data structure.
N simulator processes. Each adds an "experience object" to the pool every once in a while. The pool is responsible for balancing its tree.
M learner processes that sample a batch of "experience objects" from the pool every few moments and perform whatever learning procedure.
I don't know what's the best way to implement the above. I'm not using Tensorflow, so I cannot take advantage of its parallel capability. More concretely,
I first think of Python3's built-in multiprocessing library. Unlike multithreading, however, multiprocessing module cannot have different processes update the same global object. My hunch is that I should use the server-proxy model. Could anyone please give me a rough skeleton code to start with?
Is MPI4py a better solution?
Any other libraries that would be a better fit? I've looked at celery, disque, etc. It's not obvious to me how to adapt them to my use case.
Based on the comments, what you're really looking for is a way to update a shared object from a set of processes that are carrying out a CPU-bound task. The CPU-bounding makes multiprocessing an obvious choice - if most of your work was IO-bound, multithreading would have been a simpler choice.
Your problem follows a simpler server-client model: the clients use the server as a simple stateful store, no communication between any child processes is needed, and no process needs to be synchronised.
Thus, the simplest way to do this is to:
Start a separate process that contains a server.
Inside the server logic, provide methods to update and read from a single object.
Treat both your simulator and learner processes as separate clients that can periodically read and update the global state.
From the server's perspective, the identity of the clients doesn't matter - only their actions do.
Thus, this can be accomplished by using a customised manager in multiprocessing as so:
# server.py
from multiprocessing.managers import BaseManager
# this represents the data structure you've already implemented.
from ... import ExperienceTree
# An important note: the way proxy objects work is by shared weak reference to
# the object. If all of your workers die, it takes your proxy object with
# it. Thus, if you have an instance, the instance is garbage-collected
# once all references to it have been erased. I have chosen to sidestep
# this in my code by using class variables and objects so that instances
# are never used - you may define __init__, etc. if you so wish, but
# just be aware of what will happen to your object once all workers are gone.
class ExperiencePool(object):
tree = ExperienceTree()
#classmethod
def update(cls, experience_object):
''' Implement methods to update the tree with an experience object. '''
cls.tree.update(experience_object)
#classmethod
def sample(cls):
''' Implement methods to sample the tree's experience objects. '''
return cls.tree.sample()
# subclass base manager
class Server(BaseManager):
pass
# register the class you just created - now you can access an instance of
# ExperiencePool using Server.Shared_Experience_Pool().
Server.register('Shared_Experience_Pool', ExperiencePool)
if __name__ == '__main__':
# run the server on port 8080 of your own machine
with Server(('localhost', 8080), authkey=b'none') as server_process:
server_process.get_server().serve_forever()
Now for all of your clients you can just do:
# client.py - you can always have a separate client file for a learner and a simulator.
from multiprocessing.managers import BaseManager
from server import ExperiencePool
class Server(BaseManager):
pass
Server.register('Shared_Experience_Pool', ExperiencePool)
if __name__ == '__main__':
# run the server on port 8080 of your own machine forever.
server_process = Server(('localhost', 8080), authkey=b'none')
server_process.connect()
experience_pool = server_process.Shared_Experience_Pool()
# now do your own thing and call `experience_call.sample()` or `update` whenever you want.
You may then launch one server.py and as many workers as you want.
Is This The Best Design?
Not always. You may run into race conditions in that your learners may receive stale or old data if they are forced to compete with a simulator node writing at the same time.
If you want to ensure a preference for latest writes, you may additionally use a lock whenever your simulators are trying to write something, preventing your other processes from getting a read until the write finishes.

Transfer a Pipeline, or MapReduce, task to a different target after it started [python]

During the execution of a MapReduce Reducer [python], or during the execution of a Pipeline task, I sometimes want to move it to a different module (i.e. target).
If I knew, before creating the task, that I'd like to run in on a given target I would write:
from mapreduce import mapreduce_pipeline
......
......
mr_pipeline = mapreduce_pipeline.MapreducePipeline(.......)
mr_pipeline.target = "F2"
yield mr_pipeline
But I come to know about it only when the Reduce part starts. Only during the Reduce phase I realize how much memory it requires, and if it requires too much memory only then would I want to transfer it to a more powerful GAE module.
As far as I know neither deferred.defer(....) nor Pipeline would work here. An ugly solution would be to rerun the mapper using a new Pipeline task, wait for it to complete, and fetch the results from it. This solution is:
1. ugly and not very maintainable.
2. fragile.
3. takes unnecessary resources due to busy-waiting.
Is there any better solution?
Should the Pipeline or the MapReduce code be modified to support this?
You always can re-deploy your module with higher machine class.
Or, because mapreduce/pipeline uses task queue, you can update the queue target parameter in your queue.yaml:
- name: my-queue
rate: 10/s
target: you-more-powerful-module

How to avoid pickling Celery task?

My scenario is as follows: I have a large machine learning model, which is computed by a bunch of workers. In essence workers compute their own part of a model and then exchange with results in order to maintain globally consistent state of model.
So, every celery task computes it's own part of job.
But this means, that tasks aren't stateless, and here is my trouble : if I say some_task.delay( 123, 456 ), in reality I'm NOT sending two integers here!
I'm sending whole state of task, which is pickled somewhere in Celery. This state is typically about 200 MB :-((
I know, that it's possible to select a decent serializer in Celery, but my question is how NOT to pickle just ANY data, which could be in task.
How to pickle arguments of task only?
Here is a citation from celery/app/task.py:
def __reduce__(self):
# - tasks are pickled into the name of the task only, and the reciever
# - simply grabs it from the local registry.
# - in later versions the module of the task is also included,
# - and the receiving side tries to import that module so that
# - it will work even if the task has not been registered.
mod = type(self).__module__
mod = mod if mod and mod in sys.modules else None
return (_unpickle_task_v2, (self.name, mod), None)
I simply don't want this to happen.
Is there a simple way around it, or I'm just forced to build my own Celery ( which is ugly to imagine)?
Don't use the celery results backend for this. Use a separate data store.
While you could just use Task.ignore_result this would mean that you loose the ability to track the tasks status etc.
The best solution would be to use one storage engine (e.g. Redis) for your results backend.
You should set up a separate storage engine (a separate instance of Redis, or maybe something like MongoDB, depending on your needs) to store the actual data.
In this way you can still see the status of your tasks but the large data sets do not affect the operation of celery.
Switching to the JSON serializer may reduce the serialization overhead, depending on the format of the data you generate . However it can't solve the underlying problem of putting too much data through the results backend.
The results backend can handle relatively small amounts of data - once you go over a certain limit you start to prevent the proper operation of its primary tasks - the communication of task status.
I would suggest updating your tasks so that they return a lightweight data structure containing useful metadata (to e.g. facilitate co-ordination between tasks), and storing the "real" data in a dedicated storage solution.
You have to define the ignore result from your task as it says in the docs:
Task.ignore_result
Don’t store task state. Note that this means you can’t use AsyncResult to check if the task is ready, or get its return value.
That'd be a little offtop, but still.
What as I understood is happening here. You have several processes, which do heavy calculations in parallel with inter-process communications. So, instead of unsatisfying in your case celery you could:
use zmq for inter-process communications (to send only necessary data),
use supervisor for managing and running processes (numprocs in particular will help with running multiple same workers).
While it will not require to write your own celery, some code will require to be written.

Appropriate method to define a pool of workers in Python

I've started a new Python 3 project in which my goal is to download tweets and analyze them. As I'll be downloading tweets from different subjects, I want to have a pool of workers that must download from Twitter status with the given keywords and store them in a database. I name this workers fetchers.
Other kind of worker is the analyzers whose function is to analyze tweets contents and extract information from them, storing the result in a database also. As I'll be analyzing a lot of tweets, would be a good idea to have a pool of this kind of workers too.
I've been thinking in using RabbitMQ and Celery for this but I have some questions:
General question: Is really a good approach to solve this problem?
I need at least one fetcher worker per downloading task and this could be running for a whole year (actually is a 15 minutes cycle that repeats and last for a year). Is it appropriate to define an "infinite" task?
I've been trying Celery and I used delay to launch some example tasks. The think is that I don't want to call ready() method constantly to check if the task is completed. Is it possible to define a callback? I'm not talking about a celery task callback, just a function defined by myself. I've been searching for this and I don't find anything.
I want to have a single RabbitMQ + Celery server with workers in different networks. Is it possible to define remote workers?
Yeah, it looks like a good approach to me.
There is no such thing as infinite task. You might reschedule a task it to run once in a while. Celery has periodic tasks, so you can schedule a task so that it runs at particular times. You don't necessarily need celery for this. You can also use a cron job if you want.
You can call a function once a task is successfully completed.
from celery.signals import task_success
#task_success(sender='task_i_am_waiting_to_complete')
def call_me_when_my_task_is_done():
pass
Yes, you can have remote workes on different networks.

Categories

Resources