Celery shared data

Celery shared data - python

I have implemented websocket in Django app using Django-channels, now the front-end send some data through the websocket and i want the current running celery task to be able to read it. I tried creating shared memory static object, but not working.
SimulationInputs.add(simulation_id=simulation.id, init_data=init_inputs)
return InteractiveSimulationTask.delay_or_fail(
simulation_id=simulation.id
)
class SimulationData:
data = ''
class SimulationInputs:
data = None
#classmethod
def init_manager(cls, manager):
manager = Manager()
cls.data = manager.dict()
#classmethod
def add(cls, simulation_id, init_data):
cls.data[simulation_id] = init_data
#classmethod
def write(cls, simulation_id, simulation_data):
if cls.data.get(simulation_id):
cls.data[simulation_id] = simulation_data
#classmethod
def read(cls, simulation_id, simulation_data):
simulation_data.data = cls.data.get(simulation_id)
# manage.y
if __name__ == "__main__":
SimulationInputs.init_manager()
class InteractiveSimulationTask(JobtasticTask):
def calculate_result(self, simulation_id, **kwargs):
while True:
SimulationInputs.read(simulation_id=self.simulation.id, simulation_data=simulation_data)
The task always throw exception cls.data.get(simulation_id): NoneObjectType has no method get
I need to share data between the celery task and the main process.
Any hint?

Since you're using celery, you probably have redis or some other memory-store available. Consider using this as your indirection layer, i.e. the read and write methods use the simulation_id as a key to the simulation data
I believe the issue you're facing is due to the lifecycle of the python class. In init_manager when you assign to cls.data you're overwriting the class's property, not the instance's property. This doesn't do what you want it to, as evidenced by the error message: cls.data is going to be None.
What I think you're going for is the "Singleton Pattern". You want to have one and only SimulationInputs object which can read/write the data for each ID. This discussion can help you with implementing a singleton in python

I come up to conclusion that Django and celery should not share the memory, because they are on diff. process and they are diff programs, so they should communicate through socket or messaging system. I solved my problem by using redis Pub/Sub https://redis.io/topics/pubsub.

Related

Pass complex object instance to class that subclasses process

I have a large Python 3.6 system where multiple processes and threads interact with each other and the user. Simplified, there is a Scheduler instance (subclasses threading.Thread) and a Worker instance (subclasses multiprocessing.Process). Both objects run for the entire duration of the program.
The user interacts with the Scheduler by adding Task instances and the Scheduler passes the task to the Worker at the correct moment in time. The worker uses the information contained in the task to do its thing.
Below is some stripped out and simplified code out of the project:
class Task:
def __init__(self, name:str):
self.name = name
self.state = 'idle'
class Scheduler(threading.Thread):
def __init__(self, worker:Worker):
super().init()
self.worker = worker
self.start()
def run(self):
while True:
# Do stuff until the user schedules a new task
task = Task() # <-- In reality the Task intance is not created here but the thread gets it from elsewhere
task.state = 'scheduled'
self.worker.change_task(task)
# Do stuff until the task.state == 'finished'
class Worker(multiprocessing.Process):
def __init__(self):
super().init()
self.current_task = None
self.start()
def change_task(self, new_task:Task):
self.current_task = new_task
self.current_task.state = 'accepted-idle'
def run(self):
while True:
# Do stuff until the current task is updated
self.current_task.state = 'accepted-running'
# Task is running
self.current_task.state = 'finished'
The system used to be structured so that the task contained multiple multiprocessing.Events indicating each of its possible states. Then, not the whole Task instance was passed to the worker, but each of the task's attributes was. As they were all multiprocessing safe, it worked, with a caveat. The events changed in worker.run had to be created in worker.run and back passed to the task object for it work. Not only is this a less than ideal solution, it no longer works with some changes I am making to the project.
Back to the current state of the project, as described by the python code above. As is, this will never work because nothing makes this multiprocessing safe at the moment. So I implemented a Proxy/BaseManager structure so that when a new Task is needed, the system gets it from the multiprocessing manager. I use this structure in a sightly different way elsewhere in the project as well. The issue is that the worker.run never knows that the self.current_task is updated, it remains None. I expected this to be fixed by using the proxy but clearly I am mistaken.
def Proxy(target: typing.Type) -> typing.Type:
"""
Normally a Manager only exposes only object methods. A NamespaceProxy can be used when registering the object with
the manager to expose all the attributes. This also works for attributes created at runtime.
https://stackoverflow.com/a/68123850/8353475
1. Instead of exposing all the attributes manually, we effectively override __getattr__ to do it dynamically.
2. Instead of defining a class that subclasses NamespaceProxy for each specific object class that needs to be
proxied, this method is used to do it dynamically. The target parameter should be the class of the object you want
to generate the proxy for. The generated proxy class will be returned.
Example usage: FooProxy = Proxy(Foo)
:param target: The class of the object to build the proxy class for
:return The generated proxy class
"""
# __getattr__ is called when an attribute 'bar' is called from 'foo' and it is not found eg. 'foo.bar'. 'bar' can
# be a class method as well as a variable. The call gets rerouted from the base object to this proxy, were it is
# processed.
def __getattr__(self, key):
result = self._callmethod('__getattribute__', (key,))
# If attr call was for a method we need some further processing
if isinstance(result, types.MethodType):
# A wrapper around the method that passes the arguments, actually calls the method and returns the result.
# Note that at this point wrapper() does not get called, just defined.
def wrapper(*args, **kwargs):
# Call the method and pass the return value along
return self._callmethod(key, args, kwargs)
# Return the wrapper method (not the result, but the method itself)
return wrapper
else:
# If the attr call was for a variable it can be returned as is
return result
dic = {'types': types, '__getattr__': __getattr__}
proxy_name = target.__name__ + "Proxy"
ProxyType = type(proxy_name, (NamespaceProxy,), dic)
# This is a tuple of all the attributes that are/will be exposed. We copy all of them from the base class
ProxyType._exposed_ = tuple(dir(target))
return ProxyType
class TaskManager(BaseManager):
pass
TaskProxy = Proxy(Task)
TaskManager.register('get_task', callable=Task, proxytype=TaskProxy)

Python subclassing multiprocessing.Lock

I'm trying to understand why python can not compile the following class.
class SharedResource(multiprocessing.Lock):
def __init__(self, blocking=True, timeout=-1):
# super().__init__(blocking=True, timeout=-1)
self.blocking = blocking
self.timeout = timeout
self.data = {}
TypeError: method expected 2 arguments, got 3
The reason why I'm subclassing Lock
my objective is to create a shared list of resource that should be usable only by on process at a time.
this concept will be eventually in a Flash application where the request should not be able to use the resource concurrently
RuntimeError: Lock objects should only be shared between processes through inheritance
class SharedResource():
def __init__(self, id, model):
'''
id: mode id
model: Keras Model only one worker at a time can call predict
'''
self.mutex = Lock()
self.id = id
self.model = model
manager = Manager()
shared_list = manager.list() # a List of models
shared_list.append(SharedResource())
def worker1(l):
...read some data
while True:
resource = l[0]
with m:
resource['model'].predict(...some data)
time.sleep(60)
if __name__ == "__main__":
processes = [ Process(target=worker1, args=[shared_list])]
for p in processes:
p.start()
for p in processes:
p.join()

The reason you are getting this error is because multiprocessing.Lock is actually a function.
In .../multiprocessing/context.py there are these lines:
def Lock(self):
'''Returns a non-recursive lock object'''
from .synchronize import Lock
return Lock(ctx=self.get_context())
This may change in the future so you can verify this on your version of python by doing:
import multiprocessing
print(type(multiprocessing.Lock))
To actually subclass Lock you will need to do something like this:
from multiprocessing import synchronize
from multiprocessing.synchronize import Lock
# Since Lock is now a class, this should work:
class SharedResource(Lock):
pass
I'm not endorsing this approach as a "good" solution, but it should solve your problem if you really need to subclass Lock. Subclassing things that try to avoid being subclassed is usually not a great idea, but sometimes it can be necessary. If you can solve the problem in a different way you may want to consider that.

Parallel processing in Foreach loop [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Hello I have a situation where I am calling some API to get a list of movies. For each record in the list, I call another API. I would like to make that for loop parallel for better performance. The following is my sample code.
movies = []
for movie in collection:
tmdb_movie = tmdb.get_movie(movie['detail']['ids']['tmdb_id'])
movies.append(tmdb_movie)
return tmdb_movie
So my solution using multiprocessing is as follows:
pool = Pool()
output = pool.map(tmdb.get_movie, [movie['detail']['ids']['tmdb_id'] for movie in collection])
But when I execute this code, I get following error
PicklingError: Can't pickle <type 'instancemethod'>: attribute lookup __builtin__.instancemethod failed
I would really appreciate if someone can help me making this functionality parallel in python 2.7.

The best option for this would be to use threads. Threads in Python cannot use CPUs in parallel, but they can execute concurrently while there are blocking operations. Processes, although the can really run in parallel, are slow to start and communicate with, and are better suited to big CPU-bounded work loads. Also, as you indicate in your question, processes can sometimes be difficult to launch.
You can use the somewhat-secret (i.e. undocumented but actually well known) multiprocessing.pool.ThreadPool class. If you are going to be doing this many times, you can create a pool once at the beginning and reuse it. You just need to make sure pool.close() and maybe also pool.join() are called when the program exits.
from multiprocessing.pool import ThreadPool
# Global/class variables
NUM_THREADS = 5
pool = ThreadPool(NUM_THREADS)
# Inside some function/method
return pool.map(lambda movie: tmdb.get_movie(movie['detail']['ids']['tmdb_id']), movies)
# On exit
pool.close() # Prevents more jobs to be submitted
pool.join() # Waits until all jobs are finished

You question is very broad and leaves out many of details, so here's a outline of what would need to be done. To avoid the PicklingError, the database is opened in each process — which can be done by using an initializer function (named start_process() in the example code below).
Note: Due the the overhead involved in initializing the database to do one query, #jdehesa's multi-threading approach would likely be the better tactic in this situation (threading generally makes sharing a global variable less costly). Alternatively, you could make the get_movie() interface function process more than one id each time it's called (i.e. "batches" of them).
class Database:
""" Mock implementation. """
def __init__(self, *args, **kwargs):
pass # Open/connect to database.
def get_movie(self, id):
return 'id_%s_foobar' % id
import multiprocessing as mp
def start_process(*args):
global tmdb
tmdb = Database(*args)
def get_movie(id):
tmdb_movie = tmdb.get_movie(id)
return tmdb_movie
if __name__ == '__main__':
collection = [{'detail': {'ids': {'tmdb_id': 1}}},
{'detail': {'ids': {'tmdb_id': 2}}},
{'detail': {'ids': {'tmdb_id': 3}}}]
pool_size = mp.cpu_count()
with mp.Pool(processes=pool_size, initializer=start_process,
initargs=('init info',)) as pool:
movies = pool.map(get_movie, (movie['detail']['ids']['tmdb_id']
for movie in collection))
print(movies) # -> ['id_1_foobar', 'id_2_foobar', 'id_3_foobar']
An multiprocessing alternative which would allow the database to be shared to some degree by multiple processes without connecting to it each time, would be to define a custom multiprocessing.Manager() that opened the database once, and provided an interface to it to get the info for one (or more movies) given their id(s). This is also discussed in the Sharing state between processes section (in the Server Process subsection) of the online documentation. The built-in Manager supports a number of container types, lists, dicts, and Queues.
Below is example code showing how to create your own custom manager for the database. If you uncomment the calls to print(), you'll see that only one Database instance is created:
class Database:
""" Mock implementation. """
def __init__(self, *args, **kwargs):
# print('Database.__init__')
pass # Open/connect to database.
def get_movie(self, id):
return 'id_%s_foobar' % id
from functools import partial
import multiprocessing as mp
from multiprocessing.managers import BaseManager
class DB_Proxy(object):
""" Shared Database instance proxy class. """
def __init__(self, *args, **kwargs):
self.database = Database(*args, **kwargs)
def get_movie(self, id):
# print('DB_Proxy.get_movie')
tmdb_movie = self.database.get_movie(id)
return tmdb_movie
class MyManager(BaseManager): pass # Custom Manager
MyManager.register('DB_Proxy', DB_Proxy)
if __name__ == '__main__':
collection = [{'detail': {'ids': {'tmdb_id': 1}}},
{'detail': {'ids': {'tmdb_id': 2}}},
{'detail': {'ids': {'tmdb_id': 3}}}]
manager = MyManager()
manager.start()
db_proxy = manager.DB_Proxy('init info')
pool_size = mp.cpu_count()
with mp.Pool(pool_size) as pool:
movies = pool.map(db_proxy.get_movie,
(movie['detail']['ids']['tmdb_id']
for movie in collection))
print(movies) # -> ['id_1_foobar', 'id_2_foobar', 'id_3_foobar']

Celery with continuous deployment

I have a service that exposes an API which is then feeding tasks, it is implemented with Falcon (API) and Celery (task management).
Specifically, my workers take long time to load and their code looks something like this
class HeavyOp(celery.Task):
def __init__(self):
self._asset = get_heavy_asset() # <-- takes long time
#property
def asset(self):
return self._asset
#app.task(base=HeavyOp)
def my_task(data):
return my_task.asset.do_something(data)
What actually goes on is that in the __init__ function some object is being read from disk and held in memory for as long as the worker lives.
Sometimes, I want to update that object.
Is there a way to reload the worker, without downtime? As this is all behind an API, I don't wish to have those few minutes of loading the heavy object as downtime.
We can assume the host has more than 1 core, but the solution must be a single host solution.

I don't think you need a custom base task class. What you want to achieve is a single instance asset class which gets loaded after the worker has initialised and you can reload from a task.
This approach works:
# worker.py
import os
import sys
import time
from celery import Celery
from celery.signals import worker_ready
app = Celery(include=('tasks',))
class Asset:
def __init__(self):
self.time = time.time()
class AssetLoader:
__shared_state = {}
def __init__(self):
self.__dict__ = self.__shared_state
if '_value' not in self.__dict__:
self.get_heavy_asset()
def get_heavy_asset(self):
self._value = Asset()
#property
def value(self):
return self._value
#worker_ready.connect
def after_worker_ready(sender, **kwargs):
AssetLoader()
Here, I made AssetLoader a Borg class, but you can choose any other pattern/strategy to share a single instance of Asset. For illustrative purposes, I just capture the timestamp when executing get_heavy_asset.
# tasks.py
from worker import app, AssetLoader
#app.task(bind=True)
def load(self):
AssetLoader().get_heavy_asset()
return AssetLoader().value.time
#app.task(bind=True)
def my_task(self):
return AssetLoader().value.time
Bear in mind that Asset is shared per worker process but not across workers. If you run with concurrency=1, it doesn't make a difference, but for anything else it does. But from what I gather in your use case, it should be fine either way.

Dynamic traits do not survive pickling

traits_pickle_problem.py
from traits.api import HasTraits, List
import cPickle
class Client(HasTraits):
data = List
class Person(object):
def __init__(self):
self.client = Client()
# dynamic handler
self.client.on_trait_event(self.report,'data_items')
def report(self,obj,name,old,new):
print 'client added-- ' , new.added
if __name__ == '__main__':
p = Person()
p.client.data = [1,2,3]
p.client.data.append(10)
cPickle.dump(p,open('testTraits.pkl','wb'))
The above code reports a dynamic trait. Everything works as expected in this code. However, using a new python process and doing the following:
>>> from traits_pickle_problem import Person, Client
>>> p=cPickle.load(open('testTraits.pkl','rb'))
>>> p.client.data.append(1000)
causes no report of the list append. However, re-establishing the listener separately as follows:
>>> p.client.on_trait_event(p.report,'data_items')
>>> p.client.data.append(1000)
client added-- [1000]
makes it work again.
Am I missing something or does the handler need to be re-established in __setstate__ during the unpickling process.
Any help appreciated. This is for Python 2.7 (32-bit) on windows with traits version 4.30.

Running pickletools.dis(cPickle.dumps(p)), you can see the handler object being referenced:
...
213: c GLOBAL 'traits.trait_handlers TraitListObject'
...
But there's no further information on how it should be wired to the report method. So either the trait_handler doesn't pickle itself out properly, or it's an ephemeral thing like a file handle that can't be pickled in the first place.
In either case, your best option is to overload __setstate__ and re-wire the event handler when the object is re-created. It's not ideal, but at least everything is contained within the object.
class Person(object):
def __init__(self):
self.client = Client()
# dynamic handler
self.client.on_trait_event(self.report, 'data_items')
def __setstate__(self, d):
self.client = d['client']
self.client.on_trait_event(self.report, 'data_items')
def report(self, obj, name, old, new):
print 'client added-- ', new.added
Unpickling the file now correctly registers the event handler:
p=cPickle.load(open('testTraits.pkl','rb'))
p.client.data.append(1000)
>>> client added-- [1000]
You might find this talk Alex Gaynor did at PyCon interesting. It goes into the high points of how pickling work under the hood.
EDIT - initial response used on_trait_change - a typo that appears to work. Changed it back to on_trait_event for clarity.

I had the same problem but came around like this: Imaging I want to pickle only parts of a quiet big class and some of the objects has been set so transient=True so they're not pickled because there is nothing important to save, e.g.
class LineSpectrum(HasTraits):
andor_cam = Instance(ANDORiKonM, transient=True)
In difference to objects which should be saved, e.g.
spectrometer = Instance(SomeNiceSpectrometer)
In my LineSpectrum class, I have a
def __init__(self, f):
super(LineSpectrum, self).__init__()
self.load_spectrum(f)
def __setstate__(self, state): # WORKING!
print("LineSpectrum: __setstate__ with super(...) call")
self.__dict__.update(state)
super(LineSpectrum, self).__init__() # this has to be done, otherwise pickled sliders won't work, also first update __dict__!
self.from_pickle = True # is not needed by traits, need it for myself
self.andor_cam = ANDORiKonM(self.filename)
self.load_spectrum(self.filename)
In my case, this works perfectly - all sliders are working, all values set at the time the object has been pickled are set back.
Hope this works for you or anybody who's having the same problem. Got Anaconda Python 2.7.11, all packages updated.
PS: I know the thread is old, but didn't want to open a new one just for this.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Celery shared data - python

I come up to conclusion that Django and celery should not share the memory, because they are on diff. process and they are diff programs, so they should communicate through socket or messaging system. I solved my problem by using redis Pub/Sub https://redis.io/topics/pubsub.

Related

Pass complex object instance to class that subclasses process

Python subclassing multiprocessing.Lock

Parallel processing in Foreach loop [closed]

Celery with continuous deployment

Dynamic traits do not survive pickling

Categories

Resources