I have a service that exposes an API which is then feeding tasks, it is implemented with Falcon (API) and Celery (task management).
Specifically, my workers take long time to load and their code looks something like this
class HeavyOp(celery.Task):
def __init__(self):
self._asset = get_heavy_asset() # <-- takes long time
#property
def asset(self):
return self._asset
#app.task(base=HeavyOp)
def my_task(data):
return my_task.asset.do_something(data)
What actually goes on is that in the __init__ function some object is being read from disk and held in memory for as long as the worker lives.
Sometimes, I want to update that object.
Is there a way to reload the worker, without downtime? As this is all behind an API, I don't wish to have those few minutes of loading the heavy object as downtime.
We can assume the host has more than 1 core, but the solution must be a single host solution.
I don't think you need a custom base task class. What you want to achieve is a single instance asset class which gets loaded after the worker has initialised and you can reload from a task.
This approach works:
# worker.py
import os
import sys
import time
from celery import Celery
from celery.signals import worker_ready
app = Celery(include=('tasks',))
class Asset:
def __init__(self):
self.time = time.time()
class AssetLoader:
__shared_state = {}
def __init__(self):
self.__dict__ = self.__shared_state
if '_value' not in self.__dict__:
self.get_heavy_asset()
def get_heavy_asset(self):
self._value = Asset()
#property
def value(self):
return self._value
#worker_ready.connect
def after_worker_ready(sender, **kwargs):
AssetLoader()
Here, I made AssetLoader a Borg class, but you can choose any other pattern/strategy to share a single instance of Asset. For illustrative purposes, I just capture the timestamp when executing get_heavy_asset.
# tasks.py
from worker import app, AssetLoader
#app.task(bind=True)
def load(self):
AssetLoader().get_heavy_asset()
return AssetLoader().value.time
#app.task(bind=True)
def my_task(self):
return AssetLoader().value.time
Bear in mind that Asset is shared per worker process but not across workers. If you run with concurrency=1, it doesn't make a difference, but for anything else it does. But from what I gather in your use case, it should be fine either way.
Related
I have a large Python 3.6 system where multiple processes and threads interact with each other and the user. Simplified, there is a Scheduler instance (subclasses threading.Thread) and a Worker instance (subclasses multiprocessing.Process). Both objects run for the entire duration of the program.
The user interacts with the Scheduler by adding Task instances and the Scheduler passes the task to the Worker at the correct moment in time. The worker uses the information contained in the task to do its thing.
Below is some stripped out and simplified code out of the project:
class Task:
def __init__(self, name:str):
self.name = name
self.state = 'idle'
class Scheduler(threading.Thread):
def __init__(self, worker:Worker):
super().init()
self.worker = worker
self.start()
def run(self):
while True:
# Do stuff until the user schedules a new task
task = Task() # <-- In reality the Task intance is not created here but the thread gets it from elsewhere
task.state = 'scheduled'
self.worker.change_task(task)
# Do stuff until the task.state == 'finished'
class Worker(multiprocessing.Process):
def __init__(self):
super().init()
self.current_task = None
self.start()
def change_task(self, new_task:Task):
self.current_task = new_task
self.current_task.state = 'accepted-idle'
def run(self):
while True:
# Do stuff until the current task is updated
self.current_task.state = 'accepted-running'
# Task is running
self.current_task.state = 'finished'
The system used to be structured so that the task contained multiple multiprocessing.Events indicating each of its possible states. Then, not the whole Task instance was passed to the worker, but each of the task's attributes was. As they were all multiprocessing safe, it worked, with a caveat. The events changed in worker.run had to be created in worker.run and back passed to the task object for it work. Not only is this a less than ideal solution, it no longer works with some changes I am making to the project.
Back to the current state of the project, as described by the python code above. As is, this will never work because nothing makes this multiprocessing safe at the moment. So I implemented a Proxy/BaseManager structure so that when a new Task is needed, the system gets it from the multiprocessing manager. I use this structure in a sightly different way elsewhere in the project as well. The issue is that the worker.run never knows that the self.current_task is updated, it remains None. I expected this to be fixed by using the proxy but clearly I am mistaken.
def Proxy(target: typing.Type) -> typing.Type:
"""
Normally a Manager only exposes only object methods. A NamespaceProxy can be used when registering the object with
the manager to expose all the attributes. This also works for attributes created at runtime.
https://stackoverflow.com/a/68123850/8353475
1. Instead of exposing all the attributes manually, we effectively override __getattr__ to do it dynamically.
2. Instead of defining a class that subclasses NamespaceProxy for each specific object class that needs to be
proxied, this method is used to do it dynamically. The target parameter should be the class of the object you want
to generate the proxy for. The generated proxy class will be returned.
Example usage: FooProxy = Proxy(Foo)
:param target: The class of the object to build the proxy class for
:return The generated proxy class
"""
# __getattr__ is called when an attribute 'bar' is called from 'foo' and it is not found eg. 'foo.bar'. 'bar' can
# be a class method as well as a variable. The call gets rerouted from the base object to this proxy, were it is
# processed.
def __getattr__(self, key):
result = self._callmethod('__getattribute__', (key,))
# If attr call was for a method we need some further processing
if isinstance(result, types.MethodType):
# A wrapper around the method that passes the arguments, actually calls the method and returns the result.
# Note that at this point wrapper() does not get called, just defined.
def wrapper(*args, **kwargs):
# Call the method and pass the return value along
return self._callmethod(key, args, kwargs)
# Return the wrapper method (not the result, but the method itself)
return wrapper
else:
# If the attr call was for a variable it can be returned as is
return result
dic = {'types': types, '__getattr__': __getattr__}
proxy_name = target.__name__ + "Proxy"
ProxyType = type(proxy_name, (NamespaceProxy,), dic)
# This is a tuple of all the attributes that are/will be exposed. We copy all of them from the base class
ProxyType._exposed_ = tuple(dir(target))
return ProxyType
class TaskManager(BaseManager):
pass
TaskProxy = Proxy(Task)
TaskManager.register('get_task', callable=Task, proxytype=TaskProxy)
I have implemented websocket in Django app using Django-channels, now the front-end send some data through the websocket and i want the current running celery task to be able to read it. I tried creating shared memory static object, but not working.
SimulationInputs.add(simulation_id=simulation.id, init_data=init_inputs)
return InteractiveSimulationTask.delay_or_fail(
simulation_id=simulation.id
)
class SimulationData:
data = ''
class SimulationInputs:
data = None
#classmethod
def init_manager(cls, manager):
manager = Manager()
cls.data = manager.dict()
#classmethod
def add(cls, simulation_id, init_data):
cls.data[simulation_id] = init_data
#classmethod
def write(cls, simulation_id, simulation_data):
if cls.data.get(simulation_id):
cls.data[simulation_id] = simulation_data
#classmethod
def read(cls, simulation_id, simulation_data):
simulation_data.data = cls.data.get(simulation_id)
# manage.y
if __name__ == "__main__":
SimulationInputs.init_manager()
class InteractiveSimulationTask(JobtasticTask):
def calculate_result(self, simulation_id, **kwargs):
while True:
SimulationInputs.read(simulation_id=self.simulation.id, simulation_data=simulation_data)
The task always throw exception cls.data.get(simulation_id): NoneObjectType has no method get
I need to share data between the celery task and the main process.
Any hint?
Since you're using celery, you probably have redis or some other memory-store available. Consider using this as your indirection layer, i.e. the read and write methods use the simulation_id as a key to the simulation data
I believe the issue you're facing is due to the lifecycle of the python class. In init_manager when you assign to cls.data you're overwriting the class's property, not the instance's property. This doesn't do what you want it to, as evidenced by the error message: cls.data is going to be None.
What I think you're going for is the "Singleton Pattern". You want to have one and only SimulationInputs object which can read/write the data for each ID. This discussion can help you with implementing a singleton in python
I come up to conclusion that Django and celery should not share the memory, because they are on diff. process and they are diff programs, so they should communicate through socket or messaging system. I solved my problem by using redis Pub/Sub https://redis.io/topics/pubsub.
I'm working on a Django app. I have an API endpoint, which if requested, must carry out a function that must be repeated a few times (until a certain condition is true). How I'm dealing with it right now is -
def shut_down(request):
# Do some stuff
while True:
result = some_fn()
if result:
break
time.sleep(2)
return True
While I know that this is a terrible approach and that I shouldn't be blocking for 2 seconds, I can't figure out how to get around it.
This works, after say a wait of 4 seconds. But I'd like something that keeps the loop running in the background, and stop once some_fn returns True. (Also, it is certain that some_fn will return True)
EDIT -
Reading Oz123's response gave me an idea which seems to work. Here's what I did -
def shut_down(params):
# Do some stuff
# Offload the blocking job to a new thread
t = threading.Thread(target=some_fn, args=(id, ), kwargs={})
t.setDaemon(True)
t.start()
return True
def some_fn(id):
while True:
# Do the job, get result in res
# If the job is done, return. Or sleep the thread for 2 seconds before trying again.
if res:
return
else:
time.sleep(2)
This does the job for me. It's simple but I don't know how efficient multithreading is in conjunction with Django.
If anyone can point out pitfalls of this, criticism is appreciated.
For many small projects celery is overkill. For those projects you can use schedule, it's very easy to use.
With this library you can make any function execute a task periodically:
import schedule
import time
def job():
print("I'm working...")
schedule.every(10).minutes.do(job)
schedule.every().hour.do(job)
schedule.every().day.at("10:30").do(job)
schedule.every().monday.do(job)
schedule.every().wednesday.at("13:15").do(job)
while True:
schedule.run_pending()
time.sleep(1)
The example runs in a blocking manner, but if you look in the FAQ, you will find that you can also run tasks in a parallel thread, such that you are not blocking, and remove the task once not needed anymore:
import threading
import time
from schedule import Scheduler
def run_continuously(self, interval=1):
"""Continuously run, while executing pending jobs at each elapsed
time interval.
#return cease_continuous_run: threading.Event which can be set to
cease continuous run.
Please note that it is *intended behavior that run_continuously()
does not run missed jobs*. For example, if you've registered a job
that should run every minute and you set a continuous run interval
of one hour then your job won't be run 60 times at each interval but
only once.
"""
cease_continuous_run = threading.Event()
class ScheduleThread(threading.Thread):
#classmethod
def run(cls):
while not cease_continuous_run.is_set():
self.run_pending()
time.sleep(interval)
continuous_thread = ScheduleThread()
continuous_thread.setDaemon(True)
continuous_thread.start()
return cease_continuous_run
Scheduler.run_continuously = run_continuously
Here is an example for usage in a class method:
def foo(self):
...
if some_condition():
return schedule.CancelJob # a job can dequeue it
# can be put in __enter__ or __init__
self._job_stop = self.scheduler.run_continuously()
logger.debug("doing foo"...)
self.foo() # call foo
self.scheduler.every(5).seconds.do(
self.foo) # schedule foo for running every 5 seconds
...
# later on foo is not needed any more:
self._job_stop.set()
...
def __exit__(self, exec_type, exc_value, traceback):
# if the jobs are not stop, you can stop them
self._job_stop.set()
This answer expands on Oz123's answer a little bit.
In order to get things working, I created a file called mainapp/jobs.py to contain my scheduled jobs. Then, in my apps.py module, I put from . import jobs in the ready method. Here's my entire apps.py file:
from django.apps import AppConfig
import os
class MainappConfig(AppConfig):
name = 'mainapp'
def ready(self):
from . import jobs
if os.environ.get('RUN_MAIN', None) != 'true':
jobs.start_scheduler()
(The RUN_MAIN check is because python manage.py runserver runs the ready method twice—once in each of two processes—but we only want to run it once.)
Now, here's what I put in my jobs.py file. First, the imports. You'll need to import Scheduler, threading and time as below. The F and UserHolding imports are just for what my job does; you won't import these.
from django.db.models import F
from schedule import Scheduler
import threading
import time
from .models import UserHolding
Next, write the function you want to schedule. The following is purely an example; your function won't look anything like this.
def give_admin_gold():
admin_gold_holding = (UserHolding.objects
.filter(inventory__user__username='admin', commodity__name='gold'))
admin_gold_holding.update(amount=F('amount') + 1)
Next, monkey-patch the schedule module by adding a run_continuously method to its Scheduler class. Do this by using the below code, which is copied verbatim from Oz123's answer.
def run_continuously(self, interval=1):
"""Continuously run, while executing pending jobs at each elapsed
time interval.
#return cease_continuous_run: threading.Event which can be set to
cease continuous run.
Please note that it is *intended behavior that run_continuously()
does not run missed jobs*. For example, if you've registered a job
that should run every minute and you set a continuous run interval
of one hour then your job won't be run 60 times at each interval but
only once.
"""
cease_continuous_run = threading.Event()
class ScheduleThread(threading.Thread):
#classmethod
def run(cls):
while not cease_continuous_run.is_set():
self.run_pending()
time.sleep(interval)
continuous_thread = ScheduleThread()
continuous_thread.setDaemon(True)
continuous_thread.start()
return cease_continuous_run
Scheduler.run_continuously = run_continuously
Finally, define a function to create a Scheduler object, wire up your job, and call the scheduler's run_continuously method.
def start_scheduler():
scheduler = Scheduler()
scheduler.every().second.do(give_admin_gold)
scheduler.run_continuously()
I recommend you use Celery's task management. You can refer this to set up this app (package if you're from javaScript background).
Once set, you can alter the code to:
#app.task
def check_shut_down():
if not some_fun():
# add task that'll run again after 2 secs
check_shut_down.delay((), countdown=3)
else:
# task completed; do something to notify yourself
return True
I can't comment on oz123's (https://stackoverflow.com/a/44897678/1108505) and Tanner Swett's (https://stackoverflow.com/a/60244694/5378866) excellent post, but as a final note I wanted to add that if you use Gunicorn and you have X number of workers, the section:
from django.apps import AppConfig
import os
class MainappConfig(AppConfig):
name = 'mainapp'
def ready(self):
from . import jobs
if os.environ.get('RUN_MAIN', None) != 'true':
jobs.start_scheduler()
will be executed that same number of times, launching X schedulers at the same time.
If we only want it to run only one instance (for example if you're going to create objects in the database), we would have to add in our gunicorn.conf.py file something like this:
def on_starting(server):
from app_project import jobs
jobs.start_scheduler()
And finally in the gunicorn call add the argument --preload
Here is my solution, with sources noted. This function will allow you to create a scheduler that you can start with your app, then add and subtract jobs at will. The check_interval variable allows you to trade-off between system resources and job execution timing.
from schedule import Scheduler
import threading
import warnings
import time
class RepeatTimer(threading.Timer):
"""Add repeated run of target to timer functionality. Source: https://stackoverflow.com/a/48741004/16466191"""
running: bool = False
def __init__(self, *args, **kwargs):
threading.Timer.__init__(self, *args, **kwargs)
def start(self) -> None:
"""Protect from running start method multiple times"""
if not self.running:
super(RepeatTimer, self).start()
self.running = True
else:
warnings.warn('Timer is already running, cannot be started again.')
def cancel(self) -> None:
"""Protect from running stop method multiple times"""
if self.running:
super(RepeatTimer, self).cancel()
self.running = False
else:
warnings.warn('Timer is already canceled, cannot be canceled again.')
def run(self):
"""Replace run method of timer to run continuously"""
while not self.finished.wait(self.interval):
self.function(*self.args, **self.kwargs)
class ThreadedScheduler(Scheduler, RepeatTimer):
"""Non-blocking scheduler. Advice taken from: https://stackoverflow.com/a/50465583/16466191"""
def __init__(
self,
run_pending_interval: float,
):
"""Initialize parent classes"""
Scheduler.__init__(self)
super(RepeatTimer, self).__init__(
interval=run_pending_interval,
function=self.run_pending,
)
def print_work(what_to_say: str):
print(what_to_say)
if __name__ == '__main__':
my_schedule = ThreadedScheduler(run_pending_interval=1)
job1 = my_schedule.every(1).seconds.do(print_work, what_to_say='Did_job1')
job2 = my_schedule.every(2).seconds.do(print_work, what_to_say='Did_job2')
my_schedule.cancel()
my_schedule.start()
time.sleep(7)
my_schedule.cancel_job(job1)
my_schedule.start()
time.sleep(7)
my_schedule.cancel()
I have a heavy external library class which takes time to initialize and consumes a lot of memory. I want to create it once per task instance, at minimum.
class NlpTask(Task):
def __init__(self):
print('initializing NLP parser')
self._parser = nlplib.Parser()
print('done initializing NLP parser')
#property
def parser(self):
return self._parser
#celery.task(base=NlpTask)
def my_task(arg):
x = my_task.parser.process(arg)
# etc.
Celery starts 32 worker processes, so I'd expect the printing "initializing ... done" 32 times, as I assume that a task instance is created per each worker. Surprisingly, I'm getting the printing once. What actually happens there? Thanks.
Your NlpTask is initializing once when it is getting registered with the worker.
If you have two tasks like
#celery.task(base=NlpTask)
def foo(arg):
pass
#celery.task(base=NlpTask)
def bar(arg):
pass
Then when you start a worker, you will see 2 initializations.
If you want to initialize it once for every worker, you can use worker_process_init signal.
from celery.signals import worker_process_init
#worker_process_init.connect()
def setup(**kwargs):
print('initializing NLP parser')
# setup
print('done initializing NLP parser')
Now, when you start a worker, you will see setup is being called by each process once.
for this:
that's my point - I'd expect once per worker, and it seems like once per celery instance. I edited the question – #davka
the answer must be use a sender filter in connect, like:
#worker_process_init.connect(sender='xx')
def func(sender, **kwargs):
if sender == 'xx':
# dosomething
but I found that it's not working in celery 4.0.2.
I'd like to do some cleanup operations inside the object just before its destruction. In this case it would be close the connection to the database.
Here is what I'm already doing:
Worker class:
from PyQt5 import QtCore
from pymongo import MongoClient, ASCENDING
from time import sleep
class StatusWidgetWorker(QtCore.QObject):
ongoing_conversions_transmit = QtCore.pyqtSignal([list])
def __init__(self, mongo_settings):
super().__init__()
print("StatusWidget Worker init")
mongo_client = MongoClient([mongo_settings["server_address"]])
self.log_database = mongo_client[mongo_settings["database"]]
self.ongoing_conversions = mongo_settings["ongoing_conversions"]
def status_retriever(self):
print("mongo bridge()")
while True:
ongoing_conversions_list = []
for doc in self.log_database[self.ongoing_conversions].find({}, {'_id': False}).sort([("start_date", ASCENDING)]):
ongoing_conversions_list.append(doc)
self.ongoing_conversions_transmit.emit(ongoing_conversions_list)
sleep(2)
And the function that call the worker from an other class :
def status_worker(self):
mongo_settings = "dict parameter"
self.worker_thread_status = QtCore.QThread()
self.worker_object_status = StatusWidgetWorker(mongo_settings)
self.worker_object_status.moveToThread(self.worker_thread_status)
self.worker_thread_status.started.connect(self.worker_object_status.status_retriever)
self.worker_object_status.ongoing_conversions_transmit.connect(self.status_table_auto_updater)
self.worker_thread_status.start()
Here is what I already tried:
Define a __del__ function in the Worker class, this function is never called.
Define a function in the Worker class and then connect it to the destroyed signal with self.destroyed.connect(self.function). This function is again never called. I think this happen because the signal is emitted when the object is already destroyed, not before its destruction.
I'm really wondering on how to this, here are some parts of answer:
http://www.riverbankcomputing.com/pipermail/pyqt/2014-November/035049.html
His approach seems a bit hacky to me (no offense to the author, there is probably no simple answer) and I have signals & parameters to pass to the worker witch would make the ThreadController class messier.
I find this solution a bit hacky because you have to set up a Controller class to do the Worker class job
If nobody has an answer, I'll probably use the ThreadController class and post the result here.
thank you for reading :-)
The usual rule in python apply:
there is a module for that
the solution is to use the atexit module and register the cleanup function in the __init__ function.
Example:
import atexit
class StatusWidgetWorker(QObject):
def __init__(self):
super().__init__()
# code here
atexit.register(self.cleanup)
def cleanup(self):
print("Doing some long cleanup")
sleep(2)
self.bla = "Done !"
print(self.bla)