I am struggling with transfering additional command line arguments to celery task. I can set the desired attribute in bootstep however the same attribute is emtpy when accessed directly from task (I guess it gets overriden)
class Arguments(bootsteps.Step):
def __init__(self, worker, environment, **options):
ArgumentTask.args = {'environment': environment}
# this works
print ArgumentTask.args
Here is the custom task
class ArgumentTask(Task):
abstract = True
_args = {}
#property
def args(self):
return self._args
#args.setter
def args(self, value):
self._args.update(value)
And actual task
#celery.task(base = ArgumentTask, bind = True, name = 'jobs.send')
def send(self):
# this prints empty dictionary
print self.args
Do I need to use some additional persistence layer, eg. persistent objects or am I missing something really obvious?
Similar question
It does not seem to be possible. The reason for that is that your task could be consumed anywhere by any consumer of the queue and each consumer having different command line parameters and therefore it's processing should not depend on workers configuration.
If your problem is to manage environment dev/prod this is the way we managed it in our project:
Each environment is jailed in it's venv having a configuration so that the project is self aware of it's environment(in our case it's just db links in configuration that changes). And each environment has its queues and celery workers launched with this command:
/path/venv/bin/celery worker -A async.myapp --workdir /path -E -n celery-name#server -Ofair
Hope it helped.
If you really want to dig hard on that, each task can access a .control which allows to launch control operations on celery (like some monitoring). But I didn't find anything helpful there.
Related
I'm just starting to use Jaeger for tracing and want to get the Python client to work with Luigi. The root of the problem is, that Luigi uses multiprocessing to fork worker processes. The docs mention that this can cause problems and recommend - in case of web apps - to defer the tracer initialization until the request handling process has been forked. This does not work for me, because I want to create traces in the main process and in the worker processes.
In the main process I initialize the tracer like described here:
from jaeger_client import Config
config = Config(...)
tracer = config.initialize_tracer()
This creates internally a Tornado io-loop that cannot be reused in the forked processes. So I try to re-initialize the Jaeger client in each Luigi worker process. This is possible by setting the (undocumented?) task_process_context of the worker-Section to a class implementing the context manager protocol. It looks like this:
class WorkerContext:
def __init__(self, process):
pass
def __enter__(self):
Config._initialized = False
Config._initialized_lock = threading.Lock()
config = Config(...)
self.tracer = config.initialize_tracer()
def __exit__(self, type, value, traceback):
self.tracer.close()
time.sleep(2)
Those two lines are of course very "hackish":
Config._initialized = False
Config._initialized_lock = threading.Lock()
The class variables are copied to the forked process and initialize_tracer would complain about already being initialized, if I would not reset the variables.
The details of how multiprocessing forks the new process and what this means for the Tornado loop are somewhat "mystic" to me. So my question is:
Is the above code safe or am I asking for trouble?
Of course I would get rid of accessing Configs internals. But if the solution could be considered safe, I would ask the maintainers for a reset method, that can be called in a forked process.
This question is a follow up of django + celery: disable prefetch for one worker, Is there a bug?
I had a problem with celery (see the question that I follow up) and in order to resolve it I'd like to have two celery workers with -concurrency 1 each but with two different settings of task_acks_late.
My current approach is working, but in my opinion not very beautiful. I am doing the following:
in settings.py of my django project:
CELERY_TASK_ACKS_LATE = os.environ.get("LACK", "False") == "True"
This allows me to start the celery workers with following commands:
LACK=True celery -A miniclry worker --concurrency=1 -n w2 -Q=fast,slow --prefetch-multiplier 1
celery -A miniclry worker --concurrency=1 -n w1 -Q=fast
What would be more intuitive would be if I could do something like:
celery -A miniclry worker --concurrency=1 -n w2 -Q=fast,slow --prefetch-multiplier 1 --late-ack=True
celery -A miniclry worker --concurrency=1 -n w1 -Q=fast --late-ack=False
I found Initializing Different Celery Workers with Different Values but don't understand how to embed this in my django / celery context. In which files would I have to add the code that's adding an argument to the parser and how could I use the custom param to modify task_acks_late of the celery settings.
Update:
Thanks to #Greenev's answer I managed to add custom options to celery. However it seems, that changing the config with this mechanism 'arrives too late' and the chagne is not taken into account.
One possible solution here is to provide acks_late=True as an argument of the shared_task decorator, given your code from the prior question:
#shared_task(acks_late=True)
def task_fast(delay=0.1):
logger.warning("fast in")
time.sleep(delay)
logger.warning("fast out")
UPD. I haven't got task_acks_late to be set using this approach, but you could add a command line argument as follows.
You've already linked to a solution. I can't see any django specifics here, just put the parser.add_argument code to where you have defined your app, given your code from the prior question, you would have something like this:
app = Celery("miniclry", backend="rpc", broker="pyamqp://")
app.config_from_object('django.conf:settings', namespace='CELERY')
def add_worker_arguments(parser):
parser.add_argument('--late-ack', default=False)
app.user_options['worker'].add(add_worker_arguments)
Then you could access your argument value in celeryd_init signal handler
#celeryd_init.connect
def configure_worker(sender=None, conf=None, options=None, **kwargs):
conf.task_acks_late = options.get('late-ack') # get custom argument value from options
I want to provide shared state for a Flask app which runs with multiple workers, i. e. multiple processes.
To quote this answer from a similar question on this topic:
You can't use global variables to hold this sort of data. [...] Use a data source outside of Flask to hold global data. A database, memcached, or redis are all appropriate separate storage areas, depending on your needs.
(Source: Are global variables thread safe in flask? How do I share data between requests?)
My question is on that last part regarding suggestions on how to provide the data "outside" of Flask. Currently, my web app is really small and I'd like to avoid requirements or dependencies on other programs. What options do I have if I don't want to run Redis or anything else in the background but provide everything with the Python code of the web app?
If your webserver's worker type is compatible with the multiprocessing module, you can use multiprocessing.managers.BaseManager to provide a shared state for Python objects. A simple wrapper could look like this:
from multiprocessing import Lock
from multiprocessing.managers import AcquirerProxy, BaseManager, DictProxy
def get_shared_state(host, port, key):
shared_dict = {}
shared_lock = Lock()
manager = BaseManager((host, port), key)
manager.register("get_dict", lambda: shared_dict, DictProxy)
manager.register("get_lock", lambda: shared_lock, AcquirerProxy)
try:
manager.get_server()
manager.start()
except OSError: # Address already in use
manager.connect()
return manager.get_dict(), manager.get_lock()
You can assign your data to the shared_dict to make it accessible across processes:
HOST = "127.0.0.1"
PORT = 35791
KEY = b"secret"
shared_dict, shared_lock = get_shared_state(HOST, PORT, KEY)
shared_dict["number"] = 0
shared_dict["text"] = "Hello World"
shared_dict["array"] = numpy.array([1, 2, 3])
However, you should be aware of the following circumstances:
Use shared_lock to protect against race conditions when overwriting values in shared_dict. (See Flask example below.)
There is no data persistence. If you restart the app, or if the main (the first) BaseManager process dies, the shared state is gone.
With this simple implementation of BaseManager, you cannot directly edit nested values in shared_dict. For example, shared_dict["array"][1] = 0 has no effect. You will have to edit a copy and then reassign it to the dictionary key.
Flask example:
The following Flask app uses a global variable to store a counter number:
from flask import Flask
app = Flask(__name__)
number = 0
#app.route("/")
def counter():
global number
number += 1
return str(number)
This works when using only 1 worker gunicorn -w 1 server:app. When using multiple workers gunicorn -w 4 server:app it becomes apparent that number is not a shared state but individual for each worker process.
Instead, with shared_dict, the app looks like this:
from flask import Flask
app = Flask(__name__)
HOST = "127.0.0.1"
PORT = 35791
KEY = b"secret"
shared_dict, shared_lock = get_shared_state(HOST, PORT, KEY)
shared_dict["number"] = 0
#app.route("/")
def counter():
with shared_lock:
shared_dict["number"] += 1
return str(shared_dict["number"])
This works with any number of workers, like gunicorn -w 4 server:app.
your example is a bit magic for me! I'd suggest reusing the magic already in the multiprocessing codebase in the form of a Namespace. I've attempted to make the following code compatible with spawn servers (i.e. MS Windows) but I only have access to Linux machines, so can't test there
start by pulling in dependencies and defining our custom Manager and registering a method to get out a Namespace singleton:
from multiprocessing.managers import BaseManager, Namespace, NamespaceProxy
class SharedState(BaseManager):
_shared_state = Namespace(number=0)
#classmethod
def _get_shared_state(cls):
return cls._shared_state
SharedState.register('state', SharedState._get_shared_state, NamespaceProxy)
this might need to be more complicated if creating the initial state is expensive and hence should only be done when it's needed. note that the OPs version of initialising state during process startup will cause everything to reset if gunicorn starts a new worker process later, e.g. after killing one due to a timeout
next I define a function to get access to this shared state, similar to how the OP does it:
def shared_state(address, authkey):
manager = SharedState(address, authkey)
try:
manager.get_server() # raises if another server started
manager.start()
except OSError:
manager.connect()
return manager.state()
though I'm not sure if I'd recommend doing things like this. when gunicorn starts it spawns lots of processes that all race to run this code and it wouldn't surprise me if this could go wrong sometimes. also if it happens to kill off the server process (because of e.g. a timeout) every other process will start to fail
that said, if we wanted to use this we would do something like:
ss = shared_state('server.sock', b'noauth')
ss.number += 1
this uses Unix domain sockets (passing a string rather than a tuple as an address) to lock this down a bit more.
also note this has the same race conditions as the OP's code: incrementing a number will cause the value to be transferred to the worker's process, which is then incremented, and sent back to the server. I'm not sure what the _lock is supposed to be protecting, but I don't think it'll do much
Is there a way to determine, programatically, that the current module being imported/run is done so in the context of a celery worker?
We've settled on setting an environment variable before running the Celery worker, and checking this environment variable in the code, but I wonder if there's a better way?
Simple,
import sys
IN_CELERY_WORKER_PROCESS = sys.argv and sys.argv[0].endswith('celery')\
and 'worker' in sys.argv
if IN_CELERY_WORKER_PROCESS:
print ('Im in Celery worker')
http://percentl.com/blog/django-how-can-i-detect-whether-im-running-celery-worker/
As of celery 4.2 you can also do this by setting a flag on the worker_ready signal
in celery.py:
from celery.signals import worker_ready
app = Celery(...)
app.running = False
#worker_ready.connect
def set_running(*args, **kwargs):
app.running = True
Now you can check within your task by using the global app instance
to see whether or not you are running. This can be very useful to determine which logger to use.
Depending on what your use-case scenario is exactly, you may be able to detect it by checking whether the request id is set:
#app.task(bind=True)
def foo(self):
print self.request.id
If you invoke the above as foo.delay() then the task will be sent to a worker and self.request.id will be set to a unique number. If you invoke it as foo(), then it will be executed in your current process and self.request.id will be None.
You can use the current_worker_task property from the Celery application instance class. Docs here.
With the following task defined:
# whatever_app/tasks.py
celery_app = Celery(app)
#celery_app.task
def test_task():
if celery_app.current_worker_task:
return 'running in a celery worker'
return 'just running'
You can run the following on a python shell:
In [1]: from whatever_app.tasks import test_task
In [2]: test_task()
Out[2]: 'just running'
In [3]: r = test_task.delay()
In [4]: r.result
Out[4]: u'running in a celery worker'
Note: Obviously for test_task.delay() to succeed, you need to have at least one celery worker running and configured to load tasks from whatever_app.tasks.
Adding a environment variable is a good way to check if the module is being run by celery worker. In the task submitter process we may set the environment variable, to mark that it is not running in the context of a celery worker.
But the better way may be to use some celery signals which may help to know if the module is running in worker or task submitter. For example, worker-process-init signal is sent to each child task executor process (in preforked mode) and the handler can be used to set some global variable indicating it is a worker process.
It is a good practice to start workers with names, so that it becomes easier to manage(stop/kill/restart) them. You can use -n to name a worker.
celery worker -l info -A test -n foo
Now, in your script you can use app.control.inspect to see if that worker is running.
In [22]: import test
In [23]: i = test.app.control.inspect(['foo'])
In [24]: i.app.control.ping()
Out[24]: [{'celery#foo': {'ok': 'pong'}}]
You can read more about this in celery worker docs
i'm pretty new to fabric and i'm trying to create a script which starts three different servers and a client, but it seems it halts at the first task. here is my code:
env.user='userX'
env.roledefs = {
'client':['HostC'],
'id 0':['Host0'],
'id 1':['Host1'],
'id 2':['Host2']
}
def go():
execute(serve0)
execute(serve1)
execute(serve2)
execute(req)
#roles('id 0')
def serve0():
run('./go/bin/kvsd -id 0 -config-file ~/go/src/github.com/userX/kvs/kvsd/conf/config.ini')
#roles('id 1')
def serve1():
run('./go/bin/kvsd -id 1 -config-file ~/go/src/github.com/userX/kvs/kvsd/conf/config.ini')
#roles('id 2')
def serve2():
run('./go/bin/kvsd -id 2 -config-file ~/go/src/github.com/userX/kvs/kvsd/conf/config.ini')
#roles('client')
def req():
run('./go/bin/kvsc -config-file ~/go/src/github.com/userX/kvs/kvsc/config.ini')
I run it with "fab go", but that only leads to it executing serve0, which doesn't stop without an interupt, and as such prevents any of the other tasks from executing. Is there a way to make them run in parallel? Also, is there a better way to tie spesific tasks to spesific hosts?
Add the #parallel decorator to your functions and they will automatically run in parallel. More details here: http://docs.fabfile.org/en/1.10/usage/parallel.html
Use -H when calling fabric from the cli to identify which hosts to use. From within the task you can check the env dictionary, or you can use the #hosts decorator to specify what runs where. More info here: http://docs.fabfile.org/en/1.10/usage/execution.html