I'm trying to find a way to be able to turn celery tasks on/off from the django admin. This is mostly to disable tasks that call external services when those services are down or have a scheduled maintenance period.
For my periodic tasks, this is easy, especially with django-celery. But for tasks that are called on demand I'm having some trouble. Currently I'm exploring just storing an on/off status for various tasks in a TaskControl model and then just checking that status at the beginning of task execution, returning None if the status is False. This makes me feel dirty due to all the extra db lookups every time a task kicks off. I could use a cache backend that isn't the db, but it seems a little overkill to add caching just for these handful of key/value pairs.
in models.py
# this is a singleton model. singleton code bits omitted for brevity.
class TaskControl(models.Model):
some_status = models.BooleanField(default=True)
# more statuses
in tasks.py
#celery.task(ignore_result=True)
def some_task():
task_control = TaskControl.objects.get(pk=1)
if not task_control.some_status:
return None
# otherwise execute task as normal
What is a better way to do this?
Option 1. Try your simple approach. See if it affect performance. If not, lose the “dirty” feeling.
Option 2. Cache in process memory with a singleton. Add freshness information to your TaskControl model:
class TaskControl(models.Model):
some_status = models.BooleanField(default=True)
# more statuses
expires = models.DateTimeField()
check_interval = models.IntegerField(default=5 * 60)
def is_stale(self):
return (
(datetime.utcnow() >= self.expires) or
((datetime.utcnow() - self.retrieved).total_seconds >= self.check_interval))
Then in a task_ctl.py:
_control = None
def is_enabled():
global _control
if (_control is None) or _control.is_stale():
_control = TaskControl.objects.get(pk=1)
# There's probably a better way to set `retrieved`,
# maybe with a signal or a `Model` override,
# but this should work.
_control.retrieved = datetime.utcnow()
return _control.some_status
Option 3. Like option 2, but instead of time-based expiration, use Celery’s remote control to force all workers to reload the TaskControl (you’ll have to write your own control command, and I don’t know if all the internals you will need are public APIs).
Option 4, only applicable if all your Celery workers run on a single machine. Store the on/off flag as a file on that machine’s file system. Query its existence with os.path.exists (that should be a single stat() or something, cheap enough). If the workers and the admin panel are on different machines, use a special Celery task to create/remove the file.
Option 5. Like option 4, but on the admin/web machine: if the file exists, just don’t queue the task in the first place.
Related
I've wrapped Ray in a web API (using ray start --head and uvicorn with ray.init). Now I'm trying to:
Submit a job to Ray (via the API) and serialise the future and return to the user
Later, let the user hit an API to see if the task is finished and return the results
The kicker is that being a multi-thread I can't guarantee that the next call will be from the same thread. Here is what I naively though would work:
id = my_function.remote()
id_hex = id.hex()
Then in another request/invocation:
id = ray._raylet.ObjectID(binascii.unhexlify(id_hex))
ray.get(id)
Now this never gets finished (it times out) even though I know the future is finished and that the code works if I run it in the same thread as the original request.
I'm guessing this has to do with using another initialisation of Ray.
Is there anyway to force Ray to "refresh" a futures result from Redis?
Getting objectID directly in your way can cause unexpected behaviors because of Ray's ref counting / optimization mechanism. One recommendation is to use "detached actor". You can create a detached actor and delegate the call in there. Detached actors will survive in the lifetime of Ray (unless you kill it), so you don't need to worry about problems you mentioned. Yes. It can make the program a bit slower as it requires 2 hops, but I guess this overhead wouldn't matter for your workload (client submission model).
https://docs.ray.io/en/latest/advanced.html?highlight=detached#detached-actors
ray.remote
class TaskInvocator:
def __init__(self):
self.futures = {}
def your_function(self):
object_id = real_function.remote()
self.futures[object_id.hex()] = object_id
def get_result(self, hex_object_id):
return ray.get(self.futures[hex_object_id])
TaskInvocator.remote(detached=True, name='invocator-1')
task_invocator = ray.util.get_actor('invocator-1')
object_id = task_invocator.your_function.remote()
blah blah...
result = ray.get(task_invocator.get_result.remote(object_id.hex()))
Serializing a ray future to a string out-of-band and then deserializing it is not supported. The reason for this is because these futures are more than just IDs, they have a lot of state associated with them in various system components.
One thing you could do to support this type of API is have an actor that manages the lifetime of these tasks. When you start the task, you pass its ObjectID to the actor. Then, when a user hits the endpoint to check if it's finished, it pings the actor which looks up the corresponding ObjectID and calls ray.wait() on it.
I have a Django application that uses large data structures in-memory (due to performance constraints). This wouldn't be a problem, but I'm using Heroku, where if the python web process takes more than 30s to start, it will be stopped as it's considered a timeout error. Because of the problem aforementioned, I've used a daemon process(worker in Heroku) to handle the construction of the data structures and Redis to handle the message passing between processes.
When the worker finishes(approx 1 minute), it stores the data structures(50Mb or so) in Redis.
And now comes the crux of the matter...Django follows the request/response paradigm and it's synchronised. This implies a Django view should exist to handle the callback from the worker announcing it's done. Even if I use something fancier like a pub/sub from Redis, I'm still forced to evaluate the queue populated by a publisher in a view.
How can I circumvent the necessity of using a Django view? Isn't there an async way of doing this?
Below is the solution where I use a pub/sub inside a view. This seems bad, but I can't think of another way.
views.py
...
# data_handler can enqueue tasks on the default queue
data_handler = DataHandler()
strict_redis = redis.from_url(settings.DEFAULT_QUEUE)
pub_sub = strict_redis.pubsub()
# this puts the job of constructing the large data structures
# on the default queue so a worker can pick it up. Being async,
# it returns with an empty set of data structures.
data_structures = data_handler.start()
pub_sub.subscribe(settings.FINISHED_DATA_STRUCTURES_CHANNEL)
#require_http_methods(['POST'])
def store_and_fetch(request):
user_data = json.load(request.body.decode('utf8'))
message = pub_sub.get_message()
if message:
command = message['data'] if 'data' in message else ''
if command == settings.FINISHED_DATA_STRUCTURES_INIT.encode('utf-8'):
# this takes the data from redis and updates data_structures
data_handler.update(data_structures)
return HttpResponse(compute_response(user_data, data_structures))
Update: After working for multiple months with this, I can now say it's definitely better(and wiser) NOT to fiddle with Django's request/response cycle. There are things like Django RQ Scheduler, or Celery that can do async tasks just fine. If you want to update the main web process after some repeatable job completed, it's simpler to use something like python requests package, sending a POST to the web process from the worker that did the scheduled job. In this way we don't circumvent Django's mechanisms, and more importantly, it's simpler to do overall.
Regarding the Heroku constraints I mentioned at the beginning of the post. At the moment I wrote this question I was quite a newbie with heroku and didn't know much about the release phase. In the release phase we can set up all the complex logic we need for the main process. Thus, at the end of the release phase, we simply need to notify the web process, in the manner I've described above and use some distributed memory buffer (even Redis will work just fine).
My scenario is as follows: I have a large machine learning model, which is computed by a bunch of workers. In essence workers compute their own part of a model and then exchange with results in order to maintain globally consistent state of model.
So, every celery task computes it's own part of job.
But this means, that tasks aren't stateless, and here is my trouble : if I say some_task.delay( 123, 456 ), in reality I'm NOT sending two integers here!
I'm sending whole state of task, which is pickled somewhere in Celery. This state is typically about 200 MB :-((
I know, that it's possible to select a decent serializer in Celery, but my question is how NOT to pickle just ANY data, which could be in task.
How to pickle arguments of task only?
Here is a citation from celery/app/task.py:
def __reduce__(self):
# - tasks are pickled into the name of the task only, and the reciever
# - simply grabs it from the local registry.
# - in later versions the module of the task is also included,
# - and the receiving side tries to import that module so that
# - it will work even if the task has not been registered.
mod = type(self).__module__
mod = mod if mod and mod in sys.modules else None
return (_unpickle_task_v2, (self.name, mod), None)
I simply don't want this to happen.
Is there a simple way around it, or I'm just forced to build my own Celery ( which is ugly to imagine)?
Don't use the celery results backend for this. Use a separate data store.
While you could just use Task.ignore_result this would mean that you loose the ability to track the tasks status etc.
The best solution would be to use one storage engine (e.g. Redis) for your results backend.
You should set up a separate storage engine (a separate instance of Redis, or maybe something like MongoDB, depending on your needs) to store the actual data.
In this way you can still see the status of your tasks but the large data sets do not affect the operation of celery.
Switching to the JSON serializer may reduce the serialization overhead, depending on the format of the data you generate . However it can't solve the underlying problem of putting too much data through the results backend.
The results backend can handle relatively small amounts of data - once you go over a certain limit you start to prevent the proper operation of its primary tasks - the communication of task status.
I would suggest updating your tasks so that they return a lightweight data structure containing useful metadata (to e.g. facilitate co-ordination between tasks), and storing the "real" data in a dedicated storage solution.
You have to define the ignore result from your task as it says in the docs:
Task.ignore_result
Don’t store task state. Note that this means you can’t use AsyncResult to check if the task is ready, or get its return value.
That'd be a little offtop, but still.
What as I understood is happening here. You have several processes, which do heavy calculations in parallel with inter-process communications. So, instead of unsatisfying in your case celery you could:
use zmq for inter-process communications (to send only necessary data),
use supervisor for managing and running processes (numprocs in particular will help with running multiple same workers).
While it will not require to write your own celery, some code will require to be written.
Is there any standard/backend-independent method for querying pending tasks based on certain fields?
For example, I have a task which needs to run once after the “last user interaction”, and I'd like to implement it something like:
def user_changed_content():
task = find_task(name="handle_content_change")
if task is None:
task = queue_task("handle_content_change")
task.set_eta(datetime.now() + timedelta(minutes=5))
task.save()
Or is it simpler to hook directly into the storage backend?
No, this is not possible.
Even if some transports may support accessing the "queue" out of order (e.g. Redis)
it is not a good idea.
The task may not be on the queue anymore, and instead reserved by a worker.
See this part in the documentation: http://docs.celeryproject.org/en/latest/userguide/tasks.html#state
Given that, a better approach would be for the task to check if it should reschedule itself
when it starts:
#task
def reschedules():
new_eta = redis.get(".".join([reschedules.request.task_id, "new_eta"])
if new_eta:
return reschedules.retry(eta=new_eta)
I have an "analytics dashboard" screen that is visible to my django web applications users that takes a really long time to calculate. It's one of these screens that goes through every single transaction in the database for a user and gives them metrics on it.
I would love for this to be a realtime operation, but calculation times can be 20-30 seconds for an active user (no paging allowed, it's giving averages on transactions.)
The solution that comes to mind is to calculate this in the backend via a manage.py batch command and then just display cached values to the user. Is there a Django design pattern to help facilitate these types of models/displays?
What you're looking for is a combination of offline processing and caching. By offline, I mean that the computation logic happens outside the request-response cycle. By caching, I mean that the result of your expensive calculation is sufficiently valid for X time, during which you do not need to recalculate it for display. This is a very common pattern.
Offline Processing
There are two widely-used approaches to work which needs to happen outside the request-response cycle:
Cron jobs (often made easier via a custom management command)
Celery
In relative terms, cron is simpler to setup, and Celery is more powerful/flexible. That being said, Celery enjoys fantastic documentation and a comprehensive test suite. I've used it in production on almost every project, and while it does involve some requirements, it's not really a bear to setup.
Cron
Cron jobs are the time-honored method. If all you need is to run some logic and store some result in the database, a cron job has zero dependencies. The only fiddly bits with cron jobs is getting your code to run in the context of your django project -- that is, your code must correctly load your settings.py in order to know about your database and apps. For the uninitiated, this can lead to some aggravation in divining the proper PYTHONPATH and such.
If you're going the cron route, a good approach is to write a custom management command. You'll have an easy time testing your command from the terminal (and writing tests), and you won't need to do any special hoopla at the top of your management command to setup a proper django environment. In production, you simply run path/to/manage.py yourcommand. I'm not sure if this approach works without the assistance of virtualenv, which you really ought to be using regardless.
Another aspect to consider with cronjobs: if your logic takes a variable amount of time to run, cron is ignorant of the matter. A cute way to kill your server is to run a two-hour cronjob like this every hour. You can roll your own locking mechanism to prevent this, just be aware of this—what starts out as a short cronjob might not stay that way when your data grows, or when your RDBMS misbehaves, etc etc.
In your case, it sounds like cron is less applicable because you'd need to calculate the graphs for every user every so often, without regards to who is actually using the system. This is where celery can help.
Celery
…is the bee's knees. Usually people are scared off by the "default" requirement of an AMQP broker. It's not terribly onerous setting up RabbitMQ, but it does require stepping outside of the comfortable world of Python a bit. For many tasks, I just use redis as my task store for Celery. The settings are straightforward:
CELERY_RESULT_BACKEND = "redis"
REDIS_HOST = "localhost"
REDIS_PORT = 6379
REDIS_DB = 0
REDIS_CONNECT_RETRY = True
Voilá, no need for an AMQP broker.
Celery provides a wealth of advantages over simple cron jobs. Like cron, you can schedule periodic tasks, but you can also fire off tasks in response to other stimuli without holding up the request/response cycle.
If you don't want to compute the chart for every active user every so often, you will need to generate it on-demand. I'm assuming that querying for the latest available averages is cheap, computing new averages is expensive, and you're generating the actual charts client-side using something like flot. Here's an example flow:
User requests a page which contains an averages chart.
Check cache -- is there a stored, nonexpired queryset containing averages for this user?
If yes, use that.
If not, fire off a celery task to recalculate it, requery and cache the result. Since querying existing data is cheap, run the query if you want to show stale data to the user in the meantime.
If the chart is stale. optionally provide some indication that the chart is stale, or do some ajax fanciness to ping django every so often and ask if the refreshed chart is ready.
You could combine this with a periodic task to recalculate the chart every hour for users that have an active session, to prevent really stale charts from being displayed. This isn't the only way to skin the cat, but it provides you with all the control you need to ensure freshness while throttling CPU load of the calculation task. Best of all, the periodic task and the "on demand" task share the same logic—you define the task once and call it from both places for added DRYness.
Caching
The Django cache framework provides you with all the hooks you need to cache whatever you want for as long as you want. Most production sites rely on memcached as their cache backend, I've lately started using redis with the django-redis-cache backend instead, but I'm not sure I'd trust it for a major production site yet.
Here's some code showing off usage of the low-level caching API to accomplish the workflow laid out above:
import pickle
from django.core.cache import cache
from django.shortcuts import render
from mytasks import calculate_stuff
from celery.task import task
#task
def calculate_stuff(user_id):
# ... do your work to update the averages ...
# now pull the latest series
averages = TransactionAverage.objects.filter(user=user_id, ...)
# cache the pickled result for ten minutes
cache.set("averages_%s" % user_id, pickle.dumps(averages), 60*10)
def myview(request, user_id):
ctx = {}
cached = cache.get("averages_%s" % user_id, None)
if cached:
averages = pickle.loads(cached) # use the cached queryset
else:
# fetch the latest available data for now, same as in the task
averages = TransactionAverage.objects.filter(user=user_id, ...)
# fire off the celery task to update the information in the background
calculate_stuff.delay(user_id) # doesn't happen in-process.
ctx['stale_chart'] = True # display a warning, if you like
ctx['averages'] = averages
# ... do your other work ...
render(request, 'my_template.html', ctx)
Edit: worth noting that pickling a queryset loads the entire queryset into memory. If you're pulling up a lot of data with your averages queryset this could be suboptimal. Testing with real-world data would be wise in any case.
Simplest and IMO correct solution for such scenarios is to pre-calculate everything as things are updated, so that when user sees dashboard you calculate nothing but just display already calculated values.
There can be various ways to do that, but generic concept is to trigger a calculate function in background when something on which calculation depends changes.
For triggering such calculation in background I usually use celery, so suppose user adds a item foo in view view_foo, we call a celery task update_foo_count which will be run in background and will update foo count, alternatively you can have a celery timer which will update count say every 10 minutes by checking if re-calculation need to be done, recalculate flag can be set at various places where user updates data.
You need to have a look at Django’s cache framework.
If the data that is slow to compute can be denormalised and stored when data is added, rather than when it is viewed, then you may be interested in django-denorm.