Signaling a long running Huey task to stop - python

I'm using Huey to perform some processing operations on objects I have in my Django application.
#db_task()
def process_album(album: Album) -> None:
images = album.objects.find_non_processed_images()
for image in images:
if album.should_pause():
return
process_image(album, image)
This is a simplified example of the situation I wish to solve. I have an Album model that stores a various number of Images that are required to be processed. The processing operation is defined in a different function that s wrapped with #task decorator so it'll be able to run concurrently (when the number of workers > 1).
The question is how to implement album.should_pause() in the right way. Current implementation looks like this:
def should_pause(self):
self.refresh_from_db()
return self.processing_state != AlbumProcessingState.RUNNING
Therefore, on each iteration, the database is queried to update the model, to make sure that state field didn't change to something other than AlbumProcessingState.RUNNING, which would indicate that the album processing tasks should break.
Although it works, it feels wrong since I have to update the model from the database on each iteration, but these feelings might be false. What do you think?

Related

nidaqmx: prevent task from closing after being altered in function

I am trying to write an API that takes advantage of the python wrapper for NI-DAQmx, and need to have a global list of tasks that can be edited across the module.
Here is what I have tried so far:
1) Created an importable dictionary of tasks which is updated whenever a call is made to ni-daqmx. The function endpoint processes data from an HTTPS request, I promise it's not just a pointless wrapper around the ni-daqmx library itself.
e.g., on startup, the following is created:
#./daq/__init.py__
import nidaqmx
# ... other stuff ...#
TASKS = {}
then, the user can create a task by calling this endpoint
#./daq/task/task.py
from daq import TASKS
# ...
def api_create_task_endpoint(task_id):
try:
task = nidaqmx.Task(new_task_name=task_id)
TASKS[task_id] = task
except Exception:
# handle it
Everything up to here works as it should. I can get the task list, and the task stays open. I also tried explicitly calling task.control(nidaqmx.constants.TaskMode.TASK_RESERVE), but the following code gives me the same issue no matter what.
When I try to add channels to the task, it closes at the end of the function call no matter how I set the state.
#./daq/task/channels.py
from daq import TASKS
def api_add_channel_task_endpoint(task_id, channel_type, function):
# channel_type corresponds to ni-daqmx channel modules (e.g. ai_channels).
# function corresponds to callable functions (e.g. add_ai_voltage_chan)
# do some preliminary checks (e.g. task exists, channel type valid)
channels = get_chans_from_json_post()
with TASKS[task_id] as task:
getattr(getattr(task, channel_type), function)(channels)
# e.g. task.ai_channels.add_ai_voltage_chan("Dev1/ai0")
This is apparently closing the task. When I call api_create_task_endpoint(task_id) again, I receive the DaqResourceWarning that the task has been closed, and no longer exists.
I similarly tried setting the TaskMode using task.control here, to no avail.
I would like to be able to make edits to the task by storing it in the module-wide TASKS dict, but cannot keep the Task open long enough to do so.
2) I also tried implementing this using the NI-MAX save feature. The issue with this is that tasks cannot be saved unless they already contain channels, which I don't necessarily want to do immediately after creating the task.
I attempted to work around this by adding to the api_create_task_endpoint() some default behavior which just adds a random channel that is removed on the first channel added by the user.
Problem is, I can't find any documentation for a way to remove channels from a task after adding them without a GUI (this is running on CENTOS, so GUI is a non-starter).
Thank you so much for any help!
I haven't use the Python bindings for NI-DAQmx, but
with TASKS[task_id] as task:
looks like it would stop and clear the task immediately after updating it because the program flow leaves the with block and Task.__exit__() executes.
Because you expect these tasks to live while the Python module is in use, my recommendation is to only use task.control() when you need to change a task's state.

Keep initialized variables after executing Celery task

I have some code which is like this :
model = None
#app.task()
def compute_something():
global model
if model is None:
# Initialize model
# Use model to perform computation
So I want the set-up code (lengthy model initialization) to be executed only once when necessary, and further call to the same task could be reusing these initialized variable.
I know it partly breaks the concept of tasks being strictly atomics. But by default this does not seem to work because (I assume) multiprocessing forks separate processes for each tasks, losing the initialization.
Is there a way to achieve something like this?
RELATED QUESTION:
Another way to look at this, is there a way for a worker to look into the task queue and group tasks to perform them together more efficiently?
Let's say a worker will be much more efficient processing a group of tasks at the same time, than doing them one after the other (GPU job for instance, or here loading a big parameter file into memory).
I was wondering if there was a way for the worker to gather several instances of the same task in the task queue and process them in a batch way instead of one by one.
you may be able to initialise the model at module level:
model = initialise_model()
#app.task()
def compute_something():
# Use model to perform computation
however this assumes that the model initialisation does not depend on anything passed to the task and that the model is not modified in any way by the task.
The second question doesn't make much sense. Can you give some context? What do you mean by "group tasks" exactly? Do you want certain tasks to run on a particular host to optimise access to a shared resource? If so then read the celery documentation about routing.

Turn Off Celery Tasks

I'm trying to find a way to be able to turn celery tasks on/off from the django admin. This is mostly to disable tasks that call external services when those services are down or have a scheduled maintenance period.
For my periodic tasks, this is easy, especially with django-celery. But for tasks that are called on demand I'm having some trouble. Currently I'm exploring just storing an on/off status for various tasks in a TaskControl model and then just checking that status at the beginning of task execution, returning None if the status is False. This makes me feel dirty due to all the extra db lookups every time a task kicks off. I could use a cache backend that isn't the db, but it seems a little overkill to add caching just for these handful of key/value pairs.
in models.py
# this is a singleton model. singleton code bits omitted for brevity.
class TaskControl(models.Model):
some_status = models.BooleanField(default=True)
# more statuses
in tasks.py
#celery.task(ignore_result=True)
def some_task():
task_control = TaskControl.objects.get(pk=1)
if not task_control.some_status:
return None
# otherwise execute task as normal
What is a better way to do this?
Option 1. Try your simple approach. See if it affect performance. If not, lose the “dirty” feeling.
Option 2. Cache in process memory with a singleton. Add freshness information to your TaskControl model:
class TaskControl(models.Model):
some_status = models.BooleanField(default=True)
# more statuses
expires = models.DateTimeField()
check_interval = models.IntegerField(default=5 * 60)
def is_stale(self):
return (
(datetime.utcnow() >= self.expires) or
((datetime.utcnow() - self.retrieved).total_seconds >= self.check_interval))
Then in a task_ctl.py:
_control = None
def is_enabled():
global _control
if (_control is None) or _control.is_stale():
_control = TaskControl.objects.get(pk=1)
# There's probably a better way to set `retrieved`,
# maybe with a signal or a `Model` override,
# but this should work.
_control.retrieved = datetime.utcnow()
return _control.some_status
Option 3. Like option 2, but instead of time-based expiration, use Celery’s remote control to force all workers to reload the TaskControl (you’ll have to write your own control command, and I don’t know if all the internals you will need are public APIs).
Option 4, only applicable if all your Celery workers run on a single machine. Store the on/off flag as a file on that machine’s file system. Query its existence with os.path.exists (that should be a single stat() or something, cheap enough). If the workers and the admin panel are on different machines, use a special Celery task to create/remove the file.
Option 5. Like option 4, but on the admin/web machine: if the file exists, just don’t queue the task in the first place.

Python synchronise between threads and processes

A bit of background:
I am writing a function in Django to get the next invoice number, which needs to be sequential (not gaps), so the function looks like this:
def get_next_invoice_number():
"""
Returns the max(invoice_number) + 1 from the payment records
Does NOT pre-allocate number
"""
# TODO ensure this is thread safe
max_num = Payment.objects.aggregate(Max('invoice_number'))['invoice_number__max']
if max_num is not None:
return max_num + 1
return PaymentConfig.min_invoice_number
Now the problem is, this function only returns the max()+1, in my production environment I have multiple Django processes, so if this function is called twice for 2 different payments (before the first record saved), they will get the same invoice number.
To mitigate this problem I can override the save() function to call the get_next_invoice_number() to minimise the time gap between these function calls, but there is still a very tiny chance for problem to happen.
So I want to implement a lock in the approve method, something like
from multiprocessing import Lock
lock = Lock()
class Payment(models.Model):
def approve(self):
lock.acquire()
try:
self.invoice_number = get_next_invoice_number()
self.save()
except:
pass
finally:
lock.release()
So my questions are:
Does this look okay?
The lock is for multiprocess, how about threads?
UPDATE:
As my colleague pointed out, this is not going to work when it's deployed to multiple servers, the locks will be meaningless.
Looks like DB transaction locking is the way to go.
The easiest way to do this, by far, is with your database's existing tools for creating sequences. In fact, if you don't mind the value starting from 1 you can just use Django's AutoField.
If your business requirements are such that you need to choose a starting number, you'll have to see how to do this in the database. Here are some questions that might help.
Trying to ensure this with locks or transactions will be harder to do and slower to perform.

Django design pattern for web analytics screens that take a really long time to calculate

I have an "analytics dashboard" screen that is visible to my django web applications users that takes a really long time to calculate. It's one of these screens that goes through every single transaction in the database for a user and gives them metrics on it.
I would love for this to be a realtime operation, but calculation times can be 20-30 seconds for an active user (no paging allowed, it's giving averages on transactions.)
The solution that comes to mind is to calculate this in the backend via a manage.py batch command and then just display cached values to the user. Is there a Django design pattern to help facilitate these types of models/displays?
What you're looking for is a combination of offline processing and caching. By offline, I mean that the computation logic happens outside the request-response cycle. By caching, I mean that the result of your expensive calculation is sufficiently valid for X time, during which you do not need to recalculate it for display. This is a very common pattern.
Offline Processing
There are two widely-used approaches to work which needs to happen outside the request-response cycle:
Cron jobs (often made easier via a custom management command)
Celery
In relative terms, cron is simpler to setup, and Celery is more powerful/flexible. That being said, Celery enjoys fantastic documentation and a comprehensive test suite. I've used it in production on almost every project, and while it does involve some requirements, it's not really a bear to setup.
Cron
Cron jobs are the time-honored method. If all you need is to run some logic and store some result in the database, a cron job has zero dependencies. The only fiddly bits with cron jobs is getting your code to run in the context of your django project -- that is, your code must correctly load your settings.py in order to know about your database and apps. For the uninitiated, this can lead to some aggravation in divining the proper PYTHONPATH and such.
If you're going the cron route, a good approach is to write a custom management command. You'll have an easy time testing your command from the terminal (and writing tests), and you won't need to do any special hoopla at the top of your management command to setup a proper django environment. In production, you simply run path/to/manage.py yourcommand. I'm not sure if this approach works without the assistance of virtualenv, which you really ought to be using regardless.
Another aspect to consider with cronjobs: if your logic takes a variable amount of time to run, cron is ignorant of the matter. A cute way to kill your server is to run a two-hour cronjob like this every hour. You can roll your own locking mechanism to prevent this, just be aware of this—what starts out as a short cronjob might not stay that way when your data grows, or when your RDBMS misbehaves, etc etc.
In your case, it sounds like cron is less applicable because you'd need to calculate the graphs for every user every so often, without regards to who is actually using the system. This is where celery can help.
Celery
…is the bee's knees. Usually people are scared off by the "default" requirement of an AMQP broker. It's not terribly onerous setting up RabbitMQ, but it does require stepping outside of the comfortable world of Python a bit. For many tasks, I just use redis as my task store for Celery. The settings are straightforward:
CELERY_RESULT_BACKEND = "redis"
REDIS_HOST = "localhost"
REDIS_PORT = 6379
REDIS_DB = 0
REDIS_CONNECT_RETRY = True
Voilá, no need for an AMQP broker.
Celery provides a wealth of advantages over simple cron jobs. Like cron, you can schedule periodic tasks, but you can also fire off tasks in response to other stimuli without holding up the request/response cycle.
If you don't want to compute the chart for every active user every so often, you will need to generate it on-demand. I'm assuming that querying for the latest available averages is cheap, computing new averages is expensive, and you're generating the actual charts client-side using something like flot. Here's an example flow:
User requests a page which contains an averages chart.
Check cache -- is there a stored, nonexpired queryset containing averages for this user?
If yes, use that.
If not, fire off a celery task to recalculate it, requery and cache the result. Since querying existing data is cheap, run the query if you want to show stale data to the user in the meantime.
If the chart is stale. optionally provide some indication that the chart is stale, or do some ajax fanciness to ping django every so often and ask if the refreshed chart is ready.
You could combine this with a periodic task to recalculate the chart every hour for users that have an active session, to prevent really stale charts from being displayed. This isn't the only way to skin the cat, but it provides you with all the control you need to ensure freshness while throttling CPU load of the calculation task. Best of all, the periodic task and the "on demand" task share the same logic—you define the task once and call it from both places for added DRYness.
Caching
The Django cache framework provides you with all the hooks you need to cache whatever you want for as long as you want. Most production sites rely on memcached as their cache backend, I've lately started using redis with the django-redis-cache backend instead, but I'm not sure I'd trust it for a major production site yet.
Here's some code showing off usage of the low-level caching API to accomplish the workflow laid out above:
import pickle
from django.core.cache import cache
from django.shortcuts import render
from mytasks import calculate_stuff
from celery.task import task
#task
def calculate_stuff(user_id):
# ... do your work to update the averages ...
# now pull the latest series
averages = TransactionAverage.objects.filter(user=user_id, ...)
# cache the pickled result for ten minutes
cache.set("averages_%s" % user_id, pickle.dumps(averages), 60*10)
def myview(request, user_id):
ctx = {}
cached = cache.get("averages_%s" % user_id, None)
if cached:
averages = pickle.loads(cached) # use the cached queryset
else:
# fetch the latest available data for now, same as in the task
averages = TransactionAverage.objects.filter(user=user_id, ...)
# fire off the celery task to update the information in the background
calculate_stuff.delay(user_id) # doesn't happen in-process.
ctx['stale_chart'] = True # display a warning, if you like
ctx['averages'] = averages
# ... do your other work ...
render(request, 'my_template.html', ctx)
Edit: worth noting that pickling a queryset loads the entire queryset into memory. If you're pulling up a lot of data with your averages queryset this could be suboptimal. Testing with real-world data would be wise in any case.
Simplest and IMO correct solution for such scenarios is to pre-calculate everything as things are updated, so that when user sees dashboard you calculate nothing but just display already calculated values.
There can be various ways to do that, but generic concept is to trigger a calculate function in background when something on which calculation depends changes.
For triggering such calculation in background I usually use celery, so suppose user adds a item foo in view view_foo, we call a celery task update_foo_count which will be run in background and will update foo count, alternatively you can have a celery timer which will update count say every 10 minutes by checking if re-calculation need to be done, recalculate flag can be set at various places where user updates data.
You need to have a look at Django’s cache framework.
If the data that is slow to compute can be denormalised and stored when data is added, rather than when it is viewed, then you may be interested in django-denorm.

Categories

Resources