Celery PeriodicTask per user - python

I'm working on project which main future will be running periodically one type of async task for each user. Every user will be able to configure task (running daily, weekly etc. at specified time). Also task will use some data stored by user. Now I'm wondering which approach should be better: allow users to create own PeriodicTask (by using some restricted endpoint of course) or create single PeriodicTask (for example running every 5 minutes) which will iterate over all users and determine if task should be queued or not for current user? I think I will use AMPQ as broker.

periodic tasks scheduler in celery is not designed to handle thousands of scheduled tasks, so from performance perspective, much better solution is to have one task that is running at the smallest interval (e.g. if you allow user to sechedule dayly, weekly, monthly - running task daily is enough)
such approach is as well more stable - every time schedule changes, all of the schedule records are reloaded
plus is more secure because you do not expose or use any internal mechanisms for tasks execution

Related

Django Celery. How to make tasks execute at the appropriate different times given in variables

I try to make a task that will be done at a certain time. Example:
A customer borrowed a book on 01/01/2023 15:00 so the tasks will do exactly one week from now if he doesn't return it and charge a fee.
How to make it do at certain different times.
I'm trying to use django celery with rabbitmq, but I'm not succeeding in making this task open at different times only schematically e.g. every 60 minutes
You can check Celery beat for your case https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html
In general, what you can do is create:
a task to check your DB instances every 60 minutes and check if you have any upcoming notifications in the next hour.
if you have, trigger a celery task to notify the user when you need it.
Check out periodic tasks with celery: https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html
Just create a task that will be executed every minute or every five minutes and check if a charge should be applied and apply it to all objects.

Execute a code at some point of time

I am creating a web app made of python flask.. What I want is if I created an invoice it automatically send an e-mail on its renewal date to notify that the invoice is renewed and so on.. Please take note that renewal occurs every 3 months The code is something like:
def some_method(): // An API Enpoint that adds an invoice from a request
// Syntax that inserts the details to the database
// Let's assume that the the start of the renewal_date is December 15, 2016
How will the automatic email execution be achieved? Without putting too much stress in the backend? Because I'm guessing that if there are 300 invoices then the server might be stressed out
"Write it yourself." No, really.
Keep a database table of invoices to be sent. Every invoice has a status (values such as pending, sent, paid, ...) and an invoice date (which may be in the future). Use cron or similar to periodically run (e.g. 1/hour, 1/day) a program that queries the table for pending invoices for which the invoice date/time has arrived/passed, yet no invoice has yet been sent. This invoicing process sends the invoice, updates the status, and finishes. This invoicer utility will not be integral to your Flask web app, but live beside it as a support program.
Why? It's simple and direct approach. It keeps the invoicing code in Python, close to your chosen app language and database. It doesn't require much excursion into external systems or middleware. It's straightforward to debug and monitor, using the same database, queries, and skills as writing the app itself. Simple, direct, reliable, done. What's not to love?
Now, I fully understand that a "write it yourself" recommendation runs contrary to typical "buy not build" doctrine. But I have tried all the main alternatives such as cron and Celery; my experience with a revenue-producing web app says they're not the way to go for a few hundred long-term invoicing events.
The TL;DR--Why Not Cron and Celery?
cron and its latter-day equivalents (e.g. launchd or Heroku Scheduler) run recurring events. Every 10 minutes, every hour, once a day, every other Tuesday at 3:10am UTC. They generally don't solve the "run this once, at time and date X in the future" problem, but they are great for periodic work.
Now that last statement isn't strictly true. It describes cron and some of its replacements. But even traditional Unix provides at as side-car to cron, and some cron follow-ons (e.g. launchd, systemd) bundle recurring and future event scheduling together (along with other kitchen appliances and the proverbial sink). Evens so, there are some issues:
You're relying on external scheduling systems. That means another interface to learn, monitor, and debug if something goes wrong. There's significant impedance mismatch between those system-level schedulers and your Python app. Even if you farm out "run event at time X," you still need to write the Python code to send the invoice and properly move it along your business workflow.
Those systems, beautiful for a handful of events, generally lack interfaces that make reviewing, modifying, monitoring, or debugging hundreds of outstanding events straightforward. Debugging production app errors amidst an ocean of scheduled events is harrowing. You're talking about committing 300+ pending events to this external system. You must also consider how you'll monitor and debug that use.
Those schedulers are designed for "regular" not "high value" or "highly available" operations. As just one gotcha, what if an event is scheduled, but then you take downtime (planned or unplanned)? If the event time passes before the system is back up, what then? Most of the cron-like schedulers lack provisions for robustly handling "missed" events or "making them up at the earliest opportunity." That can be, in technical terms, "a bummer, man." Say the event triggered money collection--or in your case, invoice issuance. You have hundreds of invoices, and issuing those invoices is presumably business-critical. The capability gaps between system-level schedulers and your operational needs can be genuinely painful, especially as you scale.
Okay, what about driving those events into an external event scheduler like Celery? This is a much better idea. Celery is designed to run large number of app events. It supports various backend engines (e.g. RabbitMQ) proven in practice to handle thousands upon untold thousands of events, and it has user interface options to help deal with event multitudes. So far so good! But:
You will find yourself dealing with the complexities of installing, configuring, and operating external middleware (e.g. RabbitMQ). The effort yields very high achievable scale, but the startup and operational costs are real. This is true even if you farm much of it to a cloud service like Heroku.
More important, while great as a job dispatcher for near-term events, Celery is not ideal as a long-wait-time scheduler. In production I've seen serious issues with "long throw" events (those posted a month, or in your case three months, in the future). While the problems aren't identical, just like cron etc., Celery long-throw events intersect ungracefully with normal app update and restart cycles. This is environment-dependent, but happens on popular cloud services like Heroku.
The Celery issues are not entirely unsolvable or fatal, but long-delay events don't enjoy the same "Wow! Celery made everything work so much better!" magic that you get for swarms of near-term events. And you must become a bit of a Celery, RabbitMQ, etc. engineer and caretaker. That's a high price and a lot of work for just scheduling a few hundred invoices.
In summary: While future invoice scheduling may seem like something to farm out, in practice it will be easier, faster, and more immediately robust to keep that function in your primary app code (not directly in your Flask web app, but as an associated utility), and just farm out the "remind me to run this N times a day" low-level tickler to a system-level job scheduler.
You can use crontab in Linux, The syntax look like this
crontab -e
1 2 3 4 5 /path/to/command arg1 arg2
Or maybe you can have a look at Celery, Which I think is a good tool to handle Task Queue, and you may find something useful here.celery.schedules
EDIT
Schedule Tasks on Linux Using Crontab
HowTo: Add Jobs To cron Under Linux or UNIX?
How to Schedule Tasks on Linux: An Introduction to Crontab Files
If I understand correctly, you'll want to get the date when you generate the invoice, then add 3 months (90 days). You can use datetime.timedelta(days=90) in Python for this. Take a look at: Adding 5 days to a date in Python.
From there, you could theoretically spawn a thread with Threading.timer() (as seen here: Python - Start a Function at Given Time), but I would recommend against using Python for this part because, as you mention, it would put undue stress on the server (not to mention if the server goes down, you lose all your scheduling).
Option A (Schedule a task for each invoice):
What would be better is using the OS to schedule a task in the future. If your backend is Linux-based, Cron should work nicely. Take a look at this for ideas: How to setup cron to run a file just once at a specific time in future?. Personally, I like this answer, which suggests creating a file in /etc/cron.d for each task and having the script delete its own file when it has finished executing.
Option B (Check daily if reminders should be sent):
I know it's not what you asked, but I'll also suggest it might be cleaner to handle this as a daily task. You can schedule a daily cron job like this:
0 22 * * * /home/emailbot/bin/send_reminder_emails.py
So, in this example, at min 0, hour 22 (10pm) every day, every month, every day-of-the-week, check to see if we should send reminder emails.
In your send_reminder_emails.py script, you check a record (could be a JSON or YML file, or your database, or a custom format) for any reminders that need to be sent "today". If there's none, the script just exits, and if there is, you send out a reminder to each person on the list. Optionally, you can clean up the entries in the file as the reminders expire, or periodically.
Then all you have to do is add an entry to the reminder file every time an invoice is generated.
with open("reminder_list.txt", "a") as my_file:
my_file.write("Invoice# person#email.com 2016-12-22")
An added benefit of this method is that if your server is down for maintenance, you can keep entries and send them tomorrow by checking if the email date has passed datetime.datetime.now() >= datetime(2016,12,22). If you do that, you'll also want to keep a true/false flag that indicates whether the email has already been sent (so that you don't spam customers).

Get certain celery tasks to only run during certain hours

Using Celery, with (if it matters) a rabbitmq broker and django as the result backend.
I have many thousands of tasks, such that it will take weeks to complete them all. However, some of them I only want to be run between certain hours of the day. This is because they involve downloading files from a public server that has explicitly requested this behaviour. Each task takes several minutes to complete. How best to go about this?
You can use Crontab schedules to setup what you need.

Dynamic creation of PeriodicTasks

I have found this soultion for adding periodic task schedules dynamically with django-celery.
My use case is mailings, which being added individually for users of web-site, each mailing has a PeriodicTask associated with it, so there is potentially may be huge quantity of PeriodicTask records in DB.
Im interested - is it valid (legal, proper, right) solution in that case, or it is better to have only one or few PeriodicTask's which would check mailings for last time they been sent and send them if necessary?
According to it's creator, Ask Solem in this thread:
There is no known limit to the number of periodic tasks, and the celerybeat scheduler should perform well even with a large number of schedule entries.
That Google group thread and this one are the most clarifying about the concern you have.
Said that, I'd like to give you an advice: even when celerybeat scheduler is able to handle huge amounts of periodical tasks, that will come to a cost: more database entries, more tasks to monitor, more ram, maybe more complexity for debugging because you are creating dynamic tasks, more hits to database because you will have to check for each mailing its sent datetime and then see if you send that email.
On the other hand, if you can have one one periodical task that can do one query to retrieve just the mailing instances that have to be sent and the fire one subtask task per email you have to send, then it would look simpler in your code, when you have to debug it and when you have to monitor it. Just my two cents.
Hope it helps.
Could you not have a single periodic task which runs every day, week or whatever, and inside that calculate in the first part all the users which require mailings at that time? Once you know all of these, you could kick-off a sub-task in celery for each of these so that these are all executed asynchronously and will allow the main task to complete very quickly, e.g.
#task
def send_periodic_emails():
users_who_need_mail = get_users_who_need_mail()
for user in users_who_need_mail:
send_user_email.delay(user.id)
#task
def send_user_email(user_id):
# Do email sending here
I appreciate this doesn't answer the question as it's formed, but it should allow you to avoid finding out whether this limit exists or adding scheduled tasks programatically!
A lot depends on the nature of your work. If you can group your users into classes for mailing purposes then it would seem natural to schedule mailing of the groups rather than mailing the individual users. If everyone is on a different schedule then by all means schedule each one individually. It's certainly legal and there's no compelling reason to avoid it if it's the natural solution to your problems.
You may want to run some tests to get an idea of the load you will generate, but your approach doesn't seem unreasonable.

Django design pattern for web analytics screens that take a really long time to calculate

I have an "analytics dashboard" screen that is visible to my django web applications users that takes a really long time to calculate. It's one of these screens that goes through every single transaction in the database for a user and gives them metrics on it.
I would love for this to be a realtime operation, but calculation times can be 20-30 seconds for an active user (no paging allowed, it's giving averages on transactions.)
The solution that comes to mind is to calculate this in the backend via a manage.py batch command and then just display cached values to the user. Is there a Django design pattern to help facilitate these types of models/displays?
What you're looking for is a combination of offline processing and caching. By offline, I mean that the computation logic happens outside the request-response cycle. By caching, I mean that the result of your expensive calculation is sufficiently valid for X time, during which you do not need to recalculate it for display. This is a very common pattern.
Offline Processing
There are two widely-used approaches to work which needs to happen outside the request-response cycle:
Cron jobs (often made easier via a custom management command)
Celery
In relative terms, cron is simpler to setup, and Celery is more powerful/flexible. That being said, Celery enjoys fantastic documentation and a comprehensive test suite. I've used it in production on almost every project, and while it does involve some requirements, it's not really a bear to setup.
Cron
Cron jobs are the time-honored method. If all you need is to run some logic and store some result in the database, a cron job has zero dependencies. The only fiddly bits with cron jobs is getting your code to run in the context of your django project -- that is, your code must correctly load your settings.py in order to know about your database and apps. For the uninitiated, this can lead to some aggravation in divining the proper PYTHONPATH and such.
If you're going the cron route, a good approach is to write a custom management command. You'll have an easy time testing your command from the terminal (and writing tests), and you won't need to do any special hoopla at the top of your management command to setup a proper django environment. In production, you simply run path/to/manage.py yourcommand. I'm not sure if this approach works without the assistance of virtualenv, which you really ought to be using regardless.
Another aspect to consider with cronjobs: if your logic takes a variable amount of time to run, cron is ignorant of the matter. A cute way to kill your server is to run a two-hour cronjob like this every hour. You can roll your own locking mechanism to prevent this, just be aware of this—what starts out as a short cronjob might not stay that way when your data grows, or when your RDBMS misbehaves, etc etc.
In your case, it sounds like cron is less applicable because you'd need to calculate the graphs for every user every so often, without regards to who is actually using the system. This is where celery can help.
Celery
…is the bee's knees. Usually people are scared off by the "default" requirement of an AMQP broker. It's not terribly onerous setting up RabbitMQ, but it does require stepping outside of the comfortable world of Python a bit. For many tasks, I just use redis as my task store for Celery. The settings are straightforward:
CELERY_RESULT_BACKEND = "redis"
REDIS_HOST = "localhost"
REDIS_PORT = 6379
REDIS_DB = 0
REDIS_CONNECT_RETRY = True
Voilá, no need for an AMQP broker.
Celery provides a wealth of advantages over simple cron jobs. Like cron, you can schedule periodic tasks, but you can also fire off tasks in response to other stimuli without holding up the request/response cycle.
If you don't want to compute the chart for every active user every so often, you will need to generate it on-demand. I'm assuming that querying for the latest available averages is cheap, computing new averages is expensive, and you're generating the actual charts client-side using something like flot. Here's an example flow:
User requests a page which contains an averages chart.
Check cache -- is there a stored, nonexpired queryset containing averages for this user?
If yes, use that.
If not, fire off a celery task to recalculate it, requery and cache the result. Since querying existing data is cheap, run the query if you want to show stale data to the user in the meantime.
If the chart is stale. optionally provide some indication that the chart is stale, or do some ajax fanciness to ping django every so often and ask if the refreshed chart is ready.
You could combine this with a periodic task to recalculate the chart every hour for users that have an active session, to prevent really stale charts from being displayed. This isn't the only way to skin the cat, but it provides you with all the control you need to ensure freshness while throttling CPU load of the calculation task. Best of all, the periodic task and the "on demand" task share the same logic—you define the task once and call it from both places for added DRYness.
Caching
The Django cache framework provides you with all the hooks you need to cache whatever you want for as long as you want. Most production sites rely on memcached as their cache backend, I've lately started using redis with the django-redis-cache backend instead, but I'm not sure I'd trust it for a major production site yet.
Here's some code showing off usage of the low-level caching API to accomplish the workflow laid out above:
import pickle
from django.core.cache import cache
from django.shortcuts import render
from mytasks import calculate_stuff
from celery.task import task
#task
def calculate_stuff(user_id):
# ... do your work to update the averages ...
# now pull the latest series
averages = TransactionAverage.objects.filter(user=user_id, ...)
# cache the pickled result for ten minutes
cache.set("averages_%s" % user_id, pickle.dumps(averages), 60*10)
def myview(request, user_id):
ctx = {}
cached = cache.get("averages_%s" % user_id, None)
if cached:
averages = pickle.loads(cached) # use the cached queryset
else:
# fetch the latest available data for now, same as in the task
averages = TransactionAverage.objects.filter(user=user_id, ...)
# fire off the celery task to update the information in the background
calculate_stuff.delay(user_id) # doesn't happen in-process.
ctx['stale_chart'] = True # display a warning, if you like
ctx['averages'] = averages
# ... do your other work ...
render(request, 'my_template.html', ctx)
Edit: worth noting that pickling a queryset loads the entire queryset into memory. If you're pulling up a lot of data with your averages queryset this could be suboptimal. Testing with real-world data would be wise in any case.
Simplest and IMO correct solution for such scenarios is to pre-calculate everything as things are updated, so that when user sees dashboard you calculate nothing but just display already calculated values.
There can be various ways to do that, but generic concept is to trigger a calculate function in background when something on which calculation depends changes.
For triggering such calculation in background I usually use celery, so suppose user adds a item foo in view view_foo, we call a celery task update_foo_count which will be run in background and will update foo count, alternatively you can have a celery timer which will update count say every 10 minutes by checking if re-calculation need to be done, recalculate flag can be set at various places where user updates data.
You need to have a look at Django’s cache framework.
If the data that is slow to compute can be denormalised and stored when data is added, rather than when it is viewed, then you may be interested in django-denorm.

Categories

Resources