I am creating a web app made of python flask.. What I want is if I created an invoice it automatically send an e-mail on its renewal date to notify that the invoice is renewed and so on.. Please take note that renewal occurs every 3 months The code is something like:
def some_method(): // An API Enpoint that adds an invoice from a request
// Syntax that inserts the details to the database
// Let's assume that the the start of the renewal_date is December 15, 2016
How will the automatic email execution be achieved? Without putting too much stress in the backend? Because I'm guessing that if there are 300 invoices then the server might be stressed out
"Write it yourself." No, really.
Keep a database table of invoices to be sent. Every invoice has a status (values such as pending, sent, paid, ...) and an invoice date (which may be in the future). Use cron or similar to periodically run (e.g. 1/hour, 1/day) a program that queries the table for pending invoices for which the invoice date/time has arrived/passed, yet no invoice has yet been sent. This invoicing process sends the invoice, updates the status, and finishes. This invoicer utility will not be integral to your Flask web app, but live beside it as a support program.
Why? It's simple and direct approach. It keeps the invoicing code in Python, close to your chosen app language and database. It doesn't require much excursion into external systems or middleware. It's straightforward to debug and monitor, using the same database, queries, and skills as writing the app itself. Simple, direct, reliable, done. What's not to love?
Now, I fully understand that a "write it yourself" recommendation runs contrary to typical "buy not build" doctrine. But I have tried all the main alternatives such as cron and Celery; my experience with a revenue-producing web app says they're not the way to go for a few hundred long-term invoicing events.
The TL;DR--Why Not Cron and Celery?
cron and its latter-day equivalents (e.g. launchd or Heroku Scheduler) run recurring events. Every 10 minutes, every hour, once a day, every other Tuesday at 3:10am UTC. They generally don't solve the "run this once, at time and date X in the future" problem, but they are great for periodic work.
Now that last statement isn't strictly true. It describes cron and some of its replacements. But even traditional Unix provides at as side-car to cron, and some cron follow-ons (e.g. launchd, systemd) bundle recurring and future event scheduling together (along with other kitchen appliances and the proverbial sink). Evens so, there are some issues:
You're relying on external scheduling systems. That means another interface to learn, monitor, and debug if something goes wrong. There's significant impedance mismatch between those system-level schedulers and your Python app. Even if you farm out "run event at time X," you still need to write the Python code to send the invoice and properly move it along your business workflow.
Those systems, beautiful for a handful of events, generally lack interfaces that make reviewing, modifying, monitoring, or debugging hundreds of outstanding events straightforward. Debugging production app errors amidst an ocean of scheduled events is harrowing. You're talking about committing 300+ pending events to this external system. You must also consider how you'll monitor and debug that use.
Those schedulers are designed for "regular" not "high value" or "highly available" operations. As just one gotcha, what if an event is scheduled, but then you take downtime (planned or unplanned)? If the event time passes before the system is back up, what then? Most of the cron-like schedulers lack provisions for robustly handling "missed" events or "making them up at the earliest opportunity." That can be, in technical terms, "a bummer, man." Say the event triggered money collection--or in your case, invoice issuance. You have hundreds of invoices, and issuing those invoices is presumably business-critical. The capability gaps between system-level schedulers and your operational needs can be genuinely painful, especially as you scale.
Okay, what about driving those events into an external event scheduler like Celery? This is a much better idea. Celery is designed to run large number of app events. It supports various backend engines (e.g. RabbitMQ) proven in practice to handle thousands upon untold thousands of events, and it has user interface options to help deal with event multitudes. So far so good! But:
You will find yourself dealing with the complexities of installing, configuring, and operating external middleware (e.g. RabbitMQ). The effort yields very high achievable scale, but the startup and operational costs are real. This is true even if you farm much of it to a cloud service like Heroku.
More important, while great as a job dispatcher for near-term events, Celery is not ideal as a long-wait-time scheduler. In production I've seen serious issues with "long throw" events (those posted a month, or in your case three months, in the future). While the problems aren't identical, just like cron etc., Celery long-throw events intersect ungracefully with normal app update and restart cycles. This is environment-dependent, but happens on popular cloud services like Heroku.
The Celery issues are not entirely unsolvable or fatal, but long-delay events don't enjoy the same "Wow! Celery made everything work so much better!" magic that you get for swarms of near-term events. And you must become a bit of a Celery, RabbitMQ, etc. engineer and caretaker. That's a high price and a lot of work for just scheduling a few hundred invoices.
In summary: While future invoice scheduling may seem like something to farm out, in practice it will be easier, faster, and more immediately robust to keep that function in your primary app code (not directly in your Flask web app, but as an associated utility), and just farm out the "remind me to run this N times a day" low-level tickler to a system-level job scheduler.
You can use crontab in Linux, The syntax look like this
crontab -e
1 2 3 4 5 /path/to/command arg1 arg2
Or maybe you can have a look at Celery, Which I think is a good tool to handle Task Queue, and you may find something useful here.celery.schedules
EDIT
Schedule Tasks on Linux Using Crontab
HowTo: Add Jobs To cron Under Linux or UNIX?
How to Schedule Tasks on Linux: An Introduction to Crontab Files
If I understand correctly, you'll want to get the date when you generate the invoice, then add 3 months (90 days). You can use datetime.timedelta(days=90) in Python for this. Take a look at: Adding 5 days to a date in Python.
From there, you could theoretically spawn a thread with Threading.timer() (as seen here: Python - Start a Function at Given Time), but I would recommend against using Python for this part because, as you mention, it would put undue stress on the server (not to mention if the server goes down, you lose all your scheduling).
Option A (Schedule a task for each invoice):
What would be better is using the OS to schedule a task in the future. If your backend is Linux-based, Cron should work nicely. Take a look at this for ideas: How to setup cron to run a file just once at a specific time in future?. Personally, I like this answer, which suggests creating a file in /etc/cron.d for each task and having the script delete its own file when it has finished executing.
Option B (Check daily if reminders should be sent):
I know it's not what you asked, but I'll also suggest it might be cleaner to handle this as a daily task. You can schedule a daily cron job like this:
0 22 * * * /home/emailbot/bin/send_reminder_emails.py
So, in this example, at min 0, hour 22 (10pm) every day, every month, every day-of-the-week, check to see if we should send reminder emails.
In your send_reminder_emails.py script, you check a record (could be a JSON or YML file, or your database, or a custom format) for any reminders that need to be sent "today". If there's none, the script just exits, and if there is, you send out a reminder to each person on the list. Optionally, you can clean up the entries in the file as the reminders expire, or periodically.
Then all you have to do is add an entry to the reminder file every time an invoice is generated.
with open("reminder_list.txt", "a") as my_file:
my_file.write("Invoice# person#email.com 2016-12-22")
An added benefit of this method is that if your server is down for maintenance, you can keep entries and send them tomorrow by checking if the email date has passed datetime.datetime.now() >= datetime(2016,12,22). If you do that, you'll also want to keep a true/false flag that indicates whether the email has already been sent (so that you don't spam customers).
Related
I need your opinion on a challenge that I'm facing. I'm building a website that uses Django as a backend, PostgreSQL as my DB, GraphQL as my API layer and React as my frontend framework. Website is hosted on Heroku. I wrote a python script that logs me in to my gmail account and parse few emails, based on pre-defined conditions, and store the parsed data into Google Sheet. Now, I want the script to be part of my website in which user will specify what exactly need to be parsed (i.e. filters) and then display the parsed data in a table to review accuracy of the parsing task.
The part that I need some help with is how to architect such workflow. Below are few ideas that I managed to come up with after some googling:
generate a graphQL mutation that stores a 'task' into a task model. Once a new task entry is stored, a Django Signal will trigger the script. Not sure yet if Signal can run custom python functions, but from what i read so far, it seems doable.
Use Celery to run this task asynchronously. But i'm not sure if asynchronous tasks is what i'm after here as I need this task to run immediately after the user trigger the feature from the frontend. But i'm might be wrong here. I'm also not sure if I need Redis to store the task details or I can do that on PostgreSQL.
What is the best practice in implementing this feature? The task can be anything, not necessarily parsing emails; it can also be importing data from excel. Any task that is user generated rather than scheduled or repeated task.
I'm sorry in advance if this question seems trivial to some of you. I'm not a professional developer and the above project is a way for me to sharpen my technical skills and learn new techniques.
Looking forward to learn from your experiences.
You can dissect your problem into the following steps:
User specifies task parameters
System executes task
System displays result to the User
You can either do all of these:
Sequentially and synchronously in one swoop; or
Step by step asynchronously.
Synchronously
You can run your script when generating a response, but it will come with the following downsides:
The process in the server processing your request will block until the script is finished. This may or may not affect the processing of other requests by that same server (this will depend on the number of simultaneous requests being processed, workload of the script, etc.)
The client (e.g. your browser) and even the server might time out if the script takes too long. You can fix this to some extent by configuring your server appropriately.
The beauty of this approach however is it's simplicity. For you to do this, you can just pass the parameters through the request, server parses and does the script, then returns you the result.
No setting up of a message queue, task scheduler, or whatever needed.
Asynchronously
Ideally though, for long-running tasks, it is best to have this executed outside of the usual request-response loop for the following advantages:
The server responding to the requests can actually serve other requests.
Some scripts can take a while, some you don't even know if it's going to finish
Script is no longer dependent on the reliability of the network (imagine running an expensive task, then your internet connection skips or is just plain intermittent; you won't be able to do anything)
The downside of this is now you have to set more things up, which increases the project's complexity and points of failure.
Producer-Consumer
Whatever you choose, it's usually best to follow the producer-consumer pattern:
Producer creates tasks and puts them in a queue
Consumer takes a task from the queue and executes it
The producer is basically you, the user. You specify the task and the parameters involved in that task.
This queue could be any datastore: in-memory datastore like Redis; a messaging queue like RabbitMQ; or an relational database management system like PostgreSQL.
The consumer is your script executing these tasks. There are multiple ways of running the consumer/script: via Celery like you mentioned which runs multiple workers to execute the tasks passed through the queue; via a simple time-based job scheduler like crontab; or even you manually triggering the script
The question is actually not trivial, as the solution depends on what task you are actually trying to do. It is best to evaluate the constraints, parameters, and actual tasks to decide which approach you will choose.
But just to give you a more relevant guideline:
Just keep it simple, unless you have a compelling reason to do so (e.g. server is being bogged down, or internet connection is not reliable in practice), there's really no reason to be fancy.
The more blocking the task is, or the longer the task takes or the more dependent it is to third party APIs via the network, the more it makes sense to push this to a background process add reliability and resiliency.
In your email import script, I'll most likely push that to the background:
Have a page where you can add a task to the database
In the task details page, display the task details, and the result below if it exists or "Processing..." otherwise
Have a script that executes tasks (import emails from gmail given the task parameters) and save the results to the database
Schedule this script to run every few minutes via crontab
Yes the above has side effects, like crontab running the script in multiple times at the same time and such, but I won't go into detail without knowing more about the specifics of the task.
Suppose I have a model Event. I want to send a notification (email, push, whatever) to all invited users once the event has elapsed. Something along the lines of:
class Event(models.Model):
start = models.DateTimeField(...)
end = models.DateTimeField(...)
invited = models.ManyToManyField(model=User)
def onEventElapsed(self):
for user in self.invited:
my_notification_backend.sendMessage(target=user, message="Event has elapsed")
Now, of course, the crucial part is to invoke onEventElapsed whenever timezone.now() >= event.end.
Keep in mind, end could be months away from the current date.
I have thought about two basic ways of doing this:
Use a periodic cron job (say, every five minutes or so) which checks if any events have elapsed within the last five minutes and executes my method.
Use celery and schedule onEventElapsed using the eta parameter to be run in the future (within the models save method).
Considering option 1, a potential solution could be django-celery-beat. However, it seems a bit odd to run a task at a fixed interval for sending notifications. In addition I came up with a (potential) issue that would (probably) result in a not-so elegant solution:
Check every five minutes for events that have elapsed in the previous five minutes? seems shaky, maybe some events are missed (or others get their notifications send twice?). Potential workaroung: add a boolean field to the model that is set to True once notifications have been sent.
Then again, option 2 also has its problems:
Manually take care of the situation when an event start/end datetime is moved. When using celery, one would have to store the taskID (easy, ofc) and revoke the task once the dates have changed and issue a new task. But I have read, that celery has (design-specific) problems when dealing with tasks that are run in the future: Open Issue on github. I realize how this happens and why it is everything but trivial to solve.
Now, I have come across some libraries which could potentially solve my problem:
celery_longterm_scheduler (But does this mean I cannot use celery as I would have before, because of the differend Scheduler class? This also ties into the possible usage of django-celery-beat... Using any of the two frameworks, is it still possible to queue jobs (that are just a bit longer-running but not months away?)
django-apscheduler, uses apscheduler. However, I was unable to find any information on how it would handle tasks that are run in the far future.
Is there a fundemantal flaw with the way I am approaching this? Im glad for any inputs you might have.
Notice: I know this is likely to be somehwat opinion based, however, maybe there is a very basic thing that I have missed, regardless of what could be considered by some as ugly or elegant.
We're doing something like this in the company i work for, and the solution is quite simple.
Have a cron / celery beat that runs every hour to check if any notification needs to be sent.
Then send those notifications and mark them as done. This way, even if your notification time is years ahead, it will still be sent. Using ETA is NOT the way to go for a very long wait time, your cache / amqp might loose the data.
You can reduce your interval depending on your needs, but do make sure they dont overlap.
If one hour is too huge of a time difference, then what you can do is, run a scheduler every hour. Logic would be something like
run a task (lets call this scheduler task) hourly that gets all notifications that needs to be sent in the next hour (via celery beat) -
Schedule those notifications via apply_async(eta) - this will be the actual sending
Using that methodology would get you both of best worlds (eta and beat)
Im currently making a program that would send random text messages at randomly generated times during the day. I first made my program in python and then realized that if I would like other people to sign up to receive messages, I would have to use some sort of online framework. (If anyone knowns a way to use my code in python without having to change it that would be amazing, but for now I have been trying to use web2py) I looked into scheduler but it does not seem to do what I have in mind. If anyone knows if there is a way to pass a time value into a function and have it run at that time, that would be great. Thanks!
Check out the Apscheduler module for cron-like scheduling of events in python - In their example it shows how to schedule some python code to run in a cron'ish way.
Still not sure about the random part though..
As for a web framework that may appeal to you (seeing you are familiar with Python already) you should really look into Django (or to keep things simple just use WSGI).
Best.
I think that actually you can use Scheduler and Tasks of web2py. I've never used it ;) but the documentation describes creation of a task to which you can pass parameters from your code - so something you need - and it should work fine for your needs:
scheduler.queue_task('mytask', start_time=myrandomtime)
So you need web2py's cron job, running every day and firing code similar to the above for each message to be sent (passing parameters you need, possibly message content and phone number, see examples in web2py book). This would be a daily creation of tasks which would be processed later by the scheduler.
You can also have a simpler solution, one daily cron job which prepares the queue of messages with random times for the next day and the second one which runs every, like, ten minutes, checks what awaits to be processed and sends messages. So, no Tasks. This way is a bit ugly though (consider a single processing which takes more then 10 minutes). You may also want to have and check some statuses of the messages to be processed (like pending, ongoing, done) to prevent a situation in which two jobs are working on the same message and to allow tracking progress of the processing. Anyway, you could use the cron method it in an early version of your software and later replace it by a better method :)
In any case, you should check expected number of messages to process and average processing time on your target platform - to make sure that the chosen method is quick enough for your needs.
This is an old question but in case someone is interested, the answer is APScheduler blocking scheduler with jobs set to run in regular intervals with some jitter
See: https://apscheduler.readthedocs.io/en/3.x/modules/triggers/interval.html
I'm working on a fairly simple CGI with Python. I'm about to put it into Django, etc. The overall setup is pretty standard server side (i.e. computation is done on the server):
User uploads data files and clicks "Run" button
Server forks jobs in parallel behind the scenes, using lots of RAM and processor power. ~5-10 minutes later (average use case), the program terminates, having created a file of its output and some .png figure files.
Server displays web page with figures and some summary text
I don't think there are going to be hundreds or thousands of people using this at once; however, because the computation going on takes a fair amount of RAM and processor power (each instance forks the most CPU-intensive task using Python's Pool).
I wondered if you know whether it would be worth the trouble to use a queueing system. I came across a Python module called beanstalkc, but on the page it said it was an "in-memory" queueing system.
What does "in-memory" mean in this context? I worry about memory, not just CPU time, and so I want to ensure that only one job runs (or is held in RAM, whether it receives CPU time or not) at a time.
Also, I was trying to decide whether
the result page (served by the CGI) should tell you it's position in the queue (until it runs and then displays the actual results page)
OR
the user should submit their email address to the CGI, which will email them the link to the results page when it is complete.
What do you think is the appropriate design methodology for a light traffic CGI for a problem of this sort? Advice is much appreciated.
Definitely use celery. You can run an amqp server or I think you can sue the database as a queue for the messages. It allows you to run tasks in the background and it can use multiple worker machines to do the processing if you want. It can also do cron jobs that are database based if you use django-celery
It's as simple as this to run a task in the background:
#task
def add(x, y):
return x + y
In a project I have it's distributing the work over 4 machines and it works great.
I want to write a long running process (linux daemon) that serves two purposes:
responds to REST web requests
executes jobs which can be scheduled
I originally had it working as a simple program that would run through runs and do the updates which I then cron’d, but now I have the added REST requirement, and would also like to change the frequency of some jobs, but not others (let’s say all jobs have different frequencies).
I have 0 experience writing long running processes, especially ones that do things on their own, rather than responding to requests.
My basic plan is to run the REST part in a separate thread/process, and figured I’d run the jobs part separately.
I’m wondering if there exists any patterns, specifically python, (I’ve looked and haven’t really found any examples of what I want to do) or if anyone has any suggestions on where to begin with transitioning my project to meet these new requirements.
I’ve seen a few projects that touch on scheduling, but I’m really looking for real world user experience / suggestions here. What works / doesn’t work for you?
If the REST server and the scheduled jobs have nothing in common, do two separate implementations, the REST server and the jobs stuff, and run them as separate processes.
As mentioned previously, look into existing schedulers for the jobs stuff. I don't know if Twisted would be an alternative, but you might want to check this platform.
If, OTOH, the REST interface invokes the same functionality as the scheduled jobs do, you should try to look at them as two interfaces to the same functionality, e.g. like this:
Write the actual jobs as programs the REST server can fork and run.
Have a separate scheduler that handles the timing of the jobs.
If a job is due to run, let the scheduler issue a corresponding REST request to the local server.
This way the scheduler only handles job descriptions, but has no own knowledge how they are implemented.
It's a common trait for long-running, high-availability processes to have an additional "supervisor" process that just checks the necessary demons are up and running, and restarts them as necessary.
One option is to simply choose a lightweight WSGI server from this list:
http://wsgi.org/wsgi/Servers
and let it do the work of a long-running process that serves requests. (I would recommend Spawning.) Your code can concentrate on the REST API and handling requests through the well defined WSGI interface, and scheduling jobs.
There are at least a couple of scheduling libraries you could use, but I don't know much about them:
http://sourceforge.net/projects/pycron/
http://code.google.com/p/scheduler-py/
Here's what we did.
Wrote a simple, pure-wsgi web application to respond to REST requests.
Start jobs
Report status of jobs
Extended the built-in wsgiref server to use the select module to check for incoming requests.
Activity on the socket is ordinary REST request, we let the wsgiref handle this.
It will -- eventually -- call our WSGI applications to respond to status and
submit requests.
Timeout means that we have to do two things:
Check all children that are running to see if they're done. Update their status, etc.
Check a crontab-like schedule to see if there's any scheduled work to do. This is a SQLite database that this server maintains.
I usually use cron for scheduling. As for REST you can use one of the many, many web frameworks out there. But just running SimpleHTTPServer should be enough.
You can schedule the REST service startup with cron #reboot
#reboot (cd /path/to/my/app && nohup python myserver.py&)
The usual design pattern for a scheduler would be:
Maintain a list of scheduled jobs, sorted by next-run-time (as Date-Time value);
When woken up, compare the first job in the list with the current time. If it's due or overdue, remove it from the list and run it. Continue working your way through the list this way until the first job is not due yet, then go to sleep for (next_job_due_date - current_time);
When a job finishes running, re-schedule it if appropriate;
After adding a job to the schedule, wake up the scheduler process.
Tweak as appropriate for your situation (eg. sometimes you might want to re-schedule jobs to run again at the point that they start running rather than finish).