2019: Dynamic cron jobs Google App Engine

2019: Dynamic cron jobs Google App Engine - python

I am developing a reporting service (i.e. Database reports via email) for a project on Google App Engine, naturally using the Google Cloud Platform.
I am using Python and Django but I feel that may be unimportant to my question specifically. I want to be able to allow users of my application schedule specific cron reports to send off at specified times of the day.
I know this is completely possible by running a cron on GAE on a minute-by-minute basis (using cron.yaml since I'm using Python) and providing the logic to determine which reports to run in whatever view I decide to make the cron hit, but this seems terribly inefficient to me, and seeing as the best answer I have found suggests doing the same thing (Adding dynamic cron jobs to GAE), I wanted an "updated" suggestion.
Is there at this point in time a better option than running a cron every minute and checking a DB full of client entries to determine which report to fire off?

You may want to have a look at the new Google Cloud Scheduler service (in beta at the moment), which is a fully managed cron job service. It allows you to create cron jobs programmatically via its REST API. So you could create a specific cron job per customer with the appropriate schedule to fit you needs.

Given this limit, my guess would be NO
Free applications can have up to 20 scheduled tasks. Paid applications can have up to 250 scheduled tasks.
https://cloud.google.com/appengine/docs/standard/python/config/cronref#limits
Another version of your minute-by-minute workaround would be a daily cron task that finds everyone that wants to be launched that day, and then use the _eta argument to pinpoint the precise moment in each day for each task to launch.

Related

How do I limit access to a cloud storage bucket to one process at a time?

I am fairly new to GCP.
I have some items in a cloud storage bucket.
I have written some python code to access this bucket and perform update operations.
I want to make sure that whenever the python code is triggered, it has exclusive access to the bucket so that I do not run into some sort of race condition.
For example, if I put the python code in a cloud function and trigger it, I want to make sure it completes before another trigger occurs. Is this automatically handled or do I have to do something to prevent this? If I have to add something like a semaphore, will subsequent triggers happen automatically after the the semaphore is released?

Google Cloud
Scheduler is a
fully managed cron jobs scheduling service available in GCP. It's
basically the cron jobs which trigger at a given time. All you need
to do is specify the frequency(The time when the job needs to be
triggered) and the target(HTTP, Pub/Sub, App Engine HTTP) and you can
specify the retry configuration like Max retry attempts, Max retry
duration etc..
App Engine has a built-in cron service that allows you to write a simple cron.yaml containing the time at which you want the job to run and which endpoint it should hit. App Engine will ensure that the cron is executed at the time which you have specified. Here’s a sample cron.yaml that hits the /tasks/summary endpoint in AppEngine deployment every 24 hours.
cron:
- description: "daily summary job"
url: /tasks/summary
schedule: every 24 hours

All of the info supplied has been helpful. The best answer has been to use a max-concurrent-dispatches setting of 1 so that only one task is dispatched at a time.

AWS: containarised serverless solutions

This is more of an architectural question. If I should be asking this question elsewhere, please let me know and I shall.
I have a use-case where I need to run the same python script (could be long-running) multiple times based on demand and pass on different parameters to each of them. The trigger for running the script is external and it should be able to pass on a STRING parameter to the script along with the resource configurations each script requires to run. We plan to start using AWS by next month and I was going through the various options I have. AWS Batch and Fargate both seem like feasible options given my requirements of not having to manage the infrastructure, dynamic spawning of jobs and management of jobs via python SDK.
The only problem is that both of these services are not available in India. I need to have my processing servers in India physically. What options do I have? Auto-scaling and Python SDK management (Creation and Deletion of tasks) are my main requirements (preferably containerized).

Why are you restricted to India? Often restrictions are due to data retention, in which case just store your data on Indian servers (S3, DynamoDB etc) and then you are able to run your 'program' in another AWS region

Execute a code at some point of time

I am creating a web app made of python flask.. What I want is if I created an invoice it automatically send an e-mail on its renewal date to notify that the invoice is renewed and so on.. Please take note that renewal occurs every 3 months The code is something like:
def some_method(): // An API Enpoint that adds an invoice from a request
// Syntax that inserts the details to the database
// Let's assume that the the start of the renewal_date is December 15, 2016
How will the automatic email execution be achieved? Without putting too much stress in the backend? Because I'm guessing that if there are 300 invoices then the server might be stressed out

"Write it yourself." No, really.
Keep a database table of invoices to be sent. Every invoice has a status (values such as pending, sent, paid, ...) and an invoice date (which may be in the future). Use cron or similar to periodically run (e.g. 1/hour, 1/day) a program that queries the table for pending invoices for which the invoice date/time has arrived/passed, yet no invoice has yet been sent. This invoicing process sends the invoice, updates the status, and finishes. This invoicer utility will not be integral to your Flask web app, but live beside it as a support program.
Why? It's simple and direct approach. It keeps the invoicing code in Python, close to your chosen app language and database. It doesn't require much excursion into external systems or middleware. It's straightforward to debug and monitor, using the same database, queries, and skills as writing the app itself. Simple, direct, reliable, done. What's not to love?
Now, I fully understand that a "write it yourself" recommendation runs contrary to typical "buy not build" doctrine. But I have tried all the main alternatives such as cron and Celery; my experience with a revenue-producing web app says they're not the way to go for a few hundred long-term invoicing events.
The TL;DR--Why Not Cron and Celery?
cron and its latter-day equivalents (e.g. launchd or Heroku Scheduler) run recurring events. Every 10 minutes, every hour, once a day, every other Tuesday at 3:10am UTC. They generally don't solve the "run this once, at time and date X in the future" problem, but they are great for periodic work.
Now that last statement isn't strictly true. It describes cron and some of its replacements. But even traditional Unix provides at as side-car to cron, and some cron follow-ons (e.g. launchd, systemd) bundle recurring and future event scheduling together (along with other kitchen appliances and the proverbial sink). Evens so, there are some issues:
You're relying on external scheduling systems. That means another interface to learn, monitor, and debug if something goes wrong. There's significant impedance mismatch between those system-level schedulers and your Python app. Even if you farm out "run event at time X," you still need to write the Python code to send the invoice and properly move it along your business workflow.
Those systems, beautiful for a handful of events, generally lack interfaces that make reviewing, modifying, monitoring, or debugging hundreds of outstanding events straightforward. Debugging production app errors amidst an ocean of scheduled events is harrowing. You're talking about committing 300+ pending events to this external system. You must also consider how you'll monitor and debug that use.
Those schedulers are designed for "regular" not "high value" or "highly available" operations. As just one gotcha, what if an event is scheduled, but then you take downtime (planned or unplanned)? If the event time passes before the system is back up, what then? Most of the cron-like schedulers lack provisions for robustly handling "missed" events or "making them up at the earliest opportunity." That can be, in technical terms, "a bummer, man." Say the event triggered money collection--or in your case, invoice issuance. You have hundreds of invoices, and issuing those invoices is presumably business-critical. The capability gaps between system-level schedulers and your operational needs can be genuinely painful, especially as you scale.
Okay, what about driving those events into an external event scheduler like Celery? This is a much better idea. Celery is designed to run large number of app events. It supports various backend engines (e.g. RabbitMQ) proven in practice to handle thousands upon untold thousands of events, and it has user interface options to help deal with event multitudes. So far so good! But:
You will find yourself dealing with the complexities of installing, configuring, and operating external middleware (e.g. RabbitMQ). The effort yields very high achievable scale, but the startup and operational costs are real. This is true even if you farm much of it to a cloud service like Heroku.
More important, while great as a job dispatcher for near-term events, Celery is not ideal as a long-wait-time scheduler. In production I've seen serious issues with "long throw" events (those posted a month, or in your case three months, in the future). While the problems aren't identical, just like cron etc., Celery long-throw events intersect ungracefully with normal app update and restart cycles. This is environment-dependent, but happens on popular cloud services like Heroku.
The Celery issues are not entirely unsolvable or fatal, but long-delay events don't enjoy the same "Wow! Celery made everything work so much better!" magic that you get for swarms of near-term events. And you must become a bit of a Celery, RabbitMQ, etc. engineer and caretaker. That's a high price and a lot of work for just scheduling a few hundred invoices.
In summary: While future invoice scheduling may seem like something to farm out, in practice it will be easier, faster, and more immediately robust to keep that function in your primary app code (not directly in your Flask web app, but as an associated utility), and just farm out the "remind me to run this N times a day" low-level tickler to a system-level job scheduler.

You can use crontab in Linux, The syntax look like this
crontab -e
1 2 3 4 5 /path/to/command arg1 arg2
Or maybe you can have a look at Celery, Which I think is a good tool to handle Task Queue, and you may find something useful here.celery.schedules
EDIT
Schedule Tasks on Linux Using Crontab
HowTo: Add Jobs To cron Under Linux or UNIX?
How to Schedule Tasks on Linux: An Introduction to Crontab Files

If I understand correctly, you'll want to get the date when you generate the invoice, then add 3 months (90 days). You can use datetime.timedelta(days=90) in Python for this. Take a look at: Adding 5 days to a date in Python.
From there, you could theoretically spawn a thread with Threading.timer() (as seen here: Python - Start a Function at Given Time), but I would recommend against using Python for this part because, as you mention, it would put undue stress on the server (not to mention if the server goes down, you lose all your scheduling).
Option A (Schedule a task for each invoice):
What would be better is using the OS to schedule a task in the future. If your backend is Linux-based, Cron should work nicely. Take a look at this for ideas: How to setup cron to run a file just once at a specific time in future?. Personally, I like this answer, which suggests creating a file in /etc/cron.d for each task and having the script delete its own file when it has finished executing.
Option B (Check daily if reminders should be sent):
I know it's not what you asked, but I'll also suggest it might be cleaner to handle this as a daily task. You can schedule a daily cron job like this:
0 22 * * * /home/emailbot/bin/send_reminder_emails.py
So, in this example, at min 0, hour 22 (10pm) every day, every month, every day-of-the-week, check to see if we should send reminder emails.
In your send_reminder_emails.py script, you check a record (could be a JSON or YML file, or your database, or a custom format) for any reminders that need to be sent "today". If there's none, the script just exits, and if there is, you send out a reminder to each person on the list. Optionally, you can clean up the entries in the file as the reminders expire, or periodically.
Then all you have to do is add an entry to the reminder file every time an invoice is generated.
with open("reminder_list.txt", "a") as my_file:
my_file.write("Invoice# person#email.com 2016-12-22")
An added benefit of this method is that if your server is down for maintenance, you can keep entries and send them tomorrow by checking if the email date has passed datetime.datetime.now() >= datetime(2016,12,22). If you do that, you'll also want to keep a true/false flag that indicates whether the email has already been sent (so that you don't spam customers).

GAE How to indicate that a queue will run for hours?

In the documentation indicate that a task :
Tasks targeted at an automatic scaled module must finish execution
within 10 minutes. If you have tasks that require more time or
computing resources, they can be sent to manual or basic scaling
modules, where they can run up to 24 hours.
The link surrounding manual or basic scaling modules talks about a target, but doesn't say more about how to have a task that runs for a day.
You guessed my question :) How do I tell GAE that this specific task will be run for a day, not a minute ?

You'll need to configure a module to use basic or manual scaling, deploy your task handling code to an instance for that module.
You can read more about configuring modules/versions/instances on the App Engine Modules page for Python

Django app with long running calculations

I'm creating a Django web app which features potentially very long running calculations of up to an hour. The calculations are simulation models built in Python. The web app sends inputs to the simulation model and after some time receives the answer. Also, the user should be able to close his browser after starting the simulation and if he logs in the next day the results should be there.
From my research it seems like I can use Celery together with Redis/RabbitMQ as broker to run the calculation in the background. Ideally I would want to display progress updates using ajax, so that the page updates without a user refresh when the calculation is complete.
I want to host the app on Heroku, so the calculation will also be running on the Heroku server. How hard will it be if I want to move the calculation engine to another server? It might be useful if the calculation engine is on a different server.
So my question is, is my this a good approach above or what other options can I look at?

I think Celery is a good approach. Not sure if you need Redis/RabbitMQ as a broker or you could just use MySQL - it depends on your tasks. Celery workers could be runned on the different servers, so Celery supports distributed queues.
Another approach - implement some queue engine with python, database as a broker and a cron for job executions. But it could be a dirty way with a lots of pain and bugs.
So I think that Celery is a more nice way to do it.

If you are running on Heroku, you want django-rq, not Celery. See https://devcenter.heroku.com/articles/python-rq.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.