I want to schedule a cron job on Google App Engine to view my 5 main pages every 10 minutes or so to keep a current instance up and running and to increase page speed for users. I understand all of the basic syntax for creating a cron job but I am curious what the python would look like for that. Do I simply need to make 5 different cron jobs and have each one fetch a URL?
To answer your specific question, such a cron.yaml could look like this:
cron:
- description: five minute run
url: /refresh
schedule: every 5 minutes
where /refresh is a handler you've written in your app that is then called even N minutes automatically.
E.G. myapplication.appspot.com/refresh
There's no need to refresh a specific page or more than one. Just having the handler called will keep your app alive.
But as others have noted, this is a bit much to keep an app permanently warm.
You don't have to resort to this. You can pay to have App Engine keep a certain number of frontends running constantly. They're referred to as "resident" instances.
https://developers.google.com/appengine/docs/adminconsole/instances
I don't know about AppEngine but, in generic Python, all you need is urllib.urlopen(). I'd probably just have a single script that pulls all 5 pages in order - I can't really think of a reason to make them separate.
https://cloud.google.com/appengine/docs/standard/python/config/appref#automatic_scaling_min_instances
This seems like the proper way to solve the issue of your single low-traffic auto-scaled instance now. Basically just add this to your app.yaml:
automatic_scaling:
min_instances: 1
...Then add the warmup handler to your app (just so you're not throwing a 400 error every time GAE attempts to warm up your app):
https://cloud.google.com/appengine/docs/standard/python3/configuring-warmup-requests
Don't waste your time with pinging, this has the exact same effect & cost.
Cron job can fetch only one url.
I see 2 way:
1. You can add cron for every page.
2. You can add one cron job and add task to each page from cron.
Related
I am fairly new to GCP.
I have some items in a cloud storage bucket.
I have written some python code to access this bucket and perform update operations.
I want to make sure that whenever the python code is triggered, it has exclusive access to the bucket so that I do not run into some sort of race condition.
For example, if I put the python code in a cloud function and trigger it, I want to make sure it completes before another trigger occurs. Is this automatically handled or do I have to do something to prevent this? If I have to add something like a semaphore, will subsequent triggers happen automatically after the the semaphore is released?
Google Cloud
Scheduler is a
fully managed cron jobs scheduling service available in GCP. It's
basically the cron jobs which trigger at a given time. All you need
to do is specify the frequency(The time when the job needs to be
triggered) and the target(HTTP, Pub/Sub, App Engine HTTP) and you can
specify the retry configuration like Max retry attempts, Max retry
duration etc..
App Engine has a built-in cron service that allows you to write a simple cron.yaml containing the time at which you want the job to run and which endpoint it should hit. App Engine will ensure that the cron is executed at the time which you have specified. Here’s a sample cron.yaml that hits the /tasks/summary endpoint in AppEngine deployment every 24 hours.
cron:
- description: "daily summary job"
url: /tasks/summary
schedule: every 24 hours
All of the info supplied has been helpful. The best answer has been to use a max-concurrent-dispatches setting of 1 so that only one task is dispatched at a time.
How can I refresh my python file in every five minutes in django while running because every hour the data I'm webscraping is changing and I need to change the value of the variable?
You need is a task that runs periodically and a cron job solve that. I recomment to you take a look to django cron or Celerity, both are excelent options to create scheduled tasks.
I recommend using database like sqlite3 is better solution than restart django(web service) per hours.
You can store some datas in database and django can get them like using variables.
The real problem here is that you're fetching the data at the start of the application and keeping it in memory. I see 2 possible better methods:
move the data scraping code to your view function. This means you'll re-scrape on every call ensuring you'll always have the freshest data but at the cost of speed (the time it takes to make the request to your target url).
better yet: same as above except you cache the results locally. This could also be kept in memory (although I'd use a file or database if you're running multiple django app instances to ensure they're all using the same data). With the least amount of change to what you have already, an in memory cache could be achieved with simply adding a timestamp variable that gets the current time on each fetch. If last fetch was more than X minutes ago: refetch your data.
I am developing a reporting service (i.e. Database reports via email) for a project on Google App Engine, naturally using the Google Cloud Platform.
I am using Python and Django but I feel that may be unimportant to my question specifically. I want to be able to allow users of my application schedule specific cron reports to send off at specified times of the day.
I know this is completely possible by running a cron on GAE on a minute-by-minute basis (using cron.yaml since I'm using Python) and providing the logic to determine which reports to run in whatever view I decide to make the cron hit, but this seems terribly inefficient to me, and seeing as the best answer I have found suggests doing the same thing (Adding dynamic cron jobs to GAE), I wanted an "updated" suggestion.
Is there at this point in time a better option than running a cron every minute and checking a DB full of client entries to determine which report to fire off?
You may want to have a look at the new Google Cloud Scheduler service (in beta at the moment), which is a fully managed cron job service. It allows you to create cron jobs programmatically via its REST API. So you could create a specific cron job per customer with the appropriate schedule to fit you needs.
Given this limit, my guess would be NO
Free applications can have up to 20 scheduled tasks. Paid applications can have up to 250 scheduled tasks.
https://cloud.google.com/appengine/docs/standard/python/config/cronref#limits
Another version of your minute-by-minute workaround would be a daily cron task that finds everyone that wants to be launched that day, and then use the _eta argument to pinpoint the precise moment in each day for each task to launch.
I have been looking for a solution for my app that does not seem to be directly discussed anywhere. My goal is to publish an app and have it reach out, automatically, to a server I am working with. This just needs to be a simple Post. I have everything working fine, and am currently solving this problem with a cron job, but it is not quite sufficient - I would like the job to execute automatically once the app has been published, not after a minute (or whichever the specified time it may be set to).
In concept I am trying to have my app register itself with my server and to do this I'd like for it to run once on publish and never be ran again.
Is there a solution to this problem? I have looked at Task Queues and am unsure if it is what I am looking for.
Any help will be greatly appreciated.
Thank you.
Personally, this makes more sense to me as a responsibility of your deploy process, rather than of the app itself. If you have your own deploy script, add the post request there (after a successful deploy). If you use google's command line tools, you could wrap that in a script. If you use a 3rd party tool for something like continuous integration, they probably have deploy hooks you could use for this purpose.
The main question will be how to ensure it only runs once for a particular version.
Here is an outline on how you might approach it.
You create a HasRun module, which you use store each the version of the deployed app and this indicates if the one time code has been run.
Then make sure you increment your version, when ever you deploy your new code.
In you warmup handler or appengine_config.py grab the version deployed,
then in a transaction try and fetch the new HasRun entity by Key (version number).
If you get the Entity then don't run the one time code.
If you can not find it then create it and run the one time code, either in a task (make sure the process is idempotent, as tasks can be retried) or in the warmup/front facing request.
Now you will probably want to wrap all of that in memcache CAS operation to provide a lock or some sort. To prevent some other instance trying to do the same thing.
Alternately if you want to use the task queue, consider naming the task the version number, you can only submit a task with a particular name once.
It still needs to be idempotent (again could be scheduled to retry) but there will only ever be one task scheduled for that version - at least for a few weeks.
Or a combination/variation of all of the above.
I'm developing software using the Google App Engine.
I have some considerations about the optimal design regarding the following issue: I need to create and save snapshots of some entities at regular intervals.
In the conventional relational db world, I would create db jobs which would insert new summary records.
For example, a job would insert a record for every active user that would contain his current score to the "userrank" table, say, every hour.
I'd like to know what's the best method to achieve this in Google App Engine. I know that there is the Cron service, but does it allow us to execute jobs which will insert/update thousands of records?
I think you'll find that snapshotting every user's state every hour isn't something that will scale well no matter what your framework. A more ordinary environment will disguise this by letting you have longer running tasks, but you'll still reach the point where it's not practical to take a snapshot of every user's data, every hour.
My suggestion would be this: Add a 'last snapshot' field, and subclass the put() function of your model (assuming you're using Python; the same is possible in Java, but I don't know the syntax), such that whenever you update a record, it checks if it's been more than an hour since the last snapshot, and if so, creates and writes a snapshot record.
In order to prevent concurrent updates creating two identical snapshots, you'll want to give the snapshots a key name derived from the time at which the snapshot was taken. That way, if two concurrent updates try to write a snapshot, one will harmlessly overwrite the other.
To get the snapshot for a given hour, simply query for the oldest snapshot newer than the requested period. As an added bonus, since inactive records aren't snapshotted, you're saving a lot of space, too.
Have you considered using the remote api instead? This way you could get a shell to your datastore and avoid the timeouts. The Mapper class they demonstrate in that link is quite useful and I've used it successfully to do batch operations on ~1500 objects.
That said, cron should work fine too. You do have a limit on the time of each individual request so you can't just chew through them all at once, but you can use redirection to loop over as many users as you want, processing one user at a time. There should be an example of this in the docs somewhere if you need help with this approach.
I would use a combination of Cron jobs and a looping url fetch method detailed here: http://stage.vambenepe.com/archives/549. In this way you can catch your timeouts and begin another request.
To summarize the article, the cron job calls your initial process, you catch the timeout error and call the process again, masked as a second url. You have to ping between two URLs to keep app engine from thinking you are in a accidental loop. You also need to be careful that you do not loop infinitely. Make sure that there is an end state for your updating loop, since this would put you over your quotas pretty quickly if it never ended.