Execute very long-running tasks using Google Cloud - python

I have been using Google CLoud for a few weeks now and I am facing a big problem for my limited GCP knowledge.
I have a python project whos goal is to "scrape" datas from a website using it's API. My project run a few tens of thousands of requests during executions and it can take very long (few hours, maybe more)
I have 4 python scripts in my project and it's all orchestrated by a bash script.
The execution is as follow :
The first script check a CSV file with all the instructions for the requests, and exeute the requests, save all the results from the requests in CSV files
Second script check the previously created CSV files and recreate an other CSV instruction file
The first script run again but with the new instructions and again save results in CSV files
Second script checks again and do the same again ...
... and so on a few times
Third script cleans the datas, delete duplicates and create an unique CSV file
Fourth script upload the final CSV file to bucket storage
Now I want to get ride of that bash script and I would like to automatize execution of thos scripts approx. once a week.
The problem here is the execution time. Here is what I already tested :
Google App Engine : The timeout of a request on GAE is limited to 10 minutes, and my functions can run for few hours. GAE is not usable here.
Google Compute Engine : My scripts will run max. 10-15 hours a week, keeping a compute engine up during all that time would be too pricey.
What could I do to automatize the execution of my scripts in a cloud environment ? What could be solutions I didn't though about, without changing my code ?
Thank you

A simple way to accomplish this without the need to get rid of the existing bash script that orchestrates everything would be:
Include the bash script on the startup script for the instance.
At the end of the bash script, include a shutdown command.
Schedule the starting of the instance using Cloud Scheduler. You'll have to make an authenticated call to the GCE API to start the existing instance.
With that, your instance will start on a schedule, it will run the startup script (that will be your existing orchestrating script), and it will shut down once it's finished.

Related

How to get Windows scheduler and Batch File to run only the Master Git Branch

I have a python program doing a query and some data manipulation that is run every 5 minutes through a batch file and windows scheduler. I want to also start using Git and Github for better version control as the code base get's more advanced.
My question is, how do I set up the batch file to only run the code on the master branch? If I am in a separate branch doing some development, I assume that it will run the code in the branch that I am currently in.
Thanks! - Also, I would be open to suggestions on how to get away from windows scheduler... The program is taking the data and putting it into a CSV that is then being pulled into Tableau. So, if I could somehow get that process off of my windows scheduler that'd be nice but I suppose that's a separate question.

How can I schedule python script in the cloud?

I am developing a python script that downloads some excel files from a web service. These two files are combined with another one stored in my computer locally to produce the final file. This final file is loaded to some database and PowerBI dashboard to finally visualize data.
My question is: How can I schedule this to run it daily if my computer is turned off? As I said, two files are web scraped (so no problem to schedule) but one file is stored locally.
One solution that comes to my mind: Store the local file in Google Drive/OneDrive and download it with the API so my script is not dependent of my computer. But if this was the case, how can I schedule that? What service would you use? Heroku,...?
I am not entirely sure about your context, but I think you could look into using AWS Lambda for this. It is reasonably easy to set it up and also create a schedule for running code.
It is even easier to achieve this using the serverless framework. This link shows an example built with Python that will run on a schedule.
I am running the schedule package for exactly something like that.
It’s easy to setup and works very well.

How to run Python code periodically that connects to the internet

I'm working on a Python script that connects to the Twitter API to pull in some tweets into an array, then pushes this to a mysql database. It's a pretty basic script, but I'd like to set it up to run weekly.
I'd like to know the best way to deploy it so that it can automatically run weekly, so that I don't have to manually run it every week.
This depends on platform where you intend to run your python code. As martin says, this is not a python question and more of scheduling related question.
You can create a batch file that can activate python and run your script and then, use task scheduler to schedule youe batch file execution weekly

How to execute a specific code in a defined period -Django-

I would like to know how to automatically execute a code in Django within a defined period of time.
I'm building a web crawler that collects information and stores it in a JSON file and a function that reads the file and stores the file information in a SQLite database. This information is rendered and visible on my website. At the moment I have to run the crawler and the function that saves the data in a database with a click on a button, but that is very ineffective. It would be better if the database information were updated automatically, about every 6 hours (of course, only if the server is running).
If you want to keep the code on your webserver and you have the permissions then a CRON job (Linux) or Scheduled Task (Windows) will do what you want. Set your cron job to run every six hours and to call your Django script.
You can run the script as a Django command line, eg manage.py mycommand or run as a stand-alone PY file where you include the libs and run django.setup()
Or if you want the crawler to be run from within the context of your webserver then your cron job can initiate a GET request using HTTPIE. That gives you a nice API approach.
When I run this I have a flag in a database (a model called Job) that records what is running just in case my scheduled task is already running when the cron gets called, I have some very long-running asynchronous tasks and I tend to use cron rather than Celery.
If you are running your site on AWS or GCP then both have a cron equivalent. Eg you could create a GCP cloud scheduler link that fires off a GET to your webserver that triggers the code to run the crawler.

How to call a Google Apps Script from a local Python script

I just learned about Google Apps Script and am wondering whether this is a solution for me.
I have a Python script on my desktop which eventually creates a CSV file stored on my computer. In the end it would be great to have the values from this CSV file appended to an existing Google Spreadsheet.
So now I'm wondering: is it possible to create a Google Apps Script which fetches these values from the locally stored CSV, and ideally even to call this Google Apps Script from within my Python script?
Python scripts can invoke Google Apps Scripts via its Execution API. (You'll find a Python Quickstart at that link.)
To go the other way (have Google Apps Script pull the file from your computer) is impossible... but if your Python script puts the csv file into your local Google Drive folder (which syncs to the cloud service), a Google Apps Script can access the sync'd file. You cannot trigger this based on any event - instead, it would need to be time based. (Fine if you know the file is generated periodically, e.g. daily at 3 AM, say.)
Yes. You can go down the execution API route. For a simpler approach, you could have a doGet() function in your script, make a HTTP GET request from Python to the script's URL and pass on the data as an input. Example here: GAS as backend service

Categories

Resources