I am running a python script on Heroku which runs every 10 minutes using the Heroku-Scheduler add-on. The script needs to be able to access the last time it was run. On my local machine I simply used a .txt file which I had update whenever the program was run with the "last run time". The issue is that Heroku doesn't save any file changes when a program is run so the file doesn't update on Heroku. I have looked into alternatives like Amazon S3 and Postgresql, but these seem like major overkill for storing one line of text. Are there any simpler alternatives out there?
If anyone has a similar problem, I ended up finding out about Heroku-redis which allows you to make key-value pairs and then access them. (Like a Cloud Based Python Dictionary)
Related
I am developing a python script that downloads some excel files from a web service. These two files are combined with another one stored in my computer locally to produce the final file. This final file is loaded to some database and PowerBI dashboard to finally visualize data.
My question is: How can I schedule this to run it daily if my computer is turned off? As I said, two files are web scraped (so no problem to schedule) but one file is stored locally.
One solution that comes to my mind: Store the local file in Google Drive/OneDrive and download it with the API so my script is not dependent of my computer. But if this was the case, how can I schedule that? What service would you use? Heroku,...?
I am not entirely sure about your context, but I think you could look into using AWS Lambda for this. It is reasonably easy to set it up and also create a schedule for running code.
It is even easier to achieve this using the serverless framework. This link shows an example built with Python that will run on a schedule.
I am running the schedule package for exactly something like that.
It’s easy to setup and works very well.
I have a very simple heroku app that is basically running one python script every ten minutes or so 24/7. The script uses a text file to store a really simple queue of information (no sensitive info) that gets updated every time it runs.
I have heroku set to deploy the app via Github, but it seems like way too much work to make it programmatically commit, push, and redeploy the entire thing just to update the queue in the text file. How can I attach this file to Heroku in a way that can let it be updated easily? I've been playing around with the free database add-ons but those also seem overcomplicated (in the sense that I've got no clue how to use them).
I'm also totally open to accusations that I'm making mountains out of molehills when I could easily be using some other easier platform to freely run this script 24/7 with the queue file.
At this point I'm sure that nobody cares, but this answer is for you, future troubleshooter.
It turns out that the Heroku script works fine with a txt file queue. Once the queue included in the Heroku deployment, the script will pull from the queue and even update it, giving the correct behavior. all you have to do is put the queue in with the github repo and open/change the file with the python files as you normally would.
The confusing thing is that this does not change the files in Github. It leaves the (github repo copy) queue as the same text file it was when it was originally pushed. This means that pulling and pushing the repo is a little confusing because the stored queue gets out of date very fast.
Thanks for the question, me, I'm happy to help.
So I'm fairly new to airflow and have only really been using github as a fairly basic push/ pull tool rather than getting under the hood and using it for anything more complex.
That being said, now is the time I wish to do something more complex with airflow/ github.
My organisation uses Google cloud for pretty much everything and I currently use magnus to trigger my scheduled queries.
For many reasons, I'm aiming to move over to airflow to perform these tasks however what I'm actually trying to do is host my source code in github and use gitpython to find the .sql files for airflow to then trigger my refresh.
I'm seemingly having trouble understanding how I can possibly 'host' my github repo in an airflow instance and then isolate a file to push to a dag task.
So, problem 1 - each time I try and connect to my remote repo, I receive a windows error
Cmd('git') not found due to: FileNotFoundError('[WinError 2] The system cannot find the file specified')
cmdline: git pull Remote_server_Address.git
I'm trying various commands but not really finding the documentation useful.
As I'm aiming to host the repo in airflow (preferably within just a python instance) I'm hoping I don't need to provide a local path - but even when I try to do so, I still get the same error.
All help appreciated and apologies if it's vague.
Any other integration suggestions would also be recommended.
Thanks
It is a little hard to understand the setup you describe.
For example
isolate a file to push to a dag task
Does this mean you want a task read a specific file when you run an instance of it?
If that is the case you probably want to pass the file location (likely hosted in GCS) to the dag. This explains how.
However, a more common pattern is for something like a daily job to automatically select the file or run a query based on the date.
You could also setup a sensor that will trigger a dag when a file is added to a specific GCS folder using the gcs sensor.
I have been using Google CLoud for a few weeks now and I am facing a big problem for my limited GCP knowledge.
I have a python project whos goal is to "scrape" datas from a website using it's API. My project run a few tens of thousands of requests during executions and it can take very long (few hours, maybe more)
I have 4 python scripts in my project and it's all orchestrated by a bash script.
The execution is as follow :
The first script check a CSV file with all the instructions for the requests, and exeute the requests, save all the results from the requests in CSV files
Second script check the previously created CSV files and recreate an other CSV instruction file
The first script run again but with the new instructions and again save results in CSV files
Second script checks again and do the same again ...
... and so on a few times
Third script cleans the datas, delete duplicates and create an unique CSV file
Fourth script upload the final CSV file to bucket storage
Now I want to get ride of that bash script and I would like to automatize execution of thos scripts approx. once a week.
The problem here is the execution time. Here is what I already tested :
Google App Engine : The timeout of a request on GAE is limited to 10 minutes, and my functions can run for few hours. GAE is not usable here.
Google Compute Engine : My scripts will run max. 10-15 hours a week, keeping a compute engine up during all that time would be too pricey.
What could I do to automatize the execution of my scripts in a cloud environment ? What could be solutions I didn't though about, without changing my code ?
Thank you
A simple way to accomplish this without the need to get rid of the existing bash script that orchestrates everything would be:
Include the bash script on the startup script for the instance.
At the end of the bash script, include a shutdown command.
Schedule the starting of the instance using Cloud Scheduler. You'll have to make an authenticated call to the GCE API to start the existing instance.
With that, your instance will start on a schedule, it will run the startup script (that will be your existing orchestrating script), and it will shut down once it's finished.
I'm new to Python (relatively new to programing in general) and I have created a small python script that scrape some data off of a site once a week and stores it to a local database (I'm trying to do some statistical analysis on downloaded music). I've tested it on my Mac and would like to put it up onto my server (VPS with WiredTree running CentOS 5), but I have no idea where to start.
I tried Googling for it, but apparently I'm using the wrong terms as "deploying" means to create an executable file. The only thing that seems to make sense is to set it up inside Django, but I think that might be overkill. I don't know...
EDIT: More clarity
You should look into cron for this, which will allow you to schedule the execution of your Python script.
If you aren't sure how to make your Python script executable, add a shebang to the top of the script, and then add execute permissions to the script using chmod.
Copy script to server
test script manually on server
set cron, "crontab -e" to a value that will test it soon
once you've debugged issues set cron to the appropriate time.
Sounds like a job for Cron?
Cron is a scheduler that provides a way to run certain scripts (apps, etc.) at certain times.
Here is a short tutorial that explains how to set up cron.
See this for more general cron information.
Edit:
Also, since you are using CentOS: if you end up having issues with your script later on... it could partly be caused by SELinux. There are ways to disable SELinux on your server (if you have enough access permissions.) But... there are arguments against disabling SELinux, as well.