I am working on a python script that puts information of a large number of ids to a mongo database. My scripts iterate through these ids and add them to the database based on today's date. However, the number of ids are too many and the script take many hours to run and iterate over all the ids.
Let's say I start running the script today and runs till tomorrow but somehow fails before finishing. I wanted to ask if there's a way for me to make sure that the script runs from where it left off, instead of starting again from the beginning. I want to do this because firstly, I don't want duplicate values in the database, because now these ids will be added to the database based on tomorrow's date, thus differentiating them from today. And secondly, it is a waste of time to start the loop of the ids from beginning.
Basically, I'm trying to find an answer for something like this How can I create a script that picks up where it left off in case of failure?, but for Python.
Related
I am new to programming and don´t really have a clue what to do at the moment. I am currently trying to write a code with this problem. I´m using pandas to import a file (excel, csv) into python. I am now trying to automatically save a copy of this file, for every businessday. If this is working i automatically want to delete all the files from Monday till Thursday, when the new week starts. So i only have 4 copys at the end of the Month. At the start of a new Month i only want to keep the latest copy, so i only have 12 copys of the file at the end of the year.
Pls excuse my bad english i hope you get what i mean.
thx
Your problem seems similar to the one in this thread : Python script to do something at the same time every day.
Once you get the "do something at the same time every day" part right, everything else is just file management, I suggest you add the current date, with proper formatting, to your files (using the datetime module), which will make it easier to check which files your script has to delete and also make files more human-readable compared to just having a bunch of "filename_number.extension" somewhere.
However given that you're only going to be doing one file download a day at most maybe you should try to find an utility to launch the script for you every day, especially if your code is going to run on your personal computer, for UNIX platforms cron is likely the way to go, if you're on Windows I think they have a task scheduler. That would spare you from having to relaunch the script manually every time you restart your computer.
Edit : Just to make it clear, if you go for the option of the above paragraph, then your script should simply be :
if today is a business day then download and save file
check if files need to be deleted and delete those who do
Every few days I have to load around 50 rows of data into a Postgres 14 database through a Python script. As part of this I want to record a run number as a column in the db. This number should be the same for all rows I am inserting at that time and larger than any number currently in that column in the database, but other than that its actual value doesn't matter (ie, I need to make the number myself, I'm not just pulling it in from somewhere else).
The obvious way to do this would be with two calls from Python to the database - one to get the current max run number and one to save the data with the run number set to one more than the number retrieved in the first query. Is this best practice? Is there a better way to do this with only one call? Would people recommend making a function in postgres that does this instead and then call that function? I feel like this is likely a common situation with accepted best practice but I don't know what it is.
Here's what I'm trying to accomplish. I've got a Google Sheet that I populate with current astronomical, weather, and nautical information. The data comes from multiple sources. Most scripts run a prescribed time. For example sunrise/sunset data is gathered once a day. Current temperature and wind info is gathered every 15 minutes. I've set up crontab to run those. One of the data sets is the current tidal information -- is it rising or falling, by how much, and when does it change direction. Tides don't happen at the same time every day, so to keep the script current I run it every 15 minutes. That's overkill because tides change approximately every six hours.
Ideally I could run the script a minute or two around the next tide change to keep the sheet current. Is it possible to have a script update a row in the crontab? Is it advised? I'm sure there's an elegant way to do this, but it eludes me. If someone can point me in the right direction, I'll do the legwork. I'm not asking for code. A simple "try this, it worked for me" will do.
Currently, I'm using Google's 2-step method to backup the datastore and than import it to BigQuery.
I also reviewed the code using pipeline.
Both methods are not efficient and have high cost since all data is imported everytime.
I need only to add the records added from last import.
What is the right way of doing it?
Is there a working example on how to do it in python?
You can look at Streaming inserts. I'm actually looking at doing the same thing in Java at the moment.
If you want to do it every hour, you could maybe add your inserts to a pull queue (either as serialised entities or keys/IDs) each time you put a new entity to Datastore. You could then process the queue hourly with a cron job.
There is no full working example (as far as I know), but I believe that the following process could help you :
1- You'd need to add a "last time changed" to your entities, and update it.
2- Every hour you can run a MapReduce job, where your mapper can have a filter to check for last time updated and only pick up those entities that were updated in the last hour
3- Manually add what needs to be added to your backup.
As I said, this is pretty high level, but the actual answer will require a bunch of code. I don't think it is suited to Stack Overflow's format honestly.
Ok so I am using gspread to pull data on google spreadsheets but for what I am doing I need to pull data from long columns. Anyway the data that is being pulled doesn't need to be there until half way through the program. Is there a way to pull that data at the beginning while the first half of the program is running?
-As it runs right now it looks up some individual values ~5 seconds
-then it pulls the data from the columns and takes ~4-15 seconds (it varies) but it isn't doing ANYTHING but pulling the data so it just sits there.
-then it continues and does the rest of the calculations which take ~1 second.
I feel like this is inefficient and since it deals with minutes I worry that it might start to interfere with the way it runs when the columns get especially long...
Here is the paste bin for the code with my information removed http://pastebin.com/Wf5bfmZ0