I have to python files that create and read text from a .txt file, in order for them to work they need to know the info inside of the .txt file.
In heroku I have a scheduler that runs one file, then the other. The big problem is that the files are reset every time to their state from the original repo. How can I get around this?
Heroku does not offer a persistent file system. You will need to store them in another service (like S3), or depending on what the contents of your files are, redesign to write and read from a database instead.
Related
I am relatively new to web development and very new to using Web2py. The application I am currently working on is intended to take in a CSV upload from a user, then generate a PDF file based on the contents of the CSV, then allow the user to download that PDF. As part of this process I need to generate and access several intermediate files that are specific to each individual user (these files would be images, other pdfs, and some text files). I don't need to store these files in a database since they can be deleted after the session ends, but I am not sure the best way or place to store these files and keep them separate based on each session. I thought that maybe the subfolders in the sessions folder would make sense, but I do not know how to dynamically get the path to the correct folder for the current session. Any suggestions pointing me in the right direction are appreciated!
I was having this error "TypeError: expected string or Unicode object, NoneType found" and I had to store just a link in the session to the uploaded document in the db or maybe the upload folder in your case. I would store it to upload to proceed normally, and then clear out the values and the file if not 'approved'?
If the information is not confidential in similar circumstances, I directly write the temporary files under /tmp.
Short Explanation
Some csv files are incoming on a OneDrive Server which is synced onto a machine where a script is running to read them and push them onto BigQuery. And while the script is running fine now, I intend to run it only after all files are synced completely (i.e. available offline) on that machine since last push...
Long Explanation
So basically I use a local database for sales history of our organization which I want to push to bigquery as well to reflect realtime (lagged) info on dashboards and for other analyses and stuff as a lot of other data besides sales history resides there. Since database is strictly on-premises and cannot be accessed outside organization’s network (So literally no way to link to BigQuery!), I have some people there who export day to time sales (sales from start of the day till time of export) info periodically (1-2hrs) from database and upload to OneDrive. I got OneDrive on a machine where many other scripts are hosted (Its just convenient!) and I run (python) script for reading all csvs, combine them and push to BigQuery. Often there are duplicates so it is necessary to read all the files, remove duplicates and then push them to BigQuery (for which I use:
files = [file for file in os.listdir(input_directory) if file.count('-')<=1]
data = [pd.read_excel(input_directory+file) for file in files if file.endswith('.xlsx')]
all_data = pd.concat(data, ignore_index=True).drop_duplicates()
def upload():
all_data.to_gbq(project_id = project_id,
destination_table = table,
credentials = service_account.Credentials.from_service_account_file(
'credentials.json'),
progress_bar = True,
if_exists = 'replace')
What I am trying to do is to is only update bigquery table if there are any new changes when script is run since they don’t always got time to do it.
My current approach is I write the length of dataframe in a file at the end of script as:
with open("length.txt", "w") as f:
f.write(len(all_data))
and once all files are read in df, I use:
if len(all_data) > int(open("length.txt","r").readlines()[0]):
upload()
But doing this needs all files to be read in RAM Reading so many files actually make it a bit congested for other scripts on the machine (RAM-wise). So I do not even want to read them all in RAM as per my current approach.
I tried accessing file attributes as well and tried to build a logic based on date modified as well but as long as a new file is added, it got changed even when file is not fully downloaded on machine. I searched as well to access sync status of files and came across: Determine OneDrive Sync Status From Batch File but that did not help. Any help bettering this situation is appreciated!
We have similar workflows to this where we load data from files into a database regularly by script. For us, once a file has been processed, we move it to a different directory as part of the python script. This way, we allow the python script to load all data from all files in the directory as it is definitely new data.
If the files are cumulative (contain old data as well as new data) and therefore you only want to load any rows that are new, this is where it gets tricky. You are definitely on the right track, as we use the modified date to ascertain whether the file has changed since we last processed it. in python you can get this from the os library os.path.getmtime(file_path).
This should give you the last date/time the file was changed in any way, for any operating system.
I recommend just moving the files out of your folder containing new files once they are loaded to make it easier for your python script to handle. I do not know much about OneDrive though so i cannot help with that aspect.
Good luck!
I am running my Python script in which I write excel files to put them into my EC2 instance. However, I have noticed that these excel files, although they are created, are only put into the server once the code stops.
I guess they are kept in cache but I would like them to be added to the server straight away. Is there a "commit()" to add to the code?
Many thanks
I guess they are kept in cache but I would like them to be added to the server straight away. Is there a "commit()" to add to the code?
No. It isn't possible to stream or write a partial xlsx file like a CSV or Html file since the file format is a collection of XML files in a Zip container and it can't be generated until the file is closed.
Suppose I have a file hosted on GCS on a Python AppEngine project. unfortunately, this file structure is something like:
outer.zip/
- inner.zip/
- vid_file
- png_file
the problem is that the two files inside inner.zip do not have the .mp4 extension on the file, and it's causing all sorts of troubles. How do i rename the files so that it appears like:
outer.zip/
- inner.zip/
- vid_file.mp4
- png_file.png
so that the files inside inner.zip have their extensions?
I keep running into all sorts of limitations since gcs doesn't allow file renaming, unarchiving...etc.
the files aren't terribly large.
P.S. i'm not very familiar with Python, so any code examples would be great appreciated, thanks!
There is absolutely no way to perform any alteration to GCS objects -- full stop. They are exactly the bunch of bytes you decided at their birth (uninterpreted by GCS itself) and thus they will stay.
The best you can do is create a new object which is almost like the original except it fixes little errors and oopses you did when creating the original. Then you can overwrite (i.e completely replace) the original with the new, improved version.
Hopefully it's a one-off terrible mistake you made just once and now want to fix so it's not worth writing a program for that. Just download that GCS object, use normal tools to unzip it and unzip any further zipfiles it may contain, do the fixes on the filesystem with your favorite local filesystem tools, zip things up again, upload/rewrite the final zip to your desired new GCS object -- phew, you're done.
Alex is right that objects are immutable, i.e., no editing in-place. The only way to accomplish what you're talking about would be to download the current file, unzip it, update the new files, re-zip the files into the same-named file, and upload to GCS. GCS object overwrites are transactional, so the old content will be visible until the instant the upload completes. Doing it this way is obviously not very network efficient but at least it wouldn't leave periods of time when the object is invisible (as deleting and re-uploading would).
"Import zipfile" and you can unzip the file once it's downloaded into gcs storage.
I have code doing exactly this on a nightly basis from a cron job.
Ive never tried creating a zip file with GAE but the docs say you can do it.
https://docs.python.org/2/library/zipfile.html
Is there a way to read the contents of a static data directory or interact with that data in any way from within the code of an application?
Edit: Please excuse me if it wasn't clear at first, I mean getting a list of the files in that directory, not reading the data in them.
No. Files marked as static in app.yaml are not available to your application; they're served from separate servers.
If you just need to list them, you could build a list as part of your deploy process. If you need to actually read them, you'll need to include a second copy in your application directory (although the "copy" can be just a symlink; appcfg.py will follow symlinks and upload them.)
You can just open them (only read only).