I am running my Python script in which I write excel files to put them into my EC2 instance. However, I have noticed that these excel files, although they are created, are only put into the server once the code stops.
I guess they are kept in cache but I would like them to be added to the server straight away. Is there a "commit()" to add to the code?
Many thanks
I guess they are kept in cache but I would like them to be added to the server straight away. Is there a "commit()" to add to the code?
No. It isn't possible to stream or write a partial xlsx file like a CSV or Html file since the file format is a collection of XML files in a Zip container and it can't be generated until the file is closed.
Related
It would be really useful (and cool) if I were able to load a csv file to a pandas dataframe, in the browser, using pyscript, without starting a local server. It would allow me to create easily distributable tools.
Is it even possible?
The closest I've seen is this code. It doesn't really load the csv to a single dataframe object (that I can then manipulate). It does skip the need to start a local server and displays the csv file in the browser.
I'm working on a project that needs to update a CSV file with user info periodically. The CSV is stored in an S3 bucket so I'm assuming I would use boto3 to do this. However, I'm not exactly sure how to go about this- would I need to download the CSV from S3 and then append to it, or is there a way to do it directly? Any code samples would be appreciated.
Ideally this would be something where DynamoDB would work pretty well (as long as you can create a hash key). Your solution would require the following.
Download the CSV
Append new values to the CSV Files
Upload the CSV.
A big issue here is the possibility (not sure how this is planned) that the CSV file is updated multiple times before being uploaded, which would lead to data loss.
Using something like DynamoDB, you could have a table, and just use the put_item api call to add new values as you see fit. Then, whenever you wish, you could write a python script to scan for all the values and then write a CSV file however you wish!
Short Explanation
Some csv files are incoming on a OneDrive Server which is synced onto a machine where a script is running to read them and push them onto BigQuery. And while the script is running fine now, I intend to run it only after all files are synced completely (i.e. available offline) on that machine since last push...
Long Explanation
So basically I use a local database for sales history of our organization which I want to push to bigquery as well to reflect realtime (lagged) info on dashboards and for other analyses and stuff as a lot of other data besides sales history resides there. Since database is strictly on-premises and cannot be accessed outside organization’s network (So literally no way to link to BigQuery!), I have some people there who export day to time sales (sales from start of the day till time of export) info periodically (1-2hrs) from database and upload to OneDrive. I got OneDrive on a machine where many other scripts are hosted (Its just convenient!) and I run (python) script for reading all csvs, combine them and push to BigQuery. Often there are duplicates so it is necessary to read all the files, remove duplicates and then push them to BigQuery (for which I use:
files = [file for file in os.listdir(input_directory) if file.count('-')<=1]
data = [pd.read_excel(input_directory+file) for file in files if file.endswith('.xlsx')]
all_data = pd.concat(data, ignore_index=True).drop_duplicates()
def upload():
all_data.to_gbq(project_id = project_id,
destination_table = table,
credentials = service_account.Credentials.from_service_account_file(
'credentials.json'),
progress_bar = True,
if_exists = 'replace')
What I am trying to do is to is only update bigquery table if there are any new changes when script is run since they don’t always got time to do it.
My current approach is I write the length of dataframe in a file at the end of script as:
with open("length.txt", "w") as f:
f.write(len(all_data))
and once all files are read in df, I use:
if len(all_data) > int(open("length.txt","r").readlines()[0]):
upload()
But doing this needs all files to be read in RAM Reading so many files actually make it a bit congested for other scripts on the machine (RAM-wise). So I do not even want to read them all in RAM as per my current approach.
I tried accessing file attributes as well and tried to build a logic based on date modified as well but as long as a new file is added, it got changed even when file is not fully downloaded on machine. I searched as well to access sync status of files and came across: Determine OneDrive Sync Status From Batch File but that did not help. Any help bettering this situation is appreciated!
We have similar workflows to this where we load data from files into a database regularly by script. For us, once a file has been processed, we move it to a different directory as part of the python script. This way, we allow the python script to load all data from all files in the directory as it is definitely new data.
If the files are cumulative (contain old data as well as new data) and therefore you only want to load any rows that are new, this is where it gets tricky. You are definitely on the right track, as we use the modified date to ascertain whether the file has changed since we last processed it. in python you can get this from the os library os.path.getmtime(file_path).
This should give you the last date/time the file was changed in any way, for any operating system.
I recommend just moving the files out of your folder containing new files once they are loaded to make it easier for your python script to handle. I do not know much about OneDrive though so i cannot help with that aspect.
Good luck!
Good morning all.
I have a generic question about the best approach to handle large files with Django.
I created a python project where the user is able to read a binary file (usually the size is between 30-100MB). Once the file is read, the program processes the file and shows relevant metrics to the user. Basically it outputs the max, min, average, std of the data.
At the moment, you can only run this project from the cmd line. I'm trying to create a user interface so that anyone can use it. I decided to create a webpage using django. The page is very simple. The user uploads files, he then selects which file he wants to process and it shows the metrics to the user.
Working on my local machine I was able to implement it. I upload the files (it saves on the user's laptop and then it processes it). I then created an S3 account, and now the files are all uploaded to S3. The problem that I'm having is that when I try to get the file (I'm using smart_open (https://pypi.org/project/smart-open/)) it is really slow to read the file (for a 30MB file it's taking 300sec), but if I download the file and read it, it only takes me 8sec.
My question is: What is the best approach to retrieve files from S3, and process them? I'm thinking of simply downloading the file to my server, process it, and then deleting it. I've tried this on my localhost and it's fast. Downloading from S3 takes 5sec and processing takes 4sec.
Would this be a good approach? I'm a bit afraid that for instance if I have 10 users at the same time and each one creates a report then I'll have 10*30MB = 300MB of space that the server needs. Is this something practical, or will I fill up the server?
Thank you for your time!
Edit
To give a bit more of a context, what's making it show is the f.read() line. Due to the format of the binary file. I have to read the file in the following way:
name = f.read(30)
unit = f.read(5)
data_length = f.read(2)
data = f.read(data_length) <- This is the part that is taking a lot of time when I read it directly from S3. If I download the file, then this is super fast.
All,
After some experimenting, I found a solution that works for me.
with open('temp_file_name', 'wb') as data:
s3.download_fileobj(Bucket='YOURBUCKETNAME', Key='YOURKEY', data)
read_file('temp_file_name')
os.remove('temp_file_name')
I don't know if this is the best approach or what are the possible downfalls of this approach. I'll use it and come back to this post if I end up using a different solution.
The problem with my previous approach was that f.read() was taking too long, the problem seems to be that every time I need to read a new line, the program needs to connect to S3 (or something) and this is taking too long. What ended up working for me, was to download the file directly to my server, read it, and then deleting it once I read the file. Using this solution I was able to get the speeds that I was getting when working on a localserver (reading directly from my laptop).
If you are working with medium size files (30-50mb) this approach seems to work. My only concern is if we try to download a really large file if the server will run out of disk space.
I have to python files that create and read text from a .txt file, in order for them to work they need to know the info inside of the .txt file.
In heroku I have a scheduler that runs one file, then the other. The big problem is that the files are reset every time to their state from the original repo. How can I get around this?
Heroku does not offer a persistent file system. You will need to store them in another service (like S3), or depending on what the contents of your files are, redesign to write and read from a database instead.