I'm working on a project that needs to update a CSV file with user info periodically. The CSV is stored in an S3 bucket so I'm assuming I would use boto3 to do this. However, I'm not exactly sure how to go about this- would I need to download the CSV from S3 and then append to it, or is there a way to do it directly? Any code samples would be appreciated.
Ideally this would be something where DynamoDB would work pretty well (as long as you can create a hash key). Your solution would require the following.
Download the CSV
Append new values to the CSV Files
Upload the CSV.
A big issue here is the possibility (not sure how this is planned) that the CSV file is updated multiple times before being uploaded, which would lead to data loss.
Using something like DynamoDB, you could have a table, and just use the put_item api call to add new values as you see fit. Then, whenever you wish, you could write a python script to scan for all the values and then write a CSV file however you wish!
Related
So I have a a lot of data in an Azure blob storage. Each user can upload some cases and the end result can be represented as a series of panda dataframes. Now I want to be able to display some of this data on our site, but the files are several hundreds of MB and there is no need to download all of it. What would be the best way to get part of the df?
I can make a folder structure in each blob storage containing the different columns in each df and perhaps a more more compact summery of the columns but I would like to keep it in one file if possible.
I could also set up a database containing the info but I like the structure as it is - completely separated in cases.
Originally I thought I could do it in hdf5 but it seems that I need to download the entire file from the blob storage to my API backend before I can run my python code on it. I would prefer if I could keep the hdf5 files and get the parts of the columns from the blob storage directly but as far as I can see that is not possible.
I am thinking this is something that has been solved a million times before but it is a bit out of my domain so I have not been able to find a good solution for it.
Check out the BlobClient of the Azure Python SDK. The download_blob method might suit your needs. Use chunks() to get an iterator which allows you to iterate of over the file in chunks. You can also set other parameters to assure that a chunk doesn't exceed a set size.
I have a DataFrame that I would like to store as a CSV file in a Sharepoint.
It seems that the only way is to first save CSV file locally and then, using Shareplum, upload file to Sharepoint.
Is there a way to directly save DataFrame into Sharepoint as CSV file, without creating a local file?
Thanks a lot for your help.
It should be possible to write the csv content to an in-memory text buffer (e.g. StringIO or ByteIO) rather than to a local file - here is an example (last section of the page).
After that, you could use a library for writing the content directly to a Sharepoint: This discussion shows several approaches how to do that, including the Office365-REST-Python-Client and also SharePlum, which you have already mentioned.
Here are two more sources (Microsoft technical doc) that you might find useful:
How can I upload a file to Sharepoint using Python?
How to get and upload files from sharepoint with python?
Situation: A csv lands into AWS S3 every month. The vendor adds/removes/modifies columns from the file as they please. So the schema is not known ahead of time. The requirement is to create a table on-the-fly in Snowflake and load the data into said table. Matillion is our ELT tool.
This is what I have done so far.
Setup a Lambda to detect the arrival of the file, convert it to JSON, upload to another S3 dir and adds filename to SQS.
Matillion detects SQS message and loads the file with the JSON Data into Variant column in a SF table.
SF Stored proc takes the variant column and generates a table based on the number of fields in the JSON data. The VARIANT column in SF only works in this way if its JSON data. CSV is sadly not supported.
This works with 10,000 rows. The problem arises when I run this with a full file which is over 1GB, which is over 10M rows. It crashes the lambda job with an out of disk space error at runtime.
These are the alternatives I have thought of so far:
Attach an EFS volume to the lambda and use it to store the JSON file prior to the upload to S3. JSON data files are so much larger than their CSV counterparts, I expect the json file to be around 10-20GB since the file has over 10M rows.
Matillion has an Excel Query component where it can take the headers and create a table on the fly and load the file. I was thinking I can convert the header row from the CSV into a XLX file within the Lambda, pass it to over to Matillion, have it create the structures and then load the csv file once the structure is created.
What are my other options here? Considerations include a nice repeatable design pattern to be used for future large CSVs or similar requirements, costs of the EFS, am I making the best use of the tools that I are avaialable to me? Thanks!!!
Why not split the initial csv file into multiple files and then process each file in the same way you currently are?
Why are you converting CSV into JSON; CSV is directly being loaded into table without doing any data transformation specifically required in case of JSON, the lateral flatten to convert json into relational data rows; and why not use Snowflake Snowpipe feature to load data directly into Snowflake without use of Matallion. You can split large csv files into smaller chunks before loading into Snowflake ; this will help in distributing the data processing loads across SF Warehouses.
I also load CSV files from SFTP into Snowflake, using Matillion, with no idea of the schema.
In my process, I create a "temp" table in Snowflake, with 50 VARCHAR columns (Our files should never exceed 50 columns). Our data always contains text, dates or numbers, so VARCHAR isn't a problem. I can then load the .csv file into the temp table. I believe this should work for files coming from S3 as well.
That will at least get the data into Snowflake. How to create the "final" table however, given your scenario, I'm not sure.
I can imagine being able to use the header row, and/or doing some analysis on the 'type' of data contained in each column, to determine the column type needed.
But if you can get the 'final' table created, you could move the data over from temp. Or alter the temp table itself.
This can be achieved using an external table where the external table will be mapped with a single column and the delimiter will be a new line character. The external table also has a special virtual column and that can be processed to extract all the columns dynamically and then create a table based on the number of columns at any given time using the stored procedure. There is an interesting video which talks about this limitation in snowflake (https://youtu.be/hRNu58E6Kmg)
I an new to code and I would like to know whether it is possible to upload multiple excel documents into one dataset using python? If so, what is the code for this? All of the code I have seen is used for uploading one single excel document. Moreover, do I have to convert the data into CSV form first or I can use code to convert it into CSV after uploading it?
I am using jupyter notebook in anaconda to run my python code.
Your assistance is greatly appreciated.
By uploading, do you mean reading a file? If so, just create a list or dictionary, open the files and write them 1 by 1 into your list / dictionary. Also, it would be really helpful creating CSV files first. If you want to do it manually you can easily by saving the file as CSV in Excel.
I have data in an excel file that I would like to use to create a case in PSSE. The data is organized as it would appear in a case in PSSE (ie. for bus Bus number, name, base kV, and so on. Of course the data can be entered manually but I'm working with over 500 buses. I have tried copied and pasting, but that seems to works only sometimes. For machine data, it barely works.
Is there a way to import this data to PSSE from an excel file? I have recently started running PSSE with Python, and maybe there is a way to do this?
--
MK.
Yes. You can import data from an excel file into PSSE using the python package xlrt, however, I would reccomend instead converting your excel file to csv before you import and use csv as it is much easier. Importing data using the API is not just a copy and paste job, into the nicely tabulated spreadsheet that PSSE has in its case data.
Refer to the API documentation for PSSE, chapter II. Search this function, BUS_DATA_2. You will see that you can create buses with this function.
So your job should be three fold.
Import the csv file data with each line being a list of each data parameter for your bus. Like voltage, name, baseKV, PU etc. Store it to another list.
Iterate through the new list you just created and call:
ierr = bus_data_2(i, intgar, realar, name)
and pass in your data from the csv file. (see PSSE API documentation on how to do this) This will effectively load data from the csv file to your case ( in the form of nodes or buses).
After you are finished, you will need to call a function called psspy.save("Casename.sav") to save your work in a new PSSE case.
Note: there are functions to load in line data, fix shunt data, generator data etc.
Your other option is to call up the PTI folks as they can give you training.
Good luck
If you have an Excel data file with exactly the same "format" and same "info" as the regular case file (.sav), try this:
Open any small example .sav file from the example sub-folder PSSE's installation folder
Copy the corresponding spreadsheet to the working case (shown in spreadsheet view) with the same "info" (say, bus, branch,etc.) in PSSE GUI
After finishing copying everything, then save the edited working case in GUI as a new working case.
If this doesn't work, I suggest you to ask this question on forum of "Python for Power Systems":
https://psspy.org/psse-help-forum/questions/