Pyspark Save dataframe to S3

Pyspark Save dataframe to S3 - python

I want to save dataframe to s3 but when I save the file to s3 , it creates empty file with ${folder_name}, in which I want to save the file.
Syntax to save the dataframe :-
f.write.parquet("s3n://bucket-name/shri/test")
It saves the file in test folder but it creates $test under shri .
Is there a way I can save it without creating that extra folder?

I was able to do it by using below code.
df.write.parquet("s3a://bucket-name/shri/test.parquet",mode="overwrite")

As far as I know, there is no way to control the naming of the actual parquet files. When you write a dataframe to parquet, you specify what the directory name should be, and spark creates the appropriate parquet files under that directory.

Related

Pandas ExcelWriter workaround for fsspec URLs?

Is there a workaround for using pandas ExcelWriter to append to a fsspec URL? I am working out of OneDrive and need to automatically append a master xlsx file with each new xlsx file that gets uploaded to the OneDrive folder (a new xlsx file get added to the OneDrive folder daily and need to create a master list without changing previous data) but the append function does not work with fsspec URLs and overwrites the master xlsx file.
This script runs on a trigger automatically and calls on any new .xlsx files that are in the OneDrive folder. The columns are exactly the same, but the rows vary and the file names are not consistent other than the .xlsx format so I do not think I can use the manipulate the start row or call a specific file name.
Is there a workaround for this? Essentially I want a master xlsx file that exists in OneDrive that will grow and update with each xlsx file export that gets uploaded to the OneDrive folder every day.
I tried...
with pd.ExcelWriter(
"/Users/silby/OneDrive/test/dataTest.xlsx",
mode='a',
engine='openpyxl',
if_sheet_exists='overlay'
) as writer:
excel_merged.to_excel(writer)
and expected it to append the dataTest.xlsx file but it overwrites the existing data instead.

Writing to a CSV file in an S3 bucket using boto 3

I'm working on a project that needs to update a CSV file with user info periodically. The CSV is stored in an S3 bucket so I'm assuming I would use boto3 to do this. However, I'm not exactly sure how to go about this- would I need to download the CSV from S3 and then append to it, or is there a way to do it directly? Any code samples would be appreciated.

Ideally this would be something where DynamoDB would work pretty well (as long as you can create a hash key). Your solution would require the following.
Download the CSV
Append new values to the CSV Files
Upload the CSV.
A big issue here is the possibility (not sure how this is planned) that the CSV file is updated multiple times before being uploaded, which would lead to data loss.
Using something like DynamoDB, you could have a table, and just use the put_item api call to add new values as you see fit. Then, whenever you wish, you could write a python script to scan for all the values and then write a CSV file however you wish!

How to read multiple parquet files from a directory without concatenating the created dfs?

I'm reading a folder of parquet files on s3, let's say parquet_dir="s3://parquet_dir/".
I'm using the line pd.read_parquet(parquet_dir) to read all of the files at once. The problem is I don't want the dfs to be concatenated to each other, rather I would like to have a dict holding parquet_path: parquet_df for each of the parquet files in the folder or something like that.
Note that I am trying to avoid using pd.read_parquet on each of the files in the folder because it is very slow comparing to loading the whole directory at once.
Is there any way to do this?

Save only the required CSV file using PySpark

I am quite new to PySpark, I am trying to read and then save a CSV file using Azure Databricks.
After saving the file I see many other files like "_Committed","_Started","_Success" and finally the CSV file with a totally different name.
I have already checked using DataFrame repartition(1) and coalesce(1) but this only deals when the CSV file itself was partitioned by Spark. Is there anything that can be done using PySpark?

You can do the following:
df.toPandas().to_csv(path/to/file.csv)
It will create a single file csv as you expect.

Those are default Log files created when saving from PySpark . We can't eliminate this.
Using coalesce(1) you can save in a single file without partition.

Problem with saving spark DataFrame as Parquet

I'm trying to save a DataFrame to a path as Parquet files. The issue is: the display() function shows a bunch of results in "Prop_0" but whenever I try to save them, only the first one gets converted and goes to the path.
The code I'm using is:
dbutils.fs.rm(Path_1, True)
avroFile = spark.read.format('com.databricks.spark.avro').load(Path_1)
avroFile.write.mode("overwrite").save(Path_2, format="parquet")

This is expected behaviour, Hadoop File Format is used by Spark and this file format requires data to be partitioned - that's why you have part- files.
I'm able to run the above code without any issue.
You may use the below method to save spark DataFrame as parquet files.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark Save dataframe to S3 - python

I was able to do it by using below code. df.write.parquet("s3a://bucket-name/shri/test.parquet",mode="overwrite")

As far as I know, there is no way to control the naming of the actual parquet files. When you write a dataframe to parquet, you specify what the directory name should be, and spark creates the appropriate parquet files under that directory.

Related

Pandas ExcelWriter workaround for fsspec URLs?

Writing to a CSV file in an S3 bucket using boto 3

How to read multiple parquet files from a directory without concatenating the created dfs?

Save only the required CSV file using PySpark

Problem with saving spark DataFrame as Parquet

Categories

Resources