Save only the required CSV file using PySpark - python

I am quite new to PySpark, I am trying to read and then save a CSV file using Azure Databricks.
After saving the file I see many other files like "_Committed","_Started","_Success" and finally the CSV file with a totally different name.
I have already checked using DataFrame repartition(1) and coalesce(1) but this only deals when the CSV file itself was partitioned by Spark. Is there anything that can be done using PySpark?

You can do the following:
df.toPandas().to_csv(path/to/file.csv)
It will create a single file csv as you expect.

Those are default Log files created when saving from PySpark . We can't eliminate this.
Using coalesce(1) you can save in a single file without partition.

Related

Writing to a CSV file in an S3 bucket using boto 3

I'm working on a project that needs to update a CSV file with user info periodically. The CSV is stored in an S3 bucket so I'm assuming I would use boto3 to do this. However, I'm not exactly sure how to go about this- would I need to download the CSV from S3 and then append to it, or is there a way to do it directly? Any code samples would be appreciated.
Ideally this would be something where DynamoDB would work pretty well (as long as you can create a hash key). Your solution would require the following.
Download the CSV
Append new values to the CSV Files
Upload the CSV.
A big issue here is the possibility (not sure how this is planned) that the CSV file is updated multiple times before being uploaded, which would lead to data loss.
Using something like DynamoDB, you could have a table, and just use the put_item api call to add new values as you see fit. Then, whenever you wish, you could write a python script to scan for all the values and then write a CSV file however you wish!

How to have a pandas DATAFRAME saved into a SHAREPOINT as csv file?

I have a DataFrame that I would like to store as a CSV file in a Sharepoint.
It seems that the only way is to first save CSV file locally and then, using Shareplum, upload file to Sharepoint.
Is there a way to directly save DataFrame into Sharepoint as CSV file, without creating a local file?
Thanks a lot for your help.
It should be possible to write the csv content to an in-memory text buffer (e.g. StringIO or ByteIO) rather than to a local file - here is an example (last section of the page).
After that, you could use a library for writing the content directly to a Sharepoint: This discussion shows several approaches how to do that, including the Office365-REST-Python-Client and also SharePlum, which you have already mentioned.
Here are two more sources (Microsoft technical doc) that you might find useful:
How can I upload a file to Sharepoint using Python?
How to get and upload files from sharepoint with python?

How to upload multiple excel documents into one dataset using python?

I an new to code and I would like to know whether it is possible to upload multiple excel documents into one dataset using python? If so, what is the code for this? All of the code I have seen is used for uploading one single excel document. Moreover, do I have to convert the data into CSV form first or I can use code to convert it into CSV after uploading it?
I am using jupyter notebook in anaconda to run my python code.
Your assistance is greatly appreciated.
By uploading, do you mean reading a file? If so, just create a list or dictionary, open the files and write them 1 by 1 into your list / dictionary. Also, it would be really helpful creating CSV files first. If you want to do it manually you can easily by saving the file as CSV in Excel.

Problem with saving spark DataFrame as Parquet

I'm trying to save a DataFrame to a path as Parquet files. The issue is: the display() function shows a bunch of results in "Prop_0" but whenever I try to save them, only the first one gets converted and goes to the path.
The code I'm using is:
dbutils.fs.rm(Path_1, True)
avroFile = spark.read.format('com.databricks.spark.avro').load(Path_1)
avroFile.write.mode("overwrite").save(Path_2, format="parquet")
This is expected behaviour, Hadoop File Format is used by Spark and this file format requires data to be partitioned - that's why you have part- files.
I'm able to run the above code without any issue.
You may use the below method to save spark DataFrame as parquet files.

Pyspark Save dataframe to S3

I want to save dataframe to s3 but when I save the file to s3 , it creates empty file with ${folder_name}, in which I want to save the file.
Syntax to save the dataframe :-
f.write.parquet("s3n://bucket-name/shri/test")
It saves the file in test folder but it creates $test under shri .
Is there a way I can save it without creating that extra folder?
I was able to do it by using below code.
df.write.parquet("s3a://bucket-name/shri/test.parquet",mode="overwrite")
As far as I know, there is no way to control the naming of the actual parquet files. When you write a dataframe to parquet, you specify what the directory name should be, and spark creates the appropriate parquet files under that directory.

Categories

Resources