How to write a dataframe in pyspark having null values to CSV

How to write a dataframe in pyspark having null values to CSV - python

I'm using the below code to write to a CSV file.
df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").option("nullValue"," ").save("/home/user/test_table/")
when I execute it, I'm getting the following error:
java.lang.UnsupportedOperationException: CSV data source does not support null data type.
Could anyone please help?

I had the same problem (not using that command with the nullValue option) and I solved it by using the fillna method.
And I also realised that fillna was not working with _corrupt_record, so I dropped since I didn't need it.
df = df.drop('_corrupt_record')
df = df.fillna("")
df.write.option('header', 'true').format('csv').save('file_csv')

Related

how to convert the CSV to parquet after renaming column name?

I am using the below code to rename one of the column name in a CSV file.
input_filepath ='/<path_to_file>/2500.csv'
df_csv = spark.read.option('header', True).csv(input_filepath)
df1 = df_csv.withColumnRenamed("xyz", "abc")
df1.printSchema()
So, the above code works fine. however, I wanted to also convert the CSV to parquet format. If I am correct, the above code will make the changes and put in the memory and not to the actual file. Please correct me if I am wrong.
If the changes are kept in memory, then how can I put the changes to parquet file?
For the file format conversion, I will be using below code snippet
df_csv.write.mode('overwrite').parquet()
But not able to figure out how to use it in this case. Please suggest
Note: I am suing Databricks notebook for all the above steps.

Hi I am not 100% sure if that will solve your issue but can you try to save your csv like this:
df.to_parquet(PATH_WHERE_TO_STORE)
Let me know if that helped.
Edit: My workflow usually goes like this
Export dataframe as csv
Check visually if everything is correct
Run this function:
import pandas as pd
def convert_csv_to_parquet(path_to_csv):
df = pd.read_csv(path_to_csv)
df.to_parquet(path_to_csv.replace(".csv", ".parq"), index=False)
Which goes into the direction of Rakesh Govindulas' comment.

How to export data to csv/excel from clipboard with python

I have being trying to export a table to a csv file. The table is copied to the clipboard and it is ready to be put into a csv (at least manually).
I have seen that you can read with pandas anything that you have in the clipboard and assign it to a dataframe, so I tried this code.
df = pd.read_clipboard()
df
df.to_csv('data.csv')
However, I got this error:
pandas.errors.ParserError: Expected 10 fields in line 5, saw 16. Error could possibly be due to
quotes being ignored when a multi-char delimiter is used.
I have being looking for a solution or an alternative but failed.
Thanks in advance!

Export DataFrame from Python to CSV

I have a data frame I created with my original data appended with the topics from topic modeling. I keep running into errors when trying to export the data table into csv.
I've tried both csv module and pandas but get errors from both.
The data table has 1765 rows so writing the file row by row is not really an option.
When using pandas, most common errors are
DataFrame constructor not properly called!
and
function object has no attribute 'to_csv'
Code used:
import pandas as pd
data = (before.head)
df = pd.DataFrame(before.head)
df.to_csv (r'C:\Users\***\Desktop\beforetopics.csv', index = False, header=True)
print (df)
For the CSV module, there have been several errors such as
iterable expected, not method
Basically, how do I export this table (screenshot attached) into a csv file?

What is the command that you're trying to run?
Try this:
dataframe.to_csv('file_name.csv')
Or if it is the unicode error that you're coming across,
Try this:
dataframe.to_csv('file_name.csv', header=True, index=False, encoding='utf-8')
Since your dataframe's name is before,
Try this:
before.to_csv('file_name.csv', header=True, index=False, encoding='utf-8')

You can use the to_csv function:
before.to_csv('file_name.csv')
If you need extra options, you can check the documentation from here.

spark dataframe not dropping columns after df.drop() operation

I'm using jupyter notebook and running spark 2.4.3.
game_reviews = spark.read.format("csv").option("header", "true").load("./amazon_reviews_us_Video_Games_v1_00.tsv")
#reading is fine
game_reviews_2_columns =game_reviews.drop(
'marketplace','review_id','product_parent','product_title','product_category',
'helpful_votes' ,'total_votes','vine','verified_purchase','review_headline',
'review_body','review_date')
running this code
game_reviews_2_columns.columns
still gives all columns:
['marketplace\tcustomer_id\treview_id\tproduct_id\tproduct_parent\tproduct_title\tproduct_category\tstar_rating\thelpful_votes\ttotal_votes\tvine\tverified_purchase\treview_headline\treview_body\treview_date']
What am I doing wrong?

It did not read properly the headers, it was a list of one huge string of columns and tabs.
game_reviews = spark.read.format("csv").option("header", "true").
option("delimiter","\t"). #this is the parameter
load("./amazon_reviews_us_Video_Games_v1_00.tsv")
I could specify schema and include it too, but it works fine.

Proper way of writing and reading Dataframe to file in Python

I would like to write and later read a dataframe in Python.
df_final.to_csv(self.get_local_file_path(hash,dataset_name), sep='\t', encoding='utf8')
...
df_final = pd.read_table(self.get_local_file_path(hash,dataset_name), encoding='utf8',index_col=[0,1])
But then I get:
sys:1: DtypeWarning: Columns (7,17,28) have mixed types. Specify dtype
option on import or set low_memory=False.
I found this question. Which in the bottom line says I should specify the field types when I read the file because "low_memory" is deprecated... I find it very inefficient.
Isn't there a simple way to write & later read a Dataframe? I don't care about the human-readability of the file.

You can pickle your dataframe:
df_final.to_pickle(self.get_local_file_path(hash,dataset_name))
Read it back later:
df_final = pd.read_pickle(self.get_local_file_path(hash,dataset_name))
If your dataframe ist big and this gets to slow, you might have more luck using the HDF5 format:
df_final.to_hdf(self.get_local_file_path(hash,dataset_name))
Read it back later:
df_final = pd.read_hdf(self.get_local_file_path(hash,dataset_name))
You might need to install PyTables first.
Both ways store the data along with their types. Therefore, this should solve your problem.

The warning is because Pandas has detected conflicting Data values in your Column. You can specify the datatypes in the DataFrame Constructor if you wish.
,dtype={'FIELD':int,'FIELD2':str}
Etc.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to write a dataframe in pyspark having null values to CSV - python

Related

how to convert the CSV to parquet after renaming column name?

How to export data to csv/excel from clipboard with python

Export DataFrame from Python to CSV

spark dataframe not dropping columns after df.drop() operation

Proper way of writing and reading Dataframe to file in Python

Categories

Resources