I'm using the below code to write to a CSV file.
df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").option("nullValue"," ").save("/home/user/test_table/")
when I execute it, I'm getting the following error:
java.lang.UnsupportedOperationException: CSV data source does not support null data type.
Could anyone please help?
I had the same problem (not using that command with the nullValue option) and I solved it by using the fillna method.
And I also realised that fillna was not working with _corrupt_record, so I dropped since I didn't need it.
df = df.drop('_corrupt_record')
df = df.fillna("")
df.write.option('header', 'true').format('csv').save('file_csv')
Related
I am using the below code to rename one of the column name in a CSV file.
input_filepath ='/<path_to_file>/2500.csv'
df_csv = spark.read.option('header', True).csv(input_filepath)
df1 = df_csv.withColumnRenamed("xyz", "abc")
df1.printSchema()
So, the above code works fine. however, I wanted to also convert the CSV to parquet format. If I am correct, the above code will make the changes and put in the memory and not to the actual file. Please correct me if I am wrong.
If the changes are kept in memory, then how can I put the changes to parquet file?
For the file format conversion, I will be using below code snippet
df_csv.write.mode('overwrite').parquet()
But not able to figure out how to use it in this case. Please suggest
Note: I am suing Databricks notebook for all the above steps.
Hi I am not 100% sure if that will solve your issue but can you try to save your csv like this:
df.to_parquet(PATH_WHERE_TO_STORE)
Let me know if that helped.
Edit: My workflow usually goes like this
Export dataframe as csv
Check visually if everything is correct
Run this function:
import pandas as pd
def convert_csv_to_parquet(path_to_csv):
df = pd.read_csv(path_to_csv)
df.to_parquet(path_to_csv.replace(".csv", ".parq"), index=False)
Which goes into the direction of Rakesh Govindulas' comment.
I have being trying to export a table to a csv file. The table is copied to the clipboard and it is ready to be put into a csv (at least manually).
I have seen that you can read with pandas anything that you have in the clipboard and assign it to a dataframe, so I tried this code.
df = pd.read_clipboard()
df
df.to_csv('data.csv')
However, I got this error:
pandas.errors.ParserError: Expected 10 fields in line 5, saw 16. Error could possibly be due to
quotes being ignored when a multi-char delimiter is used.
I have being looking for a solution or an alternative but failed.
Thanks in advance!
I have a data frame I created with my original data appended with the topics from topic modeling. I keep running into errors when trying to export the data table into csv.
I've tried both csv module and pandas but get errors from both.
The data table has 1765 rows so writing the file row by row is not really an option.
When using pandas, most common errors are
DataFrame constructor not properly called!
and
function object has no attribute 'to_csv'
Code used:
import pandas as pd
data = (before.head)
df = pd.DataFrame(before.head)
df.to_csv (r'C:\Users\***\Desktop\beforetopics.csv', index = False, header=True)
print (df)
For the CSV module, there have been several errors such as
iterable expected, not method
Basically, how do I export this table (screenshot attached) into a csv file?
What is the command that you're trying to run?
Try this:
dataframe.to_csv('file_name.csv')
Or if it is the unicode error that you're coming across,
Try this:
dataframe.to_csv('file_name.csv', header=True, index=False, encoding='utf-8')
Since your dataframe's name is before,
Try this:
before.to_csv('file_name.csv', header=True, index=False, encoding='utf-8')
You can use the to_csv function:
before.to_csv('file_name.csv')
If you need extra options, you can check the documentation from here.
I'm using jupyter notebook and running spark 2.4.3.
game_reviews = spark.read.format("csv").option("header", "true").load("./amazon_reviews_us_Video_Games_v1_00.tsv")
#reading is fine
game_reviews_2_columns =game_reviews.drop(
'marketplace','review_id','product_parent','product_title','product_category',
'helpful_votes' ,'total_votes','vine','verified_purchase','review_headline',
'review_body','review_date')
running this code
game_reviews_2_columns.columns
still gives all columns:
['marketplace\tcustomer_id\treview_id\tproduct_id\tproduct_parent\tproduct_title\tproduct_category\tstar_rating\thelpful_votes\ttotal_votes\tvine\tverified_purchase\treview_headline\treview_body\treview_date']
What am I doing wrong?
It did not read properly the headers, it was a list of one huge string of columns and tabs.
game_reviews = spark.read.format("csv").option("header", "true").
option("delimiter","\t"). #this is the parameter
load("./amazon_reviews_us_Video_Games_v1_00.tsv")
I could specify schema and include it too, but it works fine.
I would like to write and later read a dataframe in Python.
df_final.to_csv(self.get_local_file_path(hash,dataset_name), sep='\t', encoding='utf8')
...
df_final = pd.read_table(self.get_local_file_path(hash,dataset_name), encoding='utf8',index_col=[0,1])
But then I get:
sys:1: DtypeWarning: Columns (7,17,28) have mixed types. Specify dtype
option on import or set low_memory=False.
I found this question. Which in the bottom line says I should specify the field types when I read the file because "low_memory" is deprecated... I find it very inefficient.
Isn't there a simple way to write & later read a Dataframe? I don't care about the human-readability of the file.
You can pickle your dataframe:
df_final.to_pickle(self.get_local_file_path(hash,dataset_name))
Read it back later:
df_final = pd.read_pickle(self.get_local_file_path(hash,dataset_name))
If your dataframe ist big and this gets to slow, you might have more luck using the HDF5 format:
df_final.to_hdf(self.get_local_file_path(hash,dataset_name))
Read it back later:
df_final = pd.read_hdf(self.get_local_file_path(hash,dataset_name))
You might need to install PyTables first.
Both ways store the data along with their types. Therefore, this should solve your problem.
The warning is because Pandas has detected conflicting Data values in your Column. You can specify the datatypes in the DataFrame Constructor if you wish.
,dtype={'FIELD':int,'FIELD2':str}
Etc.