Spark writes out `saveAsTextFile` in a Row() format - python

I'm trying to copy these files over from S3 to Redshift, and they are all in the format of Row(column1=value, column2=value,...), which obviously causes issues. How do I get a dataframe to write out in normal csv?
I'm calling it like this:
# final_data.rdd.saveAsTextFile(
# path=r's3n://inst-analytics-staging-us-standard/spark/output',
# compressionCodecClass='org.apache.hadoop.io.compress.GzipCodec'
# )
I've also tried writing out with the spark-csv module, and it seems like it ignores any of the computations I did, and just formats the original parquet file as a csv and dumps it out.
I'm calling that like this:
df.write.format('com.databricks.spark.csv').save('results')

The spark-csv approach is a good one and should be working. It seems by looking at your code that you are calling df.write on the original DataFrame df and that's why it's ignoring your transformations. To work properly, maybe you should do:
final_data = # Do your logic on df and return a new DataFrame
final_data.write.format('com.databricks.spark.csv').save('results')

Related

how to convert the CSV to parquet after renaming column name?

I am using the below code to rename one of the column name in a CSV file.
input_filepath ='/<path_to_file>/2500.csv'
df_csv = spark.read.option('header', True).csv(input_filepath)
df1 = df_csv.withColumnRenamed("xyz", "abc")
df1.printSchema()
So, the above code works fine. however, I wanted to also convert the CSV to parquet format. If I am correct, the above code will make the changes and put in the memory and not to the actual file. Please correct me if I am wrong.
If the changes are kept in memory, then how can I put the changes to parquet file?
For the file format conversion, I will be using below code snippet
df_csv.write.mode('overwrite').parquet()
But not able to figure out how to use it in this case. Please suggest
Note: I am suing Databricks notebook for all the above steps.
Hi I am not 100% sure if that will solve your issue but can you try to save your csv like this:
df.to_parquet(PATH_WHERE_TO_STORE)
Let me know if that helped.
Edit: My workflow usually goes like this
Export dataframe as csv
Check visually if everything is correct
Run this function:
import pandas as pd
def convert_csv_to_parquet(path_to_csv):
df = pd.read_csv(path_to_csv)
df.to_parquet(path_to_csv.replace(".csv", ".parq"), index=False)
Which goes into the direction of Rakesh Govindulas' comment.

How to transform CSV or DB record data for Kafka and how to get it back into a csv or DF on the other side

I've successfully set up a Kafka instance at my job and I've been able to pass simple 'Hello World' messages through it.
However, I'm not sure how to do more interesting things. I've got a CSV that contains four records from a DB that I'm trying to move through kafka, then take into a DF on the other side and save it as a CSV again.
producer = KafkaProducer(boostrap_servers='my-server-id:443',
....
df = pd.read_csv('data.csv')
df = df.to_json()
producer.send(mytopic, df.encode('utf8'))
This returns code in a tuple object (conusmer.record object, bool) that contains a list of my data. I can access the data as:
msg[0][0][6].decode('utf8')
But that comes in as a single string that I can't pass to a dataframe simply (it just merges everything into one thing).
I'm not sure if I even need a dataframe or a to_json() method or anything. I'm really just not sure how to organize data to send properly and then return it and feed it back into a dataframe so that I can either a) save it to a CSV or b) reinsert the dataframe to a DB with to_Sql.
Kafka isn't really suited to send entire matricies/dataframes around.
You can send a list of CSV rows, JSON arrays, or preferrably some other compressable binary dataformat such as Avro or Protobuf as whole objects. If you are working exclusively in Python, you could pickle the data you send and receive.
When you read the data, you must deserialize it but how you do that, is ultimately your choice, and there is no simple answer for any given application.
The solution, for this one case, would be json_normalize, then to_csv, however... And I would like to point out that Kafka isn't required for you to test that, as you definitely should be writing unit tests...
df = pd.read_csv('data.csv')
jdf = df.to_json()
msg_value = jdf # pretend you got a message from Kafka, as a JSON string
df = pd.json_normalize(msg_value) # back to a dataframe
df.to_csv()

Read Pandas DataFrame object like a flatfile

I have an custom Python library function that takes a csv flat file as input that is read using data = open('file.csv', 'r').read(). But currently I've the data processed in Python as a Pandas DataFrame. How can I pass this DataFrame as a flat file object that my custom library function accepts?
As a work around I'm writing the DataFrame to disk and reading it back using the read function which is causing adding a second or two for each iteration. I want to avoid using this process.
In the to_csv method of pandas DataFrame, if you don't provide any argument you get the CSV output returned as a string. So you can use the to_csv method on your DataFrame, that produces the same output as you are getting by storing and reading the DataFrame again.

exporting dataframe into dataframe format to pass as argument into next program

I have certain computations performed on Dataset and I need the result to be stored in external file.
Had it been to CSV, to process it further I'd have to convert again to Dataframe/SFrame, which is again increasing lines of code.
Here's the snippet:
train_data = graphlab.SFrame(ratings_base)
Clearly, it is in SFrame and can be converted to DFrame using
df_train = train_data.to_dataframe()
Now that it is in DFrame, I need it exported to a file without changing it's structure. Since the exported file will be used as Argument to another python code. That code must accept DFrame and not CSV.
I have already check out in place1, place2, place3, place4 and place5
P.S. - I'm still digging for Python serialization, if anyone can simplify
it in the context would be helpful
I'd use HDFS format as it's supported by Pandas and by graphlab.SFrame and beside that HDFS format is very fast.
Alternatively you can export Pandas.DataFrame to Pickle file and read it from another scripts:
sf.to_dataframe().to_pickle(r'/path/to/pd_frame.pickle')
to read it back (from the same or from another script):
df = pd.read_pickle(r'/path/to/pd_frame.pickle')

How to export a table dataframe in PySpark to csv?

I am using Spark 1.3.1 (PySpark) and I have generated a table using a SQL query. I now have an object that is a DataFrame. I want to export this DataFrame object (I have called it "table") to a csv file so I can manipulate it and plot the columns. How do I export the DataFrame "table" to a csv file?
Thanks!
If data frame fits in a driver memory and you want to save to local files system you can convert Spark DataFrame to local Pandas DataFrame using toPandas method and then simply use to_csv:
df.toPandas().to_csv('mycsv.csv')
Otherwise you can use spark-csv:
Spark 1.3
df.save('mycsv.csv', 'com.databricks.spark.csv')
Spark 1.4+
df.write.format('com.databricks.spark.csv').save('mycsv.csv')
In Spark 2.0+ you can use csv data source directly:
df.write.csv('mycsv.csv')
For Apache Spark 2+, in order to save dataframe into single csv file. Use following command
query.repartition(1).write.csv("cc_out.csv", sep='|')
Here 1 indicate that I need one partition of csv only. you can change it according to your requirements.
If you cannot use spark-csv, you can do the following:
df.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("file.csv")
If you need to handle strings with linebreaks or comma that will not work. Use this:
import csv
import cStringIO
def row2csv(row):
buffer = cStringIO.StringIO()
writer = csv.writer(buffer)
writer.writerow([str(s).encode("utf-8") for s in row])
buffer.seek(0)
return buffer.read().strip()
df.rdd.map(row2csv).coalesce(1).saveAsTextFile("file.csv")
You need to repartition the Dataframe in a single partition and then define the format, path and other parameter to the file in Unix file system format and here you go,
df.repartition(1).write.format('com.databricks.spark.csv').save("/path/to/file/myfile.csv",header = 'true')
Read more about the repartition function
Read more about the save function
However, repartition is a costly function and toPandas() is worst. Try using .coalesce(1) instead of .repartition(1) in previous syntax for better performance.
Read more on repartition vs coalesce functions.
Using PySpark
Easiest way to write in csv in Spark 3.0+
sdf.write.csv("/path/to/csv/data.csv")
this can generate multiple files based on the number of spark nodes you are using. In case you want to get it in a single file use repartition.
sdf.repartition(1).write.csv("/path/to/csv/data.csv")
Using Pandas
If your data is not too much and can be held in the local python, then you can make use of pandas too
sdf.toPandas().to_csv("/path/to/csv/data.csv", index=False)
Using Koalas
sdf.to_koalas().to_csv("/path/to/csv/data.csv", index=False)
How about this (in case you don't want a one liner) ?
for row in df.collect():
d = row.asDict()
s = "%d\t%s\t%s\n" % (d["int_column"], d["string_column"], d["string_column"])
f.write(s)
f is an opened file descriptor. Also the separator is a TAB char, but it's easy to change to whatever you want.
'''
I am late to the pary but: this will let me rename the file, move it to a desired directory and delete the unwanted additional directory spark made
'''
import shutil
import os
import glob
path = 'test_write'
#write single csv
students.repartition(1).write.csv(path)
#rename and relocate the csv
shutil.move(glob.glob(os.getcwd() + '\\' + path + '\\' + r'*.csv')[0], os.getcwd()+ '\\' + path+ '.csv')
#remove additional directory
shutil.rmtree(os.getcwd()+'\\'+path)
try display(df) and use the download option in the results. Please note: only 1 million rows can be downloaded with this option but its really quick.
I used the method with pandas and this gave me horrible performance. In the end it took so long that I stopped to look for another method.
If you are looking for a way to write to one csv instead of multiple csv's this would be what you are looking for:
df.coalesce(1).write.csv("train_dataset_processed", header=True)
It reduced processing my dataset from 2+ hours to 2 minutes

Categories

Resources