How to export a table dataframe in PySpark to csv?

How to export a table dataframe in PySpark to csv? - python

I am using Spark 1.3.1 (PySpark) and I have generated a table using a SQL query. I now have an object that is a DataFrame. I want to export this DataFrame object (I have called it "table") to a csv file so I can manipulate it and plot the columns. How do I export the DataFrame "table" to a csv file?
Thanks!

If data frame fits in a driver memory and you want to save to local files system you can convert Spark DataFrame to local Pandas DataFrame using toPandas method and then simply use to_csv:
df.toPandas().to_csv('mycsv.csv')
Otherwise you can use spark-csv:
Spark 1.3
df.save('mycsv.csv', 'com.databricks.spark.csv')
Spark 1.4+
df.write.format('com.databricks.spark.csv').save('mycsv.csv')
In Spark 2.0+ you can use csv data source directly:
df.write.csv('mycsv.csv')

For Apache Spark 2+, in order to save dataframe into single csv file. Use following command
query.repartition(1).write.csv("cc_out.csv", sep='|')
Here 1 indicate that I need one partition of csv only. you can change it according to your requirements.

If you cannot use spark-csv, you can do the following:
df.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("file.csv")
If you need to handle strings with linebreaks or comma that will not work. Use this:
import csv
import cStringIO
def row2csv(row):
buffer = cStringIO.StringIO()
writer = csv.writer(buffer)
writer.writerow([str(s).encode("utf-8") for s in row])
buffer.seek(0)
return buffer.read().strip()
df.rdd.map(row2csv).coalesce(1).saveAsTextFile("file.csv")

You need to repartition the Dataframe in a single partition and then define the format, path and other parameter to the file in Unix file system format and here you go,
df.repartition(1).write.format('com.databricks.spark.csv').save("/path/to/file/myfile.csv",header = 'true')
Read more about the repartition function
Read more about the save function
However, repartition is a costly function and toPandas() is worst. Try using .coalesce(1) instead of .repartition(1) in previous syntax for better performance.
Read more on repartition vs coalesce functions.

Using PySpark
Easiest way to write in csv in Spark 3.0+
sdf.write.csv("/path/to/csv/data.csv")
this can generate multiple files based on the number of spark nodes you are using. In case you want to get it in a single file use repartition.
sdf.repartition(1).write.csv("/path/to/csv/data.csv")
Using Pandas
If your data is not too much and can be held in the local python, then you can make use of pandas too
sdf.toPandas().to_csv("/path/to/csv/data.csv", index=False)
Using Koalas
sdf.to_koalas().to_csv("/path/to/csv/data.csv", index=False)

How about this (in case you don't want a one liner) ?
for row in df.collect():
d = row.asDict()
s = "%d\t%s\t%s\n" % (d["int_column"], d["string_column"], d["string_column"])
f.write(s)
f is an opened file descriptor. Also the separator is a TAB char, but it's easy to change to whatever you want.

'''
I am late to the pary but: this will let me rename the file, move it to a desired directory and delete the unwanted additional directory spark made
'''
import shutil
import os
import glob
path = 'test_write'
#write single csv
students.repartition(1).write.csv(path)
#rename and relocate the csv
shutil.move(glob.glob(os.getcwd() + '\\' + path + '\\' + r'*.csv')[0], os.getcwd()+ '\\' + path+ '.csv')
#remove additional directory
shutil.rmtree(os.getcwd()+'\\'+path)

try display(df) and use the download option in the results. Please note: only 1 million rows can be downloaded with this option but its really quick.

I used the method with pandas and this gave me horrible performance. In the end it took so long that I stopped to look for another method.
If you are looking for a way to write to one csv instead of multiple csv's this would be what you are looking for:
df.coalesce(1).write.csv("train_dataset_processed", header=True)
It reduced processing my dataset from 2+ hours to 2 minutes

Related

What is the fastest way to retrieve header names from excel files using pandas

I have a big size excel files that I'm organizing the column names into a unique list.
The code below works, but it takes ~9 minutes!
Does anyone have suggestions for speeding it up?
import pandas as pd
import os
get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl').columns)
print(get_col)

Using pandas to extract just the column names of a large excel file is very inefficient.
You can use openpyxl for this:
from openpyxl import load_workbook
wb = load_workbook("E:\DATA\dbo.xlsx", read_only=True)
columns = {}
for sheet in worksheets:
for value in sheet.iter_rows(min_row=1, max_row=1, values_only=True):
columns = value
Assuming you only have one sheet, you will get a tuple of column names here.

If you want faster reading, then I suggest you use other type files. Excel, while convenient and fast are binary files, therefore for pandas to be able to read it and correctly parse it must use the full file. Using nrows or skipfooter to work with less data with only happen after the full data is loaded and therefore shouldn't really affect the waiting time. On the opposite, when working with a .csv() file, given its type and that there is no significant metadata, you can just extract the first rows of it as an interable using the chunksize parameter in pd.read_csv().
Other than that, using list() with a dataframe as value, returns a list of the columns already. So my only suggestion for the code you use is:
get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl'))
The stronger suggestion is to change datatype if you specifically want to address this issue.

How to create a hierarchical csv file?

I have following N number of invoice data in Excel and I want to create CSV of that file so that it can be imported whenever needed...so how can I archive this?
Here is a screenshot:

Assuming you have a Folder "excel" full of Excel Files within your Project-Directory and you also have another folder "csv" where you intend to put your generated CSV Files, you could pretty much easily batch-convert all the Excel Files in the "excel" Directory into "csv" using Pandas.
It will be assumed that you already have Pandas installed on your System. Otherwise, you could do that via: pip install pandas. The fairly commented Snippet below illustrates the Process:
# IMPORT DATAFRAME FROM PANDAS AS WELL AS PANDAS ITSELF
from pandas import DataFrame
import pandas as pd
import os
# OUR GOAL IS:::
# LOOP THROUGH THE FOLDER: excelDir.....
# AT EACH ITERATION IN THE LOOP, CHECK IF THE CURRENT FILE IS AN EXCEL FILE,
# IF IT IS, SIMPLY CONVERT IT TO CSV AND SAVE IT:
for fileName in os.listdir(excelDir):
#DO WE HAVE AN EXCEL FILE?
if fileName.endswith(".xls") or fileName.endswith(".xlsx"):
#IF WE DO; THEN WE DO THE CONVERSION USING PANDAS...
targetXLFile = os.path.join(excelDir, fileName)
targetCSVFile = os.path.join(csvDir, fileName) + ".csv"
# NOW, WE READ "IN" THE EXCEL FILE
dFrame = pd.read_excel(targetXLFile)
# ONCE WE DONE READING, WE CAN SIMPLY SAVE THE DATA TO CSV
pd.DataFrame.to_csv(dFrame, path_or_buf=targetCSVFile)
Hope this does the Trick for you.....
Cheers and Good-Luck.

Instead of putting total output into one csv, you could go with following steps.
Convert your excel content to csv files or csv-objects.
Each object will be tagged with invoice id and save into dictionary.
your dictionary data structure could be like {'invoice-id':
csv-object, 'invoice-id2': csv-object2, ...}
write custom function which can reads your csv-object, and gives you
name,product-id, qty, etc...
Hope this helps.

exporting dataframe into dataframe format to pass as argument into next program

I have certain computations performed on Dataset and I need the result to be stored in external file.
Had it been to CSV, to process it further I'd have to convert again to Dataframe/SFrame, which is again increasing lines of code.
Here's the snippet:
train_data = graphlab.SFrame(ratings_base)
Clearly, it is in SFrame and can be converted to DFrame using
df_train = train_data.to_dataframe()
Now that it is in DFrame, I need it exported to a file without changing it's structure. Since the exported file will be used as Argument to another python code. That code must accept DFrame and not CSV.
I have already check out in place1, place2, place3, place4 and place5
P.S. - I'm still digging for Python serialization, if anyone can simplify
it in the context would be helpful

I'd use HDFS format as it's supported by Pandas and by graphlab.SFrame and beside that HDFS format is very fast.
Alternatively you can export Pandas.DataFrame to Pickle file and read it from another scripts:
sf.to_dataframe().to_pickle(r'/path/to/pd_frame.pickle')
to read it back (from the same or from another script):
df = pd.read_pickle(r'/path/to/pd_frame.pickle')

how add link to excel file using python

I'm generating an csv file that is opened by excel and converted to xlsx manually.
The csv contains some path to .txt files.
Is it possible to build the file path in such way that when the csv is converted to xlsx , they became clickable hyperlinks ?
Thanks.

I would be interested to understand your workflow a bit better, but to try and help with your specific request:
The HYPERLINK solution proposed in the comments looks like a good one
If you are able to implement that upstream in the csv generation step then great
If not and/or you are interested in automating the conversion process, consider using the pandas library:
Create a DataFrame object from a csv using the pandas.read_csv method
Convert your paths to HYPERLINKs
Write back to xlsx using the pandas.DataFrame.to_excel method
E.g. if you have a file original.csv and the relevant column header is file_paths:
import pandas as pd
df = pd.read_csv('original.csv')
df['file_paths'] = '=HYPERLINK("' + df['file_paths'] + '")'
df.to_excel('new.xlsx', index=False)
Hope that helps!
Jon

Spark writes out `saveAsTextFile` in a Row() format

I'm trying to copy these files over from S3 to Redshift, and they are all in the format of Row(column1=value, column2=value,...), which obviously causes issues. How do I get a dataframe to write out in normal csv?
I'm calling it like this:
# final_data.rdd.saveAsTextFile(
# path=r's3n://inst-analytics-staging-us-standard/spark/output',
# compressionCodecClass='org.apache.hadoop.io.compress.GzipCodec'
# )
I've also tried writing out with the spark-csv module, and it seems like it ignores any of the computations I did, and just formats the original parquet file as a csv and dumps it out.
I'm calling that like this:
df.write.format('com.databricks.spark.csv').save('results')

The spark-csv approach is a good one and should be working. It seems by looking at your code that you are calling df.write on the original DataFrame df and that's why it's ignoring your transformations. To work properly, maybe you should do:
final_data = # Do your logic on df and return a new DataFrame
final_data.write.format('com.databricks.spark.csv').save('results')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to export a table dataframe in PySpark to csv? - python

For Apache Spark 2+, in order to save dataframe into single csv file. Use following command query.repartition(1).write.csv("cc_out.csv", sep='|') Here 1 indicate that I need one partition of csv only. you can change it according to your requirements.

How about this (in case you don't want a one liner) ? for row in df.collect(): d = row.asDict() s = "%d\t%s\t%s\n" % (d["int_column"], d["string_column"], d["string_column"]) f.write(s) f is an opened file descriptor. Also the separator is a TAB char, but it's easy to change to whatever you want.

try display(df) and use the download option in the results. Please note: only 1 million rows can be downloaded with this option but its really quick.

Related

What is the fastest way to retrieve header names from excel files using pandas

How to create a hierarchical csv file?

exporting dataframe into dataframe format to pass as argument into next program

how add link to excel file using python

Spark writes out `saveAsTextFile` in a Row() format

Categories

Resources