I'm very new in spark and I'm still with my first tests with it. I installed one single node and I'm using it as my master on a decent server running:
pyspark --master local[20]
And of course I'm facing some difficulties with my first steps using pyspark.
I have a CSV file of 40GB and around 300 million lines on it. What I want to do is to find the fastest way to split this file over and make small packages of it and store them as CSV files as well. For that I have two scenarios:
First one. Split the file without any criteria. Just split it equally into lets say 100 pieces (3 million rows each).
Second one. The CSV data I'm loading is a tabular one and I have one column X with 100K different IDs. What I woudl like to do is to create a set of dictionaries and create smaller pieces of CSV files where my dictionaries will tell me to which package each row should go.
So far, this is where I'm now:
sc=SparkContext.getOrCreate()
file_1 = r'D:\PATH\TOFILE\data.csv'
sdf = spark.read.option("header","true").csv(file_1, sep=";", encoding='cp1252')
Thanks for your help!
The best (and probably "fastest") way to do this would be to take advantage of the in-built partitioning of RDDs by Spark and write to one CSV file from each partition. You may repartition or coalesce to create the desired number of partitions (let's say, 100) you want. This will give you maximum parallelism (based on your cluster resources and configurations) as each Spark Executor works on the task on one partition at a time.
You may do one of these:
Do a mapPartition over the Dataframe and write each partition to a unique CSV file.
OR df.write.partitionBy("X").csv('mycsv.csv'), which will create one partition (and thereby file) per unique entry in "X"
Note. If you use HDFS to store your CSV files, Spark will automatically create multiple files to store the different partitions (number of files created = number of RDD partitions).
What I did at last was to load the data as a spark dataframe and spark automatically creates equal sized parititions of 128MB (default configuration of hive) and then I used the repartition method to redistribute my rows according the values for a specific column on my dataframe.
# This will load my CSV data on a spark dataframe and will generate the requiered amount of 128MB partitions to store my raw data.
sdf = spark.read.option('header','true').csv(file_1, sep=';', encoding='utf-8')
# This line will redistribute the rows of each paritition according the values on a specific column. Here I'm placing all rows with the same set of values on the same partition and I'm creating 20 of them. (Sparks handle to allocate the rows so the partitions will be the same size)
sdf_2 = sdf.repartition(20, 'TARGET_COLUMN')
# This line will save all my 20 partitions on different csv files
sdf_2.write.saveAsTable('CSVBuckets', format='csv', sep=';', mode='overwrite', path=output_path, header='True')
the easiest way to split a csv file is to use unix utils called split.
Just google split unix command line.
I split my files using split -l 3500 XBTUSDorderbooks4.csv orderbooks
Related
I have a use case where in I am reading data from a source into a dataframe, doing a groupBy on a field and essentially breaking that dataframe into an array of dataframes.
My target state is to have all these dataframes be written as individual CSVs files in S3 ( CSV being they need to be downloaded by the client and need to be human readable ).
What's the best way of going about this?
I used this to split df into df_array : df_array = [(df.where(df[column_name] == i),i) for i in distinct_values]
And df.toPandas().to_csv(output_path +'.csv',index=False) individually on dataframes to convert to CSV files - but the challenges being faced in this approach are
My understanding is since I require a single CSV file per my grouping field, to_csv will bring data from all worker nodes to the driver and and may give driver OOM issue.
I am unable to use python multiprocessing to write the individual dataframes to S3 since data is distributed on worker nodes and gives me an error : Spark: Broadcast variables: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion
No space left on device.
The pipeline is pretty slow as well, what is the better way I can approach this use case?
[EDIT]
I want to control the name of the CSV file which gets created as well. Target state is 1 CSV file per my group-by field ( let's call that Name ) so if there are 10 different Names in my initial df, output will be 10 CSV files each with the title as Name1.csv, Name2.csv and so on
As you're using pyspark, why don't you use the repartition and partitionBy to achieve your goal?
df.repartition(1).write.partitionBy('grouping_field1', 'grouping_field2', ...).save('/path/to/save', format='csv')
I have a bunch of parquet files, each containing a subset of my dataset. Let's say that the files are named data-N.parquet with N being an integer.
I can read them all and subsequently convert to a pandas dataframe:
files = glob.glob("data-**.parquet")
ds = pq.ParquetDataset(
files,
metadata_nthreads=64,
).read_table(use_threads=True)
df = ds.to_pandas()
This works just fine. What it would like to have is an additional column in the final data frame, indicating from which file the data is originating.
As far as I understand, the ds data is partitioned, with one partition per file. So it would be a matter of including the partition key in the data frame.
Is this feasible?
The partition key is, at the moment, included in the dataframe. However, all existing partitioning schemes use directory names for the key. So if your data was /N/data.parquet or /batch=N/data.parquet this will happen (you will need to supply a partitioning object when you read the dataset).
There is no way today (in pyarrow) to get the filename in the returned results.
I am using Pandas to split large csv to multiple csv each containing single row.
I have a csv having 1 million records and using below code it is taking to much time.
For Eg: In the above case there will be 1 million csv created.
Anyone can help me how to decrease time in splitting csv.
for index, row in lead_data.iterrows():
row.to_csv(row['lead_id']+".csv")
lead_data is the dataframe object.
Thanks
You don't need to loop through the data. Filter records by lead_id and the export the data to CSV file. That way you will be able to split the files based on the lead ID (assuming).
Example, split all EPL games where arsenal was at home:
data=pd.read_csv('footbal/epl-2017-GMTStandardTime.csv')
print("Selecting Arsenal")
ft=data.loc[data['HomeTeam']=='Arsenal']
print(ft.head())
# Export data to CSV
ft.to_csv('arsenal.csv')
print("Done!")
This way it is much faster than using one record at a time.
I'm trying to write code that will read from a set of CSVs named my_file_*.csv into a Dask dataframe.
Then I want to set the partitions based on the length of the CSV. I'm trying to map a function on each partition and in order to do that, each partition must be the whole CSV.
I've tried to reset the index, and then set partitions based on the length of each CSV but it looks like the index of the Dask dataframe is not unique.
Is there a better way to partition based on the length of each CSV?
So one partition should contain exactly one file?
You cold do:
import dask.dataframe as dd
ddf = dd.read_csv(my_file_*.csv, blocksize = None)
Setting blocksize to None makes sure that files are not split up in several partitions. Therefore, ddf will be a dask dataframe containing one file per partition.
You might want to check out the documentation:
general instructions how to generate dask dataframes from data
details about read_csv
I'm trying to merge a single data column from 40 almost similar csv files with Pandas. The files contains info from windows processes in csv form generated by Windows 'Tasklist' command.
What I want to do is, to merge the memory information from these files into a single file by using the PID as the key. However there are some random insignificant processes that appear every now and then, but cause inconsistency among the csv files. Meaning that in some file there might be 65 rows and in some files 75 rows. However those random processes are not significant and their changing PID should not matter and they should also be dropped off when merging the files.
This is how I first tried to do it:
# CSV files have following columns
# Image Name, PID, Session Name, Session #, Mem Usage
file1 = pd.read_csv("tasklist1.txt")
file1 = file1.drop(file1.columns[[2,3]], axis=1)
for i in range(2,41):
filename = "tasklist" + str(i) + ".txt"
filei = pd.read_csv(filename)
filei = filei.drop(filei.columns[[0,2,3]], axis=1)
file1 = file1.merge(filei, on='PID')
file1.to_csv("Final.txt", index=False)
From the first csv file I just drop the Session Name and Session # columns, but keep the Image Names just as the titles for each row. Then from the following csv files I just keep the PID and Mem Usage columns and try to merge the previous all the time growing csv file with the data from upcoming file.
The problem here is that when the loop comes to 5th iteration, it cannot merge the files anymore as I get the "Reindexing only valid with uniquely valued Index objects" error.
So I can merge 1st file with 2nd to 4th inside the first loop. If I then create second loop where I merge the 5th file to 6th to 8th file and then merge these two merged files together, all the data from files 1 to 8 will be merged just perfectly fine.
Any suggestion how to perform this kind of chained merge without creating x amount of additional loops? At this point I'm experimenting with 40 files and could actually go through the whole process by brute force this with nested loops, but that isn't effective way of merging in the first place and unacceptable, if I need to scale this to merge even more files.
Duplicate column names will cause this error.
So you can add parameter suffixes in function merge:
suffixes : 2-length sequence (tuple, list, ...)
Suffix to apply to overlapping column names in the left and right side, respectively
Overlapping value columns.