Dask reading CSV, setting partition as CSV length - python

I'm trying to write code that will read from a set of CSVs named my_file_*.csv into a Dask dataframe.
Then I want to set the partitions based on the length of the CSV. I'm trying to map a function on each partition and in order to do that, each partition must be the whole CSV.
I've tried to reset the index, and then set partitions based on the length of each CSV but it looks like the index of the Dask dataframe is not unique.
Is there a better way to partition based on the length of each CSV?

So one partition should contain exactly one file?
You cold do:
import dask.dataframe as dd
ddf = dd.read_csv(my_file_*.csv, blocksize = None)
Setting blocksize to None makes sure that files are not split up in several partitions. Therefore, ddf will be a dask dataframe containing one file per partition.
You might want to check out the documentation:
general instructions how to generate dask dataframes from data
details about read_csv

Related

Adding file name column to Dask DataFrame

I have a data set of around 400 CSV files containing a time series of multiple variables (my CSV has a time column and then multiple columns of other variables).
My final goal is the choose some variables and plot those 400 time series in a graph.
In order to do so, I tried to use Dask to read the 400 files and then plot them.
However, from my understanding, In order to actually draw 400 time series and not a single appended data frame, I should groupby the data by the file name it came from.
Is there any Dask efficient way to add a column to each CSV so I could later groupby my results?
A parquet files is also an option.
For example, I tried to do something like this:
import dask.dataframe as dd
import os
filenames = ['part0.parquet', 'part1.parquet', 'part2.parquet']
df = dd.read_parquet(filenames, engine='pyarrow')
df = df.assign(file=lambda x: filenames[x.index])
df_grouped = df.groupby('file')
I understand that I can use from_delayed() but then I lose al the parallel computation.
Thank you
If you are can work with CSV files, then passing include_path_column option might be sufficient for your purpose:
from dask.dataframe import read_csv
ddf = read_csv("some_path/*.csv", include_path_column="file_path")
print(ddf.columns)
# the list of columns will include `file_path` column
There is no equivalent option for read_parquet, but something similar can be achieved with delayed. Using delayed will not remove parallelism, the code just need to make sure that the actual calculation is done after the delayed tasks are defined.

Split a spark dataframe into multiple frames and write as CSV

I have a use case where in I am reading data from a source into a dataframe, doing a groupBy on a field and essentially breaking that dataframe into an array of dataframes.
My target state is to have all these dataframes be written as individual CSVs files in S3 ( CSV being they need to be downloaded by the client and need to be human readable ).
What's the best way of going about this?
I used this to split df into df_array : df_array = [(df.where(df[column_name] == i),i) for i in distinct_values]
And df.toPandas().to_csv(output_path +'.csv',index=False) individually on dataframes to convert to CSV files - but the challenges being faced in this approach are
My understanding is since I require a single CSV file per my grouping field, to_csv will bring data from all worker nodes to the driver and and may give driver OOM issue.
I am unable to use python multiprocessing to write the individual dataframes to S3 since data is distributed on worker nodes and gives me an error : Spark: Broadcast variables: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion
No space left on device.
The pipeline is pretty slow as well, what is the better way I can approach this use case?
[EDIT]
I want to control the name of the CSV file which gets created as well. Target state is 1 CSV file per my group-by field ( let's call that Name ) so if there are 10 different Names in my initial df, output will be 10 CSV files each with the title as Name1.csv, Name2.csv and so on
As you're using pyspark, why don't you use the repartition and partitionBy to achieve your goal?
df.repartition(1).write.partitionBy('grouping_field1', 'grouping_field2', ...).save('/path/to/save', format='csv')

Read a partitioned parquet dataset from multiple files with PyArrow and add a partition key based on the filename

I have a bunch of parquet files, each containing a subset of my dataset. Let's say that the files are named data-N.parquet with N being an integer.
I can read them all and subsequently convert to a pandas dataframe:
files = glob.glob("data-**.parquet")
ds = pq.ParquetDataset(
files,
metadata_nthreads=64,
).read_table(use_threads=True)
df = ds.to_pandas()
This works just fine. What it would like to have is an additional column in the final data frame, indicating from which file the data is originating.
As far as I understand, the ds data is partitioned, with one partition per file. So it would be a matter of including the partition key in the data frame.
Is this feasible?
The partition key is, at the moment, included in the dataframe. However, all existing partitioning schemes use directory names for the key. So if your data was /N/data.parquet or /batch=N/data.parquet this will happen (you will need to supply a partitioning object when you read the dataset).
There is no way today (in pyarrow) to get the filename in the returned results.

How can I read each Parquet row group into a separate partition?

I have a parquet file with 10 row groups:
In [30]: print(pyarrow.parquet.ParquetFile("/tmp/test2.parquet").num_row_groups)
10
But when I load it using Dask Dataframe, it is read into a single partition:
In [31]: print(dask.dataframe.read_parquet("/tmp/test2.parquet").npartitions)
1
This appears to contradict this answer, which states that Dask Dataframe reads each Parquet row group into a separate partition.
How can I read each Parquet row group into a separate partition with Dask Dataframe? Or must the data be distributed over different files for this to work?
I believe that fastparquet will read each row-group separately, and the fact that pyarrow apparently doesn't could be considered bug or at least a feature enhancement that you could request on the dask issues tracker. I would tend to agree that a set of files containing one row-group each and a single file containing the same row-groups should result in the same partition structure.
I can read using the batches with pyarrow.
import pyarrow as pq
batch_size = 1
_file = pq.parquet.ParquetFile("file.parquet")
batches = _file.iter_batches(batch_size) #batches will be a generator
for batch in batches:
process(batch)

Pyspark split csv file in packets

I'm very new in spark and I'm still with my first tests with it. I installed one single node and I'm using it as my master on a decent server running:
pyspark --master local[20]
And of course I'm facing some difficulties with my first steps using pyspark.
I have a CSV file of 40GB and around 300 million lines on it. What I want to do is to find the fastest way to split this file over and make small packages of it and store them as CSV files as well. For that I have two scenarios:
First one. Split the file without any criteria. Just split it equally into lets say 100 pieces (3 million rows each).
Second one. The CSV data I'm loading is a tabular one and I have one column X with 100K different IDs. What I woudl like to do is to create a set of dictionaries and create smaller pieces of CSV files where my dictionaries will tell me to which package each row should go.
So far, this is where I'm now:
sc=SparkContext.getOrCreate()
file_1 = r'D:\PATH\TOFILE\data.csv'
sdf = spark.read.option("header","true").csv(file_1, sep=";", encoding='cp1252')
Thanks for your help!
The best (and probably "fastest") way to do this would be to take advantage of the in-built partitioning of RDDs by Spark and write to one CSV file from each partition. You may repartition or coalesce to create the desired number of partitions (let's say, 100) you want. This will give you maximum parallelism (based on your cluster resources and configurations) as each Spark Executor works on the task on one partition at a time.
You may do one of these:
Do a mapPartition over the Dataframe and write each partition to a unique CSV file.
OR df.write.partitionBy("X").csv('mycsv.csv'), which will create one partition (and thereby file) per unique entry in "X"
Note. If you use HDFS to store your CSV files, Spark will automatically create multiple files to store the different partitions (number of files created = number of RDD partitions).
What I did at last was to load the data as a spark dataframe and spark automatically creates equal sized parititions of 128MB (default configuration of hive) and then I used the repartition method to redistribute my rows according the values for a specific column on my dataframe.
# This will load my CSV data on a spark dataframe and will generate the requiered amount of 128MB partitions to store my raw data.
sdf = spark.read.option('header','true').csv(file_1, sep=';', encoding='utf-8')
# This line will redistribute the rows of each paritition according the values on a specific column. Here I'm placing all rows with the same set of values on the same partition and I'm creating 20 of them. (Sparks handle to allocate the rows so the partitions will be the same size)
sdf_2 = sdf.repartition(20, 'TARGET_COLUMN')
# This line will save all my 20 partitions on different csv files
sdf_2.write.saveAsTable('CSVBuckets', format='csv', sep=';', mode='overwrite', path=output_path, header='True')
the easiest way to split a csv file is to use unix utils called split.
Just google split unix command line.
I split my files using split -l 3500 XBTUSDorderbooks4.csv orderbooks

Categories

Resources