Pyspark split the file while writing with specific limit

Pyspark split the file while writing with specific limit - python

I'm looking at specific limit ( 4GB ) size to be passed while writing the dataframe into csv in pyspark. I have already tried using maxPartitionBytes, but not working as expected.
Below is the one I have used and tested on a 90 GB table from hive- ORC formatted. At the export(write) level it's giving random file sizes other than 4 GB
Any suggestion here to split the files with limit size while writing. Here I don't want to use repartition or coalesce as the df is going thru a lot of wide transformations.
df.write.format("csv").mode("overwrite").option("maxPartitionBytes", 4*1024*1024(1024).save(outputpath)

According to documentation spark.sql.files.maxPartitionBytes is working on read, if you are doing some shuffles later final size of tasks and due to that final files on write may change
Spark docu
You may try to use spark.sql.files.maxRecordsPerFile as according to docu its working on write
spark.sql.files.maxRecordsPerFile Maximum number of records to write out to a single file. If this value
is zero or negative, there is no limit.
If its not going to do the trick i think that other option is, as you mentioned, to repartition this dataset just before write

Related

Store very huge data in txt file format with Tab Separated values

I am loading the dataset from SQL DB by using pd.read_sql(). I tried to store 100 million rows and 300 columns in an excel/csv file. But it failed due to the limitation of 1,048,576 rows.
So I am trying to store the same file as .tsv file using
pd.to_csv("data.txt", header=True, index=False, sep='\t', mode='a')
I dont find the limitation of tab separated txt file.
is it good to go or is there any other good option?

The only thing here that I am not sure about is how pandas internally works. Besides that, your approach is totally fine. Hadoop widely uses .tsv format to store and process data. And there is no such thing like "the limitation of .tsv file". A file is just a sequence of bytes. \t and \n are just characters without any differences. The limitation you encountered is imposed by Microsoft Excel, not by OS. For example, it was lower a long time ago and other spread sheet applications could impose different limitations.
If you open('your_file.tsv', 'wt') and readline, bytes until \n are just taken. Nothing else happens. There is no such thing like how many \ts are allowed until \n, how many \ns are allowed in a file. They are all just bytes and a file can take as much as characters allowed by OS.
It varies across different OSs, however, according to NTFS vs FAT vs exFAT, the maximum file size of an NTFS file system is almost 16TB. But in real, splitting a big file into multiple files of a reasonable size is a good idea. For example, you can easily distribute them.
To process such big data, you should take iterative or distributed approach. For example, Hadoop.

Probably not a good idea. Your limitation is your machines memory since pandas loads everything into memory. A dataframe of that size won't fit. You probably need more machines and a distributed computing framework like apache spark or dask.
Alternatively, dependending on what you want to do with the data, you might not need to load it to memory.

Efficient use of dask with parquet files

I have received a huge (140MM records) dataset and Dask has come in handy but I'm not sure if I could perhaps do a better job. Imagine the records are mostly numeric (two columns are dates), so the process to transform from CSV to parquet was a breeze (dask.dataframe.read_csv('in.csv').to_parquet('out.pq')), but
(i) I would like to use the data on Amazon Athena, so a single parquet file would be nice. How to achieve this? As it stands, Dask saved it as hundreds of files.
(ii) For the Exploratory Data Analysis I'm trying with this dataset, there are certain operations where I need more then a couple of variables, which won't fit into memory so I'm constantly dumping two/three-variable views into SQL, is this code efficient use of dask?
mmm = ['min','mean','max']
MY_COLUMNS = ['emisor','receptor','actividad', 'monto','grupo']
gdict = {'grupo': mmm, 'monto': mmm, 'actividad': ['mean','count']}
df = dd.read_parquet('out.pq', columns=MY_COLUMNS).groupby(['emisor','receptor']).agg(gdict)
df = df.compute()
df.columns = ['_'.join(c) for c in df.columns] # ('grupo','max') -> grupo_max
df.to_sql('er_stats',conn,index=False,if_exists='replace')
Reading the file takes about 80 and writing to SQL about 60 seconds.

To reduce the number of partitions, you should either set the blocksize when reading the CSV (preferred), or repartition before writing the parquet. The "best" size depends on your memory and number of workers, but a single partition is probably not possible if your data is "huge". Putting the many partitions into a single file is also not possible (or, rather, not implemented), because dask writes in parallel and there would be no way of knowing where in the file the next part goes before the previous part is finished. I could imagine writing code to read in successive dask-produced parts and streaming them into a single output, it would not be hard but perhaps not trivial either.
writing to SQL about 60 seconds
This suggests that your output is still quite large. Is SQL the best option here? Perhaps writing again to parquet files would be possible.

partitionBy taking too long while saving a dataset on S3 using Pyspark

I am trying to save a dataset using partitionBy on S3 using pyspark. I am partitioning by on a date column. Spark job is taking more than hour to execute it. If i run the code without partitionBy it just takes 3-4 mints.
Could somebody help me in fining tune the parititonby?

Ok, so spark is terrible at doing IO. Especially with respect to s3. Currently when you are writing in spark it will use a whole executor to write the data SEQUENTIALLY. That with the back and forth between s3 and spark leads to it being quite slow. So you can do a few things to help mitigate/side step these issues.
Use a different partitioning strategy, if possible, with the goal being minimizing files written.
If there is a shuffle involved before the write, you can change the settings around default shuffle size: spark.sql.shuffle.partitions 200 // 200 is the default you'll probably want to reduce this and/or repartition the data before writing.
You can go around sparks io and write your own hdfs writer or use s3 api directly. Using something like foreachpartition then a function for writing to s3. That way things will write in parallel instead of sequentially.
Finally, you may want to use repartition and partitionBy together when writing (DataFrame partitionBy to a single Parquet file (per partition)). This will lead to one file per partition when mixed with maxRecordsPerFile (below) above this will help keep your file size down.
As a side note: you can use the option spark.sql.files.maxRecordsPerFile 1000000 to help control file sizes to make sure they don't get out of control.
In short, you should avoid creating too many files, especially small ones. Also note: you will see a big performance hit when you go to read those 2000*n files back in as well.
We use all of the above strategies in different situations. But in general we just try to use a reasonable partitioning strategy + repartitioning before write. Another note: if a shuffle is performed your partitioning is destroyed and sparks automatic partitioning takes over. Hence, the need for the constant repartitioning.
Hope these suggestions help. SparkIO is quite frustrating but just remember to keep files read/written to a minimum and you should see fine performance.

Use version 2 of the FileOutputCommiter
.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

Python large dataset feature engineering workflow using dask hdf/parquet

There is already a nice question about it in SO but the best answer is now 5years old, So I think there should be better option(s) in 2018.
I am currently looking for a feature engineering pipeline for larger than memory dataset (using suitable dtypes).
The initial file is a csv that doesn't fit in memory. Here are my needs:
Create features (mainly using groupby operations on multiple columns.)
Merge the new feature to the previous data (on disk because it doesn't fit in memory)
Use a subset (or all) columns/index for some ML applications
Repeat 1/2/3 (This is an iterative process like day1: create 4
features, day2: create 4 more ...)
Attempt with parquet and dask:
First, I splitted the big csv file in multiple small "parquet" files. With this, dask is very efficient for the calculation of new features but then, I need to merge them to the initial dataset and atm, we cannot add new columns to parquet files. Reading the csv by chunk, merging and resaving to multiple parquet files is too time consuming as feature engineering is an iterative process in this project.
Attempt with HDF and dask:
I then turned to HDF because we can add columns and also use special queries and it is still a binary file storage. Once again I splitted the big csv file to multiple HDF with the same key='base' for the base features, in order to use the concurrent writing with DASK (not allowed by HDF).
data = data.repartition(npartitions=10) # otherwise it was saving 8Mo files using to_hdf
data.to_hdf('./hdf/data-*.hdf', key='base', format='table', data_columns=['day'], get=dask.threaded.get)
(Annex quetion: specifying data_columns seems useless for dask as there is no "where" in dask.read_hdf?)
Unlike what I expected, I am not able to merge the new feature to the multiples small files with code like this:
data = dd.read_hdf('./hdf/data-*.hdf', key='base')
data['day_pow2'] = data['day']**2
data['day_pow2'].to_hdf('./hdf/data-*.hdf', key='added', get=dask.threaded.get)
with dask.threaded I get "python stopped working" after 2%.
With dask.multiprocessing.get it takes forever and create new files
What are the most appropriated tools (storage and processing) for this workflow?

I will just make a copy of a comment from the related issue on fastparquet: it is technically possible to add columns to existing parquet data-sets, but this is not implemented in fastparquet and possibly not in any other parquet implementation either.
Making code to do this might not be too onerous (but it is not currently planned): the calls to write columns happen sequentially, so new columns for writing would need to percolate down to this function, together with the file position corresponding to the current first byte of the metadata in the footer. I addition, the schema would need to be updated separately (this is simple). The process would need to be repeated for every file of a data-set. This is not an "answer" to the question, but perhaps someone fancies taking on the task.

I would seriously consider using database (indexed access) as a storage or even using Apache Spark (for processing data in a distributed / clustered way) and Hive / Impala as a backend ...

What is the most efficient way to partition an input file in pyspark?

I am reading in an input file using PySpark and I'm wondering what's the best way to repartition the input data so it can be spread out evenly across the Mesos cluster.
Currently, I'm doing:
rdd = sc.textFile('filename').repartition(10)
I was looking at sparkContext documentation and I noticed that textFile method has an option called minPartitions which is by default set to None.
I'm wondering if it will be more efficient if I specify my partition value there. For example:
rdd = sc.textFile('filename', 10)
I'm assuming/hoping it will eliminate the need for shuffle after the data has been read in, if I read in the file in chunks to begin with.
Do I understand it correctly? If not, what is the difference between the two methods (if any)?

There are two main differences between these methods:
repartition shuffles the data after loading while using minPartitions doesn't
repartition results in exact number of partitions while minPartitions provides only a lower bound (see Why does partition parameter of SparkContext.textFile not take effect?)
In general if you load data using textFile there should be no need to further repartition it to get roughly uniform distribution. Since input splits are computed based on amount of data all partitions should be already more or less of the same size. So the only reason to further modify number of partitions is to improve utilization of resources like memory or CPU cores.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.