Pandas/Dask - Very long time to write to file

Pandas/Dask - Very long time to write to file - python

I have a few files. The big one is ~87 million rows. I have others that are ~500K rows. Part of what I am doing is joining them, and when I try to do it with Pandas, I get memory issues. So I have been using Dask. It is super fast to do all the joins/applies, but then it takes 5 hours to write out to a csv, even if I know the resulting dataframe is only 26 rows.
I've read that some joins/applies are not the best for Dask, but does that mean it is slower using Dask? Because mine have been very quick. It takes seconds to do all of my computations/manipulations on the millions of rows. But it takes forever to write out. Any ideas how to speed this up/why this is happening?

You can use Dask Parallel Processing or try writing into Parquet file instead of CSV as Parquet operation is very fast with Dask

dask uses lazy evaluation. This means that when you perform the operations, you are actually only creating the processing graph.
Once you try to write your data to a csv file, Dask starts performing the operations.
And that is why it takes 5 hrs, he just needs to process a lot of data.
See https://tutorial.dask.org/01x_lazy.html for more information on the topic.
One way to speed up the processing would be to increase the parallelism by using a machine with more resources.

Related

Why does it take longer than using Pandas when I used modin.pandas [ray]

I'm just a Python newbie who's had fun dealing with data with Python.
When I was be able to use Python's representative data tool, Pandas, it seemed that it would be able to work on Excel very quickly.
However, I was somewhat disappointed to see it take more than 1 to 2 minutes to retrieve data(.xlsx) with 470,000 rows, and as a result, I found out that using modin and ray (or dask) would enable faster operation.
After learning how to use it simply as below, I compared it to using Pandas only. (this time, 100M rows data, about 5GB)
import ray
ray.init()
import modin.pandas as md
%%time
TB = md.read_csv('train.csv')
TB
But it only took 1 minute and 3 seconds to write Pandas, but it took 1 minute and 9 seconds to write modin [ray].
I was disappointed to see that it would take longer than just a small difference.
How can I use modin faster than pandas? Complex operations such as groupby or merge? Is there little difference in simply reading data?
Modin is faster to read data when other people are using it, is there something wrong with my computer's settings? I want to know why.
enter image description here
Write down the method installed at the prompt just in case you need it.
!pip install modin[ray]
!pip install ray[default]

First off, to do a fair assessment you always need to use the %%timeit magic command, which gives you an average of multiple runs.
Modin generally works best when you have:
Very large files
Large number of cores
The unimpressive performance, in your case, I believe is largely due to multi-processing management done by Ray/Dask, e.g. worker scheduling and all the set up that goes into parallelisation. When you meet at least one of the 2 criteria above (specially the first, given any current processor) the trade-off between the resource management and the speed up you get from Modin would be in your favour, but nor a 5GB file neither 6 cores are large enough to tip this in your favour. Parallelisation is costly, and the task must be worthy of it.
If it is a one-off, 1-2 minutes is not an unreasonable amount of time at all for this sort of thing. If it is a file that you are going to continuously read and write I would recommend writing it to HDF5 or pickle format in which case your read/write performance will improve far more than just using Modin.
Alternatively, Vaex is the fastest option around for reading any df. Though, I personally think it is still very incomplete and sometimes doesn't match the promises made about it beyond simple numerical-data operations, e.g. when you have large strings in your data.

dask out of memory error despite data size doesnt exceed memory

I'm trying to load a dask dataframe from a MySQL table which takes about 4gb space on disk. I'm using a single machine with 8gb of memory but as soon as I do a drop duplicate and try to get the length of the dataframe, an out of memory error is encountered.
Here's a snippet of my code:
df = dd.read_sql_table("testtable", db_uri, npartitions=8, index_col=sql.func.abs(sql.column("id")).label("abs(id)"))
df = df[['gene_id', 'genome_id']].drop_duplicates()
print(len(df))
I have tried more partitions for the dataframe(as many as 64) but they also failed. I'm confused why this could cause an OOM? The dataframe should fit in memory even without any parallel processing.

which takes about 4gb space on disk
It is very likely to be much much bigger than this in memory. Disk storage is optimised for compactness, with various encoding and compression mechanisms.
The dataframe should fit in memory
So, have you measured its size as a single pandas dataframe?
You should also keep in mind than any processing you do to your data often involves making temporary copies within functions. For example, you can only drop duplicates by first finding duplicates, which must happen before you can discard any data.
Finally, in a parallel framework like dask, there may be multiple threads and processes (you don't specify how you are running dask) which need to marshal their work and assemble the final output while the client and scheduler also take up some memory. In short, you need to measure your situation, perhaps tweak worker config options.

You don't want to read an entire DataFrame into a Dask DataFrame and then perform filtering in Dask. It's better to perform filtering at the database level and then read a small subset of the data into a Dask DataFrame.
MySQL can select columns and drop duplicates with distinct. The resulting data is what you should read in the Dask DataFrame.
See here for more information on syntax. It's easiest to query databases that have official connectors, like dask-snowflake.

Efficient use of dask with parquet files

I have received a huge (140MM records) dataset and Dask has come in handy but I'm not sure if I could perhaps do a better job. Imagine the records are mostly numeric (two columns are dates), so the process to transform from CSV to parquet was a breeze (dask.dataframe.read_csv('in.csv').to_parquet('out.pq')), but
(i) I would like to use the data on Amazon Athena, so a single parquet file would be nice. How to achieve this? As it stands, Dask saved it as hundreds of files.
(ii) For the Exploratory Data Analysis I'm trying with this dataset, there are certain operations where I need more then a couple of variables, which won't fit into memory so I'm constantly dumping two/three-variable views into SQL, is this code efficient use of dask?
mmm = ['min','mean','max']
MY_COLUMNS = ['emisor','receptor','actividad', 'monto','grupo']
gdict = {'grupo': mmm, 'monto': mmm, 'actividad': ['mean','count']}
df = dd.read_parquet('out.pq', columns=MY_COLUMNS).groupby(['emisor','receptor']).agg(gdict)
df = df.compute()
df.columns = ['_'.join(c) for c in df.columns] # ('grupo','max') -> grupo_max
df.to_sql('er_stats',conn,index=False,if_exists='replace')
Reading the file takes about 80 and writing to SQL about 60 seconds.

To reduce the number of partitions, you should either set the blocksize when reading the CSV (preferred), or repartition before writing the parquet. The "best" size depends on your memory and number of workers, but a single partition is probably not possible if your data is "huge". Putting the many partitions into a single file is also not possible (or, rather, not implemented), because dask writes in parallel and there would be no way of knowing where in the file the next part goes before the previous part is finished. I could imagine writing code to read in successive dask-produced parts and streaming them into a single output, it would not be hard but perhaps not trivial either.
writing to SQL about 60 seconds
This suggests that your output is still quite large. Is SQL the best option here? Perhaps writing again to parquet files would be possible.

Convert a csv larger than RAM into parquet with Dask

I have approximately 60,000 small CSV files of varying sizes 1MB to several hundred MB that I would like to convert into a single Parquet file. The total size of all the CSVs is around 1.3 TB. This is larger than the memory of the server that I am using (678 GB available).
Since all the CSVs have same fields, I've concatenated them into a single large file. I tried to process this file with Dask:
ddf = dd.read_csv("large.csv", blocksize="1G").to_parquet("large.pqt")
My understanding was that the blocksize option would prevent dask running out of memory when the job was split over multiple workers.
What happens is that eventually Dask does run out of memory and I get a bunch of messages like:
distributed.nanny - WARNING - Restarting worker
Is my approach completely wrong or am I just missing an important detail?

You don't have to concatenate all of your files into one large file. dd.read_csv is happy to accept a list of filenames, or a string with a "*" in it.
If you have text data in your CSV file, then loading it into pandas or dask dataframes can expand the amount of memory used considerably, so your 1GB chunks might be quite a bit bigger than you expect. Do things work if you use a smaller chunk size? You might want to consult this doc entry: https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-partitions
In general I recommend using Dask's dashboard to watch the computation, and see what is taking up your memory. This might help you find a good solution. https://docs.dask.org/en/latest/diagnostics-distributed.html

partitionBy taking too long while saving a dataset on S3 using Pyspark

I am trying to save a dataset using partitionBy on S3 using pyspark. I am partitioning by on a date column. Spark job is taking more than hour to execute it. If i run the code without partitionBy it just takes 3-4 mints.
Could somebody help me in fining tune the parititonby?

Ok, so spark is terrible at doing IO. Especially with respect to s3. Currently when you are writing in spark it will use a whole executor to write the data SEQUENTIALLY. That with the back and forth between s3 and spark leads to it being quite slow. So you can do a few things to help mitigate/side step these issues.
Use a different partitioning strategy, if possible, with the goal being minimizing files written.
If there is a shuffle involved before the write, you can change the settings around default shuffle size: spark.sql.shuffle.partitions 200 // 200 is the default you'll probably want to reduce this and/or repartition the data before writing.
You can go around sparks io and write your own hdfs writer or use s3 api directly. Using something like foreachpartition then a function for writing to s3. That way things will write in parallel instead of sequentially.
Finally, you may want to use repartition and partitionBy together when writing (DataFrame partitionBy to a single Parquet file (per partition)). This will lead to one file per partition when mixed with maxRecordsPerFile (below) above this will help keep your file size down.
As a side note: you can use the option spark.sql.files.maxRecordsPerFile 1000000 to help control file sizes to make sure they don't get out of control.
In short, you should avoid creating too many files, especially small ones. Also note: you will see a big performance hit when you go to read those 2000*n files back in as well.
We use all of the above strategies in different situations. But in general we just try to use a reasonable partitioning strategy + repartitioning before write. Another note: if a shuffle is performed your partitioning is destroyed and sparks automatic partitioning takes over. Hence, the need for the constant repartitioning.
Hope these suggestions help. SparkIO is quite frustrating but just remember to keep files read/written to a minimum and you should see fine performance.

Use version 2 of the FileOutputCommiter
.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.