How to speed up large data frame joins in Spark - python

I have 2 dataframes in Spark 2.4, they are close to the same size. Each has about 40 million records. One is generated simply by loading the dataframe from S3, the other loads a bunch of dataframes and uses sparkSQL to generate a big dataframe. Then I join these 2 dataframes together multiple times into multiple dataframes and try to write them as CSV to S3... However I am seeing my write times upwards of 30 minutes, I am not sure if it is re-evaluating the dataframe or if perhaps I need more CPUs for this task. Nonetheless, I was hoping someone may have some advice on how to optimize these write times.

So when a dataframe is created from other dataframes it seems an execution plan is what is first created. Then when executing a write operation that plan gets evaluated.
The best way to take care of this particular situation is to take advantage of the spark lazy-loading caching (I have not seen an eager-loading solution for spark but if that exists it could be even better).
By doing:
dataframe1.cache()
And
dataframe2.cache()
when you join these 2 dataframes the first time both dataframes are evaluated and loaded into cache. Then when joining and writing again the 2 dataframe execution plans are already evaluated and the join and write becomes much faster.
This means the first write still takes over 30 minutes but the other 2 writes are much quicker.
Additionally, you can increase performance with additional CPUs and proper paritioning and coalesce of the dataframes. That could help with the evaluation of the first join and write operation.
Hope this helps.

Related

dask - merge/full cartesian product - memory issue

I have a dask dataframe and need to compare all the values in one column to all the values in the same column. Eventually I will be using a function on these value pairs, but for now I'm just trying to get the .merge to work. Example code:
my_dd['dummy'] = 1
my_dd = my_dd.merge(my_dd, how='inner', on='dummy', npartitions=100000)
print(str(len(my_dd)))
At this point my starting data set is pretty small - my_dd before the join has about 19K rows. So after the join, there going to be about 360 million rows.
The above code gives me a memory error. Running on a single machine / LocalCluster with 32GB ram. With specifying nparitions, the code seems like it will work but eventually fails partway through (when most of the graph tasks are mostly complete).
I realize the problem here is the massive resulting df, but I would imagine there is some solution. Have done a lot of digging and can't find anything.

Why is the compute() method slow for Dask dataframes but the head() method is fast?

So I'm a newbie when it comes to working with big data.
I'm dealing with a 60GB CSV file so I decided to give Dask a try since it produces pandas dataframes. This may be a silly question but bear with me, I just need a little push in the right direction...
So I understand why the following query using the "compute" method would be slow(lazy computation):
df.loc[1:5 ,'enaging_user_following_count'].compute()
btw it took 10 minutes to compute.
But what I don't understand is why the following code using the "head" method prints the output in less than two seconds (i.e., In this case, I'm asking for 250 rows while the previous code snippet was just for 5 rows):
df.loc[50:300 , 'enaging_user_following_count'].head(250)
Why doesn't the "head" method take a long time? I feel like I'm missing something here because I'm able to pull a huge number of rows in a way shorter time than when using the "compute" method.
Or is the compute method used in other situations?
Note: I tried to read the documentation but there was no explanation to why head() is fast.
I played around with this a bit half a year ago. .head() is not checking all your partitions but simply checks the first partition. There is no synchronization overhead etc. so it is quite fast, but it does not take the whole Dataset into account.
You could try
df.loc[-251: , 'enaging_user_following_count'].head(250)
IIRC you should get the last 250 entries of the first partition instead of the actual last indices.
Also if you try something like
df.loc[conditionThatIsOnlyFulfilledOnPartition3 , 'enaging_user_following_count'].head(250)
you get an error that head could not find 250 samples.
If you actually just want the first few entries however it is quite fast :)
This processes the whole dataset
df.loc[1:5, 'enaging_user_following_count'].compute()
The reason is, that loc is a label-based selector, and there is no telling what labels exist in which partition (there's no reason that they should be monotonically increasing). In the case that the index is well-formed, you may have useful values of df.divisions, and in this case Dask should be able to pick only the partitions of your data that you need.

Pandas/Dask - Very long time to write to file

I have a few files. The big one is ~87 million rows. I have others that are ~500K rows. Part of what I am doing is joining them, and when I try to do it with Pandas, I get memory issues. So I have been using Dask. It is super fast to do all the joins/applies, but then it takes 5 hours to write out to a csv, even if I know the resulting dataframe is only 26 rows.
I've read that some joins/applies are not the best for Dask, but does that mean it is slower using Dask? Because mine have been very quick. It takes seconds to do all of my computations/manipulations on the millions of rows. But it takes forever to write out. Any ideas how to speed this up/why this is happening?
You can use Dask Parallel Processing or try writing into Parquet file instead of CSV as Parquet operation is very fast with Dask
dask uses lazy evaluation. This means that when you perform the operations, you are actually only creating the processing graph.
Once you try to write your data to a csv file, Dask starts performing the operations.
And that is why it takes 5 hrs, he just needs to process a lot of data.
See https://tutorial.dask.org/01x_lazy.html for more information on the topic.
One way to speed up the processing would be to increase the parallelism by using a machine with more resources.

What parameters to use to improve performance of writing dataframes to a Parquet file?

I'm working with some data and my code is taking more than one minute to write a dataframe in a Parquet file. The dataframe has around 90000 lines and 10 columns. It's my first time using spark, so I'm not sure about the time performance here, but I think this is too much time. I've read some texts to get a better performance writing to Parquet, but it didn't help yet. I would like to know what type of parameters I could use to get a better performance, or if maybe my data is too small, and that's the normal time.
I have a for loop that iterates in my dataframe date column. It iterates on each day and writes in the file. Currently on the test, I have only one day in the column, so it iterates in this for loop only once, and it takes about 10 seconds to perform all the other operations (I didn't include the code for the other operations in the dataframe), but when it gets to this line to write the file, it takes more than 1 minute.
if i == 0:
df.write.mode('overwrite').parquet(self.files['parquet'])
else:
df.write.mode('append').parquet(self.files['parquet'])
You don't need a for loop for saving a Spark dataframe. Just do:
df.write.mode('overwrite').parquet(path)

Dataframe writing to Postgresql poor performance

working in postgresql I have a cartesian join producing ~4 million rows.
The join takes ~5sec and the write back to the DB takes ~1min 45sec.
The data will be required for use in python, specifically in a pandas dataframe, so I am experimenting with duplicating this same data in python. I should say here that all these tests are running on one machine, so nothing is going across a network.
Using psycopg2 and pandas, reading in the data and performing the join to get the 4 million rows (from an answer here:cartesian product in pandas) takes consistently under 3 secs, impressive.
Writing the data back to a table in the database however takes anything from 8 minutes (best method) to 36+minutes (plus some methods I rejected as I had to stop them after >1hr).
While I was not expecting to reproduce the "sql only" time, I would hope to be able to get closer than 8 minutes (I`d have thought 3-5 mins would not be unreasonable).
Slower methods include:
36min - sqlalchemy`s table.insert (from 'test_sqlalchemy_core' here https://docs.sqlalchemy.org/en/latest/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow)
13min - psycopg2.extras.execute_batch (https://stackoverflow.com/a/52124686/3979391)
13-15min (depends on chunksize) - pandas.dataframe.to_sql (again using sqlalchemy) (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html)
Best way (~8min) is using psycopg2`s cursor.copy_from method (found here: https://github.com/blaze/odo/issues/614#issuecomment-428332541).
This involves dumping the data to a csv first (in memory via io.StringIO), that alone takes 2 mins.
So, my questions:
Anyone have any potentially faster ways of writing millions of rows from a pandas dataframe to postgresql?
The docs for the cursor.copy_from method (http://initd.org/psycopg/docs/cursor.html) state that the source object needs to support the read() and readline() methods (hence the need for io.StringIO). Presumably, if the dataframe supported those methods, we could dispense with the write to csv. Is there some way to add these methods?
Thanks.
Giles
EDIT:
On Q2 - pandas can now use a custom callable for to_sql and the given example here: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method does pretty much what I suggest above (IE it copies csv data directly from STDIN using StringIO).
I found an ~40% increase in write speed using this method, which brings to_sql close to the "best" method mentioned above.
Answering Q 1 myself:
It seems the issue had more to do with Postgresql (or rather Databases in general). Taking into account points made in this article:https://use-the-index-luke.com/sql/dml/insert I found the following:
1) Removing all indexes from the destination table resulted in the query running in 9 seconds. Rebuilding the indexes (in postgresql) took a further 12 seconds, so still well under the other times.
2) With only a primary key in place, Inserting rows ordered by the primary key columns reduced the time taken to about a third. This makes sense as there should be little or no shuffling of the index rows required. I also verified that this is the reason why my cartesian join in postgresql was faster in the first place (IE the rows were ordered by the index, purely by chance), placing the same rows in a temporary table (unordered) and inserting from that actually took allot longer.
3) I tried similar experiments on our mysql systems and found the same increase in insert speed when removing indexes. With mysql however it seemed that rebuilding the indexes used up any time gained.
I hope this helps anyone else who comes across this question from a search.
I still wonder if it is possible to remove the write to csv step in python (Q2 above) as I believe I could then write something in python that would be faster than pure postgresql.
Thanks, Giles

Categories

Resources