I have a dask dataframe and need to compare all the values in one column to all the values in the same column. Eventually I will be using a function on these value pairs, but for now I'm just trying to get the .merge to work. Example code:
my_dd['dummy'] = 1
my_dd = my_dd.merge(my_dd, how='inner', on='dummy', npartitions=100000)
print(str(len(my_dd)))
At this point my starting data set is pretty small - my_dd before the join has about 19K rows. So after the join, there going to be about 360 million rows.
The above code gives me a memory error. Running on a single machine / LocalCluster with 32GB ram. With specifying nparitions, the code seems like it will work but eventually fails partway through (when most of the graph tasks are mostly complete).
I realize the problem here is the massive resulting df, but I would imagine there is some solution. Have done a lot of digging and can't find anything.
Related
I'm working with a dataset stored in S3 bucket (parquet files) consisting of a total of ~165 million records (with ~30 columns). Now, the requirement is to first groupby a certain ID column then generate 250+ features for each of these grouped records based on the data. Building these features is quite complex using multiple Pandas functionality along with 10+ supporting functions. The groupby function should generate ~5-6 million records, hence the final output should be 6M x 250 shaped dataframe.
Now, I've tested the code on a smaller sample and it works fine. The issue is, when I'm implementing it on the entire dataset, it takes a very long time - the progress bar in Spark display doesn't change even after 4+ hours of running. I'm running this in AWS EMR Notebook connected to a Cluster (1 m5.xlarge Master & 2 m5.xlarge Core Nodes).
I've tried with 1 m5.4xlarge Master & 2 m5.4xlarge Core Nodes, 1 m5.xlarge Master & 8 m5.xlarge Core Nodes combinations among others. None of them have shown any progress.
I've tried running it in Pandas in-memory in my local machine for ~650k records, the progress was ~3.5 iterations/sec which came to be an ETA of ~647 hours.
So, the question is - can anyone share any better solution to reduce the time consumption and speed up the processing ? Should another cluster type be used for this use-case ? Should this be refactored or should Pandas dataframe usage be removed or any other pointer would be really helpful.
Thanks much in advance!
First things first: is your data partitioned enough to take advantage of all of your workers? If some part of your process causes it to coalesce to e.g. a single partition, then you're basically running single-threaded.
Beyond that, I don't know for certain without seeing the code, but here's a subtle behaviour that can cause runtimes to become massive:
source_df = # some pandas dataframe with a lot of features in columns
flattened_df = your_df.stack().reset_index().unstack() # Turn the features into rows
spark_df = spark.createDataFrame(flattened_df) # 'index' is the column that contains the feature name
# a function to do a linear regression and calculate residual
def your_good_pandas_function(key, slice):
clf = LinearRegression()
X = slice[subset,of,columns]
y = slice[key]
clf.train(X,y)
predicted = clf.predict(X)
return y-predicted
def your_bad_pandas_function(key, slice):
clf = LinearRegression()
X = slice[subset,of,columns]
y = slice[key]
clf.train(X,y)
predicted = clf.predict(X)
return source_df[key]-predicted
spark_df.groupBy('index').applyInPandas(your_good_pandas_function,schema=some_schema) #fast
spark_df.groupBy('index').applyInPandas(your_bad_pandas_function,schema=some_schema) #slow
These two ApplyInPandas functions do the same thing - they linear-regress some characteristics against a feature and calculate the residual. The first uses variables that are in scope within the pandas UDF. The second uses a variable that is out of scope of the pandas UDF. In the second case, Spark will help you out by broadcasting source_df to every single invocation of your pandas UDF. This will cause enormous memory usage and definitely kill your job.
Your data don't seem large enough to take that long, so my guess is that the reason why it works on a small subset and not the larger set may be because you're inadvertently broadcasting the larger set to your applyInPandas function calls.
Really struggling to make sense of all the performance tuning information I am finding online and implementing it into my notebook.
I have the following looking dataframe
Id like to pivot / unpivot this data into a wider dataframe, ie:
At the moment I use a simple script:
def pivotData(self, data):
df = data
#df.persist()
df = df.groupBy("GROUP", "SUBGROUP").pivot("SOURCE").agg(first(F.col("VALUE")))
return df
The above does exactly what I need on my smaller subset of data pretty quickly, but as soon as I plug in the production parquets data to be consumed (Which I assume have billions of records), IT TAKES FOREVER
Other info: AWS Dev endpoint:
Number of workers: 5
Worker type: G.1X
Data processing units (DPUs): 6
This post is really just a reach out to see if anyone has any tips on improving performance? Perhaps my code needs to change completely and move away from groupBy & pivot ? I have absolutely no idea what sort of speeds I should be seeing when working with billions of records? But every article I read seems to be doing things in seconds :(
Any tips / articles you python / pyspark / glue experts have would be greatly appreciated. Growing tired of looking at this process bar doing nothing.
I have 2 dataframes in Spark 2.4, they are close to the same size. Each has about 40 million records. One is generated simply by loading the dataframe from S3, the other loads a bunch of dataframes and uses sparkSQL to generate a big dataframe. Then I join these 2 dataframes together multiple times into multiple dataframes and try to write them as CSV to S3... However I am seeing my write times upwards of 30 minutes, I am not sure if it is re-evaluating the dataframe or if perhaps I need more CPUs for this task. Nonetheless, I was hoping someone may have some advice on how to optimize these write times.
So when a dataframe is created from other dataframes it seems an execution plan is what is first created. Then when executing a write operation that plan gets evaluated.
The best way to take care of this particular situation is to take advantage of the spark lazy-loading caching (I have not seen an eager-loading solution for spark but if that exists it could be even better).
By doing:
dataframe1.cache()
And
dataframe2.cache()
when you join these 2 dataframes the first time both dataframes are evaluated and loaded into cache. Then when joining and writing again the 2 dataframe execution plans are already evaluated and the join and write becomes much faster.
This means the first write still takes over 30 minutes but the other 2 writes are much quicker.
Additionally, you can increase performance with additional CPUs and proper paritioning and coalesce of the dataframes. That could help with the evaluation of the first join and write operation.
Hope this helps.
working in postgresql I have a cartesian join producing ~4 million rows.
The join takes ~5sec and the write back to the DB takes ~1min 45sec.
The data will be required for use in python, specifically in a pandas dataframe, so I am experimenting with duplicating this same data in python. I should say here that all these tests are running on one machine, so nothing is going across a network.
Using psycopg2 and pandas, reading in the data and performing the join to get the 4 million rows (from an answer here:cartesian product in pandas) takes consistently under 3 secs, impressive.
Writing the data back to a table in the database however takes anything from 8 minutes (best method) to 36+minutes (plus some methods I rejected as I had to stop them after >1hr).
While I was not expecting to reproduce the "sql only" time, I would hope to be able to get closer than 8 minutes (I`d have thought 3-5 mins would not be unreasonable).
Slower methods include:
36min - sqlalchemy`s table.insert (from 'test_sqlalchemy_core' here https://docs.sqlalchemy.org/en/latest/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow)
13min - psycopg2.extras.execute_batch (https://stackoverflow.com/a/52124686/3979391)
13-15min (depends on chunksize) - pandas.dataframe.to_sql (again using sqlalchemy) (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html)
Best way (~8min) is using psycopg2`s cursor.copy_from method (found here: https://github.com/blaze/odo/issues/614#issuecomment-428332541).
This involves dumping the data to a csv first (in memory via io.StringIO), that alone takes 2 mins.
So, my questions:
Anyone have any potentially faster ways of writing millions of rows from a pandas dataframe to postgresql?
The docs for the cursor.copy_from method (http://initd.org/psycopg/docs/cursor.html) state that the source object needs to support the read() and readline() methods (hence the need for io.StringIO). Presumably, if the dataframe supported those methods, we could dispense with the write to csv. Is there some way to add these methods?
Thanks.
Giles
EDIT:
On Q2 - pandas can now use a custom callable for to_sql and the given example here: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method does pretty much what I suggest above (IE it copies csv data directly from STDIN using StringIO).
I found an ~40% increase in write speed using this method, which brings to_sql close to the "best" method mentioned above.
Answering Q 1 myself:
It seems the issue had more to do with Postgresql (or rather Databases in general). Taking into account points made in this article:https://use-the-index-luke.com/sql/dml/insert I found the following:
1) Removing all indexes from the destination table resulted in the query running in 9 seconds. Rebuilding the indexes (in postgresql) took a further 12 seconds, so still well under the other times.
2) With only a primary key in place, Inserting rows ordered by the primary key columns reduced the time taken to about a third. This makes sense as there should be little or no shuffling of the index rows required. I also verified that this is the reason why my cartesian join in postgresql was faster in the first place (IE the rows were ordered by the index, purely by chance), placing the same rows in a temporary table (unordered) and inserting from that actually took allot longer.
3) I tried similar experiments on our mysql systems and found the same increase in insert speed when removing indexes. With mysql however it seemed that rebuilding the indexes used up any time gained.
I hope this helps anyone else who comes across this question from a search.
I still wonder if it is possible to remove the write to csv step in python (Q2 above) as I believe I could then write something in python that would be faster than pure postgresql.
Thanks, Giles
I have two dataframes each around 400k rows call them a and b. What I want to do is for every row in df b find the account number in that row in data frame a. If it exists, i want to drop that row from dataframe a. Problem is, when I try to run this code, i keep getting memory errors. Initially I was using iterrows, but that seems to be bad when working with large datasets, so i switched to apply, but I'm running into the same error. Below is simplified pseudocode of what I'm trying:
def reduceAccount(accountID):
idx = frameA.loc[frameA["AccountID"] == accountID].index
frameB.drop(idx, inplace=True)
frameB["AccountID"].apply(reduceAccount)
I've even tried some shennanigans like iterating thru the first few hundred/thousand rows, but after a cycle, i still hit the memory error, which makes me think im still loading things into memory rather than clearing thru. Is there a better way to reduce dataframeA than what im trying? Note that I do not want to merge the frames (yet) just remove any row in dataframe a that has a duplicate key in dataframe b.
The issue is that in order to see all values to filter, you will need to store both DFs in memory at some point. You can improve your efficiency somewhat by not using apply(), which is still an iterator. The following code is a more efficient, vectorized approach using boolean masking directly.
dfB[~dfB["AccountID"].isin(dfA["AccountID"])]
However, if storage is the problem, then this may still not work. Some approaches to consider are chunking the data, as you say you've already tried, or some of the options in the documentation on enhancing performance
So basically you want every row in A which 'AccountID' is not in B.
This can be done with a left join: frameA = frameA.join(frameB, on='AccountID', how='left')
I think this is best in terms of memory efficiency for you'll be leveraging the power of pandas built-in optimized code.