Most efficient way of pivoting huge dataframes pyspark

Most efficient way of pivoting huge dataframes pyspark - python

Really struggling to make sense of all the performance tuning information I am finding online and implementing it into my notebook.
I have the following looking dataframe
Id like to pivot / unpivot this data into a wider dataframe, ie:
At the moment I use a simple script:
def pivotData(self, data):
df = data
#df.persist()
df = df.groupBy("GROUP", "SUBGROUP").pivot("SOURCE").agg(first(F.col("VALUE")))
return df
The above does exactly what I need on my smaller subset of data pretty quickly, but as soon as I plug in the production parquets data to be consumed (Which I assume have billions of records), IT TAKES FOREVER
Other info: AWS Dev endpoint:
Number of workers: 5
Worker type: G.1X
Data processing units (DPUs): 6
This post is really just a reach out to see if anyone has any tips on improving performance? Perhaps my code needs to change completely and move away from groupBy & pivot ? I have absolutely no idea what sort of speeds I should be seeing when working with billions of records? But every article I read seems to be doing things in seconds :(
Any tips / articles you python / pyspark / glue experts have would be greatly appreciated. Growing tired of looking at this process bar doing nothing.

Related

dask - merge/full cartesian product - memory issue

I have a dask dataframe and need to compare all the values in one column to all the values in the same column. Eventually I will be using a function on these value pairs, but for now I'm just trying to get the .merge to work. Example code:
my_dd['dummy'] = 1
my_dd = my_dd.merge(my_dd, how='inner', on='dummy', npartitions=100000)
print(str(len(my_dd)))
At this point my starting data set is pretty small - my_dd before the join has about 19K rows. So after the join, there going to be about 360 million rows.
The above code gives me a memory error. Running on a single machine / LocalCluster with 32GB ram. With specifying nparitions, the code seems like it will work but eventually fails partway through (when most of the graph tasks are mostly complete).
I realize the problem here is the massive resulting df, but I would imagine there is some solution. Have done a lot of digging and can't find anything.

Pandas DataFrame GroupBy and New Calculated Columns Based on subsets of grouped data

I am a starter with pandas, picked it up as it seemed to be most popular and easiest to work with based on reviews. My intention is fast data processing using async processes (pandas don't really support async, but haven't reached that problem yet). If you believe I could use better library for my needs based on below scenarios, please let me know.
My code is running websockets using asyncio which are fetching activity data constantly and storing it into a pandas DataFrame like so:
data_set.loc[len(data_set)] = [datetime.now(),res['data']['E'] ,res['data']['s'] ,res['data']['p'] ,res['data']['q'] ,res['data']['m']]
That seems to work while printing out the results. The data frame gets big quickly, so have clean up function checking len() of data frame and drop() rows.
My intention is to take the full set in data_set and create a summary view based on a group value and calculate additional values as analytics using the grouped data and data points at different date_time snaps. These calculations would be running multiple times per second.
What I mean is this (all is made up, not a working code example just principle of what's needed):
grouped_data = data_set.groupby('name')
stats_data['name'] = grouped_data['name'].drop_duplicates()
stats_data['latest'] = grouped_data['column_name'].tail(1)
stats_data['change_over_1_day'] = ? (need to get oldest record that's within 1 day frame (out of multiple day data), and get value from specific column and compare it against ['latest']
stats_data['change_over_2_day'] = ?
stats_data['change_over_3_day'] = ?
stats_data['total_over_1_day'] = grouped_data.filter(data > 1 day ago).sum(column_name)
I have googled a million things, every time the examples are quite basic and don't really help my scenarios.
Any help appreciated.

The question was a bit vague I guess, but after some more research (googling) and trial/error (hours) managed to accomplish all that I mentioned here.
Hopefully can help someone to save some time who are new to this:
stats_data = data.loc[trade_data.groupby('name')['date_time'].idxmax()].reset_index(drop=True)
1_day_ago = data.loc[data[data.date_time > day_1].groupby("name")["date_time"].idxmin()].drop(labels = ['date_time','id','volume','flag'], axis=1).set_index('name')['value']
stats_data['change_over_1_day'] = stats_data['value'].astype('float') / stats_data['name'].map(1_day_ago).astype('float') * 100 - 100
Same principal applied to other columns.
If anyone has a much more efficient/faster way to do this, please post your answer.

(Py)Spark: df.sample(0.1) doesn't affect runtime of df.toPandas()

I'm working on a data set of ~ 100k lines in PySpark, and I wan't to convert it to Pandas. The data on web clicks contains string variables and is read from a .snappy.orc file in an Amazon S3 bucket via spark.read.orc(...).
The conversion is running too slow for my application (for reasons very well explained here on stackoverflow), and thus I've tried to downsample my spark DataFrame to one tenth - the dataset is so large, that the statistical analysis I need to do is probably still valid. I however need to repeat the analysis for 5000 similar datasets, why speed the a concern.
What surprised me, is that the running time of: df.sample(false, 0.1).toPandas() is exactly the same as df.toPandas() (approx 180s), and so I don't get the reduction in running time I was hoping for.
I'm suspecting it may be a question of putting in a .cache() or .collect(), but I can't figure out a proper way to fit it in.

Read only Pandas dataset in Dask Distributed

TL;DR
I want to allow workers to use a scattered Pandas Dataframe, but not allow them to mutate any data. Look below for sample code. Is this possible? (or is this a pure Pandas question?)
Full question
I'm reading a Pandas Dataframe, and scattering it to the workers. I then use this future when I submit new tasks, and store it in a Variable for easy access.
Sample code:
df = pq.read_table('/data/%s.parq' % dataset).to_pandas()
df = client.scatter(df, broadcast=True, direct=True)
v = Variable('dataset')
v.set(df)
When I submit a job I use:
def top_ten_x_based_on_y(dataset, column):
return dataset.groupby(dataset['something'])[column].mean().sort_values(ascending=False)[0:10].to_dict()
a = client.submit(top_ten_x_based_on_y, df, 'a_column')
Now, I want to run 10-20 QPS on this dataset which all workers have in memory (data < RAM) but I want to protect against accidental changes of the dataset, such as one worker "corrupting" it's own memory which can lead to inconsistencies. Preferably raising an exception on trying to modify.
The data set is roughly 2GB.
I understand this might be problematic since a Pandas Dataframe itself is not immutable (although a Numpy array can be made to).
Other ideas:
Copy the dataset on each query, but 2GB copy takes time even in RAM (roughly 1.4 seconds)
Devise a way to hash a dataframe (probematic in itself, even though hash_pandas_object now exists), and check before and after (or every minute) if dataframe is the same as expected. Running hash_pandas_object takes roughly 5 seconds.

Unfortunately Dask currently offers no additional features on top of Python to avoid mutation in this way. Dask just runs Python functions, and those Python functions can do anything they like.
Your suggestions of copying or checking before running operations seems sensible to me.
You might also consider raising this as a question or feature request to Pandas itself.

General Approach to Working with Data in DataFrames

Question for experienced Pandas users on approach to working with Dataframe data.
Invariably we want to use Pandas to explore relationships among data elements. Sometimes we use groupby type functions to get summary level data on subsets of the data. Sometimes we use plots and charts to compare one column of data against another. I'm sure there are other application I haven't thought of.
When I speak with other fairly novice users like myself, they generally try to extract portions of a "large" dataframe into smaller dfs that are sorted or formatted properly to run applications or plot. This approach certainly has disadvantages in that if you strip out a subset of data into a smaller df and then want to run an analysis against a column of data you left in the bigger df, you have to go back and recut stuff.
My question is - is best practices for more experienced users to leave the large dataframe and try to syntactically pull out the data in such a way that the effect is the same or similar to cutting out a smaller df? Or is it best to actually cut out smaller dfs to work with?
Thanks in advance.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Most efficient way of pivoting huge dataframes pyspark - python

Related

dask - merge/full cartesian product - memory issue

Pandas DataFrame GroupBy and New Calculated Columns Based on subsets of grouped data

(Py)Spark: df.sample(0.1) doesn't affect runtime of df.toPandas()

Read only Pandas dataset in Dask Distributed

General Approach to Working with Data in DataFrames

Categories

Resources