PySpark algorithem slowed after join

PySpark algorithem slowed after join - python

I'm relatively new to PySpark and I'm currently trying to implement the SVD algorithm for predicting user-item ratings. The input is a matrix with columns - user_id, item_id and rating. In the first step I initialize the biases (bu, bi) and the factor matrices (pu, qi) for each user and each item. So I start the algorithm with the following dataframe:
Initial dataframe
In the current case the number of partitions is 7 and the time needed to count all the rows takes 0.7 seconds. The number of rows is 2.5 million.
Partitions and count time
In the next step I add a column to my dataframe - error. I use a UDF function which calculates the error for each row with regards to all the other columns (I don't think the equation is relevant). After the count function takes about the same amount of time.
Now comes the tricky part. I have to create 2 new dataframes. In the first I group together all the users (named userGroup) and in the second I group together all the items (named itemGroup). I have another UDF function implemented that updates the biases (update_b) and one that updates the factor matrices (update_factor_F). The userGroup dataframe has 1.5 million rows and the itemGroup has 72000 rows.
Updated biases and factors for each user
I then take the initial dataframe and join it firstly by user - I take the user_id, item_id and rating from the initial and the biases and factors from the userGroup dataframe. I repeat the same process with the itemGroup.
train = train.join(userGroup, train.u_id == userGroup.u_id_ug, 'outer') \
.select(train.u_id, train.i_id, train.rating, userGroup.bu, userGroup.pu)
I end up having a dataframe with the same size as the initial one. However if I do a .count() it now takes around 8 seconds. I would have to repeat the above steps iteratively and each iteration slows the time to do the .count() action even further.
I know the issue lies in the join of the dataframes and have searched for solutions to my issues. So far I haver tried different combinations of partitioning (I used .repartition(7, "u_id") on the userGroup dataframe) to try and match the number of partitions. I also tried repartitioning the final dataframe, but the .count() remains high.
My goal is to not loose performance after each iteration.

As some of your dataframes can be used multiple times, you will want to cache them so that they are not re-evaluated every time you need them. To do this you can rely on cache() or persist() operations.
Also, the logical plan of your dataframe will grow as you move forward on your iterative algorithm. This will increase computations exponentially as you move forward on your iterations. To cope with this issue, you will need to rely on checkpoint() operation to regularly break the lineage of your dataframes.

Related

Count consecutive rows with values over a certain value and get the average length of these instances

I have a big dataset, with 10,000 or so rows as pandas Dataframe. [['Date', 'TAMETR']].
The float values under 'TAMETR' increases and decreases over time.
I wish to loop through the 'TAMETR' column and check if there are consecutive instances where values are greater than let's say 1. Ultimately I'd like to get the average duration length and the distribution of the instances.
I've played around a little with what is written here: How to count consecutive ordered values on pandas data frame
Doubt I fully understand the code but I cant make it work. I don't understand how to tweak it with greater or lower than (</>).
The preferred output would be a dataframe, or array, with all the instances (greater than 1).
I can calculate the average and plot the distribution.

How to reduce computation time for nested loop when finding rows based on certain condition from dataset of more than 2M rows

I have a large dataset for 2M rows which columns as warehouse code, part code, transport mode, time_taken. there are 10 ways for shipping a product, I wanted to find the transport mode which took the least time for each product. There are 10 different warehouses and more than 2700 unique products. I wanted to find the least time taken for each product in the data set. I have written this function but This code is taking more than 14Hr to execute can anyone share a better solution to this problem
def model_selection_df(df):
value =pd.DataFrame()
for WH in Whse:
wh_df = df[df['WAREHOUSE_CODE']==WH]
unique_SKU = wh_df['VEND_PART'].unique()
for part in unique_SKU:
df_1 = wh_df[wh_df['VEND_PART']==part]
other_method = wh_df[(wh_df['VEND_PART']==part) & (wh_df['TIME_TAKEN']== df_1['TIME_TAKEN'].min() )]
value= value.append(other_method)
return value

Pandas: rank groupby and then compute bins qcut

I have a dataframe with multiple scores and multiple dates. My goal is to bin each day into equal sized buckets (let's say 5 buckets) based on whatever score I choose. The problem is that some scores have an abundance of ties and therefore I need to first compute rank to introduce a tie-breaker criteria and then the qcut can be applied.
The simple solution is to create a field for the rank and then do groupby('date')['rank'].transform(pd.qcut). However, since efficiency is key, this implies doing two expensive groupbys and I was wondering if it is possible to "chain" the two operations into one sweep.
This is the closest I got; my goal is to create 5 buckets but the qcut seems to be wrong since it is asking me to provide hundreds of labels
df_main.groupby('date')['score'].\
apply(lambda x: pd.qcut(x.rank(method='first'),
5,
duplicates='drop',
labels=lbls)
)
Thanks

How does pandas.DataFrame.sum() physically work?

How does .sum() method in pandas.DataFrame physically work?
I'm calculating proportion of salary of each individual staffworker to the total of all salaries.
The CSV has 33,000 rows.
The below function, add_proportion, goes row by row and reads each worker's salary, then divides it by salary.sum() for all rows.
Question: In each of these 33,000 cycles, does salary.sum() do its own 33,000 cycles to calculate the total over and over?
Asking because in this case the total number of cycles would be 1 billion (33,000 times 33,000), which should result in some kind of a delay. But there is no delay, the function runs instantly.
Therefore, does .sum() calculate the total during the first cycle only and then reuses the value?
Thanks.
import pandas as pd
staff = pd.read_csv('staff.csv', names = ['name', 'salary'])
def add_proportion(group):
group['proportion'] = salary / salary.sum()
return group

pandas uses numpy under the hood. In numpy, the behavior of applying operations between differently sized arrays is called broadcasting.
It depends how you are calling your add_proportion function, but the call to sum should only happen once for the whole dataframe (or once per group if you are doing a groupby(...).apply(add_proportion) for example).

Each sum is thread, which means that all the sums are made at the same time , they are parallelized.
The limit is your RAM that determine the number of parallel process you are allowed to have.
For more information, I would advice https://medium.com/#bfortuner/python-multithreading-vs-multiprocessing-73072ce5600b

Fuzzy Match between large number of records

I have two data frames. One contains 33765 companies. Another contains 358839 companies. I want to find the matching between the two using fuzzy match. Because the number of records are too high, I am trying to break down the records of both data frames according to 1st letter of the company name.
For example: For all the companies starting with letter "A", 1st data frame has 2600 records, and 2nd has 25000 records. I am implementing full merge between them and then applying fuzzy match to get all the companies with fuzz value more than 95.
This still does not work because number of records are still too high to perform full merge between them and then implement fuzzy. Kernel dies every time I do these operations. The same approach was working fine when the number of records in both frames was 4-digit.
Also, suggest if there is a way to automates this for all letters 'A' to 'Z', instead of manually running the code for each letter (without making kernel die).
Here's my code:
c='A'
df1 = df1[df1.companyName.str[0] == c ].copy()
df2 = df2[df2.companyName.str[0] == c].copy()
df1['Join'] =1
df2['Join'] =1
df3 = pd.merge(df1,df2, left_on='Join',right_on='Join')
df3['Fuzz'] = df3.apply(lambda x: fuzz.ratio(x['companyName_x'], x['companyName_y']) , axis=1)
df3.sort_values(['companyName_x','Fuzz'],ascending=False, inplace=True)
df4 = df3.groupby('companyName_x',as_index=False).first()
df5=df4[df4.Fuzz>=95]

You started going down the right path by chunking records based on a shared attributed (the first letter). In the record linkage literature, this concept is called blocking and it's critical to reducing the number of comparisons to something tractable.
The way forward is to find even better blocking rules: maybe first five characters, or a whole word in common.
The dedupe library can help you find good blocking rules. (I'm a core dev for this library)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark algorithem slowed after join - python

Related

Count consecutive rows with values over a certain value and get the average length of these instances

How to reduce computation time for nested loop when finding rows based on certain condition from dataset of more than 2M rows

Pandas: rank groupby and then compute bins qcut

How does pandas.DataFrame.sum() physically work?

Fuzzy Match between large number of records

Categories

Resources