LeftAnti join in pyspark is too slow - python

I am trying to do some operations on pyspark. Actually I have a big dataframe (90 Million Rows, 23 columns) and another dataframe (30k rows, 1 column).
I have to remove from the first dataframe all the occurrences where the value of a certain column matches any of the values of the second dataframe.
firstdf = firstdf.join(seconddf, on = ["Field"], how = "leftanti")
The problem is that this operation is extremely slow (about 13 mins on databricks). Is there any way to improve the performance of this operation?

Related

Groupby, lag and sum in python: Memory and Time Efficient way

I have a 15 million row dataset with 30 columns.
One column is ID, another is YearMonth, and the remaining 28 columns are a mix of int and float variables.
I'm only showing 5 columns for illustration purposes.
df = pd.DataFrame({'ID':np.repeat(range(1,150001),100),'YearMonth':[i for i in range(1,101)]*150000,
'Var1':np.random.randint(0,5,15000000),'Var2':np.random.randint(0,5,15000000),
'Var3':np.random.randint(0,5,15000000)})
Assume that the data is first sorted on ID and YearMonth level.
Now, For each column (Var1, Var2, Var3 - in this case), I want to groupby on ID column and take sum of multiple lag values. Because the nested for loops were inefficient, I attempted to write the following code using numpy and vectors:
df = df.assign(**{'{}_total_lag_in_prev_{}_{}'.format(col,int(k[0]),int(k[1])):
np.concatenate([
df.groupby(['ID'])[col].shift(i).fillna(0).to_numpy().reshape(-1,1)
for i in range(int(k[0]),int(k[1])+1)
],axis=1).sum(axis=1).reshape(-1,1)
for col in ['Var1','Var2','Var3']
for k in [(1,2),(3,4)]})
This works perfectly. But what if I want to improve and optimise the code?
How can I decrease the execution time and improve the memory utilisation?

Fastest way to iterate over 70 million rows in pandas dataframe

I have two pandas dataframes bookmarks and ratings where columns are respectively :
id_profile, id_item, time_watched
id_profile, id_item, score
I would like to find score for each couple (profile,item) in the ratings dataframe (set to 0 if does not exist). The problem is the bookmarks dataframe has 73 million rows and it takes so much time (after 15 min the code continues to run). I suppose there is a better way to do it.
Here is my code :
def find_rating(val):
res = ratings.loc[(ratings['id_profile'] == val[0]) & (ratings['id_asset'] == val[1])]
if res.empty :
return 0
return res['score'].values[0]
arr = bookmarks[['id_profile','id_asset']].values
rates = [find_rating(i) for i in arr]
I work on collab.
Do you think I can improve speed of the execution?
Just some thoughts ! I have not tried such large data in Pandas.
In pandas, the data is index on row as well as columns. So if you have 1 million rows with 5 columns ==> we have indexed 5 million records.
For performance boost,
Check if you can use Sparse* data structures - https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html
Filter out the unnecessary data as much feasible.
If you can, try stick to using only numpy. As this reduces few features, we also we loose some drag. Worth exploring.
Use some distributed multithreaded/multiprocessing tools like Dask\Ray. If you have 4 cores -> 4 parallel jobs => 25% faster

what is the best way to loop over a large dataset in python?

I am currently using the below code to loop over a data set of around 20K records. I created a generator and used it in the for loop. This took around 10 minutes to complete. Is there a more efficient way to loop over large datasets in python? What I am essentially trying to achieve is to identify if there are duplicate values in certain columns for each unique value in the 'number' column of the dataframe(df_ir) and if there is, then storing the total number of duplicates for each column in the dictionary d_cnt.
df_ir is a pandas dataframe with 120k records
df_ir['number'].unique() = 20K records
lst_tk = ['caller_id','opened_by','made_sla']
d_cnt = {}
for col in lst_tk:
d_cnt[col]=0
gen_inc = (i for i in df_ir['number'].unique())
for incnum in gen_inc:
for col in lst_tk:
if df_ir[df_ir['number']== incnum][col].value_counts().count()>1:
d_cnt[col]+=1

Whats the fastest way of do a checking and drop of two asymetrical dataframes

I have two dataframes. Dataframe A (named data2_) has 2,5 million rows and 15 columns, Dataframe B (named data) has 250 rows and 4 columns. Both have a matching column: IDENTITY.
I want to reduce Dataframe A to just the rows who match with the IDENTITY row of Dataframe B.
I tried that, but takes a lot to compute (tqdm estimates a year):
for i in tqdm(list(range(data2_.shape[0]))):
for t in list(range(data.shape[0])):
if data2_["IDENTITY"].iloc[i] != data["IDENTITY"].iloc[t]:
data2_.drop( index = i)
else:
pass

Use Groupby to construct a dataframe with value counts of other column

I have a dataframe with two column features: startneighborhood and hour
hour can take any value from 1-24, i.e., [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24]
startneighborhood can be 37 different neighborhood options.
I want to find the number of hours for every neighborhood and use "hour" as an index.
So my matrix would be 24 rows x 37 columns, with the 1:24 hours array as my index and the 37 neighborhood as the column names.
How can I use Pandas to perform this computation? I'm a bit lost on the fastest way.
I've constructed the dataframe, with the index and the neighborhood names as the column names. I now just need to add the values..
Im a little bit confused by the question but I think what you want to do is a crosstab
import pandas as pd
df = <...> #construct your dataframe
table = pd.crosstab(index=df.hour,columns=df.startneighborhood)
This will give you a 24x37 table where each element is the count of the number of occurrences of that combination of hour and startneighborhood.

Categories

Resources