Groupby, lag and sum in python: Memory and Time Efficient way - python

I have a 15 million row dataset with 30 columns.
One column is ID, another is YearMonth, and the remaining 28 columns are a mix of int and float variables.
I'm only showing 5 columns for illustration purposes.
df = pd.DataFrame({'ID':np.repeat(range(1,150001),100),'YearMonth':[i for i in range(1,101)]*150000,
'Var1':np.random.randint(0,5,15000000),'Var2':np.random.randint(0,5,15000000),
'Var3':np.random.randint(0,5,15000000)})
Assume that the data is first sorted on ID and YearMonth level.
Now, For each column (Var1, Var2, Var3 - in this case), I want to groupby on ID column and take sum of multiple lag values. Because the nested for loops were inefficient, I attempted to write the following code using numpy and vectors:
df = df.assign(**{'{}_total_lag_in_prev_{}_{}'.format(col,int(k[0]),int(k[1])):
np.concatenate([
df.groupby(['ID'])[col].shift(i).fillna(0).to_numpy().reshape(-1,1)
for i in range(int(k[0]),int(k[1])+1)
],axis=1).sum(axis=1).reshape(-1,1)
for col in ['Var1','Var2','Var3']
for k in [(1,2),(3,4)]})
This works perfectly. But what if I want to improve and optimise the code?
How can I decrease the execution time and improve the memory utilisation?

Related

How to create a new df with pandas based on a set of conditions that compares each row by the others within the same df?

I am attempting to use pandas to create a new df based on a set of conditions that compares the rows from one another within the original df. I am new to using pandas and feel comfortable comparing two df from one another and basic column comparisons, but for some reason the row by row comparison is stumping me. My specific conditions and problem are found below:
Cosine_i_ start_time fid_ Shape_Area
0 0.820108 2022-08-31T10:48:34Z emit20220831t104834_o24307_s000 0.067763
1 0.962301 2022-08-27T12:25:06Z emit20220827t122506_o23908_s000 0.067763
2 0.811369 2022-08-19T15:39:39Z emit20220819t153939_o23110_s000 0.404882
3 0.788322 2023-01-29T13:23:39Z emit20230129t132339_o02909_s000 0.404882
4 0.811369 2022-08-19T15:39:39Z emit20220819t153939_o23110_s000 0.108256
^^Above is my original df that I will be working with.
Goal: I am hoping to create a new df that contains only the FIDs that meet the following conditions: If the shape area is equal, the cosi values have a difference greater than 0.1, and the start time has a difference greater than 5 days. This is going to be applied to a large dataset, the df displayed is just a small sample one I made to help write the code.
For example: Rows 2 & 3 have the same shape area, so then looking at the cosi values, they have a difference in values greater than 0.1, and lastly they have a difference in their start times that is greater than 5 days. They meet all set conditions, so I would then like to take the FID values for BOTH of these rows and append it to a new df.
So essentially I want to compare every row with the other rows and that's where I am having trouble.
I am looking for as much guidance as possible on how to set this up as I am very very new to coding and am hoping to get a tutorial of some sort!
Thanks in advance.
Group by Shape_Area and filter each pair (single items on Shape_Area are omitted) by required conditions:
fids = df.groupby('Shape_Area').filter(lambda x: x.index.size > 1
and x['Cosine_i_'].diff().values[-1] >= 0.1
and x['start_time'].diff().abs().dt.days.values[-1] > 5,
dropna=True)['fid_'].tolist()
print(fids)

How to efficiently compute a complex function in spark dataframe which dynamically references multiple rows when UDFs are not possible?

I have a dataframe with cols [a, b, c, t] where cols a, b, c are floats that represent some user data and t is the time step.
I need to compute 5th column result which is a function of the other columns. The problem is that it also needs to reference values from previous rows as well (which correspond to previous time steps). Column t is strictly increasing.
Essentially a complicated expanding window that generates a new value for each row.
Here k represents the row number where 0 is the top row. Since we are using the values from columns in previous rows I do not think we can use UDFs. Is there any other convenient / efficient way to compute these values using pyspark dataframes?

LeftAnti join in pyspark is too slow

I am trying to do some operations on pyspark. Actually I have a big dataframe (90 Million Rows, 23 columns) and another dataframe (30k rows, 1 column).
I have to remove from the first dataframe all the occurrences where the value of a certain column matches any of the values of the second dataframe.
firstdf = firstdf.join(seconddf, on = ["Field"], how = "leftanti")
The problem is that this operation is extremely slow (about 13 mins on databricks). Is there any way to improve the performance of this operation?

what is the best way to loop over a large dataset in python?

I am currently using the below code to loop over a data set of around 20K records. I created a generator and used it in the for loop. This took around 10 minutes to complete. Is there a more efficient way to loop over large datasets in python? What I am essentially trying to achieve is to identify if there are duplicate values in certain columns for each unique value in the 'number' column of the dataframe(df_ir) and if there is, then storing the total number of duplicates for each column in the dictionary d_cnt.
df_ir is a pandas dataframe with 120k records
df_ir['number'].unique() = 20K records
lst_tk = ['caller_id','opened_by','made_sla']
d_cnt = {}
for col in lst_tk:
d_cnt[col]=0
gen_inc = (i for i in df_ir['number'].unique())
for incnum in gen_inc:
for col in lst_tk:
if df_ir[df_ir['number']== incnum][col].value_counts().count()>1:
d_cnt[col]+=1

How to get the most and second frequent value in a row?

Let' say I have a dataframe with 1 million rows and 30 columns.
I want to add a column to the dataframe and the value is "the most frequent value of the previous 30 columns". I also want to add the "second most frequent value of the previous 30 columns"
I know that you can do df.mode(axis=1) for "the most frequent value of the previous 30 columns", but it is so slow.
Is there anyway to vectorize this so it could be fast?
df.mode(axis=1) is already vectorized. However, you may want to consider how it works. It needs to operate on each row independently, which means you would benefit from "row-major order" which is called C order in NumPy. A Pandas DataFrame is always column-major order, which means that getting 30 values to compute the mode for one row requires touching 30 pages of memory, which is not efficient.
So, try loading your data into a plain NumPy 2D array and see if that helps speed things up. It should.
I tried this on my 1.5 GHz laptop:
x = np.random.randint(0,5,(10000,30))
df = pd.DataFrame(x)
%timeit df.mode(axis=1)
%timeit scipy.stats.mode(x, axis=1)
The DataFrame way takes 6 seconds (!), whereas the SciPy (row-major) way takes 16 milliseconds for 10k rows. Even SciPy in column-major order is not much slower, which makes me think the Pandas version is less efficient than it could be.

Categories

Resources