Using apply() rather than for loop - Pandas - python

I am extracting maximum rainfall intensity for different durations using data with 5-minute rainfall totals. The code produces a list of max rainfall intensity for each duration (DURS). The code works but is slow when using data sets with 1,000,000+ rows. I am new to Pandas and I understand the apply() method is much faster than using a For loop but I do not know how to re-write a For loop using the apply() method.
Example of dataframe:
Value[mm] State of value
Date_Time
2020-01-01 00:00:00 1.0 5
2020-01-01 00:05:00 0.5 5
2020-01-01 00:10:00 4.0 5
2020-01-01 00:15:00 2.0 5
2020-01-01 00:20:00 2.0 5
2020-01-01 00:25:00 0.5 5
Example of Code:
import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter
import math, numpy, array, glob
import pandas as pd
import numpy as np
pluvi_file = "rain.csv"
DURS = [5,6,10,15,20,25,30,45,60,90,120,180,270,360,540,720,1440,2880,4320]
df = pd.read_csv(pluvi_file, delimiter=',',parse_dates=[['Date','Time']])
df['Date_Time'] = pd.to_datetime(df['Date_Time'], dayfirst=True)
df.index = df['Date_Time']
del df['Date_Time']
lista = []
for DUR in DURS:
x = str(DUR)+' Min'
df1 = df.groupby(pd.Grouper(freq=x)).sum()
a = df1['Value[mm]'].max()/DUR*60
print(a)
lista.append(a)
Output (Max rainfall intensity for each duration in mm/hr):
5 66.0
6 60.0
10 54.0
15 40.0
20 40.5
25 30.0
30 34.0
45 26.666666666666664
60 26.5
90 20.666666666666668
120 23.0
180 12.166666666666666
270 8.11111111111111
360 9.416666666666666
540 6.444444444444445
720 4.708333333333333
1440 3.8958333333333335
2880 2.7708333333333335
4320 2.1597222222222223
How would I re-write this using the apply() method?

Solution
It looks like applying doesn't suit here, since functions you are applying on groups are vectorised methods from Essential Basic Functionality. Also removing of for loop doesn't look like a promising way for performance optimization, since there are no too much durations in your DURS list, so the main issue is grouping operation and calculations on groups, and there's no too much space for optimization, at least at my opinion.
Create artificial data
import pandas as pd
df = pd.DataFrame({'Date_Time' : ["2020-01-01 00:00:00",
"2020-01-01 00:05:00",
"2020-01-01 00:10:00",
"2020-01-01 00:15:00",
"2020-01-01 00:20:00",
"2020-01-01 00:25:00"],
'Value[mm]' : [1.0,0.5,4.0,2.0,2.0,0.5],
'State of value': [5,5,5,5,5,5]
})
df = df.sample(3900875, replace=True).reset_index(drop=True)
Now, lets set Date_Time as an index, and get just series we need to calculate our values
df['Date_Time'] = pd.to_datetime(df['Date_Time'], dayfirst=True)
df = df.set_index('Date_Time', drop = True)
df = df['Value[mm]']
Compare the performance of different approaches
Grouping and looping
%%timeit
lista = []
for DUR in DURS:
x = str(DUR)+' Min'
df1 = df.groupby(pd.Grouper(freq=x)).sum()
a = df1.max()/DUR*60
lista.append(a)
19.6 s ± 439 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Resampling
Time boost is probably random here, since it looks like the same is hapening under the hood.
%%timeit
def get_max_by_dur(DUR):
return df.resample(str(DUR)+"Min").sum().max()
l_a = [get_max_by_dur(dur)/dur*60 for dur in DURS]
17.2 s ± 559 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Resampling + Dask
Despite the fact that there's no way to properly vectorize - you still can make some parallelization and optimization with Dask.
!python3 -m pip install "dask[dataframe]" --upgrade
import dask.dataframe as dd
%%timeit
dd_df = dd.from_pandas(df, npartitions = 1)
def get_max_by_dur(DUR):
return dd_df.resample(str(DUR)+"Min").sum().max()
l_a = [(get_max_by_dur(dur)/dur*60).compute() for dur in DURS]
2.21 s ± 110 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Few words on apply and optimization
Usually, you use apply, to apply a function along the axis of a DataFrame. So that's the substitution for looping thru rows or columns of DataFrame itself, but in reality, apply is just a glorified loop with some extra functionality. So, when the performance matters you usually want to optimize your code like this.
Vectorization or Essential Basic Functionality (as you made)
Cython routines or numba
List comprehension.
Apply method
Iteration
Ilustration
Let's say you want to get a product of two columns
1). Vectorization or basic methods.
Basic methods:
df["product"] = df.prod(axis=1)
162 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Vectorization:
import numpy as np
def multiply(Value,State): # you may use lambda here as well
return Value*State
%timeit df["new_column"] = np.vectorize(multiply) (df["Value[mm]"], df["State of value"])
853 ms ± 42.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2). Cython or numba
It can be very useful in cases if you already wrote some looping. You can often, just decorate it with #numba.jit and achieve significant performance boost. It's also very helpful when you want to compute some iterative value, which is difficult to vectorize.
Since the function we choose is multiplication you'll not have benefits comparing to usual apply.
%%cython
cdef double cython_multiply(double Value, double State):
return Value * State
%timeit df["new_column"] = df.apply(lambda row:multiply(row["Value[mm]"], row["State of value"]), axis = 1)
1min 38s ± 4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
3). List comprehension.
It's pythonic and, also quite similar to for loop.
%timeit df["new_column"] = [x*y for x, y in zip(df["Value[mm]"], df["State of value"])]
1.56 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4). Apply method
Notice, how slow it is.
%timeit df["new_column"] = df.apply(lambda row:row["Value[mm]"]*row["State of value"], axis = 1)
1min 37s ± 4.76 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
5). Looping thru rows
itertuples:
%%timeit
list_a = []
for row in df.itertuples():
list_a.append(row[2]*row[3])
df['product'] = list_a
9.81 s ± 831 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
iterrows (you probably shouldn't use that):
%%timeit
list_a = []
for row in df.iterrows():
list_a.append(row[1][1]*row[1][2])
df['product'] = list_a
6min 40s ± 1min 8s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Related

Vectorize eval Fast - Python Numpy Pandas

I need a way to vectorize the eval statement that loops through the reference dataframe (df_ref) that has the strings that iloc to the source df (df)
Here's a reprex of the problem:
import pandas as pd
dataValues = ['a','b','c']
df = pd.DataFrame({'values': dataValues})
df_list = ['df.iloc[0,0]','df.iloc[2,0]']
df_ref = pd.DataFrame({'ref':df_list})
#looping 10000 times just to replicate a the amount of times this operation
#may run in a typical scenario
def help():
for i in range(10000):
for index,row in df_ref.iterrows():
eval(df_ref.ref[index])
%timeit help()
Outputs:
2.84 s ± 366 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Before you respond with an alternative.. My problem will always have some dynamic referencing that I have to replicate in python, so a more straightforward route may not solve my particular problem.
Thanks for the help!
As commented, extract the indexes first, then slice into the whole thing:
def help1():
indexes = df_ref['ref'].str.extract('iloc\[(\d+),\s*(\d+)\]').astype(int)
return df.to_numpy()[indexes[0], indexes[1]]
and
help1()
> array(['a', 'c'], dtype=object)
%timeit -n 1000 help1()
> 1.11 ms ± 89.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

pandas frequency of a specific value per group

Suppose I have data for 50K shoppers and the products they bought. I want to count the number of times each user purchased product "a". value_counts seems to be the fastest way to calculate these types of numbers for a grouped pandas data frame. However, I was surprised at how much slower it was to calculate the purchase frequency for just one specific product (e.g., "a") using agg or apply. I could select a specific column from a data frame created using value_counts but that could be rather inefficient on very large data sets with lots of products.
Below a simulated example where each customer purchases 10 times from a set of three products. At this size you already notice speed differences between apply and agg compared to value_counts. Is there a better/faster way to extract information like this from a grouped pandas data frame?
import pandas as pd
import numpy as np
df = pd.DataFrame({
"col1": [f'c{j}' for i in range(10) for j in range(50000)],
"col2": np.random.choice(["a", "b", "c"], size=500000, replace=True)
})
dfg = df.groupby("col1")
# value_counts is fast
dfg["col2"].value_counts().unstack()
# apply and agg are (much) slower
dfg["col2"].apply(lambda x: (x == "a").sum())
dfg["col2"].agg(lambda x: (x == "a").sum())
# much faster to do
dfg["col2"].value_counts().unstack()["a"]
EDIT:
Two great responses to this question. Given the starting point of an already grouped data frame, it seems there may not be a better/faster way to count the number of occurrences of a single level in a categorical variable than using (1) apply or agg with a lambda function or (2) using value_counts to get the counts for all levels and then selecting the one you need.
The groupby/size approach is an excellent alternative to value_counts. With a minor edit to Cainã Max Couto-Silva's answer, this would give:
dfg = df.groupby(['col1', 'col2'])
dfg.size().unstack(fill_value=0)["a"]
I assume there would be a trade-off at some point where if you have many levels apply/agg or value_counts on an already grouped data frame may be faster than the groupby/size approach which requires creating a newly grouped data frame. I'll post back when I have some time to look into that.
Thanks for the comments and answers!
This is still faster:
dfg = df.groupby(['col1','col2'])
dfg.size().unstack()
Tests:
%%timeit
pd.crosstab(df.col1, df.col2)
# > 712 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
dfg = df.groupby("col1")
dfg["col2"].value_counts().unstack()
# > 165 ms ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
dfg = df.groupby(['col1','col2'])
dfg.size().unstack()
# > 131 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If we expand the dataframe to 5 million rows:
df = pd.concat([df for _ in range(10)])
print(f'df.shape = {df.shape}')
# > df.shape = (5000000, 2)
print(f'{df.shape[0]:,} rows.')
# > 5,000,000 rows.
%%timeit
pd.crosstab(df.col1, df.col2)
# > 1.58 s ± 33.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
dfg = df.groupby("col1")
dfg["col2"].value_counts().unstack()
# > 1.27 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
dfg = df.groupby(['col1','col2'])
dfg.size().unstack()
# > 847 ms ± 53.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Filter before value_counts
df.loc[df.col2=='a','col1'].value_counts()['c0']
Also I think crosstab is 'faster' than groupby + value_counts
pd.crosstab(df.col1, df.col2)

Optimization of the given operation, is there a better way?

I am a newbie and I need some insight. Say I have a pandas dataframe as follows:
temp = pd.DataFrame()
temp['A'] = np.random.rand(100)
temp['B'] = np.random.rand(100)
temp['C'] = np.random.rand(100)
I need to write a function where I replace every value in column "C" with 0's if the value of "A" is bigger than 0.5 in the corresponding row. Otherwise I need to multiply A and B in the same row element-wise and write down the output at the corresponding row on column "C".
What I did so far, is:
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
It works just as I desire it to work HOWEVER I am not sure if there's a faster way to implement this. I am very skeptical especially in the slicings that I feel like it's abundant to use those many slices. Though, I couldn't find any other solutions since I have to write 0's for C rows where A is bigger than 0.5.
Or, is there a way to slice the part that is needed only, perform calculations, then somehow remember the indices so you could put the required values back to the original data-frame on the corresponding rows?
One way using numpy.where:
temp["C"] = np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0)
Benchmark (about 4x faster in sample, and keeps on increasing):
# With given sample of 100 rows
%%timeit
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
# 819 µs ± 2.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0)
# 174 µs ± 455 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Benchmark on larger data (about 7x faster)
temp = pd.DataFrame()
temp['A'] = np.random.rand(1000000)
temp['B'] = np.random.rand(1000000)
temp['C'] = np.random.rand(1000000)
%%timeit
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
# 35.2 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0)
# 5.16 ms ± 188 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Validation
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
np.array_equal(temp["C"], np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0))
# True

Applying group-specific function that returns a single series

I'm trying to figure out an efficient split/apply/combine scheme for the following scenario. Consider the pandas dataframe demoAll defined below:
import datetime
import pandas as pd
demoA = pd.DataFrame({'date':[datetime.date(2010,1,1), datetime.date(2010,1,2), datetime.date(2010,1,3)],
'ticker':['A', 'A', 'A'],
'x1':[10,20,30],
'close':[120, 133, 129]}).set_index('date', drop=True)
demoB = pd.DataFrame({'date':[datetime.date(2010,1,1), datetime.date(2010,1,2), datetime.date(2010,1,3)],
'ticker':['B', 'B', 'B'],
'x1':[18,11,45],
'close':[50, 49, 51]}).set_index('date', drop=True)
demoAll = pd.concat([demoA, demoB])
print(demoAll)
The result is:
ticker x1 close
date
2010-01-01 A 10 120
2010-01-02 A 20 133
2010-01-03 A 30 129
2010-01-01 B 18 50
2010-01-02 B 11 49
2010-01-03 B 45 51
I also have a dictionary mapping of tickers to model objects
ticker2model = {'A':model_A, 'B':model_B,...}
where each model has a self.predict(df) method that takes-in an entire dataframe and returns a series of the same length.
I now would like to create a new column, demoAll['predictions'], that corresponds to these predictions. What is the cleanest/most-efficient way of doing this? A few things to note:
demoAll was the concatenation of ticker-specific dataframes that were each indexed just by date. Thus the indices of demoAll are not unique. (However, the combination of date/ticker IS unique.)
My thinking has been to do something like the example below, but running into issues with indexing, data-type coercions, and slow run times. The real dataset is quite large (both rows and columns).
demoAll['predictions'] = demoAll.groupby('ticker').apply(
lambda x: ticker2model[x.name].predict(x)
)
I may have misunderstood what you are passing through the model in order to predict but if I have understood correctly I would do the following:
pre-allocate the predictions columns of the demoAll
loop through the unique values of the ticker and filter demoAll
filter out the ticker row
predict the result using the filtered df
save the results in the correct locations in demoAll['predictions']
An Example using your code:
# get non 'ticker' columns
non_ticker_cols = [col for col in demoAll.columns if col is not 'ticker']
# get unique set of tickers
tickers = demoAll.ticker.unique()
# create and prepopulate the predictions column
demoAll['predictions'] = 0
for ticker in tickers:
# get boolean Series to filter the Dataframes by.
filter_by_ticker = demoAll.ticker == ticker
# filter, predict and allocate
demoAll.loc[filter_by_ticker, 'predictions'] = ticker2model[
ticker].predict(
demoAll.loc[filter_by_ticker,
non_ticker_cols]
)
The output would look like:
ticker x1 close predictions
date
2010-01-01 A 10 120 10.0
2010-01-02 A 20 133 10.0
2010-01-03 A 30 129 10.0
2010-01-01 B 18 50 100.0
2010-01-02 B 11 49 100.0
2010-01-03 B 45 51 100.0
Comparison to using apply
We could use apply per row but as you mentioned it would slow. I will compare the two to give an idea of the speedup.
Setup
I will use DummyRegressor from sklearn to allow me to call a predict method and create the dictionary you mention in your question.
model_a = DummyRegressor(strategy='mean')
model_b = DummyRegressor(strategy='median')
model_a.fit([[10,14]], y=np.array([10]))
model_b.fit([[200,200]], [100])
ticker2model = {'A':model_a, 'B':model_b}
Defining both as functions
def predict_by_ticker_filter(df, model_dict):
# get non 'ticker' columns
non_ticker_cols = [col for col in df.columns if col is not 'ticker']
# get unique set of tickers
tickers = df.ticker.unique()
# create and prepopulate the predictions column
df['predictions'] = 0
for ticker in tickers:
# get boolean Series to filter the Dataframes by.
filter_by_ticker = df.ticker==ticker
# filter, predict and allocate
df.loc[filter_by_ticker,'predictions'] = model_dict[ticker].predict(
df.loc[filter_by_ticker,
non_ticker_cols]
)
return df
def model_apply_by_row(df_row, model_dict):
# includes some conversions to list to allow the predict method to run
return model_dict[df_row['ticker']].predict([df_row[['x1','close']].tolist()])[0]
Performance
I use timeit on the function call to give the following results
On your example demoAll:
model_apply_by_row
%timeit demoAll.apply(model_apply_by_row,model_dict=ticker2model, axis=1)
3.78 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
predict_by_ticker_filter
%timeit predict_by_ticker_filter(demoAll, ticker2model)
6.24 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Increasing the size of demoAll to (606, 3):
model_apply_by_row
%timeit demoAll.apply(model_apply_by_row,model_dict=ticker2model, axis=1)
320 ms ± 28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
predict_by_ticker_filter
%timeit predict_by_ticker_filter(demoAll, ticker2model)
6.1 ms ± 512 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Increasing the size of demoAll to (6006, 3):
model_apply_by_row
%timeit demoAll.apply(model_apply_by_row,model_dict=ticker2model, axis=1)
3.15 s ± 371 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
predict_by_ticker_filter
%timeit predict_by_ticker_filter(demoAll, ticker2model)
9.1 ms ± 767 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Speed up Python .loc function search

I am pulling out a value from a table, searching for the value based on matches in other columns. Right now, because there are hundreds of thousands of grid cells to go through, each call of the function takes a few seconds, but it adds up to hours. Is there a faster way to do this?
data_1 = data.loc[(data['test1'] == test1) & (data['test2'] == X) & (data['Column'] == col1) & (data['Row']== row1)].Value
Sample data
Column Row Value test2 test1
2 3 5 X 0TO4
2 6 10 Y 100UP
2 10 5.64 Y 10TO14
5 2 9.4 Y 15TO19
9 2 6 X 20TO24
13 11 7.54 X 25TO29
25 2 6.222 X 30TO34
It may be worth a quick read-through on the enhancing performance docs to see what best fits your needs.
One option is to drop down to numpy using .values and slicing. Without seeing your actual data or use case, I created the following synthetic data:
data=pd.DataFrame({'column':[np.random.randint(30) for i in range(100000)],
'row':[np.random.randint(50) for i in range(100000)],
'value':[np.random.randint(100)+np.random.rand() for i in range(100000)],
'test1':[np.random.choice(['X','Y']) for i in range(100000)],
'test2':[np.random.choice(['d','e','f','g','h','i']) for i in range(100000)]})
data.head()
column row value test1 test2
0 4 30 88.367151 X e
1 7 10 92.482926 Y d
2 1 17 11.151060 Y i
3 27 10 78.707897 Y g
4 19 35 95.204207 Y h
Then using %timeit I got the following results using .loc indexing, boolean masking, and numpy slicing
(Note, at this point I realized I missed one of the lookups so that may affect the total time count but ratios should hold true)
%timeit data_1 = data.loc[(data['test1'] == 'X') & (data['column'] >=12) & (data['row'] > 22)]['value']
13 ms ± 538 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit data_1 = data[(data['test1'] == 'X') & (data['column'] >=12) & (data['row'] > 22)]['value']
13.1 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Now, this next part contains some overhead for converting the dataframe to a numpy array. If you're converting it once then doing multiple lookups against it, then this will be faster. But if not, you will likely end up taking longer for a single convert/slice
Without considering conversion time:
d1=data.values
%timeit d1[(d1[:,3]=='X')&(d1[:,0]>=12)&(d1[:,1]>22)][:,2]
8.37 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Approximately 30% improvement
With conversion time:
%timeit d1=data.values;d1[(d1[:,3]=='X')&(d1[:,0]>=12)&(d1[:,1]>22)][:,2]
20.6 ms ± 624 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Approximately 50% worse
You can index by test1, test2, Column and Row, and then lookup by that index.
Indexing:
data.set_index(["test1", "test2", "Column", "Row"], inplace=True)
and then lookup by doing this:
data_1 = data.loc[(test1, X, col1, row1)].Value

Categories

Resources