I am looking to learn how to improve the performance of code over a large dataframe (10 million rows) and my solution loops over multiple dates (2023-01-10, 2023-01-20, 2023-01-30) for different combinations of category_a and category_b.
The working approach is shown below, which iterates over the dates for different pairings of the two-category data by first locating a subset of a particular pair. However, I would want to refactor it to see if there is an approach that is more efficient.
My input (df) looks like:
date
category_a
category_b
outflow
open
inflow
max
close
buy
random_str
0
2023-01-10
4
1
1
0
0
10
0
0
a
1
2023-01-20
4
1
2
0
0
20
nan
nan
a
2
2023-01-30
4
1
10
0
0
20
nan
nan
a
3
2023-01-10
4
2
2
0
0
10
0
0
b
4
2023-01-20
4
2
2
0
0
20
nan
nan
b
5
2023-01-30
4
2
0
0
0
20
nan
nan
b
with 2 pairs (4, 1) and (4,2) over the days and my expected output (results) looks like this:
date
category_a
category_b
outflow
open
inflow
max
close
buy
random_str
0
2023-01-10
4
1
1
0
0
10
-1
23
a
1
2023-01-20
4
1
2
-1
23
20
20
10
a
2
2023-01-30
4
1
10
20
10
20
20
nan
a
3
2023-01-10
4
2
2
0
0
10
-2
24
b
4
2023-01-20
4
2
2
-2
24
20
20
0
b
5
2023-01-30
4
2
0
20
0
20
20
nan
b
I have a working solution using pandas dataframes to take a subset then loop over it to get a solution but I would like to see how I can improve the performance of this using perhaps ;numpy, numba, pandas-multiprocessing or dask. Another great idea was to rewrite it in BigQuery SQL.
I am not sure what the best solution would be and I would appreciate any help in improving the performance.
Minimum working example
The code below generates the input dataframe.
import pandas as pd
import numpy as np
# prepare the input df
df = pd.DataFrame({
'date' : ['2023-01-10', '2023-01-20','2023-01-30', '2023-01-10', '2023-01-20','2023-01-30'] ,
'category_a' : [4, 4,4,4, 4, 4] ,
'category_b' : [1, 1,1, 2, 2,2] ,
'outflow' : [1.0, 2.0,10.0, 2.0, 2.0, 0.0],
'open' : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0] ,
'inflow' : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0] ,
'max' : [10.0, 20.0, 20.0 , 10.0, 20.0, 20.0] ,
'close' : [0.0, np.nan,np.nan, 0.0, np.nan, np.nan] ,
'buy' : [0.0, np.nan,np.nan, 0.0, np.nan,np.nan],
'random_str' : ['a', 'a', 'a', 'b', 'b', 'b']
})
df['date'] = pd.to_datetime(df['date'])
# get unique pairs of category_a and category_b in a dictionary
unique_pairs = df.groupby(['category_a', 'category_b']).size().reset_index().rename(columns={0:'count'})[['category_a', 'category_b']].to_dict('records')
unique_dates = np.sort(df['date'].unique())
Using this input dataframe and Numpy, the code below is what I am trying to optmizize.
df = df.set_index('date')
day_0 = unique_dates[0] # first date
# Using Dictionary comprehension
list_of_numbers = list(range(len(unique_pairs)))
myset = {key: None for key in list_of_numbers}
for count_pair, value in enumerate(unique_pairs):
# pair of category_a and category_b
category_a = value['category_a']
category_b = value['category_b']
# subset the dataframe for the pair
df_subset = df.loc[(df['category_a'] == category_a) & (df['category_b'] == category_b)]
log.info(f" running for {category_a} and {category_b}")
# day 0
df_subset.loc[day_0, 'close'] = df_subset.loc[day_0, 'open'] + df_subset.loc[day_0, 'inflow'] - df_subset.loc[day_0, 'outflow']
# loop over single pair using date
for count, date in enumerate(unique_dates[1:], start=1):
previous_date = unique_dates[count-1]
df_subset.loc[date, 'open'] = df_subset.loc[previous_date, 'close']
df_subset.loc[date, 'close'] = df_subset.loc[date, 'open'] + df_subset.loc[date, 'inflow'] - df_subset.loc[date, 'outflow']
# check if closing value is negative, if so, set inflow to buy for next weeks deficit
if df_subset.loc[date, 'close'] < df_subset.loc[date, 'max']:
df_subset.loc[previous_date, 'buy'] = df_subset.loc[date, 'max'] - df_subset.loc[date, 'close'] + df_subset.loc[date, 'inflow']
elif df_subset.loc[date, 'close'] > df_subset.loc[date, 'max']:
df_subset.loc[previous_date, 'buy'] = 0
else:
df_subset.loc[previous_date, 'buy'] = df_subset.loc[date, 'inflow']
df_subset.loc[date, 'inflow'] = df_subset.loc[previous_date, 'buy']
df_subset.loc[date, 'close'] = df_subset.loc[date, 'open'] + df_subset.loc[date, 'inflow'] - df_subset.loc[date, 'outflow']
# store all the dataframes in a container myset
myset[count_pair] = df_subset
# make myset into a dataframe
result = pd.concat(myset.values()).reset_index(drop=False)
result
After which we can check that the solution is the same as what we expected.
from pandas.testing import assert_frame_equal
expected = pd.DataFrame({
'date' : [pd.Timestamp('2023-01-10 00:00:00'), pd.Timestamp('2023-01-20 00:00:00'), pd.Timestamp('2023-01-30 00:00:00'), pd.Timestamp('2023-01-10 00:00:00'), pd.Timestamp('2023-01-20 00:00:00'), pd.Timestamp('2023-01-30 00:00:00')] ,
'category_a' : [4, 4, 4, 4, 4, 4] ,
'category_b' : [1, 1, 1, 2, 2, 2] ,
'outflow' : [1, 2, 10, 2, 2, 0] ,
'open' : [0.0, -1.0, 20.0, 0.0, -2.0, 20.0] ,
'inflow' : [0.0, 23.0, 10.0, 0.0, 24.0, 0.0] ,
'max' : [10, 20, 20, 10, 20, 20] ,
'close' : [-1.0, 20.0, 20.0, -2.0, 20.0, 20.0] ,
'buy' : [23.0, 10.0, np.nan, 24.0, 0.0, np.nan] ,
'random_str' : ['a', 'a', 'a', 'b', 'b', 'b']
})
# check that the result is the same as expected
assert_frame_equal(result, expected)
SQL to create first table
The solution can also be in sql, if so you can use the following code to create the initial table.
I am busy trying to implement a solution in big query sql using a user defined function to keep the logic going too. This would be a nice approach to solving the problem too.
WITH data AS (
SELECT
DATE '2023-01-10' as date, 4 as category_a, 1 as category_b, 1 as outflow, 0 as open, 0 as inflow, 10 as max, 0 as close, 0 as buy, 'a' as random_str
UNION ALL
SELECT
DATE '2023-01-20' as date, 4 as category_a, 1 as category_b, 2 as outflow, 0 as open, 0 as inflow, 20 as max, NULL as close, NULL as buy, 'a' as random_str
UNION ALL
SELECT
DATE '2023-01-30' as date, 4 as category_a, 1 as category_b, 10 as outflow, 0 as open, 0 as inflow, 20 as max, NULL as close, NULL as buy, 'a' as random_str
UNION ALL
SELECT
DATE '2023-01-10' as date, 4 as category_a, 2 as category_b, 2 as outflow, 0 as open, 0 as inflow, 10 as max, 0 as close, 0 as buy, 'b' as random_str
UNION ALL
SELECT
DATE '2023-01-20' as date, 4 as category_a, 2 as category_b, 2 as outflow, 0 as open, 0 as inflow, 20 as max, NULL as close, NULL as buy, 'b' as random_str
UNION ALL
SELECT
DATE '2023-01-30' as date, 4 as category_a, 2 as category_b, 0 as outflow, 0 as open, 0 as inflow, 20 as max, NULL as close, NULL as buy, 'b' as random_str
)
SELECT
ROW_NUMBER() OVER (ORDER BY date) as " ",
date,
category_a,
category_b,
outflow,
open,
inflow,
max,
close,
buy,
random_str
FROM data
Efficient algorithm
First of all, the complexity of the algorithm can be improved. Indeed, (df['category_a'] == category_a) & (df['category_b'] == category_b) travels the whole dataframe and this is done for each item in unique_pairs. The running time is O(U R) where U = len(unique_pairs) and R = len(df).
An efficient solution is to perform a groupby, that is, to split the dataframe in M groups each sharing the same pair of category. This operation can be done in O(R) time where R is the number of rows in the dataframe. In practice, Pandas may implement this using a (comparison-based) sort running in O(R log R) time.
Faster access & Conversion to Numpy
Moreover, accessing a dataframe item per item using loc is very slow. Indeed, Pandas needs to locate the location of the column using an internal dictionary, find the row based on the provided date, extract the value in the dataframe based on the ith row and jth column, create a new object and return it, not to mention the several check done (eg. types and bounds). On top of that, Pandas introduces a significant overhead partially due to its code being interpreted using typically CPython.
A faster solution is to extract the columns ahead of time, and to iterate over the row using integers instead of values (like dates). The thing is the order of the sorted date may not be the one in the dataframe subset. I guess it is the case for your input dataframe in practice, but if it is not, then you can sort the dataframe of each precomputed groups by date. I assume all the dates are present in all subset dataframe (but again, if this not the case, you can correct the result of the groupby). Each column can be converted to Numpy so to the can be faster. The result is a pure-Numpy code, not using Pandas anymore. Computationally-intensive Numpy codes are great since they can often be heavily optimized, especially when the target arrays contains native numerical types.
Here is the implementation so far:
df = df.set_index('date')
day_0 = unique_dates[0] # first date
# Using Dictionary comprehension
list_of_numbers = list(range(len(unique_pairs)))
myset = {key: None for key in list_of_numbers}
groups = dict(list(df.groupby(['category_a', 'category_b'])))
for count_pair, value in enumerate(unique_pairs):
# pair of category_a and category_b
category_a = value['category_a']
category_b = value['category_b']
# subset the dataframe for the pair
df_subset = groups[(category_a, category_b)]
# Extraction of the Pandas columns and convertion to Numpy ones
col_open = df_subset['open'].to_numpy()
col_close = df_subset['close'].to_numpy()
col_inflow = df_subset['inflow'].to_numpy()
col_outflow = df_subset['outflow'].to_numpy()
col_max = df_subset['max'].to_numpy()
col_buy = df_subset['buy'].to_numpy()
# day 0
col_close[0] = col_open[0] + col_inflow[0] - col_outflow[0]
# loop over single pair using date
for i in range(1, len(unique_dates)):
col_open[i] = col_close[i-1]
col_close[i] = col_open[i] + col_inflow[i] - col_outflow[i]
# check if closing value is negative, if so, set inflow to buy for next weeks deficit
if col_close[i] < col_max[i]:
col_buy[i-1] = col_max[i] - col_close[i] + col_inflow[i]
elif col_close[i] > col_max[i]:
col_buy[i-1] = 0
else:
col_buy[i-1] = col_inflow[i]
col_inflow[i] = col_buy[i-1]
col_close[i] = col_open[i] + col_inflow[i] - col_outflow[i]
# store all the dataframes in a container myset
myset[count_pair] = df_subset
# make myset into a dataframe
result = pd.concat(myset.values()).reset_index(drop=False)
result
This code is not only faster, but also a bit easier to read.
Fast execution using Numba
At this point, the general solution is to use vectorized functions but this is really not easy to do that efficiently (if even possible) here due to the loop dependencies and the conditionals. A fast solution is to use a JIT compiler like Numba so to generate a very-fast implementation. Numba is designed to work efficiently on natively-typed Numpy arrays so this is the perfect use-case. Note that Numba need the input parameter to have a well-defined (native) type. Providing the types manually cause Numba to generate the code eagerly (during the definition of the function) instead of lazily (during the first execution).
Here is the final resulting code:
import numba as nb
#nb.njit('(float64[:], float64[:], float64[:], int64[:], int64[:], float64[:], int64)')
def compute(col_open, col_close, col_inflow, col_outflow, col_max, col_buy, n):
# Important checks to avoid out-of bounds that are
# not checked by Numba for sake of performance.
# If they are not true and not done, then
# the function can simply cause a crash.
assert col_open.size == n and col_close.size == n
assert col_inflow.size == n and col_outflow.size == n
assert col_max.size == n and col_buy.size == n
# day 0
col_close[0] = col_open[0] + col_inflow[0] - col_outflow[0]
# loop over single pair using date
for i in range(1, n):
col_open[i] = col_close[i-1]
col_close[i] = col_open[i] + col_inflow[i] - col_outflow[i]
# check if closing value is negative, if so, set inflow to buy for next weeks deficit
if col_close[i] < col_max[i]:
col_buy[i-1] = col_max[i] - col_close[i] + col_inflow[i]
elif col_close[i] > col_max[i]:
col_buy[i-1] = 0
else:
col_buy[i-1] = col_inflow[i]
col_inflow[i] = col_buy[i-1]
col_close[i] = col_open[i] + col_inflow[i] - col_outflow[i]
df = df.set_index('date')
day_0 = unique_dates[0] # first date
# Using Dictionary comprehension
list_of_numbers = list(range(len(unique_pairs)))
myset = {key: None for key in list_of_numbers}
groups = dict(list(df.groupby(['category_a', 'category_b'])))
for count_pair, value in enumerate(unique_pairs):
# pair of category_a and category_b
category_a = value['category_a']
category_b = value['category_b']
# subset the dataframe for the pair
df_subset = groups[(category_a, category_b)]
# Extraction of the Pandas columns and convertion to Numpy ones
col_open = df_subset['open'].to_numpy()
col_close = df_subset['close'].to_numpy()
col_inflow = df_subset['inflow'].to_numpy()
col_outflow = df_subset['outflow'].to_numpy()
col_max = df_subset['max'].to_numpy()
col_buy = df_subset['buy'].to_numpy()
# Numba-accelerated computation
compute(col_open, col_close, col_inflow, col_outflow, col_max, col_buy, len(unique_dates))
# store all the dataframes in a container myset
myset[count_pair] = df_subset
# make myset into a dataframe
result = pd.concat(myset.values()).reset_index(drop=False)
result
Feel free to change the type of the parameters if they do not match with the real-world input data-type (eg. int32 vs int64 or float64 vs int64). Note that you can replace things like float64[:] by float64[::1] if you know that the input array is contiguous which is likely the case. This generates a faster code.
Also please note that myset can be a list since count_pair is an increasing integer. This is simpler and faster but it might be useful in your real-world code.
Performance results
The Numba function call runs in about 1 µs on my machine as opposed to 7.1 ms for the initial code. This means the hot part of the code is 7100 times faster just on the tiny example. That being said, Pandas takes some time to convert the columns to Numpy, to create groups and to merge the dataframes. The former takes a small constant time negligible for large arrays. The two later operations take more time on bigger input dataframes and they are actually the main bottleneck on my machine (both takes 1 ms on the small example). Overall, the whole initial code takes 16.5 ms on my machine for the tiny example dataframe, while the new one takes 3.1 ms. This means a 5.3 times faster code just for this small input. On bigger input dataframes the speed-up should be significantly better. Finally, please not that df.groupby(['category_a', 'category_b']) was actually already precomputed so I am not even sure we should include it in the benchmark ;) .
Imagine I have the following data frame:
Product
Month 1
Month 2
Month 3
Month 4
Total
Stuff A
5
0
3
3
11
Stuff B
10
11
4
8
33
Stuff C
0
0
23
30
53
that can be constructed from:
df = pd.DataFrame({'Product': ['Stuff A', 'Stuff B', 'Stuff C'],
'Month 1': [5, 10, 0],
'Month 2': [0, 11, 0],
'Month 3': [3, 4, 23],
'Month 4': [3, 8, 30],
'Total': [11, 33, 53]})
This data frame shows the amount of units sold per product, per month.
Now, what I want to do is to create a new column called "Average" that calculates the average units sold per month. HOWEVER, notice in this example that Stuff C's values for months 1 and 2 are 0. This product was probably introduced in Month 3, so its average should be calculated based on months 3 and 4 only. Also notice that Stuff A's units sold in Month 2 were 0, but that does not mean the product was introduced in Month 3 since 5 units were sold in Month 1. That is, its average should be calculated based on all four months. Assume that the provided data frame may contain any number of months.
Based on these conditions, I have come up with the following solution in pseudo-code:
months = ["list of index names of months to calculate"]
x = len(months)
if df["Month 1"] != 0:
df["Average"] = df["Total"] / x
elif df["Month 2"] != 0:
df["Average"] = df["Total"] / x - 1
...
elif df["Month " + str(x)] != 0:
df["Average"] = df["Total"] / 1
else:
df["Average"] = 0
That way, the average would be calculated starting from the first month where units sold are different from 0. However, I haven't been able to translate this logical abstraction into actual working code. I couldn't manage to iterate over len(months) while maintaining the elif conditions. Or maybe there is a better, more practical approach.
I would appreciate any help, since I've been trying to crack this problem for a while with no success.
There is numpy method np.trim_zeros that trims leading and/or trailing zeros. Using a list comprehension, you can iterate over the relevant DataFrame rows, trim the leading zeros and find the average of what remains for each row.
Note that since 'Month 1' to 'Month 4' are consecutive, you can slice the columns between them using .loc.
import numpy as np
df['Average Sales'] = [np.trim_zeros(row, trim='f').mean() for row in df.loc[:, 'Month 1':'Month 4'].to_numpy()]
Output:
Product Month 1 Month 2 Month 3 Month 4 Total Average Sales
0 Stuff A 5 0 3 3 11 2.75
1 Stuff B 10 11 4 8 33 8.25
2 Stuff C 0 0 23 30 53 26.50
Try:
df = df.set_index(['Product','Total'])
df['Average'] = df.where(df.ne(0).cummax(axis=1)).mean(axis=1)
df_out=df.reset_index()
print(df_out)
Output:
Product Total Month 1 Month 2 Month 3 Month 4 Average
0 Stuff A 11 5 0 3 3 2.75
1 Stuff B 33 10 11 4 8 8.25
2 Stuff C 53 0 0 23 30 26.50
Details:
Move Product and Total into the dataframe index, so we can do calcation on the rest of the dataframe.
First create a boolean matrix using ne to zero. Then, use cummax along the rows which means that if there is a non-zero value, It will remain True until then end of the row. If it starts with a zero, then the False will stay until first non-zero then turns to Turn and remain True.
Next, use pd.DataFrame.where to only select those values for that boolean matrix were Turn, other values (leading zeros) will be NaN and not used in the calcuation of mean.
If you don't mind it being a little memory inefficient, you could put your dataframe into a numpy array. Numpy has a built-in function to remove zeroes from an array, and then you could use the mean function to calculate the average. It could look something like this:
import numpy as np
arr = np.array(Stuff_A_DF)
mean = arr[np.nonzero(arr)].mean()
Alternatively, you could manually extract the row to a list, then loop through to remove the zeroes.