I am looking to learn how to improve the performance of code over a large dataframe (10 million rows) and my solution loops over multiple dates (2023-01-10, 2023-01-20, 2023-01-30) for different combinations of category_a and category_b.
The working approach is shown below, which iterates over the dates for different pairings of the two-category data by first locating a subset of a particular pair. However, I would want to refactor it to see if there is an approach that is more efficient.
My input (df) looks like:
date
category_a
category_b
outflow
open
inflow
max
close
buy
random_str
0
2023-01-10
4
1
1
0
0
10
0
0
a
1
2023-01-20
4
1
2
0
0
20
nan
nan
a
2
2023-01-30
4
1
10
0
0
20
nan
nan
a
3
2023-01-10
4
2
2
0
0
10
0
0
b
4
2023-01-20
4
2
2
0
0
20
nan
nan
b
5
2023-01-30
4
2
0
0
0
20
nan
nan
b
with 2 pairs (4, 1) and (4,2) over the days and my expected output (results) looks like this:
date
category_a
category_b
outflow
open
inflow
max
close
buy
random_str
0
2023-01-10
4
1
1
0
0
10
-1
23
a
1
2023-01-20
4
1
2
-1
23
20
20
10
a
2
2023-01-30
4
1
10
20
10
20
20
nan
a
3
2023-01-10
4
2
2
0
0
10
-2
24
b
4
2023-01-20
4
2
2
-2
24
20
20
0
b
5
2023-01-30
4
2
0
20
0
20
20
nan
b
I have a working solution using pandas dataframes to take a subset then loop over it to get a solution but I would like to see how I can improve the performance of this using perhaps ;numpy, numba, pandas-multiprocessing or dask. Another great idea was to rewrite it in BigQuery SQL.
I am not sure what the best solution would be and I would appreciate any help in improving the performance.
Minimum working example
The code below generates the input dataframe.
import pandas as pd
import numpy as np
# prepare the input df
df = pd.DataFrame({
'date' : ['2023-01-10', '2023-01-20','2023-01-30', '2023-01-10', '2023-01-20','2023-01-30'] ,
'category_a' : [4, 4,4,4, 4, 4] ,
'category_b' : [1, 1,1, 2, 2,2] ,
'outflow' : [1.0, 2.0,10.0, 2.0, 2.0, 0.0],
'open' : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0] ,
'inflow' : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0] ,
'max' : [10.0, 20.0, 20.0 , 10.0, 20.0, 20.0] ,
'close' : [0.0, np.nan,np.nan, 0.0, np.nan, np.nan] ,
'buy' : [0.0, np.nan,np.nan, 0.0, np.nan,np.nan],
'random_str' : ['a', 'a', 'a', 'b', 'b', 'b']
})
df['date'] = pd.to_datetime(df['date'])
# get unique pairs of category_a and category_b in a dictionary
unique_pairs = df.groupby(['category_a', 'category_b']).size().reset_index().rename(columns={0:'count'})[['category_a', 'category_b']].to_dict('records')
unique_dates = np.sort(df['date'].unique())
Using this input dataframe and Numpy, the code below is what I am trying to optmizize.
df = df.set_index('date')
day_0 = unique_dates[0] # first date
# Using Dictionary comprehension
list_of_numbers = list(range(len(unique_pairs)))
myset = {key: None for key in list_of_numbers}
for count_pair, value in enumerate(unique_pairs):
# pair of category_a and category_b
category_a = value['category_a']
category_b = value['category_b']
# subset the dataframe for the pair
df_subset = df.loc[(df['category_a'] == category_a) & (df['category_b'] == category_b)]
log.info(f" running for {category_a} and {category_b}")
# day 0
df_subset.loc[day_0, 'close'] = df_subset.loc[day_0, 'open'] + df_subset.loc[day_0, 'inflow'] - df_subset.loc[day_0, 'outflow']
# loop over single pair using date
for count, date in enumerate(unique_dates[1:], start=1):
previous_date = unique_dates[count-1]
df_subset.loc[date, 'open'] = df_subset.loc[previous_date, 'close']
df_subset.loc[date, 'close'] = df_subset.loc[date, 'open'] + df_subset.loc[date, 'inflow'] - df_subset.loc[date, 'outflow']
# check if closing value is negative, if so, set inflow to buy for next weeks deficit
if df_subset.loc[date, 'close'] < df_subset.loc[date, 'max']:
df_subset.loc[previous_date, 'buy'] = df_subset.loc[date, 'max'] - df_subset.loc[date, 'close'] + df_subset.loc[date, 'inflow']
elif df_subset.loc[date, 'close'] > df_subset.loc[date, 'max']:
df_subset.loc[previous_date, 'buy'] = 0
else:
df_subset.loc[previous_date, 'buy'] = df_subset.loc[date, 'inflow']
df_subset.loc[date, 'inflow'] = df_subset.loc[previous_date, 'buy']
df_subset.loc[date, 'close'] = df_subset.loc[date, 'open'] + df_subset.loc[date, 'inflow'] - df_subset.loc[date, 'outflow']
# store all the dataframes in a container myset
myset[count_pair] = df_subset
# make myset into a dataframe
result = pd.concat(myset.values()).reset_index(drop=False)
result
After which we can check that the solution is the same as what we expected.
from pandas.testing import assert_frame_equal
expected = pd.DataFrame({
'date' : [pd.Timestamp('2023-01-10 00:00:00'), pd.Timestamp('2023-01-20 00:00:00'), pd.Timestamp('2023-01-30 00:00:00'), pd.Timestamp('2023-01-10 00:00:00'), pd.Timestamp('2023-01-20 00:00:00'), pd.Timestamp('2023-01-30 00:00:00')] ,
'category_a' : [4, 4, 4, 4, 4, 4] ,
'category_b' : [1, 1, 1, 2, 2, 2] ,
'outflow' : [1, 2, 10, 2, 2, 0] ,
'open' : [0.0, -1.0, 20.0, 0.0, -2.0, 20.0] ,
'inflow' : [0.0, 23.0, 10.0, 0.0, 24.0, 0.0] ,
'max' : [10, 20, 20, 10, 20, 20] ,
'close' : [-1.0, 20.0, 20.0, -2.0, 20.0, 20.0] ,
'buy' : [23.0, 10.0, np.nan, 24.0, 0.0, np.nan] ,
'random_str' : ['a', 'a', 'a', 'b', 'b', 'b']
})
# check that the result is the same as expected
assert_frame_equal(result, expected)
SQL to create first table
The solution can also be in sql, if so you can use the following code to create the initial table.
I am busy trying to implement a solution in big query sql using a user defined function to keep the logic going too. This would be a nice approach to solving the problem too.
WITH data AS (
SELECT
DATE '2023-01-10' as date, 4 as category_a, 1 as category_b, 1 as outflow, 0 as open, 0 as inflow, 10 as max, 0 as close, 0 as buy, 'a' as random_str
UNION ALL
SELECT
DATE '2023-01-20' as date, 4 as category_a, 1 as category_b, 2 as outflow, 0 as open, 0 as inflow, 20 as max, NULL as close, NULL as buy, 'a' as random_str
UNION ALL
SELECT
DATE '2023-01-30' as date, 4 as category_a, 1 as category_b, 10 as outflow, 0 as open, 0 as inflow, 20 as max, NULL as close, NULL as buy, 'a' as random_str
UNION ALL
SELECT
DATE '2023-01-10' as date, 4 as category_a, 2 as category_b, 2 as outflow, 0 as open, 0 as inflow, 10 as max, 0 as close, 0 as buy, 'b' as random_str
UNION ALL
SELECT
DATE '2023-01-20' as date, 4 as category_a, 2 as category_b, 2 as outflow, 0 as open, 0 as inflow, 20 as max, NULL as close, NULL as buy, 'b' as random_str
UNION ALL
SELECT
DATE '2023-01-30' as date, 4 as category_a, 2 as category_b, 0 as outflow, 0 as open, 0 as inflow, 20 as max, NULL as close, NULL as buy, 'b' as random_str
)
SELECT
ROW_NUMBER() OVER (ORDER BY date) as " ",
date,
category_a,
category_b,
outflow,
open,
inflow,
max,
close,
buy,
random_str
FROM data
Efficient algorithm
First of all, the complexity of the algorithm can be improved. Indeed, (df['category_a'] == category_a) & (df['category_b'] == category_b) travels the whole dataframe and this is done for each item in unique_pairs. The running time is O(U R) where U = len(unique_pairs) and R = len(df).
An efficient solution is to perform a groupby, that is, to split the dataframe in M groups each sharing the same pair of category. This operation can be done in O(R) time where R is the number of rows in the dataframe. In practice, Pandas may implement this using a (comparison-based) sort running in O(R log R) time.
Faster access & Conversion to Numpy
Moreover, accessing a dataframe item per item using loc is very slow. Indeed, Pandas needs to locate the location of the column using an internal dictionary, find the row based on the provided date, extract the value in the dataframe based on the ith row and jth column, create a new object and return it, not to mention the several check done (eg. types and bounds). On top of that, Pandas introduces a significant overhead partially due to its code being interpreted using typically CPython.
A faster solution is to extract the columns ahead of time, and to iterate over the row using integers instead of values (like dates). The thing is the order of the sorted date may not be the one in the dataframe subset. I guess it is the case for your input dataframe in practice, but if it is not, then you can sort the dataframe of each precomputed groups by date. I assume all the dates are present in all subset dataframe (but again, if this not the case, you can correct the result of the groupby). Each column can be converted to Numpy so to the can be faster. The result is a pure-Numpy code, not using Pandas anymore. Computationally-intensive Numpy codes are great since they can often be heavily optimized, especially when the target arrays contains native numerical types.
Here is the implementation so far:
df = df.set_index('date')
day_0 = unique_dates[0] # first date
# Using Dictionary comprehension
list_of_numbers = list(range(len(unique_pairs)))
myset = {key: None for key in list_of_numbers}
groups = dict(list(df.groupby(['category_a', 'category_b'])))
for count_pair, value in enumerate(unique_pairs):
# pair of category_a and category_b
category_a = value['category_a']
category_b = value['category_b']
# subset the dataframe for the pair
df_subset = groups[(category_a, category_b)]
# Extraction of the Pandas columns and convertion to Numpy ones
col_open = df_subset['open'].to_numpy()
col_close = df_subset['close'].to_numpy()
col_inflow = df_subset['inflow'].to_numpy()
col_outflow = df_subset['outflow'].to_numpy()
col_max = df_subset['max'].to_numpy()
col_buy = df_subset['buy'].to_numpy()
# day 0
col_close[0] = col_open[0] + col_inflow[0] - col_outflow[0]
# loop over single pair using date
for i in range(1, len(unique_dates)):
col_open[i] = col_close[i-1]
col_close[i] = col_open[i] + col_inflow[i] - col_outflow[i]
# check if closing value is negative, if so, set inflow to buy for next weeks deficit
if col_close[i] < col_max[i]:
col_buy[i-1] = col_max[i] - col_close[i] + col_inflow[i]
elif col_close[i] > col_max[i]:
col_buy[i-1] = 0
else:
col_buy[i-1] = col_inflow[i]
col_inflow[i] = col_buy[i-1]
col_close[i] = col_open[i] + col_inflow[i] - col_outflow[i]
# store all the dataframes in a container myset
myset[count_pair] = df_subset
# make myset into a dataframe
result = pd.concat(myset.values()).reset_index(drop=False)
result
This code is not only faster, but also a bit easier to read.
Fast execution using Numba
At this point, the general solution is to use vectorized functions but this is really not easy to do that efficiently (if even possible) here due to the loop dependencies and the conditionals. A fast solution is to use a JIT compiler like Numba so to generate a very-fast implementation. Numba is designed to work efficiently on natively-typed Numpy arrays so this is the perfect use-case. Note that Numba need the input parameter to have a well-defined (native) type. Providing the types manually cause Numba to generate the code eagerly (during the definition of the function) instead of lazily (during the first execution).
Here is the final resulting code:
import numba as nb
#nb.njit('(float64[:], float64[:], float64[:], int64[:], int64[:], float64[:], int64)')
def compute(col_open, col_close, col_inflow, col_outflow, col_max, col_buy, n):
# Important checks to avoid out-of bounds that are
# not checked by Numba for sake of performance.
# If they are not true and not done, then
# the function can simply cause a crash.
assert col_open.size == n and col_close.size == n
assert col_inflow.size == n and col_outflow.size == n
assert col_max.size == n and col_buy.size == n
# day 0
col_close[0] = col_open[0] + col_inflow[0] - col_outflow[0]
# loop over single pair using date
for i in range(1, n):
col_open[i] = col_close[i-1]
col_close[i] = col_open[i] + col_inflow[i] - col_outflow[i]
# check if closing value is negative, if so, set inflow to buy for next weeks deficit
if col_close[i] < col_max[i]:
col_buy[i-1] = col_max[i] - col_close[i] + col_inflow[i]
elif col_close[i] > col_max[i]:
col_buy[i-1] = 0
else:
col_buy[i-1] = col_inflow[i]
col_inflow[i] = col_buy[i-1]
col_close[i] = col_open[i] + col_inflow[i] - col_outflow[i]
df = df.set_index('date')
day_0 = unique_dates[0] # first date
# Using Dictionary comprehension
list_of_numbers = list(range(len(unique_pairs)))
myset = {key: None for key in list_of_numbers}
groups = dict(list(df.groupby(['category_a', 'category_b'])))
for count_pair, value in enumerate(unique_pairs):
# pair of category_a and category_b
category_a = value['category_a']
category_b = value['category_b']
# subset the dataframe for the pair
df_subset = groups[(category_a, category_b)]
# Extraction of the Pandas columns and convertion to Numpy ones
col_open = df_subset['open'].to_numpy()
col_close = df_subset['close'].to_numpy()
col_inflow = df_subset['inflow'].to_numpy()
col_outflow = df_subset['outflow'].to_numpy()
col_max = df_subset['max'].to_numpy()
col_buy = df_subset['buy'].to_numpy()
# Numba-accelerated computation
compute(col_open, col_close, col_inflow, col_outflow, col_max, col_buy, len(unique_dates))
# store all the dataframes in a container myset
myset[count_pair] = df_subset
# make myset into a dataframe
result = pd.concat(myset.values()).reset_index(drop=False)
result
Feel free to change the type of the parameters if they do not match with the real-world input data-type (eg. int32 vs int64 or float64 vs int64). Note that you can replace things like float64[:] by float64[::1] if you know that the input array is contiguous which is likely the case. This generates a faster code.
Also please note that myset can be a list since count_pair is an increasing integer. This is simpler and faster but it might be useful in your real-world code.
Performance results
The Numba function call runs in about 1 µs on my machine as opposed to 7.1 ms for the initial code. This means the hot part of the code is 7100 times faster just on the tiny example. That being said, Pandas takes some time to convert the columns to Numpy, to create groups and to merge the dataframes. The former takes a small constant time negligible for large arrays. The two later operations take more time on bigger input dataframes and they are actually the main bottleneck on my machine (both takes 1 ms on the small example). Overall, the whole initial code takes 16.5 ms on my machine for the tiny example dataframe, while the new one takes 3.1 ms. This means a 5.3 times faster code just for this small input. On bigger input dataframes the speed-up should be significantly better. Finally, please not that df.groupby(['category_a', 'category_b']) was actually already precomputed so I am not even sure we should include it in the benchmark ;) .
I have used sklearn to fit and predict a model, but I want to have the top 5 predictions (in terms of probabilities) per item.
So I used predict_proba, which gave me a list of lists like:
probabilities = [[0.8,0.15,0.5,0,0],[0.4,0.6,0,0,0],[0,0,0,0,1]]
What I want to do, is loop over this list of lists to give me an overview of each prediction made, along with its position in the list (which represents the classes).
When using [i for i, j in enumerate(predicted_proba[0]) if j > 0] it returns me [0],[1] , which is what I want for the complete list of lists (and if possible also with the probability next to it).
When trying to use a for-loop over the above code, it returns an IndexError.
Something like this:
probabilities = [[0.8, 0.15, 0.5, 0, 0], [0.4, 0.6, 0, 0, 0], [0, 0, 0, 0, 1]]
for list in range(0,len(probabilities)):
print("Iteration_number:", list)
for index, prob in enumerate(probabilities[list]):
print("index", index, "=", prob)
Results in:
Iteration_number: 0
index 0 = 0.8
index 1 = 0.15
index 2 = 0.5
index 3 = 0
index 4 = 0
Iteration_number: 1
index 0 = 0.4
index 1 = 0.6
index 2 = 0
index 3 = 0
index 4 = 0
Iteration_number: 2
index 0 = 0
index 1 = 0
index 2 = 0
index 3 = 0
index 4 = 1
for i in predicted_proba:
for index, value in enumerate(i):
if value > 0:
print(index)
Hope this helps.
I'm pretty sure there's a really simple solution for this and I'm just not realising it. However...
I have a data frame of high-frequency data. Call this data frame A. I also have a separate list of far lower frequency demarcation points, call this B. I would like to append a column to A that would display 1 if A's timestamp column is between B[0] and B[1], 2 if it is between B[1] and B[2], and so on.
As said, it's probably incredibly trivial, and I'm just not realising it at this late an hour.
Here is a quick and dirty approach using a list comprehension.
>>> df = pd.DataFrame({'A': np.arange(1, 3, 0.2)})
>>> A = df.A.values.tolist()
A: [1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.5, 2.6, 2.8]
>>> B = np.arange(0, 3, 1).tolist()
B: [0, 1, 2]
>>> BA = [k for k in range(0, len(B)-1) for a in A if (B[k]<=a) & (B[k+1]>a) or (a>max(B))]
BA: [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Use searchsorted:
A['group'] = B['timestamp'].searchsorted(A['timestamp'])
For each value in A['timestamp'], an index value is returned. That index indicates where amongst the sorted values in B['timestamp'] that value from A would be inserted into B in order to maintain sorted order.
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
N = 10
A = pd.DataFrame({'timestamp':np.random.uniform(0, 1, size=N).cumsum()})
B = pd.DataFrame({'timestamp':np.random.uniform(0, 3, size=N).cumsum()})
# timestamp
# 0 1.739869
# 1 2.467790
# 2 2.863659
# 3 3.295505
# 4 5.106419
# 5 6.872791
# 6 7.080834
# 7 9.909320
# 8 11.027117
# 9 12.383085
A['group'] = B['timestamp'].searchsorted(A['timestamp'])
print(A)
yields
timestamp group
0 0.896705 0
1 1.626945 0
2 2.410220 1
3 3.151872 3
4 3.613962 4
5 4.256528 4
6 4.481392 4
7 5.189938 5
8 5.937064 5
9 6.562172 5
Thus, the timestamp 0.896705 is in group 0 because it comes before B['timestamp'][0] (i.e. 1.739869). The timestamp 2.410220 is in group 1 because it is larger than B['timestamp'][0] (i.e. 1.739869) but smaller than B['timestamp'][1] (i.e. 2.467790).
You should also decide what to do if a value in A['timestamp'] is exactly equal to one of the cutoff values in B['timestamp']. Use
B['timestamp'].searchsorted(A['timestamp'], side='left')
if you want searchsorted to return i when B['timestamp'][i] <= A['timestamp'][i] <= B['timestamp'][i+1]. Use
B['timestamp'].searchsorted(A['timestamp'], side='right')
if you want searchsorted to return i+1 in that situation. If you don't specify side, then side='left' is used by default.