My problem
I'm having trouble with the performance of resample function in combination with a groupby. The operation I'm doing is currently taking 8+ seconds on a data sample of 5000 rows which is totally unreasonable for my requirements.
Sample data (500 rows)
Pastebin with data as dict: https://pastebin.com/RPNdhXsy
The logic
I have data with dates in a quarterly interval which I want to group by a column and then resample the dates within the groups on a monthly basis.
Input:
isin report_date val
SE001 2018-12-31 1
SE001 2018-09-30 2
SE001 2018-06-31 3
US001 2018-10-31 4
US001 2018-07-31 5
Output:
isin report_date val
SE001 2018-12-31 1
2018-11-30 NaN
2018-10-31 NaN
2018-09-30 2
2018-08-31 NaN
2018-07-31 NaN
2018-06-30 3
US001 2018-10-30 4
2018-09-31 NaN
2018-08-31 NaN
2018-07-31 5
I used to have this operation:
df.groupby('isin').resample('M', on="report_date").first()[::-1]
Since it seems that asfreq() has slightly better performance than using on= in resample, I instead do the following currently. It's still slow though.
I reverse since resample seems to non-optionally sort dates descending.
df.set_index('report_date').groupby('isin').resample('M').asfreq()[::-1]
As stated, with 5000 rows and around 16 columns this takes 15 seconds to run since I need to do it on two separate dataframes.
With the sample data in the pastebin (500 rows) the operation takes me 0.7s which is way too long for me since my final data will have 800k rows.
EDIT: Timing of the different operations
Current way
setindex --- 0.001055002212524414 seconds ---
groupby --- 0.00033092498779296875 seconds ---
resample --- 0.004662036895751953 seconds ---
asfreq --- 0.8990700244903564 seconds ---
[::-1] --- 0.0013098716735839844 seconds ---
= 0.9056s
Old way
groupby --- 0.0005779266357421875 seconds ---
resample --- 0.0044629573822021484 seconds ---
first --- 1.6829369068145752 seconds ---
[::-1] --- 0.001600027084350586 seconds ---
= 1.6894s
Judging by this, it seems that converting from the pandas.core.resample.DatetimeIndexResamplerGroupby to a df is taking very long. Now what?
EDIT2: Using reindex
df.set_index('report_date').groupby('isin').apply(lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max(), freq='M'), fill_value=0))[::-1]
This takes 0.28s which is a vast improvement. Still not very good though.
How can I speed this up? Is there another way to do the same thing?
I cut execution time for a 25k row test data set from 850 ms to 320 ms. I wrapped the reindex logic in a function, to make timing easier:
def orig_pipeline(df):
return (df
.set_index('report_date')
.groupby('isin')
.apply(lambda x: x.reindex(pd.date_range(x.index.min(),
x.index.max(),
freq='M'),
fill_value=0))
[::-1])
Then, I created new functions to make date arithmetic and reindexing faster:
def create_params(df):
return (df.groupby('isin')['report_date']
.agg(['min', 'max']).sort_index().reset_index())
def create_multiindex(df, params):
all_dates = pd.date_range(start='1999-12-31', end='2020-12-31', freq='M')
midx = (
(row.isin, d)
for row in params.itertuples()
for d in all_dates[(row.min <= all_dates) & (all_dates <= row.max)])
return pd.MultiIndex.from_tuples(midx, names=['isin', 'report_date'])
def apply_mulitindex(df, midx):
return df.set_index(['isin', 'report_date']).reindex(midx)
def new_pipeline(df):
params = create_params(df)
midx = create_multiindex(df, params)
return apply_mulitindex(df, midx)
Old and new pipelines give same results (except possibly sort order):
v1 = orig_pipeline(df).drop(columns='isin').sort_index()
v2 = new_pipeline(df).sort_index().fillna(0)
assert(v1 == v2).all().all()
Timing results:
%%timeit
v1 = orig_pipeline(df_big)
854 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
v2 = new_pipeline(df_big)
322 ms ± 5.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I would like to illustrate the experiments I made trying to figure which solution yields the most performance, and it shows that #jsmart 's is the best one.
My dataset is like the following (sorry for the screenshot I could not manage to paste a pretty table):
My goal is to have for each (orgacom, client) couple the indicators resampled by business day.
Solution 1: groupby / apply asfreq
%%time
sol1 = (
to_process.groupby(['orgacom', 'client'], observed=True, )
.apply(lambda x: x.asfreq('B', fill_value=np.nan))
)
CPU times: user 4min 6s, sys: 2.91 s, total: 4min 9s
Wall time: 4min 9s
Solution 2: groupby / apply reindex (as of #jokab EDIT2)
%%time
sol2 = (
to_process.groupby(['orgacom', 'client'], observed=True, )
.apply(lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max(), freq='B'), fill_value=np.nan))
)
CPU times: user 4min 13s, sys: 2.16 s, total: 4min 15s
Wall time: 4min 15s
Solution 3: recoding resample (as of #jsmart answer)
def create_params(df):
return (df.reset_index().groupby(['orgacom', 'client'], observed=True, )['date']
.agg(['min', 'max']).sort_index().reset_index())
def create_multiindex(df, params):
all_dates = pd.date_range(start='2016-12-31', end='2020-12-31', freq='B')
midx = (
(row.orgacom, row.client, d)
for row in params.itertuples()
for d in all_dates[(row.min <= all_dates) & (all_dates <= row.max)])
return pd.MultiIndex.from_tuples(midx, names=['orgacom', 'client', 'date'])
def apply_mulitindex(df, midx):
return df.set_index(['orgacom', 'client', 'date']).reindex(midx)
def new_pipeline(df):
params = create_params(df)
midx = create_multiindex(df, params)
return apply_mulitindex(df, midx)
%%time
sol3 = new_pipeline(to_process.reset_index())
CPU times: user 1min 46s, sys: 4.93 s, total: 1min 51s
Wall time: 1min 51s
Solution 4: groupby / resample asfreq (as of #jokab first solution)
%%time
sol4 = to_process.groupby(['orgacom', 'client']).resample('B').asfreq()
CPU times: user 4min 22s, sys: 8.01 s, total: 4min 30s
Wall time: 4min 30s
I also noticed that resampling on groupby can be slow. In my case, I used data reshaping for speed up,
df.set_index(['isin', 'report_date'])['val'].unstack(0).resample('M')
There is another way of doing this. Use itertools.groupby() and list comprehension
import time
from itertools import groupby
print(time.time())
data = (
('SE001', '2018-12-31', 1),
('SE001', '2018-09-30', 2),
('SE001', '2018-06-31', 3),
('US001', '2018-10-31', 4),
('US001', '2018-07-31', 5),
)
aggr = [(key, sum([g[2] for g in grp])) for key, grp in groupby(sorted(data), key=lambda x: x[0])]
print(aggr)
print(time.time())
# 100,000 records
# 2.5 seconds
Related
I have a df with over hundreds of millions of rows.
latitude longitude time VAL
0 -39.20000076293945312500 140.80000305175781250000 1972-01-19 13:00:00 1.20000004768371582031
1 -39.20000076293945312500 140.80000305175781250000 1972-01-20 13:00:00 0.89999997615814208984
2 -39.20000076293945312500 140.80000305175781250000 1972-01-21 13:00:00 1.50000000000000000000
3 -39.20000076293945312500 140.80000305175781250000 1972-01-22 13:00:00 1.60000002384185791016
4 -39.20000076293945312500 140.80000305175781250000 1972-01-23 13:00:00 1.20000004768371582031
... ...
It contains a time column with the type of datetime64 in UTC. The following code is to create a new column isInDST to indicate if the time is in daylight saving period in a local time zone.
df['isInDST'] = pd.DatetimeIndex(df['time']).tz_localize('UTC').tz_convert('Australia/Victoria').map(lambda x : x.dst().total_seconds()!=0)
It takes about 400 seconds to process 15,223,160 rows.
Is there a better approach to achieve this with better performance? Is vectorize a better way?
All results are calculated on 1M datapoints.
Cython + np.vectorize
7.2 times faster than the original code
%%cython
from cpython.datetime cimport datetime
cpdef bint c_is_in_dst(datetime dt):
return dt.dst().total_seconds() != 0
%%timeit
df['isInDST'] = np.vectorize(c_is_in_dst)(df['time'].dt.tz_localize('UTC').dt.tz_convert('Australia/Victoria').dt.to_pydatetime())
1.08 s ± 10.2 ms per loop per loop
np.vectorize
6.5 times faster than the original code
def is_in_dst(dt):
return dt.dst().total_seconds() != 0
%%timeit
df['isInDST'] = np.vectorize(is_in_dst)(df['time'].dt.tz_localize('UTC').dt.tz_convert('Australia/Victoria').dt.to_pydatetime())
1.2 s ± 29.3 ms per loop per loop
Based on the documentation (The implementation is essentially a for loop) I expected the result to be the same as for the list comprehension, but it's consistently a little bit better than list comprehension.
List comprehension
5.9 times faster than the original code
%%timeit
df['isInDST'] = [x.dst().total_seconds()!=0 for x in pd.DatetimeIndex(df['time']).tz_localize('UTC').tz_convert('Australia/Victoria')]
1.33 s ± 48.4 ms per loop
This result shows that pandas map/apply is very slow, it adds additional overhead that can be eliminated by just using a python for loop.
Original approach (map on pandas DatetimeIndex)
%%timeit
df['isInDST'] = pd.DatetimeIndex(df['time']).tz_localize('UTC').tz_convert('Australia/Victoria').map(lambda x : x.dst().total_seconds()!=0)
7.82 s ± 84.3 ms per loop
Tested on 1M rows of dummy data
N = 1_000_000
df = pd.DataFrame({"time": [datetime.datetime.now().replace(hour=random.randint(0,23),minute=random.randint(0,59)) for _ in range(N)]})
Also, run the code on 100K and 10M rows - the results are linearly dependant on the number of rows
Appreciate any help from the community on this. I've been toying with it for a few days now.
I have 2 dataframes, df1 & df2. The first dataframe will always be 1 min data about 20-30 thousand rows. The second dataframe will contain random times with associated relevant data & will always be relatively small (1000-4000 rows x 4 or 5 columns). I'm working through df1 with itertuples in order to perform a time specific slice (trailing). This process gets repeated thousands of times, & the single slice line below (df3 = df2...) is causing over 50% of the runtime. Simply adding a couple slicing criteria in the single line below can have 30+% increases on the final runtimes which run hours long!
I've considered trying pandas 'query', but have read it really only helps on larger dataframes. My thought is that it may be better to reduce df2 into a numpy array, simple python list, or other since it is always fairly short, though I think I'll need it back into a dataframe for subsequent sorting, summations, and vector multiplications that come afterward in the primary code. I did succeed in utilizing concurrent futures on a 12 core setup, which increased speed about 5X for my overall application, though I'm still talking hours of runtime.
Any help or suggestions would be appreciated.
Example code illustrating the issue:
import pandas as pd
import numpy as np
import random
from datetime import datetime as dt
from datetime import timedelta, timezone
def random_dates(start, end, n=10):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
dfsize = 34000
df1 = pd.DataFrame({'datetime': pd.date_range('2010-01-01', periods=dfsize, freq='1min'), 'val':np.random.uniform(10, 100, size=dfsize)})
sizedf = 3000
start = pd.to_datetime('2010-01-01')
end = pd.to_datetime('2010-01-24')
test_list = [5, 30]
df2 = pd.DataFrame({'datetime':random_dates(start,end, sizedf), 'a':np.random.uniform(10, 100, size=sizedf), 'b':np.random.choice(test_list, sizedf), 'c':np.random.uniform(10, 100, size=sizedf), 'd':np.random.uniform(10, 100, size=sizedf), 'e':np.random.uniform(10, 100, size=sizedf)})
df2.set_index('datetime', inplace=True)
daysback5 = 3
daysback30 = 8
#%%timeit -r1 #time this section here:
#Slow portion here - Performing ~4000+ slices on a dataframe (df2) which is ~1000 to 3000 rows -- Some slowdown due to itertuples, which don't think is avoidable
for line, row in enumerate(df1.itertuples(index=False), 0):
if row.datetime.minute % 5 ==0:
#Lion's share of the slowdown:
df3 = df2[(df2['a']<=row.val*1.25) & (df2['a']>=row.val*.75) & (df2.index<=row.datetime) & (((df2.index>=row.datetime-timedelta(days=daysback30)) & (df2['b']==30)) | ((df2.index>=row.datetime-timedelta(days=daysback5)) & (df2['b']==5))) ].reset_index(drop=True).copy()
Time of slow part:
8.53 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
df1:
datetime val
0 2010-01-01 00:00:00 58.990147
1 2010-01-01 00:01:00 27.457308
2 2010-01-01 00:02:00 20.657251
3 2010-01-01 00:03:00 36.416561
4 2010-01-01 00:04:00 71.398897
... ... ...
33995 2010-01-24 14:35:00 77.763085
33996 2010-01-24 14:36:00 21.151239
33997 2010-01-24 14:37:00 83.741844
33998 2010-01-24 14:38:00 93.370216
33999 2010-01-24 14:39:00 99.720858
34000 rows × 2 columns
df2:
a b c d e
datetime
2010-01-03 23:38:13 22.363251 30 81.158073 21.806457 11.116421
2010-01-09 16:27:32 78.952070 5 27.045279 29.471537 29.559228
2010-01-13 04:49:57 85.985935 30 79.206437 29.711683 74.454446
2010-01-07 22:29:22 36.009752 30 43.072552 77.646257 57.208626
2010-01-15 09:33:02 13.653679 5 87.987849 37.433810 53.768334
... ... ... ... ... ...
2010-01-12 07:36:42 30.328512 5 81.281791 14.046032 38.288534
2010-01-08 20:26:31 80.911904 30 32.524414 80.571806 26.234552
2010-01-14 08:32:01 12.198825 5 94.270709 27.255914 87.054685
2010-01-06 03:25:09 82.591519 5 91.160917 79.042083 17.831732
2010-01-07 14:32:47 38.337405 30 10.619032 32.557640 87.890791
3000 rows × 5 columns
Actually, cross merge and query works pretty well for your data size:
(df1[df1.datetime.dt.minute % 5==0].assign(dummy=1)
.merge(df2.reset_index().assign(dummy=1),
on='dummy', suffixes=['_1','_2'])
.query('val*1.25 >= a >= val*.75 and datetime_2 <= datetime_1 ')
.loc[lambda x: ((x.datetime_2 >= x.datetime_1 - daysback30) & x['b'].eq(30) )
|((x.datetime_2>= x.datetime_1 - daysback5) & (x['b']==5))]
)
which takes about on my system:
2.05 s ± 60.4 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)
where your code runs for about 10s.
I am calculating 48 derived pandas columns by iterating and calculating each column at a time but need to speed up the process. What is the best way to do this to make it faster and more efficent. Each column calculates the closing price as a percentage of the period's (T, T-1, T-2 etc) high and low price.
The code I am currently using is:
#get last x closes as percentage of period high and low
for i in range(1, 49, 1):
df.loc[:,'Close_T_period_'+str(i)] = ((df['BidClose'].shift(i).values
- df['BidLow'].shift(i).values)/
(df['BidHigh'].shift(i).values - df['BidLow'].shift(i).values))
Input dataframe sample:
BidOpen BidHigh BidLow BidClose AskOpen AskHigh AskLow AskClose Volume
Date
2019-09-27 09:00:00 1.22841 1.22919 1.22768 1.22893 1.22850 1.22927 1.22777 1.22900 12075.0
2019-09-27 10:00:00 1.22893 1.23101 1.22861 1.23058 1.22900 1.23110 1.22870 1.23068 16291.0
2019-09-27 11:00:00 1.23058 1.23109 1.22971 1.23076 1.23068 1.23119 1.22979 1.23087 10979.0
2019-09-27 12:00:00 1.23076 1.23308 1.23052 1.23232 1.23087 1.23314 1.23062 1.23241 16528.0
2019-09-27 13:00:00 1.23232 1.23247 1.23163 1.23217 1.23241 1.23256 1.23172 1.23228 14106.0
Output dataframe sample:
BidOpen BidHigh BidLow BidClose ... Close_T_period_45 Close_T_period_46 Close_T_period_47 Close_T_period_48
Date ...
2019-09-27 09:00:00 1.22841 1.22919 1.22768 1.22893 ... 0.682635 0.070796 0.128940 0.794521
2019-09-27 10:00:00 1.22893 1.23101 1.22861 1.23058 ... 0.506024 0.682635 0.070796 0.128940
2019-09-27 11:00:00 1.23058 1.23109 1.22971 1.23076 ... 0.774920 0.506024 0.682635 0.070796
2019-09-27 12:00:00 1.23076 1.23308 1.23052 1.23232 ... 0.212500 0.774920 0.506024 0.682635
2019-09-27 13:00:00 1.23232 1.23247 1.23163 1.23217 ... 0.378882 0.212500 0.774920 0.506024
Short Answer (faster implementation)
the following code is 6x times faster:
import numpy as np
def my_shift(x, i):
first = np.array([np.nan]*i)
return np.append(first, x[:-i])
result = ((df2['BidClose'].values - df2['BidLow'].values)/(df2['BidHigh'].values - df2['BidLow'].values))
for i in range(1, 49, 1):
df2.loc[:,'Close_T_period_'+str(i)] = my_shift(result, i)
Long Answer (explanation)
The two main bottleneck issues in your code are:
In every iteration you recalculate the same values, the only
difference is that, every times, are shifted differently;
pandas shift operation is very slow for your purpose.
so my code simply manage the two issues. Basically I calculate the result just one time and I use the loop only for shifting (Issues #1 improved), and I implemented my own shift function that append in front of the original array i NaN values and cut the last i.
Execution time
With a dataframe with 5000 rows the time benchmark give:
42 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
with my solution I obtained:
7.62 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
UPDATE
I tried to implement a solution with apply:
result = ((df2['BidClose'].values - df2['BidLow'].values)/(df2['BidHigh'].values - df2['BidLow'].values))
df3 = df.reindex(df2.columns.tolist() +[f'Close_T_period_{i}' for i in range(1, 2000)], axis=1)
df3.iloc[:, 9:] = df3.iloc[:, 9:].apply(lambda row: my_shift(result, int(row.name.split('_')[-1])))
In my test this solution seems a slightly slower then the first one.
I've inherited some pandas code that I'm trying to optimize. One DataFrame, results, has been created with
results = pd.DataFrame(columns=['plan','volume','avg_denial_increase','std_dev_impact', 'avg_idr_increase', 'std_dev_idr_increase'])
for plan in my_df['plan_name'].unique():
df1 = df[df['plan_name'] == plan]]
df1['volume'].fillna(0, inplace=True)
df1['change'] = df1['idr'] - df1['idr'].shift(1)
df1['change'].fillna(0, inplace=True)
df1['impact'] = df1['change'] * df1['volume']
describe_impact = df1['impact'].describe()
describe_change = df1['change'].describe()
results = results.append({'plan': plan,
'volume': df1['volume'].mean(),
'avg_denial_increase': describe_impact['mean'],
'std_dev_impact': describe_impact['std'],
'avg_idr_increase': describe_change['mean'],
'std_dev_idr_increase': describe_change['std']},
ignore_index=True)
My first thought was to move everything from under the for-loop into a separate function, get_results_for_plan, and use pandas groupby() and apply() methods. But his has proven to be even slower. Running
%lprun -f get_results_for_plan my_df.groupby('plan_name', sort=False, as_index=False).apply(get_results_for_plan)
returns
Timer unit: 1e-06 s
Total time: 0.77167 s
File: <ipython-input-46-7c36b3902812>
Function: get_results_for_plan at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def get_results_for_plan(plan_df):
2 94 33221.0 353.4 4.3 plan = plan_df.iloc[0]['plan_name']
3 94 25901.0 275.5 3.4 plan_df['volume'].fillna(0, inplace=True)
4 94 75765.0 806.0 9.8 plan_df['change'] = plan_df['idr'] - plan_df['idr'].shift(1)
5 93 38653.0 415.6 5.0 plan_df['change'].fillna(0, inplace=True)
6 93 57088.0 613.8 7.4 plan_df['impact'] = plan_df['change'] * plan_df['volume']
7 93 204828.0 2202.5 26.5 describe_impact = plan_df['impact'].describe()
8 93 201127.0 2162.7 26.1 describe_change = plan_df['change'].describe()
9 93 129.0 1.4 0.0 return pd.DataFrame({'plan': plan,
10 93 21703.0 233.4 2.8 'volume': plan_df['volume'].mean(),
11 93 4291.0 46.1 0.6 'avg_denial_increase': describe_impact['mean'],
12 93 1957.0 21.0 0.3 'std_dev_impact': describe_impact['std'],
13 93 2912.0 31.3 0.4 'avg_idr_increase': describe_change['mean'],
14 93 1783.0 19.2 0.2 'std_dev_idr_increase': describe_change['std']},
15 93 102312.0 1100.1 13.3 index=[0])
The most glaring issue I see is the number of hits each line has. The number of groups, as counted by
len(my_df.groupby('plan_name', sort=False, as_index=False).groups)
is 72. So why are these lines being hit 94 or 93 times each? (This may be related to this issue, but in that case I'd expect the hit count to be num_groups + 1)
Update: In the %lprun call to groupby() above, removing sort=False reduces line hits to 80 for lines 2-6 and 79 for the rest. Still more hits than I'd think there should be, but a bit better .
Secondary question: are there better ways to optimize this particular code?
Here's a rough draft of what I mean in my comment:
def append_to_list():
l = []
for _ in range(10000):
l.append(np.random.random(4))
return pd.DataFrame(l, columns=list('abcd'))
def append_to_df():
cols = list('abcd')
df = pd.DataFrame(columns=cols)
for _ in range(10000):
df = df.append({k: v for k, v in zip(cols, np.random.random(4))},
ignore_index=True)
return df
%timeit append_to_list
# 31.5 ms ± 925 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit append_to_df
# 9.05 s ± 337 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So probably the biggest benefit to your code would be this:
results = []
for plan in my_df['plan_name'].unique():
df1 = df[df['plan_name'] == plan]]
df1['volume'].fillna(0, inplace=True)
df1['change'] = df1['idr'] - df1['idr'].shift(1)
df1['change'].fillna(0, inplace=True)
df1['impact'] = df1['change'] * df1['volume']
describe_impact = df1['impact'].describe()
describe_change = df1['change'].describe()
results.append((plan,
df1['volume'].mean(),
describe_impact['mean'],
describe_impact['std'],
describe_change['mean'],
describe_change['std']))
results = pd.DataFrame(results, columns=['plan','volume','avg_denial_increase','std_dev_impact', 'avg_idr_increase', 'std_dev_idr_increase'])
I would like know if there is a faster way to run a cumsum in pandas.
For example:
import numpy as np
import pandas as pd
n = 10000000
values = np.random.randint(1, 100000, n)
ids = values.astype("S10")
df = pd.DataFrame({"ids": ids, "val": values})
Now, I want to groupby using ids and get some stats.
The max for example is pretty fast:
time df.groupby("ids").val.max()
CPU times: user 5.08 s, sys: 131 ms, total: 5.21 s
Wall time: 5.22 s
However, the cumsum is very slow:
time df.groupby("ids").val.cumsum()
CPU times: user 26.8 s, sys: 707 ms, total: 27.5 s
Wall time: 27.6 s
My problem is that I need the cumsum grouped by a key in a large dataset, almost as shown here, but it takes minutes. Is there a way to make it faster?
Thanks!