Appreciate any help from the community on this. I've been toying with it for a few days now.
I have 2 dataframes, df1 & df2. The first dataframe will always be 1 min data about 20-30 thousand rows. The second dataframe will contain random times with associated relevant data & will always be relatively small (1000-4000 rows x 4 or 5 columns). I'm working through df1 with itertuples in order to perform a time specific slice (trailing). This process gets repeated thousands of times, & the single slice line below (df3 = df2...) is causing over 50% of the runtime. Simply adding a couple slicing criteria in the single line below can have 30+% increases on the final runtimes which run hours long!
I've considered trying pandas 'query', but have read it really only helps on larger dataframes. My thought is that it may be better to reduce df2 into a numpy array, simple python list, or other since it is always fairly short, though I think I'll need it back into a dataframe for subsequent sorting, summations, and vector multiplications that come afterward in the primary code. I did succeed in utilizing concurrent futures on a 12 core setup, which increased speed about 5X for my overall application, though I'm still talking hours of runtime.
Any help or suggestions would be appreciated.
Example code illustrating the issue:
import pandas as pd
import numpy as np
import random
from datetime import datetime as dt
from datetime import timedelta, timezone
def random_dates(start, end, n=10):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
dfsize = 34000
df1 = pd.DataFrame({'datetime': pd.date_range('2010-01-01', periods=dfsize, freq='1min'), 'val':np.random.uniform(10, 100, size=dfsize)})
sizedf = 3000
start = pd.to_datetime('2010-01-01')
end = pd.to_datetime('2010-01-24')
test_list = [5, 30]
df2 = pd.DataFrame({'datetime':random_dates(start,end, sizedf), 'a':np.random.uniform(10, 100, size=sizedf), 'b':np.random.choice(test_list, sizedf), 'c':np.random.uniform(10, 100, size=sizedf), 'd':np.random.uniform(10, 100, size=sizedf), 'e':np.random.uniform(10, 100, size=sizedf)})
df2.set_index('datetime', inplace=True)
daysback5 = 3
daysback30 = 8
#%%timeit -r1 #time this section here:
#Slow portion here - Performing ~4000+ slices on a dataframe (df2) which is ~1000 to 3000 rows -- Some slowdown due to itertuples, which don't think is avoidable
for line, row in enumerate(df1.itertuples(index=False), 0):
if row.datetime.minute % 5 ==0:
#Lion's share of the slowdown:
df3 = df2[(df2['a']<=row.val*1.25) & (df2['a']>=row.val*.75) & (df2.index<=row.datetime) & (((df2.index>=row.datetime-timedelta(days=daysback30)) & (df2['b']==30)) | ((df2.index>=row.datetime-timedelta(days=daysback5)) & (df2['b']==5))) ].reset_index(drop=True).copy()
Time of slow part:
8.53 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
df1:
datetime val
0 2010-01-01 00:00:00 58.990147
1 2010-01-01 00:01:00 27.457308
2 2010-01-01 00:02:00 20.657251
3 2010-01-01 00:03:00 36.416561
4 2010-01-01 00:04:00 71.398897
... ... ...
33995 2010-01-24 14:35:00 77.763085
33996 2010-01-24 14:36:00 21.151239
33997 2010-01-24 14:37:00 83.741844
33998 2010-01-24 14:38:00 93.370216
33999 2010-01-24 14:39:00 99.720858
34000 rows × 2 columns
df2:
a b c d e
datetime
2010-01-03 23:38:13 22.363251 30 81.158073 21.806457 11.116421
2010-01-09 16:27:32 78.952070 5 27.045279 29.471537 29.559228
2010-01-13 04:49:57 85.985935 30 79.206437 29.711683 74.454446
2010-01-07 22:29:22 36.009752 30 43.072552 77.646257 57.208626
2010-01-15 09:33:02 13.653679 5 87.987849 37.433810 53.768334
... ... ... ... ... ...
2010-01-12 07:36:42 30.328512 5 81.281791 14.046032 38.288534
2010-01-08 20:26:31 80.911904 30 32.524414 80.571806 26.234552
2010-01-14 08:32:01 12.198825 5 94.270709 27.255914 87.054685
2010-01-06 03:25:09 82.591519 5 91.160917 79.042083 17.831732
2010-01-07 14:32:47 38.337405 30 10.619032 32.557640 87.890791
3000 rows × 5 columns
Actually, cross merge and query works pretty well for your data size:
(df1[df1.datetime.dt.minute % 5==0].assign(dummy=1)
.merge(df2.reset_index().assign(dummy=1),
on='dummy', suffixes=['_1','_2'])
.query('val*1.25 >= a >= val*.75 and datetime_2 <= datetime_1 ')
.loc[lambda x: ((x.datetime_2 >= x.datetime_1 - daysback30) & x['b'].eq(30) )
|((x.datetime_2>= x.datetime_1 - daysback5) & (x['b']==5))]
)
which takes about on my system:
2.05 s ± 60.4 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)
where your code runs for about 10s.
Related
I was looking through the pandas.query documentation but couldn't find anything specific about this.
Is it possible to perform a query on a date based on the closest date to the one given, instead of a specific date?
For example lets say we use the wine dataset and creates some random dates.
import pandas as pd
import numpy as np
from sklearn import datasets
dir(datasets)
df = pd.DataFrame(datasets.load_wine().data)
df.columns = datasets.load_wine().feature_names
df.columns=df.columns.str.strip()
def random_dates(start, end, n, unit='D'):
ndays = (end - start).days + 1
return pd.to_timedelta(np.random.rand(n) * ndays, unit=unit) + start
np.random.seed(0)
start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2022-01-01')
datelist=random_dates(start, end, 178)
df['Dates'] = datelist
if you perform a simple query on hue
df.query('hue == 0.6')
you'll receive three rows with three random dates. Is it possible to pick the query result that's closest to let's say 2017-1-1?
so something like
df.query('hue==0.6').query('Date ~2017-1-1')
I hope this makes sense!
You can use something like:
df.query("('2018-01-01' < Dates) & (Dates < '2018-01-31')")
# Output
alcohol malic_acid ... proline Dates
6 14.39 1.87 ... 1290.0 2018-01-24 08:21:14.665824000
41 13.41 3.84 ... 1035.0 2018-01-22 22:15:56.547561600
51 13.83 1.65 ... 1265.0 2018-01-26 22:37:26.812156800
131 12.88 2.99 ... 530.0 2018-01-01 18:58:05.118441600
139 12.84 2.96 ... 590.0 2018-01-08 13:38:26.117376000
142 13.52 3.17 ... 520.0 2018-01-19 22:37:10.170825600
[6 rows x 14 columns]
Or using #variables:
date = pd.to_datetime('2018-01-01')
offset = pd.DateOffset(days=10)
start = date - offset
end = date + offset
df.query("Dates.between(#start, #end)")
# Output
alcohol malic_acid ... proline Dates
131 12.88 2.99 ... 530.0 2018-01-01 18:58:05.118441600
139 12.84 2.96 ... 590.0 2018-01-08 13:38:26.117376000
Given a series, find the entries closest to a given date:
def closest_to_date(series, date, n=5):
date = pd.to_datetime(date)
return abs(series - date).nsmallest(n)
Then we can use the index of the returned series to select further rows (or you change the api to suit you):
(df.loc[df.hue == 0.6]
.loc[lambda df_: closest_to_date(df_.Dates, "2017-1-1", n=1).index]
)
I'm not sure if you have to use query, but this will give you the results you are looking for
df['Count'] = (df[df['hue'] == .6].sort_values(['Dates'], ascending=True)).groupby(['hue']).cumcount() + 1
df.loc[df['Count'] == 1]
I have a df with over hundreds of millions of rows.
latitude longitude time VAL
0 -39.20000076293945312500 140.80000305175781250000 1972-01-19 13:00:00 1.20000004768371582031
1 -39.20000076293945312500 140.80000305175781250000 1972-01-20 13:00:00 0.89999997615814208984
2 -39.20000076293945312500 140.80000305175781250000 1972-01-21 13:00:00 1.50000000000000000000
3 -39.20000076293945312500 140.80000305175781250000 1972-01-22 13:00:00 1.60000002384185791016
4 -39.20000076293945312500 140.80000305175781250000 1972-01-23 13:00:00 1.20000004768371582031
... ...
It contains a time column with the type of datetime64 in UTC. The following code is to create a new column isInDST to indicate if the time is in daylight saving period in a local time zone.
df['isInDST'] = pd.DatetimeIndex(df['time']).tz_localize('UTC').tz_convert('Australia/Victoria').map(lambda x : x.dst().total_seconds()!=0)
It takes about 400 seconds to process 15,223,160 rows.
Is there a better approach to achieve this with better performance? Is vectorize a better way?
All results are calculated on 1M datapoints.
Cython + np.vectorize
7.2 times faster than the original code
%%cython
from cpython.datetime cimport datetime
cpdef bint c_is_in_dst(datetime dt):
return dt.dst().total_seconds() != 0
%%timeit
df['isInDST'] = np.vectorize(c_is_in_dst)(df['time'].dt.tz_localize('UTC').dt.tz_convert('Australia/Victoria').dt.to_pydatetime())
1.08 s ± 10.2 ms per loop per loop
np.vectorize
6.5 times faster than the original code
def is_in_dst(dt):
return dt.dst().total_seconds() != 0
%%timeit
df['isInDST'] = np.vectorize(is_in_dst)(df['time'].dt.tz_localize('UTC').dt.tz_convert('Australia/Victoria').dt.to_pydatetime())
1.2 s ± 29.3 ms per loop per loop
Based on the documentation (The implementation is essentially a for loop) I expected the result to be the same as for the list comprehension, but it's consistently a little bit better than list comprehension.
List comprehension
5.9 times faster than the original code
%%timeit
df['isInDST'] = [x.dst().total_seconds()!=0 for x in pd.DatetimeIndex(df['time']).tz_localize('UTC').tz_convert('Australia/Victoria')]
1.33 s ± 48.4 ms per loop
This result shows that pandas map/apply is very slow, it adds additional overhead that can be eliminated by just using a python for loop.
Original approach (map on pandas DatetimeIndex)
%%timeit
df['isInDST'] = pd.DatetimeIndex(df['time']).tz_localize('UTC').tz_convert('Australia/Victoria').map(lambda x : x.dst().total_seconds()!=0)
7.82 s ± 84.3 ms per loop
Tested on 1M rows of dummy data
N = 1_000_000
df = pd.DataFrame({"time": [datetime.datetime.now().replace(hour=random.randint(0,23),minute=random.randint(0,59)) for _ in range(N)]})
Also, run the code on 100K and 10M rows - the results are linearly dependant on the number of rows
I have a dataframe like this
df = pd.DataFrame({'id': [205,205,205, 211, 211, 211]
, 'date': pd.to_datetime(['2019-12-01','2020-01-01', '2020-02-01'
,'2019-12-01' ,'2020-01-01', '2020-03-01'])})
df
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
3 211 2019-12-01
4 211 2020-01-01
5 211 2020-03-01
where the column date is made by consecutive months for id 205 but not for id 211.
I want to keep only the observations (id) for which I have monthly data without jumps. In this example I want:
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
Here I am collecting the id to keep:
keep_id = []
for num in pd.unique(df.index):
temp = (df.loc[df['id']==num,'date'].dt.year - df.loc[df['id']==num,'date'].shift(1).dt.year) * 12 + df.loc[df['id']==num,'date'].dt.month - df.loc[df['id']==num,'date'].shift(1).dt.month
temp.values[0] = 1.0 # here I correct the first entry
if (temp==1.).all():
keep_id.append(num)
where I am using (df.loc[num,'date'].dt.year - df.loc[num,'date'].shift(1).dt.year) * 12 + df.loc[num,'date'].dt.month - df.loc[num,'date'].shift(1).dt.month to compute the difference in months from the previous date for every id.
This seems to work when tested on a small portion of df, but I'm sure there is a better way of doing this, maybe using the .groupby() method.
Since df is made of millions of observations my code takes too much time (and I'd like to learn a more efficient and pythonic way of doing this)
What you want to do is use groupby-filter rather than a groupby apply.
df.groupby('id').filter(lambda x: not (x.date.diff() > pd.Timedelta(days=32)).any())
provides exactly:
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
And indeed, I would keep the index unique, there are too many useful characteristics to retain.
Both this response and Michael's above are correct in terms of output. In terms of performance, they are very similar as well:
%timeit df.groupby('id').filter(lambda x: not (x.date.diff() > pd.Timedelta(days=32)).any())
1.48 ms ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
and
%timeit df[df.groupby('id')['date'].transform(lambda x: x.diff().max() < pd.Timedelta(days=32))]
1.7 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
For most operations, this difference is negligible.
You can use the following approach. Only ~3x faster in my tests.
df[df.groupby('id')['date'].transform(lambda x: x.diff().max() < pd.Timedelta(days=32))]
Out:
date
id
205 2019-12-01
205 2020-01-01
205 2020-02-01
I am calculating 48 derived pandas columns by iterating and calculating each column at a time but need to speed up the process. What is the best way to do this to make it faster and more efficent. Each column calculates the closing price as a percentage of the period's (T, T-1, T-2 etc) high and low price.
The code I am currently using is:
#get last x closes as percentage of period high and low
for i in range(1, 49, 1):
df.loc[:,'Close_T_period_'+str(i)] = ((df['BidClose'].shift(i).values
- df['BidLow'].shift(i).values)/
(df['BidHigh'].shift(i).values - df['BidLow'].shift(i).values))
Input dataframe sample:
BidOpen BidHigh BidLow BidClose AskOpen AskHigh AskLow AskClose Volume
Date
2019-09-27 09:00:00 1.22841 1.22919 1.22768 1.22893 1.22850 1.22927 1.22777 1.22900 12075.0
2019-09-27 10:00:00 1.22893 1.23101 1.22861 1.23058 1.22900 1.23110 1.22870 1.23068 16291.0
2019-09-27 11:00:00 1.23058 1.23109 1.22971 1.23076 1.23068 1.23119 1.22979 1.23087 10979.0
2019-09-27 12:00:00 1.23076 1.23308 1.23052 1.23232 1.23087 1.23314 1.23062 1.23241 16528.0
2019-09-27 13:00:00 1.23232 1.23247 1.23163 1.23217 1.23241 1.23256 1.23172 1.23228 14106.0
Output dataframe sample:
BidOpen BidHigh BidLow BidClose ... Close_T_period_45 Close_T_period_46 Close_T_period_47 Close_T_period_48
Date ...
2019-09-27 09:00:00 1.22841 1.22919 1.22768 1.22893 ... 0.682635 0.070796 0.128940 0.794521
2019-09-27 10:00:00 1.22893 1.23101 1.22861 1.23058 ... 0.506024 0.682635 0.070796 0.128940
2019-09-27 11:00:00 1.23058 1.23109 1.22971 1.23076 ... 0.774920 0.506024 0.682635 0.070796
2019-09-27 12:00:00 1.23076 1.23308 1.23052 1.23232 ... 0.212500 0.774920 0.506024 0.682635
2019-09-27 13:00:00 1.23232 1.23247 1.23163 1.23217 ... 0.378882 0.212500 0.774920 0.506024
Short Answer (faster implementation)
the following code is 6x times faster:
import numpy as np
def my_shift(x, i):
first = np.array([np.nan]*i)
return np.append(first, x[:-i])
result = ((df2['BidClose'].values - df2['BidLow'].values)/(df2['BidHigh'].values - df2['BidLow'].values))
for i in range(1, 49, 1):
df2.loc[:,'Close_T_period_'+str(i)] = my_shift(result, i)
Long Answer (explanation)
The two main bottleneck issues in your code are:
In every iteration you recalculate the same values, the only
difference is that, every times, are shifted differently;
pandas shift operation is very slow for your purpose.
so my code simply manage the two issues. Basically I calculate the result just one time and I use the loop only for shifting (Issues #1 improved), and I implemented my own shift function that append in front of the original array i NaN values and cut the last i.
Execution time
With a dataframe with 5000 rows the time benchmark give:
42 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
with my solution I obtained:
7.62 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
UPDATE
I tried to implement a solution with apply:
result = ((df2['BidClose'].values - df2['BidLow'].values)/(df2['BidHigh'].values - df2['BidLow'].values))
df3 = df.reindex(df2.columns.tolist() +[f'Close_T_period_{i}' for i in range(1, 2000)], axis=1)
df3.iloc[:, 9:] = df3.iloc[:, 9:].apply(lambda row: my_shift(result, int(row.name.split('_')[-1])))
In my test this solution seems a slightly slower then the first one.
My problem
I'm having trouble with the performance of resample function in combination with a groupby. The operation I'm doing is currently taking 8+ seconds on a data sample of 5000 rows which is totally unreasonable for my requirements.
Sample data (500 rows)
Pastebin with data as dict: https://pastebin.com/RPNdhXsy
The logic
I have data with dates in a quarterly interval which I want to group by a column and then resample the dates within the groups on a monthly basis.
Input:
isin report_date val
SE001 2018-12-31 1
SE001 2018-09-30 2
SE001 2018-06-31 3
US001 2018-10-31 4
US001 2018-07-31 5
Output:
isin report_date val
SE001 2018-12-31 1
2018-11-30 NaN
2018-10-31 NaN
2018-09-30 2
2018-08-31 NaN
2018-07-31 NaN
2018-06-30 3
US001 2018-10-30 4
2018-09-31 NaN
2018-08-31 NaN
2018-07-31 5
I used to have this operation:
df.groupby('isin').resample('M', on="report_date").first()[::-1]
Since it seems that asfreq() has slightly better performance than using on= in resample, I instead do the following currently. It's still slow though.
I reverse since resample seems to non-optionally sort dates descending.
df.set_index('report_date').groupby('isin').resample('M').asfreq()[::-1]
As stated, with 5000 rows and around 16 columns this takes 15 seconds to run since I need to do it on two separate dataframes.
With the sample data in the pastebin (500 rows) the operation takes me 0.7s which is way too long for me since my final data will have 800k rows.
EDIT: Timing of the different operations
Current way
setindex --- 0.001055002212524414 seconds ---
groupby --- 0.00033092498779296875 seconds ---
resample --- 0.004662036895751953 seconds ---
asfreq --- 0.8990700244903564 seconds ---
[::-1] --- 0.0013098716735839844 seconds ---
= 0.9056s
Old way
groupby --- 0.0005779266357421875 seconds ---
resample --- 0.0044629573822021484 seconds ---
first --- 1.6829369068145752 seconds ---
[::-1] --- 0.001600027084350586 seconds ---
= 1.6894s
Judging by this, it seems that converting from the pandas.core.resample.DatetimeIndexResamplerGroupby to a df is taking very long. Now what?
EDIT2: Using reindex
df.set_index('report_date').groupby('isin').apply(lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max(), freq='M'), fill_value=0))[::-1]
This takes 0.28s which is a vast improvement. Still not very good though.
How can I speed this up? Is there another way to do the same thing?
I cut execution time for a 25k row test data set from 850 ms to 320 ms. I wrapped the reindex logic in a function, to make timing easier:
def orig_pipeline(df):
return (df
.set_index('report_date')
.groupby('isin')
.apply(lambda x: x.reindex(pd.date_range(x.index.min(),
x.index.max(),
freq='M'),
fill_value=0))
[::-1])
Then, I created new functions to make date arithmetic and reindexing faster:
def create_params(df):
return (df.groupby('isin')['report_date']
.agg(['min', 'max']).sort_index().reset_index())
def create_multiindex(df, params):
all_dates = pd.date_range(start='1999-12-31', end='2020-12-31', freq='M')
midx = (
(row.isin, d)
for row in params.itertuples()
for d in all_dates[(row.min <= all_dates) & (all_dates <= row.max)])
return pd.MultiIndex.from_tuples(midx, names=['isin', 'report_date'])
def apply_mulitindex(df, midx):
return df.set_index(['isin', 'report_date']).reindex(midx)
def new_pipeline(df):
params = create_params(df)
midx = create_multiindex(df, params)
return apply_mulitindex(df, midx)
Old and new pipelines give same results (except possibly sort order):
v1 = orig_pipeline(df).drop(columns='isin').sort_index()
v2 = new_pipeline(df).sort_index().fillna(0)
assert(v1 == v2).all().all()
Timing results:
%%timeit
v1 = orig_pipeline(df_big)
854 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
v2 = new_pipeline(df_big)
322 ms ± 5.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I would like to illustrate the experiments I made trying to figure which solution yields the most performance, and it shows that #jsmart 's is the best one.
My dataset is like the following (sorry for the screenshot I could not manage to paste a pretty table):
My goal is to have for each (orgacom, client) couple the indicators resampled by business day.
Solution 1: groupby / apply asfreq
%%time
sol1 = (
to_process.groupby(['orgacom', 'client'], observed=True, )
.apply(lambda x: x.asfreq('B', fill_value=np.nan))
)
CPU times: user 4min 6s, sys: 2.91 s, total: 4min 9s
Wall time: 4min 9s
Solution 2: groupby / apply reindex (as of #jokab EDIT2)
%%time
sol2 = (
to_process.groupby(['orgacom', 'client'], observed=True, )
.apply(lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max(), freq='B'), fill_value=np.nan))
)
CPU times: user 4min 13s, sys: 2.16 s, total: 4min 15s
Wall time: 4min 15s
Solution 3: recoding resample (as of #jsmart answer)
def create_params(df):
return (df.reset_index().groupby(['orgacom', 'client'], observed=True, )['date']
.agg(['min', 'max']).sort_index().reset_index())
def create_multiindex(df, params):
all_dates = pd.date_range(start='2016-12-31', end='2020-12-31', freq='B')
midx = (
(row.orgacom, row.client, d)
for row in params.itertuples()
for d in all_dates[(row.min <= all_dates) & (all_dates <= row.max)])
return pd.MultiIndex.from_tuples(midx, names=['orgacom', 'client', 'date'])
def apply_mulitindex(df, midx):
return df.set_index(['orgacom', 'client', 'date']).reindex(midx)
def new_pipeline(df):
params = create_params(df)
midx = create_multiindex(df, params)
return apply_mulitindex(df, midx)
%%time
sol3 = new_pipeline(to_process.reset_index())
CPU times: user 1min 46s, sys: 4.93 s, total: 1min 51s
Wall time: 1min 51s
Solution 4: groupby / resample asfreq (as of #jokab first solution)
%%time
sol4 = to_process.groupby(['orgacom', 'client']).resample('B').asfreq()
CPU times: user 4min 22s, sys: 8.01 s, total: 4min 30s
Wall time: 4min 30s
I also noticed that resampling on groupby can be slow. In my case, I used data reshaping for speed up,
df.set_index(['isin', 'report_date'])['val'].unstack(0).resample('M')
There is another way of doing this. Use itertools.groupby() and list comprehension
import time
from itertools import groupby
print(time.time())
data = (
('SE001', '2018-12-31', 1),
('SE001', '2018-09-30', 2),
('SE001', '2018-06-31', 3),
('US001', '2018-10-31', 4),
('US001', '2018-07-31', 5),
)
aggr = [(key, sum([g[2] for g in grp])) for key, grp in groupby(sorted(data), key=lambda x: x[0])]
print(aggr)
print(time.time())
# 100,000 records
# 2.5 seconds