What is the best way to vectorize/optimize this python code? - python

I am calculating 48 derived pandas columns by iterating and calculating each column at a time but need to speed up the process. What is the best way to do this to make it faster and more efficent. Each column calculates the closing price as a percentage of the period's (T, T-1, T-2 etc) high and low price.
The code I am currently using is:
#get last x closes as percentage of period high and low
for i in range(1, 49, 1):
df.loc[:,'Close_T_period_'+str(i)] = ((df['BidClose'].shift(i).values
- df['BidLow'].shift(i).values)/
(df['BidHigh'].shift(i).values - df['BidLow'].shift(i).values))
Input dataframe sample:
BidOpen BidHigh BidLow BidClose AskOpen AskHigh AskLow AskClose Volume
Date
2019-09-27 09:00:00 1.22841 1.22919 1.22768 1.22893 1.22850 1.22927 1.22777 1.22900 12075.0
2019-09-27 10:00:00 1.22893 1.23101 1.22861 1.23058 1.22900 1.23110 1.22870 1.23068 16291.0
2019-09-27 11:00:00 1.23058 1.23109 1.22971 1.23076 1.23068 1.23119 1.22979 1.23087 10979.0
2019-09-27 12:00:00 1.23076 1.23308 1.23052 1.23232 1.23087 1.23314 1.23062 1.23241 16528.0
2019-09-27 13:00:00 1.23232 1.23247 1.23163 1.23217 1.23241 1.23256 1.23172 1.23228 14106.0
Output dataframe sample:
BidOpen BidHigh BidLow BidClose ... Close_T_period_45 Close_T_period_46 Close_T_period_47 Close_T_period_48
Date ...
2019-09-27 09:00:00 1.22841 1.22919 1.22768 1.22893 ... 0.682635 0.070796 0.128940 0.794521
2019-09-27 10:00:00 1.22893 1.23101 1.22861 1.23058 ... 0.506024 0.682635 0.070796 0.128940
2019-09-27 11:00:00 1.23058 1.23109 1.22971 1.23076 ... 0.774920 0.506024 0.682635 0.070796
2019-09-27 12:00:00 1.23076 1.23308 1.23052 1.23232 ... 0.212500 0.774920 0.506024 0.682635
2019-09-27 13:00:00 1.23232 1.23247 1.23163 1.23217 ... 0.378882 0.212500 0.774920 0.506024

Short Answer (faster implementation)
the following code is 6x times faster:
import numpy as np
def my_shift(x, i):
first = np.array([np.nan]*i)
return np.append(first, x[:-i])
result = ((df2['BidClose'].values - df2['BidLow'].values)/(df2['BidHigh'].values - df2['BidLow'].values))
for i in range(1, 49, 1):
df2.loc[:,'Close_T_period_'+str(i)] = my_shift(result, i)
Long Answer (explanation)
The two main bottleneck issues in your code are:
In every iteration you recalculate the same values, the only
difference is that, every times, are shifted differently;
pandas shift operation is very slow for your purpose.
so my code simply manage the two issues. Basically I calculate the result just one time and I use the loop only for shifting (Issues #1 improved), and I implemented my own shift function that append in front of the original array i NaN values and cut the last i.
Execution time
With a dataframe with 5000 rows the time benchmark give:
42 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
with my solution I obtained:
7.62 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
 UPDATE
I tried to implement a solution with apply:
result = ((df2['BidClose'].values - df2['BidLow'].values)/(df2['BidHigh'].values - df2['BidLow'].values))
df3 = df.reindex(df2.columns.tolist() +[f'Close_T_period_{i}' for i in range(1, 2000)], axis=1)
df3.iloc[:, 9:] = df3.iloc[:, 9:].apply(lambda row: my_shift(result, int(row.name.split('_')[-1])))
In my test this solution seems a slightly slower then the first one.

Related

How to improve performance on a lambda function on a massive dataframe

I have a df with over hundreds of millions of rows.
latitude longitude time VAL
0 -39.20000076293945312500 140.80000305175781250000 1972-01-19 13:00:00 1.20000004768371582031
1 -39.20000076293945312500 140.80000305175781250000 1972-01-20 13:00:00 0.89999997615814208984
2 -39.20000076293945312500 140.80000305175781250000 1972-01-21 13:00:00 1.50000000000000000000
3 -39.20000076293945312500 140.80000305175781250000 1972-01-22 13:00:00 1.60000002384185791016
4 -39.20000076293945312500 140.80000305175781250000 1972-01-23 13:00:00 1.20000004768371582031
... ...
It contains a time column with the type of datetime64 in UTC. The following code is to create a new column isInDST to indicate if the time is in daylight saving period in a local time zone.
df['isInDST'] = pd.DatetimeIndex(df['time']).tz_localize('UTC').tz_convert('Australia/Victoria').map(lambda x : x.dst().total_seconds()!=0)
It takes about 400 seconds to process 15,223,160 rows.
Is there a better approach to achieve this with better performance? Is vectorize a better way?
All results are calculated on 1M datapoints.
Cython + np.vectorize
7.2 times faster than the original code
%%cython
from cpython.datetime cimport datetime
cpdef bint c_is_in_dst(datetime dt):
return dt.dst().total_seconds() != 0
%%timeit
df['isInDST'] = np.vectorize(c_is_in_dst)(df['time'].dt.tz_localize('UTC').dt.tz_convert('Australia/Victoria').dt.to_pydatetime())
1.08 s ± 10.2 ms per loop per loop
np.vectorize
6.5 times faster than the original code
def is_in_dst(dt):
return dt.dst().total_seconds() != 0
%%timeit
df['isInDST'] = np.vectorize(is_in_dst)(df['time'].dt.tz_localize('UTC').dt.tz_convert('Australia/Victoria').dt.to_pydatetime())
1.2 s ± 29.3 ms per loop per loop
Based on the documentation (The implementation is essentially a for loop) I expected the result to be the same as for the list comprehension, but it's consistently a little bit better than list comprehension.
List comprehension
5.9 times faster than the original code
%%timeit
df['isInDST'] = [x.dst().total_seconds()!=0 for x in pd.DatetimeIndex(df['time']).tz_localize('UTC').tz_convert('Australia/Victoria')]
1.33 s ± 48.4 ms per loop
This result shows that pandas map/apply is very slow, it adds additional overhead that can be eliminated by just using a python for loop.
Original approach (map on pandas DatetimeIndex)
%%timeit
df['isInDST'] = pd.DatetimeIndex(df['time']).tz_localize('UTC').tz_convert('Australia/Victoria').map(lambda x : x.dst().total_seconds()!=0)
7.82 s ± 84.3 ms per loop
Tested on 1M rows of dummy data
N = 1_000_000
df = pd.DataFrame({"time": [datetime.datetime.now().replace(hour=random.randint(0,23),minute=random.randint(0,59)) for _ in range(N)]})
Also, run the code on 100K and 10M rows - the results are linearly dependant on the number of rows

pandas operation by group

I have a dataframe like this
df = pd.DataFrame({'id': [205,205,205, 211, 211, 211]
, 'date': pd.to_datetime(['2019-12-01','2020-01-01', '2020-02-01'
,'2019-12-01' ,'2020-01-01', '2020-03-01'])})
df
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
3 211 2019-12-01
4 211 2020-01-01
5 211 2020-03-01
where the column date is made by consecutive months for id 205 but not for id 211.
I want to keep only the observations (id) for which I have monthly data without jumps. In this example I want:
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
Here I am collecting the id to keep:
keep_id = []
for num in pd.unique(df.index):
temp = (df.loc[df['id']==num,'date'].dt.year - df.loc[df['id']==num,'date'].shift(1).dt.year) * 12 + df.loc[df['id']==num,'date'].dt.month - df.loc[df['id']==num,'date'].shift(1).dt.month
temp.values[0] = 1.0 # here I correct the first entry
if (temp==1.).all():
keep_id.append(num)
where I am using (df.loc[num,'date'].dt.year - df.loc[num,'date'].shift(1).dt.year) * 12 + df.loc[num,'date'].dt.month - df.loc[num,'date'].shift(1).dt.month to compute the difference in months from the previous date for every id.
This seems to work when tested on a small portion of df, but I'm sure there is a better way of doing this, maybe using the .groupby() method.
Since df is made of millions of observations my code takes too much time (and I'd like to learn a more efficient and pythonic way of doing this)
What you want to do is use groupby-filter rather than a groupby apply.
df.groupby('id').filter(lambda x: not (x.date.diff() > pd.Timedelta(days=32)).any())
provides exactly:
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
And indeed, I would keep the index unique, there are too many useful characteristics to retain.
Both this response and Michael's above are correct in terms of output. In terms of performance, they are very similar as well:
%timeit df.groupby('id').filter(lambda x: not (x.date.diff() > pd.Timedelta(days=32)).any())
1.48 ms ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
and
%timeit df[df.groupby('id')['date'].transform(lambda x: x.diff().max() < pd.Timedelta(days=32))]
1.7 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
For most operations, this difference is negligible.
You can use the following approach. Only ~3x faster in my tests.
df[df.groupby('id')['date'].transform(lambda x: x.diff().max() < pd.Timedelta(days=32))]
Out:
date
id
205 2019-12-01
205 2020-01-01
205 2020-02-01

Pandas Efficiency -- Is there a faster way to slice thousands of times?

Appreciate any help from the community on this. I've been toying with it for a few days now.
I have 2 dataframes, df1 & df2. The first dataframe will always be 1 min data about 20-30 thousand rows. The second dataframe will contain random times with associated relevant data & will always be relatively small (1000-4000 rows x 4 or 5 columns). I'm working through df1 with itertuples in order to perform a time specific slice (trailing). This process gets repeated thousands of times, & the single slice line below (df3 = df2...) is causing over 50% of the runtime. Simply adding a couple slicing criteria in the single line below can have 30+% increases on the final runtimes which run hours long!
I've considered trying pandas 'query', but have read it really only helps on larger dataframes. My thought is that it may be better to reduce df2 into a numpy array, simple python list, or other since it is always fairly short, though I think I'll need it back into a dataframe for subsequent sorting, summations, and vector multiplications that come afterward in the primary code. I did succeed in utilizing concurrent futures on a 12 core setup, which increased speed about 5X for my overall application, though I'm still talking hours of runtime.
Any help or suggestions would be appreciated.
Example code illustrating the issue:
import pandas as pd
import numpy as np
import random
from datetime import datetime as dt
from datetime import timedelta, timezone
def random_dates(start, end, n=10):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
dfsize = 34000
df1 = pd.DataFrame({'datetime': pd.date_range('2010-01-01', periods=dfsize, freq='1min'), 'val':np.random.uniform(10, 100, size=dfsize)})
sizedf = 3000
start = pd.to_datetime('2010-01-01')
end = pd.to_datetime('2010-01-24')
test_list = [5, 30]
df2 = pd.DataFrame({'datetime':random_dates(start,end, sizedf), 'a':np.random.uniform(10, 100, size=sizedf), 'b':np.random.choice(test_list, sizedf), 'c':np.random.uniform(10, 100, size=sizedf), 'd':np.random.uniform(10, 100, size=sizedf), 'e':np.random.uniform(10, 100, size=sizedf)})
df2.set_index('datetime', inplace=True)
daysback5 = 3
daysback30 = 8
#%%timeit -r1 #time this section here:
#Slow portion here - Performing ~4000+ slices on a dataframe (df2) which is ~1000 to 3000 rows -- Some slowdown due to itertuples, which don't think is avoidable
for line, row in enumerate(df1.itertuples(index=False), 0):
if row.datetime.minute % 5 ==0:
#Lion's share of the slowdown:
df3 = df2[(df2['a']<=row.val*1.25) & (df2['a']>=row.val*.75) & (df2.index<=row.datetime) & (((df2.index>=row.datetime-timedelta(days=daysback30)) & (df2['b']==30)) | ((df2.index>=row.datetime-timedelta(days=daysback5)) & (df2['b']==5))) ].reset_index(drop=True).copy()
Time of slow part:
8.53 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
df1:
datetime val
0 2010-01-01 00:00:00 58.990147
1 2010-01-01 00:01:00 27.457308
2 2010-01-01 00:02:00 20.657251
3 2010-01-01 00:03:00 36.416561
4 2010-01-01 00:04:00 71.398897
... ... ...
33995 2010-01-24 14:35:00 77.763085
33996 2010-01-24 14:36:00 21.151239
33997 2010-01-24 14:37:00 83.741844
33998 2010-01-24 14:38:00 93.370216
33999 2010-01-24 14:39:00 99.720858
34000 rows × 2 columns
df2:
a b c d e
datetime
2010-01-03 23:38:13 22.363251 30 81.158073 21.806457 11.116421
2010-01-09 16:27:32 78.952070 5 27.045279 29.471537 29.559228
2010-01-13 04:49:57 85.985935 30 79.206437 29.711683 74.454446
2010-01-07 22:29:22 36.009752 30 43.072552 77.646257 57.208626
2010-01-15 09:33:02 13.653679 5 87.987849 37.433810 53.768334
... ... ... ... ... ...
2010-01-12 07:36:42 30.328512 5 81.281791 14.046032 38.288534
2010-01-08 20:26:31 80.911904 30 32.524414 80.571806 26.234552
2010-01-14 08:32:01 12.198825 5 94.270709 27.255914 87.054685
2010-01-06 03:25:09 82.591519 5 91.160917 79.042083 17.831732
2010-01-07 14:32:47 38.337405 30 10.619032 32.557640 87.890791
3000 rows × 5 columns
Actually, cross merge and query works pretty well for your data size:
(df1[df1.datetime.dt.minute % 5==0].assign(dummy=1)
.merge(df2.reset_index().assign(dummy=1),
on='dummy', suffixes=['_1','_2'])
.query('val*1.25 >= a >= val*.75 and datetime_2 <= datetime_1 ')
.loc[lambda x: ((x.datetime_2 >= x.datetime_1 - daysback30) & x['b'].eq(30) )
|((x.datetime_2>= x.datetime_1 - daysback5) & (x['b']==5))]
)
which takes about on my system:
2.05 s ± 60.4 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)
where your code runs for about 10s.

What's the most efficient way to convert a time-series data into a cross-sectional one?

Here's the thing, I have the dataset below where date is the index:
date value
2020-01-01 100
2020-02-01 140
2020-03-01 156
2020-04-01 161
2020-05-01 170
.
.
.
And I want to transform it in this other dataset:
value_t0 value_t1 value_t2 value_t3 value_t4 ...
100 NaN NaN NaN NaN ...
140 100 NaN NaN NaN ...
156 140 100 NaN NaN ...
161 156 140 100 NaN ...
170 161 156 140 100 ...
First I thought about using pandas.pivot_table to do something, but that would just provide a different layout grouped by some column, which is not exactly what I want. Later, I thought about using pandasql and apply 'case when', but that wouldn't work because I would have to type dozens of lines of code. So I'm stuck here.
try this:
new_df = pd.DataFrame({f"value_t{i}": df['value'].shift(i) for i in range(len(df))})
The series .shift(n) method can get you a single column of your desired output by shifting everything down and filling in NaNs above. So we're building a new dataframe by feeding it a dictionary of the form {column name: column data, ...}, by using dictionary comprehension to iterate through your original dataframe.
I think the best is use numpy
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0], 1)] = np.nan
new_df = pd.DataFrame(new_values).add_prefix('value_t')
Times for 5000 rows
%%timeit
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0],1)] = np.nan
new_df = pd.DataFrame(new_values).add_prefix('value_t')
556 ms ± 35.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
new_df = pd.DataFrame({f"value_t{i}": df['value'].shift(i) for i in range(len(df))})
1.31 s ± 36.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Time without add_prefix
%%timeit
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0],1)] = np.nan
new_df = pd.DataFrame(new_values)
357 ms ± 8.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Resample Pandas Dataframe Without Filling in Missing Times

Resampling a dataframe can take the dataframe to either a higher or lower temporal resolution. Most of the time this is used to go to lower resolution (e.g. resample 1-minute data to monthly values). When the dataset is sparse (for example, no data were collected in Feb-2020) then the Feb-2020 row in will be filled with NaNs the resampled dataframe. The problem is when the data record is long AND sparse there are a lot of NaN rows, which makes the dataframe unnecessarily large and takes a lot of CPU time. For example, consider this dataframe and resample operation:
import numpy as np
import pandas as pd
freq1 = pd.date_range("20000101", periods=10, freq="S")
freq2 = pd.date_range("20200101", periods=10, freq="S")
index = np.hstack([freq1.values, freq2.values])
data = np.random.randint(0, 100, (20, 10))
cols = list("ABCDEFGHIJ")
df = pd.DataFrame(index=index, data=data, columns=cols)
# now resample to daily average
df = df.resample(rule="1D").mean()
Most of the data in this dataframe is useless and can be removed via:
df.dropna(how="all", axis=0, inplace=True)
however, this is sloppy. Is there another method to resample the dataframe that does not fill all of the data gaps with NaN (i.e. in the example above, the resultant dataframe would have only two rows)?
Updating my original answer with (what I think) is an improvement, plus updated times.
Use groupby
There are a couple ways you can use groupby instead of resample. In the case of a day ("1D") resampling, you can just use the date property of the DateTimeIndex:
df = df.groupby(df.index.date).mean()
This is in fact faster than the resample for your data:
%%timeit
df.resample(rule='1D').mean().dropna()
# 2.08 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.groupby(df.index.date).mean()
# 666 µs ± 15.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The more general approach would be to use the floor of the timestamps to do the groupby operation:
rule = '1D'
f = df.index.floor(rule)
df.groupby(f).mean()
# A B C D E F G H I J
# 2000-01-01 50.5 33.5 62.7 42.4 46.7 49.2 64.0 53.3 71.0 38.0
# 2020-01-01 50.4 56.3 57.4 46.2 55.0 60.2 60.3 57.8 63.5 47.3
This will work with more irregular frequencies as well. The main snag here is that by default, it seems like the floor is calculated in reference to some initial date, which can cause weird results (see my post):
rule = '7D'
f = df.index.floor(rule)
df.groupby(f).mean()
# A B C D E F G H I J
# 1999-12-30 50.5 33.5 62.7 42.4 46.7 49.2 64.0 53.3 71.0 38.0
# 2019-12-26 50.4 56.3 57.4 46.2 55.0 60.2 60.3 57.8 63.5 47.3
The major issue is that the resampling doesn't start on the earliest timestamp within your data. However, it is fixable using this solution to the above post:
# custom function for flooring relative to a start date
def floor(x, freq):
offset = x[0].ceil(freq) - x[0]
return (x + offset).floor(freq) - offset
rule = '7D'
f = floor(df.index, rule)
df.groupby(f).mean()
# A B C D E F G H I J
# 2000-01-01 50.5 33.5 62.7 42.4 46.7 49.2 64.0 53.3 71.0 38.0
# 2019-12-28 50.4 56.3 57.4 46.2 55.0 60.2 60.3 57.8 63.5 47.3
# the cycle of 7 days is now starting from 01-01-2000
Just note here that the function floor() is relatively slow compared to pandas.Series.dt.floor(). So it is best to us the latter if you can, but both are better than the original resample (in your example):
%%timeit
df.groupby(df.index.floor('1D')).mean()
# 1.06 ms ± 6.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df.groupby(floor(df.index, '1D')).mean()
# 1.42 ms ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Categories

Resources