How to efficiently do operation on pandas each group - python

So I have a data frame like this--
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,22], [1,23], [1,44], [2, 33], [2, 55]], columns=['id', 'delay'])
id delay
0 1 22
1 1 23
2 1 44
3 2 33
4 2 55
What I am doing is grouping by id and doing rolling operation on the delay column like below--
k = [0.1, 0.5, 1]
def f(d):
d['new_delay'] = pd.Series([0,0]).append(d['delay']).rolling(window=3).apply(lambda x: np.sum(x*k)).iloc[2:]
return d
df.groupby(['id']).apply(f)
id delay new_delay
0 1 22 22.0
1 1 23 34.0
2 1 44 57.7
3 2 33 33.0
4 2 55 71.5
It is working just fine but I am curious whether .apply on grouped data frame is vectorized or not. Since my dataset is huge, is there a better-vectorized way to do this kind of operation? Also I am curious if Python is single-threaded and I am running on CPU how pandas, numpy achieve vectorized calculation.

You can use strides for vectorized rolling with GroupBy.transform:
k = [0.1, 0.5, 1]
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def f(d):
return np.sum(rolling_window(np.append([0,0],d.to_numpy()), 3) * k, axis=1)
df['new_delay'] = df.groupby('id')['delay'].transform(f)
print (df)
id delay new_delay
0 1 22 22.0
1 1 23 34.0
2 1 44 57.7
3 2 33 33.0
4 2 55 71.5

Another option will be to use np.convolve() instead:
# Our function
f = lambda x: np.convolve(np.array([1,0.5,0.1]),x)[:len(x)]
# Groupby + Transform
df['new_delay'] = df.groupby('id')['delay'].transform(f)
Don't know if it's faster or not.

Here is one approach with groupby + rolling and apply a custom function compiled using numba
def func(v):
k = np.array([0.1, 0.5, 1])
return np.sum(v * k[len(k) - len(v):])
(
df.groupby('id')['delay']
.rolling(3, min_periods=1)
.apply(func, raw=True, engine='numba')
.droplevel(0)
)
0 22.0
1 34.0
2 57.7
3 33.0
4 71.5
Name: delay, dtype: float64

Related

Compare two columns based on last N rows in a pandas DataFrame

I want to groupby "ts_code" and calculate percentage between one column max and min value from another column after max based on last N rows for each group. Specifically,
df
ts_code high low
0 A 20 10
1 A 30 5
2 A 40 20
3 A 50 10
4 A 20 30
5 B 20 10
6 B 30 5
7 B 40 20
8 B 50 10
9 B 20 30
Goal
Below is my expected result
ts_code high low l3_high_low_pct_chg l4_high_low_pct_chg
0 A 20 10 NA NA
1 A 30 5 NA NA
2 A 40 20 0.5 NA
3 A 50 10 0.8 0.8
4 A 20 30 0.8 0.8
5 B 50 10 NA NA
6 B 30 5 NA NA
7 B 40 20 0.9 NA
8 B 10 10 0.75 0.9
9 B 20 30 0.75 0.75
ln_high_low_pct_chg(such as l3_high_low_pct_chg)= 1-(the min value of the low column after the peak)/(the max value of high column),on last N rows for each group and each row.
Try and problem
df['l3_highest']=df.groupby('ts_code')['high'].transform(lambda x: x.rolling(3).max())
df['l3_lowest']=df.groupby('ts_code')['low'].transform(lambda x: x.rolling(3).min())
df['l3_high_low_pct_chg']=1-df['l3_lowest']/df['l3_highest']
But it fails such that for second row, the l3_lowest would be 5 not 20. I don't know how to calculate percentage after peak.
For last 4 rows, at index=8, low=10,high=50,low=5, l4_high_low_pct_chg=0.9
, at index=9, high=40, low=10, l4_high_low_pct_chg=0.75
Another test data
If the rolling window is 52, for hy_code 880912 group and index 1252, l52_high_low_pct_chg would be 0.281131 and 880301 group and index 1251, l52_high_low_pct_chg would be 0.321471.
Grouping by 'ts_code' is just a trivial groupby() function. DataFrame.rolling() function is for single columns, so it's a tricky to apply it if you need data from multiple columns. You can use "from numpy_ext import rolling_apply as rolling_apply_ext" as in this example: Pandas rolling apply using multiple columns. However, I just created a function that manually groups the dataframe into n length sub-dataframes, then applies the function to calculate the value. idxmax() finds the index value of the peak of the low column, then we find the min() of the values that follow. The rest is pretty straightforward.
import numpy as np
import pandas as pd
df = pd.DataFrame([['A', 20, 10],
['A', 30, 5],
['A', 40, 20],
['A', 50, 10],
['A', 20, 30],
['B', 50, 10],
['B', 30, 5],
['B', 40, 20],
['B', 10, 10],
['B', 20, 30]],
columns=['ts_code', 'high', 'low']
)
def custom_f(df, n):
s = pd.Series(np.nan, index=df.index)
def sub_f(df_):
high_peak_idx = df_['high'].idxmax()
min_low_after_peak = df_.loc[high_peak_idx:]['low'].min()
max_high = df_['high'].max()
return 1 - min_low_after_peak / max_high
for i in range(df.shape[0] - n + 1):
df_ = df.iloc[i:i + n]
s.iloc[i + n - 1] = sub_f(df_)
return s
df['l3_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 3).values
df['l4_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 4).values
print(df)
If you prefer to use the rolling function, this method gives the same output:
def rolling_f(rolling_df):
df_ = df.loc[rolling_df.index]
high_peak_idx = df_['high'].idxmax()
min_low_after_peak = df_.loc[high_peak_idx:]["low"].min()
max_high = df_['high'].max()
return 1 - min_low_after_peak / max_high
df['l3_high_low_pct_chg'] = df.groupby("ts_code").rolling(3).apply(rolling_f).values[:, 0]
df['l4_high_low_pct_chg'] = df.groupby("ts_code").rolling(4).apply(rolling_f).values[:, 0]
print(df)
Finally, if you want to do a true rolling window calculation that avoids any index lookup, you can use the numpy_ext (https://pypi.org/project/numpy-ext/)
from numpy_ext import rolling_apply
def np_ext_f(rolling_df, n):
def rolling_apply_f(high, low):
return 1 - low[np.argmax(high):].min() / high.max()
try:
return pd.Series(rolling_apply(rolling_apply_f, n, rolling_df['high'].values, rolling_df['low'].values), index=rolling_df.index)
except ValueError:
return pd.Series(np.nan, index=rolling_df.index)
df['l3_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=3).sort_index(level=1).values
df['l4_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=4).sort_index(level=1).values
print(df)
output:
ts_code high low l3_high_low_pct_chg l4_high_low_pct_chg
0 A 20 10 NaN NaN
1 A 30 5 NaN NaN
2 A 40 20 0.50 NaN
3 A 50 10 0.80 0.80
4 A 20 30 0.80 0.80
5 B 50 10 NaN NaN
6 B 30 5 NaN NaN
7 B 40 20 0.90 NaN
8 B 10 10 0.75 0.90
9 B 20 30 0.75 0.75
For large datasets, the speed of these operations becomes an issue. So, to compare the speed of these different methods, I created a timing function:
import time
def timeit(f):
def timed(*args, **kw):
ts = time.time()
result = f(*args, **kw)
te = time.time()
print ('func:%r took: %2.4f sec' % \
(f.__name__, te-ts))
return result
return timed
Next, let's make a large DataFrame, just by copying the existing dataframe 500 times:
df = pd.concat([df for x in range(500)], axis=0)
df = df.reset_index()
Finally, we run the three tests under a timing function:
#timeit
def method_1():
df['l52_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 52).values
method_1()
#timeit
def method_2():
df['l52_high_low_pct_chg'] = df.groupby("ts_code").rolling(52).apply(rolling_f).values[:, 0]
method_2()
#timeit
def method_3():
df['l52_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=52).sort_index(level=1).values
method_3()
Which gives us this output:
func:'method_1' took: 2.5650 sec
func:'method_2' took: 15.1233 sec
func:'method_3' took: 0.1084 sec
So, the fastest method is to use the numpy_ext, which makes sense because that's optimized for vectorized calculations. The second fastest method is the custom function I wrote, which is somewhat efficient because it does some vectorized calculations while also doing some Pandas lookups. The slowest method by far is using Pandas rolling function.
For my solution, we'll use .groupby("ts_code") then .rolling to process groups of certain size and a custom_function. This custom function will take each group, and instead of applying a function directly on the received values, we'll use those values to query the original dataframe. Then, we can calculate the values as you expect by finding the row where the "high" peak is, then look the following rows to find the minimum "low" value and finally calculate the result using your formula:
def custom_function(group, df):
# Query the original dataframe using the group values
group = df.loc[group.values]
# Calculate your formula
high_peak_row = group["high"].idxmax()
min_low_after_peak = group.loc[high_peak_row:, "low"].min()
return 1 - min_low_after_peak / group.loc[high_peak_row, "high"]
# Reset the index to roll over that column and be able query the original dataframe
df["l3_high_low_pct_chg"] = df.reset_index().groupby("ts_code")["index"].rolling(3).apply(custom_function, args=(df,)).values
df["l4_high_low_pct_chg"] = df.reset_index().groupby("ts_code")["index"].rolling(4).apply(custom_function, args=(df,)).values
Output:
ts_code high low l3_high_low_pct_chg l4_high_low_pct_chg
0 A 20 10 NaN NaN
1 A 30 5 NaN NaN
2 A 40 20 0.50 NaN
3 A 50 10 0.80 0.80
4 A 20 30 0.80 0.80
5 B 50 10 NaN NaN
6 B 30 5 NaN NaN
7 B 40 20 0.90 NaN
8 B 10 10 0.75 0.90
9 B 20 30 0.75 0.75
We can take this idea further an only group once:
groups = df.reset_index().groupby("ts_code")["index"]
for n in [3, 4]:
df[f"l{n}_high_low_pct_chg"] = groups.rolling(n).apply(custom_function, args=(df,)).values

Write if else loop in simple way to reduce time complexity

I have the below code
for i in range(index, len(df_2)+1):
if df_2.loc[i, 'Duration'] == 0:
df_2.loc[i, 'Duration'] = df_2.loc[i, "idle_hrs"] + df_2.loc[i - 1, "Duration"]
How can i write this in simple way to reduce time complexity? is there a war to write it in list comprehension style?
You can use shift for the accumulation.
import pandas as pd
index = 2
df = pd.DataFrame({'Duration': [1,0,2,0], 'idle_hrs': [10,20,30,40]})
df
Duration idle_hrs
0 1 10
1 0 20
2 2 30
3 0 40
start_df = df[:index]
df.loc[df['Duration'] == 0, 'Duration'] = df['idle_hrs'] + df['Duration'].shift(1)
df.iloc[:index] = start_df
df
Duration idle_hrs
0 1.0 10
1 0.0 20
2 2.0 30
3 42.0 40

Calculating rolling beta in Pandas

I am trying to calculating a rolling beta between two Series in Pandas.
My understanding is that to get the beta, I need to get the covariance matrix and then divide the cells (0, 1) by (1, 1)
So I created a function:
def calc_beta (A, B) :
covariance = np.cov (A, B)
beta = covariance[0, 1] / covariance[1, 1]
return beta
If I just wanted to run it for the entire series, I would do:
calc_beta(A, B)
But I'm not sure how to do that on a rolling basis, I tried A.rolling(10).apply(calc_beta, raw=False, B) unsuccessfully.
Then I just tried calculating the cov matrix on a rolling basis, which I can do:
A = pd.Series(np.random.randint(1,101,50))
B = pd.Series(np.random.randint(1,101,50))
df = pd.DataFrame([A, B]).transpose()
df.rolling(10).cov(df, pairwise=True)
Now I have a covariance matrix but how do I perform the beta calc, i.e. (covariance[0,1]/covariance[1,1]) on a rolling basis (and then get the mean).
It might not be the best answer (read, the most compact) but Ithink this could do the trick. You were actually on the right track to begin with. So, assume you have the two series you gave and make them into a df
A = pd.Series(np.random.randint(1,101,50))
B = pd.Series(np.random.randint(1,101,50))
df = pd.concat([A, B], axis=1)
Define the beta and the rolling in the following way:
def calc_beta(df):
np_array = df.values
s = np_array[:,0]
m = np_array[:,1]
covariance = np.cov(s,m)
beta = covariance[0,1]/covariance[1,1]
return beta
def rolling(df, period, function , min_periods=None):
if min_periods is None:
min_periods = period
result = pd.Series(np.nan, index=df.index)
for i in range(1, len(df)+1):
df2 = df.iloc[max(i-period, 0):i,:] #I edited here
if len(df2) >= min_periods:
idx = df2.index[-1]
result[idx] = function(df2)
return result
And do the following:
calc_beta(df)
which return 0.15350171576854774
and
rolling(df, 12,calc_beta, min_periods=None)
(Of course, you can choose any period)
which gives
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 0.034478
12 0.019883
13 -0.093483
14 0.140603
15 0.137694
16 -0.004115
17 -0.144355
18 -0.079803
19 -0.023759
20 0.099539
21 0.186670
22 0.199526
23 0.113457
24 0.152232
25 0.149928
26 0.079760
27 0.032097
28 0.056294
29 0.070176
30 0.076560
31 0.013778
32 0.080279
33 0.058864
34 0.006916
35 0.303566
36 0.133580
37 0.238668
38 0.312243
39 0.406835
40 0.337503
41 0.370470
42 0.237132
43 0.253779
44 0.160348
45 0.103425
46 0.261430
47 0.130407
48 0.314028
49 0.322890
dtype: float64
so I appreciate the answer #Serge but I felt like I could do it in a slightly cleaner way. This is what I've come up with which works for me. Let me know if you have any comments on it. Thanks again.
A = pd.Series(np.random.randint(1,101,50))
B = pd.Series(np.random.randint(1,101,50))
df = pd.DataFrame({'A' : A, 'B' : B})
df.rolling(10).cov(df, pairwise=True).drop(['A'], axis=1) \
.unstack(1) \
.droplevel(0, axis=1) \
.apply(lambda row: row['A'] / row['B'], axis=1) \
.mean()

pandas.Series : how to get rate of next value

About pandas, I want to know how to get rate of next value. Below series is a sample.
import pandas as pd
s = pd.Series([1,2,1,1,1,3])
>>> s
0 1
1 2
2 1
3 1
4 1
5 3
# What I wanna get are below rates.
# 1 to 2 : 1/5(0.2)
# 2 to 1 : 1/5(0.2)
# 1 to 1 : 2/5(0.4)
# 1 to 3 : 1/5(0.2)
Sorry for bad description, but is there anyone who know how to do this?
One possible solution with strides, aggregate count by GroupBy.size and division by length of DataFrame:
import pandas as pd
import numpy as np
s = pd.Series([1,2,1,1,1,3])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
df1 = pd.DataFrame(rolling_window(s.values, 2), columns=['from','to'])
df1 = df1.groupby(['from','to'], sort=False).size().div(len(df1)).reset_index(name='rate')
print (df1)
from to rate
0 1 2 0.2
1 2 1 0.2
2 1 1 0.4
3 1 3 0.2

How to create a column using a function based of previous values in the column in python

My Problem
I have a loop that creates a value for x in time period t based on x in time period t-1. The loop is really slow so i wanted to try and turn it into a function. I tried to use np.where with shift() but I had no joy. Any idea how i might be able to get around this problem?
Thanks!
My Code
import numpy as np
import pandas as pd
csv1 = pd.read_csv('y_list.csv', delimiter = ',')
df = pd.DataFrame(csv1)
df.loc[df.index[0], 'var'] = 0
for x in range(1,len(df.index)):
if df["LAST"].iloc[x] > 0:
df["var"].iloc[x] = ((df["var"].iloc[x - 1] * 2) + df["LAST"].iloc[x]) / 3
else:
df["var"].iloc[x] = (df["var"].iloc[x - 1] * 2) / 3
df
Input Data
Dates,LAST
03/09/2018,-7
04/09/2018,5
05/09/2018,-4
06/09/2018,5
07/09/2018,-6
10/09/2018,6
11/09/2018,-7
12/09/2018,7
13/09/2018,-9
Output
Dates,LAST,var
03/09/2018,-7,0.000000
04/09/2018,5,1.666667
05/09/2018,-4,1.111111
06/09/2018,5,2.407407
07/09/2018,-6,1.604938
10/09/2018,6,3.069959
11/09/2018,-7,2.046639
12/09/2018,7,3.697759
13/09/2018,-9,2.465173
You are looking at ewm:
arg = df.LAST.clip(lower=0)
arg.iloc[0] = 0
arg.ewm(alpha=1/3, adjust=False).mean()
Output:
0 0.000000
1 1.666667
2 1.111111
3 2.407407
4 1.604938
5 3.069959
6 2.046639
7 3.697759
8 2.465173
Name: LAST, dtype: float64
You can use df.shift to shift the dataframe be a default of 1 row, and convert the if-else block in to a vectorized np.where:
In [36]: df
Out[36]:
Dates LAST var
0 03/09/2018 -7 0.0
1 04/09/2018 5 1.7
2 05/09/2018 -4 1.1
3 06/09/2018 5 2.4
4 07/09/2018 -6 1.6
5 10/09/2018 6 3.1
6 11/09/2018 -7 2.0
7 12/09/2018 7 3.7
8 13/09/2018 -9 2.5
In [37]: (df.shift(1)['var']*2 + np.where(df['LAST']>0, df['LAST'], 0)) / 3
Out[37]:
0 NaN
1 1.666667
2 1.133333
3 2.400000
4 1.600000
5 3.066667
6 2.066667
7 3.666667
8 2.466667
Name: var, dtype: float64

Categories

Resources