I am trying to calculating a rolling beta between two Series in Pandas.
My understanding is that to get the beta, I need to get the covariance matrix and then divide the cells (0, 1) by (1, 1)
So I created a function:
def calc_beta (A, B) :
covariance = np.cov (A, B)
beta = covariance[0, 1] / covariance[1, 1]
return beta
If I just wanted to run it for the entire series, I would do:
calc_beta(A, B)
But I'm not sure how to do that on a rolling basis, I tried A.rolling(10).apply(calc_beta, raw=False, B) unsuccessfully.
Then I just tried calculating the cov matrix on a rolling basis, which I can do:
A = pd.Series(np.random.randint(1,101,50))
B = pd.Series(np.random.randint(1,101,50))
df = pd.DataFrame([A, B]).transpose()
df.rolling(10).cov(df, pairwise=True)
Now I have a covariance matrix but how do I perform the beta calc, i.e. (covariance[0,1]/covariance[1,1]) on a rolling basis (and then get the mean).
It might not be the best answer (read, the most compact) but Ithink this could do the trick. You were actually on the right track to begin with. So, assume you have the two series you gave and make them into a df
A = pd.Series(np.random.randint(1,101,50))
B = pd.Series(np.random.randint(1,101,50))
df = pd.concat([A, B], axis=1)
Define the beta and the rolling in the following way:
def calc_beta(df):
np_array = df.values
s = np_array[:,0]
m = np_array[:,1]
covariance = np.cov(s,m)
beta = covariance[0,1]/covariance[1,1]
return beta
def rolling(df, period, function , min_periods=None):
if min_periods is None:
min_periods = period
result = pd.Series(np.nan, index=df.index)
for i in range(1, len(df)+1):
df2 = df.iloc[max(i-period, 0):i,:] #I edited here
if len(df2) >= min_periods:
idx = df2.index[-1]
result[idx] = function(df2)
return result
And do the following:
calc_beta(df)
which return 0.15350171576854774
and
rolling(df, 12,calc_beta, min_periods=None)
(Of course, you can choose any period)
which gives
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 0.034478
12 0.019883
13 -0.093483
14 0.140603
15 0.137694
16 -0.004115
17 -0.144355
18 -0.079803
19 -0.023759
20 0.099539
21 0.186670
22 0.199526
23 0.113457
24 0.152232
25 0.149928
26 0.079760
27 0.032097
28 0.056294
29 0.070176
30 0.076560
31 0.013778
32 0.080279
33 0.058864
34 0.006916
35 0.303566
36 0.133580
37 0.238668
38 0.312243
39 0.406835
40 0.337503
41 0.370470
42 0.237132
43 0.253779
44 0.160348
45 0.103425
46 0.261430
47 0.130407
48 0.314028
49 0.322890
dtype: float64
so I appreciate the answer #Serge but I felt like I could do it in a slightly cleaner way. This is what I've come up with which works for me. Let me know if you have any comments on it. Thanks again.
A = pd.Series(np.random.randint(1,101,50))
B = pd.Series(np.random.randint(1,101,50))
df = pd.DataFrame({'A' : A, 'B' : B})
df.rolling(10).cov(df, pairwise=True).drop(['A'], axis=1) \
.unstack(1) \
.droplevel(0, axis=1) \
.apply(lambda row: row['A'] / row['B'], axis=1) \
.mean()
Related
I want to groupby "ts_code" and calculate percentage between one column max and min value from another column after max based on last N rows for each group. Specifically,
df
ts_code high low
0 A 20 10
1 A 30 5
2 A 40 20
3 A 50 10
4 A 20 30
5 B 20 10
6 B 30 5
7 B 40 20
8 B 50 10
9 B 20 30
Goal
Below is my expected result
ts_code high low l3_high_low_pct_chg l4_high_low_pct_chg
0 A 20 10 NA NA
1 A 30 5 NA NA
2 A 40 20 0.5 NA
3 A 50 10 0.8 0.8
4 A 20 30 0.8 0.8
5 B 50 10 NA NA
6 B 30 5 NA NA
7 B 40 20 0.9 NA
8 B 10 10 0.75 0.9
9 B 20 30 0.75 0.75
ln_high_low_pct_chg(such as l3_high_low_pct_chg)= 1-(the min value of the low column after the peak)/(the max value of high column),on last N rows for each group and each row.
Try and problem
df['l3_highest']=df.groupby('ts_code')['high'].transform(lambda x: x.rolling(3).max())
df['l3_lowest']=df.groupby('ts_code')['low'].transform(lambda x: x.rolling(3).min())
df['l3_high_low_pct_chg']=1-df['l3_lowest']/df['l3_highest']
But it fails such that for second row, the l3_lowest would be 5 not 20. I don't know how to calculate percentage after peak.
For last 4 rows, at index=8, low=10,high=50,low=5, l4_high_low_pct_chg=0.9
, at index=9, high=40, low=10, l4_high_low_pct_chg=0.75
Another test data
If the rolling window is 52, for hy_code 880912 group and index 1252, l52_high_low_pct_chg would be 0.281131 and 880301 group and index 1251, l52_high_low_pct_chg would be 0.321471.
Grouping by 'ts_code' is just a trivial groupby() function. DataFrame.rolling() function is for single columns, so it's a tricky to apply it if you need data from multiple columns. You can use "from numpy_ext import rolling_apply as rolling_apply_ext" as in this example: Pandas rolling apply using multiple columns. However, I just created a function that manually groups the dataframe into n length sub-dataframes, then applies the function to calculate the value. idxmax() finds the index value of the peak of the low column, then we find the min() of the values that follow. The rest is pretty straightforward.
import numpy as np
import pandas as pd
df = pd.DataFrame([['A', 20, 10],
['A', 30, 5],
['A', 40, 20],
['A', 50, 10],
['A', 20, 30],
['B', 50, 10],
['B', 30, 5],
['B', 40, 20],
['B', 10, 10],
['B', 20, 30]],
columns=['ts_code', 'high', 'low']
)
def custom_f(df, n):
s = pd.Series(np.nan, index=df.index)
def sub_f(df_):
high_peak_idx = df_['high'].idxmax()
min_low_after_peak = df_.loc[high_peak_idx:]['low'].min()
max_high = df_['high'].max()
return 1 - min_low_after_peak / max_high
for i in range(df.shape[0] - n + 1):
df_ = df.iloc[i:i + n]
s.iloc[i + n - 1] = sub_f(df_)
return s
df['l3_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 3).values
df['l4_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 4).values
print(df)
If you prefer to use the rolling function, this method gives the same output:
def rolling_f(rolling_df):
df_ = df.loc[rolling_df.index]
high_peak_idx = df_['high'].idxmax()
min_low_after_peak = df_.loc[high_peak_idx:]["low"].min()
max_high = df_['high'].max()
return 1 - min_low_after_peak / max_high
df['l3_high_low_pct_chg'] = df.groupby("ts_code").rolling(3).apply(rolling_f).values[:, 0]
df['l4_high_low_pct_chg'] = df.groupby("ts_code").rolling(4).apply(rolling_f).values[:, 0]
print(df)
Finally, if you want to do a true rolling window calculation that avoids any index lookup, you can use the numpy_ext (https://pypi.org/project/numpy-ext/)
from numpy_ext import rolling_apply
def np_ext_f(rolling_df, n):
def rolling_apply_f(high, low):
return 1 - low[np.argmax(high):].min() / high.max()
try:
return pd.Series(rolling_apply(rolling_apply_f, n, rolling_df['high'].values, rolling_df['low'].values), index=rolling_df.index)
except ValueError:
return pd.Series(np.nan, index=rolling_df.index)
df['l3_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=3).sort_index(level=1).values
df['l4_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=4).sort_index(level=1).values
print(df)
output:
ts_code high low l3_high_low_pct_chg l4_high_low_pct_chg
0 A 20 10 NaN NaN
1 A 30 5 NaN NaN
2 A 40 20 0.50 NaN
3 A 50 10 0.80 0.80
4 A 20 30 0.80 0.80
5 B 50 10 NaN NaN
6 B 30 5 NaN NaN
7 B 40 20 0.90 NaN
8 B 10 10 0.75 0.90
9 B 20 30 0.75 0.75
For large datasets, the speed of these operations becomes an issue. So, to compare the speed of these different methods, I created a timing function:
import time
def timeit(f):
def timed(*args, **kw):
ts = time.time()
result = f(*args, **kw)
te = time.time()
print ('func:%r took: %2.4f sec' % \
(f.__name__, te-ts))
return result
return timed
Next, let's make a large DataFrame, just by copying the existing dataframe 500 times:
df = pd.concat([df for x in range(500)], axis=0)
df = df.reset_index()
Finally, we run the three tests under a timing function:
#timeit
def method_1():
df['l52_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 52).values
method_1()
#timeit
def method_2():
df['l52_high_low_pct_chg'] = df.groupby("ts_code").rolling(52).apply(rolling_f).values[:, 0]
method_2()
#timeit
def method_3():
df['l52_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=52).sort_index(level=1).values
method_3()
Which gives us this output:
func:'method_1' took: 2.5650 sec
func:'method_2' took: 15.1233 sec
func:'method_3' took: 0.1084 sec
So, the fastest method is to use the numpy_ext, which makes sense because that's optimized for vectorized calculations. The second fastest method is the custom function I wrote, which is somewhat efficient because it does some vectorized calculations while also doing some Pandas lookups. The slowest method by far is using Pandas rolling function.
For my solution, we'll use .groupby("ts_code") then .rolling to process groups of certain size and a custom_function. This custom function will take each group, and instead of applying a function directly on the received values, we'll use those values to query the original dataframe. Then, we can calculate the values as you expect by finding the row where the "high" peak is, then look the following rows to find the minimum "low" value and finally calculate the result using your formula:
def custom_function(group, df):
# Query the original dataframe using the group values
group = df.loc[group.values]
# Calculate your formula
high_peak_row = group["high"].idxmax()
min_low_after_peak = group.loc[high_peak_row:, "low"].min()
return 1 - min_low_after_peak / group.loc[high_peak_row, "high"]
# Reset the index to roll over that column and be able query the original dataframe
df["l3_high_low_pct_chg"] = df.reset_index().groupby("ts_code")["index"].rolling(3).apply(custom_function, args=(df,)).values
df["l4_high_low_pct_chg"] = df.reset_index().groupby("ts_code")["index"].rolling(4).apply(custom_function, args=(df,)).values
Output:
ts_code high low l3_high_low_pct_chg l4_high_low_pct_chg
0 A 20 10 NaN NaN
1 A 30 5 NaN NaN
2 A 40 20 0.50 NaN
3 A 50 10 0.80 0.80
4 A 20 30 0.80 0.80
5 B 50 10 NaN NaN
6 B 30 5 NaN NaN
7 B 40 20 0.90 NaN
8 B 10 10 0.75 0.90
9 B 20 30 0.75 0.75
We can take this idea further an only group once:
groups = df.reset_index().groupby("ts_code")["index"]
for n in [3, 4]:
df[f"l{n}_high_low_pct_chg"] = groups.rolling(n).apply(custom_function, args=(df,)).values
So I have a data frame like this--
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,22], [1,23], [1,44], [2, 33], [2, 55]], columns=['id', 'delay'])
id delay
0 1 22
1 1 23
2 1 44
3 2 33
4 2 55
What I am doing is grouping by id and doing rolling operation on the delay column like below--
k = [0.1, 0.5, 1]
def f(d):
d['new_delay'] = pd.Series([0,0]).append(d['delay']).rolling(window=3).apply(lambda x: np.sum(x*k)).iloc[2:]
return d
df.groupby(['id']).apply(f)
id delay new_delay
0 1 22 22.0
1 1 23 34.0
2 1 44 57.7
3 2 33 33.0
4 2 55 71.5
It is working just fine but I am curious whether .apply on grouped data frame is vectorized or not. Since my dataset is huge, is there a better-vectorized way to do this kind of operation? Also I am curious if Python is single-threaded and I am running on CPU how pandas, numpy achieve vectorized calculation.
You can use strides for vectorized rolling with GroupBy.transform:
k = [0.1, 0.5, 1]
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def f(d):
return np.sum(rolling_window(np.append([0,0],d.to_numpy()), 3) * k, axis=1)
df['new_delay'] = df.groupby('id')['delay'].transform(f)
print (df)
id delay new_delay
0 1 22 22.0
1 1 23 34.0
2 1 44 57.7
3 2 33 33.0
4 2 55 71.5
Another option will be to use np.convolve() instead:
# Our function
f = lambda x: np.convolve(np.array([1,0.5,0.1]),x)[:len(x)]
# Groupby + Transform
df['new_delay'] = df.groupby('id')['delay'].transform(f)
Don't know if it's faster or not.
Here is one approach with groupby + rolling and apply a custom function compiled using numba
def func(v):
k = np.array([0.1, 0.5, 1])
return np.sum(v * k[len(k) - len(v):])
(
df.groupby('id')['delay']
.rolling(3, min_periods=1)
.apply(func, raw=True, engine='numba')
.droplevel(0)
)
0 22.0
1 34.0
2 57.7
3 33.0
4 71.5
Name: delay, dtype: float64
I have two dataframes as such:
df_pos = pd.DataFrame(
data = [[5,4,3,6,0,7,1,2], [2,5,3,6,4,7,1,0]]
)
df_value = pd.DataFrame(
data=[np.arange(10 + i, 50 + i, 5) for i in range(0,2)]
)
and I want to have a new dataframe df_final where df_pos notates the position and df_value the corresponding value.
I can do it like this:
df_value_copy = df_value.copy()
for i in range(len(df_pos)):
df_value_copy.iloc[i, df_pos.iloc[i, :]] = df_value.iloc[i].values
df_final = df_value_copy
However, I have very large dataframes that would be way too slow. Therefore I want to see whether there is any smarter way to do it.
We can also try np.put_along_axis to place df_value into df_final based on the df_pos:
df_final = df_value.copy()
np.put_along_axis(
arr=df_final.values, # Destination Arr
indices=df_pos.values, # Indices
values=df_value.values, # Source Values
axis=1 # Along Axis
)
The arguments do not need to be kwargs can be positional like:
df_final = df_value.copy()
np.put_along_axis(df_final.values, df_pos.values, df_value.values, 1)
df_final:
0 1 2 3 4 5 6 7
0 30 40 45 20 15 10 25 35
1 46 41 11 21 31 16 26 36
You can try setting values with numpy advanced indexing:
df_final = df_value.copy()
df_final.values[np.arange(len(df_pos))[:,None], df_pos.values] = df_value.values
df_final
0 1 2 3 4 5 6 7
0 30 40 45 20 15 10 25 35
1 46 41 11 21 31 16 26 36
Let's say I have the following dataframe:
df = pd.DataFrame({"quantity": [101, 102, 103], "price":[12, 33, 44]})
price quantity
0 12 101
1 33 102
2 44 103
I have been struggling to find how to apply a rolling complex function on it.
For simplicity, let's assume this function f is just the product of quantity and price. In this case, how do I apply this function on a rolling window of size 1, with a scaling parameter, say:
scaling = 10
such that the resulting dataframe would be:
price quantity value
0 12 101 NaN
1 33 102 12120.0
2 44 103 33660.0
with value[i] = price[i-1]*quantity[i-1]*scaling
I have tried:
def f(x,scaling):
return x['quantity']*x['price']*scaling
df.rolling(window=1).apply(lambda x: f(x,scaling))
and
def f(quantity,price,scaling):
return quantity*price*scaling
df.rolling(window=1).apply(lambda x: f(x['quantity'],x['price'],scaling))
Could you please help me fixing this without doing a simple:
df['value'] = df['quantity'].shift(1)*df['price'].shift(1)*scaling
?
Assuming what you want is indeed value[i] = price[i-1] * quantity[i-1] * scaling , then:
scaling = 10
df['value'] = df.shift(1).apply(lambda x: x['quantity'] * x['price'] * scaling, axis=1)
df
quantity price value
0 101 12 NaN
1 102 33 12120.0
2 103 44 33660.0
I have a data frame according to below.
df = pd.DataFrame({'var1' : list('a' * 3) + list('b' * 2) + list('c' * 4)
,'var2' : [i for i in range(9)]
,'var3' : [20, 40, 100, 10, 80, 12,24, 53, 90]
})
End result that I want is the following:
var1 var2 var3 var3_lt_50
0 a 0 20 60
1 a 1 40 60
2 a 2 100 60
3 b 3 10 10
4 b 4 80 10
5 c 5 12 36
6 c 6 24 36
7 c 7 53 36
8 c 8 90 36
I get this result in two steps, through a group-by and a merge, according to code below:
df = df.merge(df[df.var3 < 50][['var1', 'var3']].groupby('var1', as_index = False).sum().rename(columns = {'var3' : 'var3_lt_50'})
,how = 'left'
,left_on = 'var1'
,right_on = 'var1')
Can someone show me a way of doing this type of boolean logic expression + broadcasting of inter groupby scalar without the "groupby" + "merge" step im doing today. I want a smoother line of code.
Thanks in advance for input,
/Swepab
You can use groupby.transform which keeps the shape of the transformed variable as well as the index so you can just assign the result back to the data frame:
df['var3_lt_50'] = df.groupby('var1').var3.transform(lambda g: g[g < 50].sum())
df