Compare two columns based on last N rows in a pandas DataFrame - python

I want to groupby "ts_code" and calculate percentage between one column max and min value from another column after max based on last N rows for each group. Specifically,
df
ts_code high low
0 A 20 10
1 A 30 5
2 A 40 20
3 A 50 10
4 A 20 30
5 B 20 10
6 B 30 5
7 B 40 20
8 B 50 10
9 B 20 30
Goal
Below is my expected result
ts_code high low l3_high_low_pct_chg l4_high_low_pct_chg
0 A 20 10 NA NA
1 A 30 5 NA NA
2 A 40 20 0.5 NA
3 A 50 10 0.8 0.8
4 A 20 30 0.8 0.8
5 B 50 10 NA NA
6 B 30 5 NA NA
7 B 40 20 0.9 NA
8 B 10 10 0.75 0.9
9 B 20 30 0.75 0.75
ln_high_low_pct_chg(such as l3_high_low_pct_chg)= 1-(the min value of the low column after the peak)/(the max value of high column),on last N rows for each group and each row.
Try and problem
df['l3_highest']=df.groupby('ts_code')['high'].transform(lambda x: x.rolling(3).max())
df['l3_lowest']=df.groupby('ts_code')['low'].transform(lambda x: x.rolling(3).min())
df['l3_high_low_pct_chg']=1-df['l3_lowest']/df['l3_highest']
But it fails such that for second row, the l3_lowest would be 5 not 20. I don't know how to calculate percentage after peak.
For last 4 rows, at index=8, low=10,high=50,low=5, l4_high_low_pct_chg=0.9
, at index=9, high=40, low=10, l4_high_low_pct_chg=0.75
Another test data
If the rolling window is 52, for hy_code 880912 group and index 1252, l52_high_low_pct_chg would be 0.281131 and 880301 group and index 1251, l52_high_low_pct_chg would be 0.321471.

Grouping by 'ts_code' is just a trivial groupby() function. DataFrame.rolling() function is for single columns, so it's a tricky to apply it if you need data from multiple columns. You can use "from numpy_ext import rolling_apply as rolling_apply_ext" as in this example: Pandas rolling apply using multiple columns. However, I just created a function that manually groups the dataframe into n length sub-dataframes, then applies the function to calculate the value. idxmax() finds the index value of the peak of the low column, then we find the min() of the values that follow. The rest is pretty straightforward.
import numpy as np
import pandas as pd
df = pd.DataFrame([['A', 20, 10],
['A', 30, 5],
['A', 40, 20],
['A', 50, 10],
['A', 20, 30],
['B', 50, 10],
['B', 30, 5],
['B', 40, 20],
['B', 10, 10],
['B', 20, 30]],
columns=['ts_code', 'high', 'low']
)
def custom_f(df, n):
s = pd.Series(np.nan, index=df.index)
def sub_f(df_):
high_peak_idx = df_['high'].idxmax()
min_low_after_peak = df_.loc[high_peak_idx:]['low'].min()
max_high = df_['high'].max()
return 1 - min_low_after_peak / max_high
for i in range(df.shape[0] - n + 1):
df_ = df.iloc[i:i + n]
s.iloc[i + n - 1] = sub_f(df_)
return s
df['l3_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 3).values
df['l4_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 4).values
print(df)
If you prefer to use the rolling function, this method gives the same output:
def rolling_f(rolling_df):
df_ = df.loc[rolling_df.index]
high_peak_idx = df_['high'].idxmax()
min_low_after_peak = df_.loc[high_peak_idx:]["low"].min()
max_high = df_['high'].max()
return 1 - min_low_after_peak / max_high
df['l3_high_low_pct_chg'] = df.groupby("ts_code").rolling(3).apply(rolling_f).values[:, 0]
df['l4_high_low_pct_chg'] = df.groupby("ts_code").rolling(4).apply(rolling_f).values[:, 0]
print(df)
Finally, if you want to do a true rolling window calculation that avoids any index lookup, you can use the numpy_ext (https://pypi.org/project/numpy-ext/)
from numpy_ext import rolling_apply
def np_ext_f(rolling_df, n):
def rolling_apply_f(high, low):
return 1 - low[np.argmax(high):].min() / high.max()
try:
return pd.Series(rolling_apply(rolling_apply_f, n, rolling_df['high'].values, rolling_df['low'].values), index=rolling_df.index)
except ValueError:
return pd.Series(np.nan, index=rolling_df.index)
df['l3_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=3).sort_index(level=1).values
df['l4_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=4).sort_index(level=1).values
print(df)
output:
ts_code high low l3_high_low_pct_chg l4_high_low_pct_chg
0 A 20 10 NaN NaN
1 A 30 5 NaN NaN
2 A 40 20 0.50 NaN
3 A 50 10 0.80 0.80
4 A 20 30 0.80 0.80
5 B 50 10 NaN NaN
6 B 30 5 NaN NaN
7 B 40 20 0.90 NaN
8 B 10 10 0.75 0.90
9 B 20 30 0.75 0.75
For large datasets, the speed of these operations becomes an issue. So, to compare the speed of these different methods, I created a timing function:
import time
def timeit(f):
def timed(*args, **kw):
ts = time.time()
result = f(*args, **kw)
te = time.time()
print ('func:%r took: %2.4f sec' % \
(f.__name__, te-ts))
return result
return timed
Next, let's make a large DataFrame, just by copying the existing dataframe 500 times:
df = pd.concat([df for x in range(500)], axis=0)
df = df.reset_index()
Finally, we run the three tests under a timing function:
#timeit
def method_1():
df['l52_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 52).values
method_1()
#timeit
def method_2():
df['l52_high_low_pct_chg'] = df.groupby("ts_code").rolling(52).apply(rolling_f).values[:, 0]
method_2()
#timeit
def method_3():
df['l52_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=52).sort_index(level=1).values
method_3()
Which gives us this output:
func:'method_1' took: 2.5650 sec
func:'method_2' took: 15.1233 sec
func:'method_3' took: 0.1084 sec
So, the fastest method is to use the numpy_ext, which makes sense because that's optimized for vectorized calculations. The second fastest method is the custom function I wrote, which is somewhat efficient because it does some vectorized calculations while also doing some Pandas lookups. The slowest method by far is using Pandas rolling function.

For my solution, we'll use .groupby("ts_code") then .rolling to process groups of certain size and a custom_function. This custom function will take each group, and instead of applying a function directly on the received values, we'll use those values to query the original dataframe. Then, we can calculate the values as you expect by finding the row where the "high" peak is, then look the following rows to find the minimum "low" value and finally calculate the result using your formula:
def custom_function(group, df):
# Query the original dataframe using the group values
group = df.loc[group.values]
# Calculate your formula
high_peak_row = group["high"].idxmax()
min_low_after_peak = group.loc[high_peak_row:, "low"].min()
return 1 - min_low_after_peak / group.loc[high_peak_row, "high"]
# Reset the index to roll over that column and be able query the original dataframe
df["l3_high_low_pct_chg"] = df.reset_index().groupby("ts_code")["index"].rolling(3).apply(custom_function, args=(df,)).values
df["l4_high_low_pct_chg"] = df.reset_index().groupby("ts_code")["index"].rolling(4).apply(custom_function, args=(df,)).values
Output:
ts_code high low l3_high_low_pct_chg l4_high_low_pct_chg
0 A 20 10 NaN NaN
1 A 30 5 NaN NaN
2 A 40 20 0.50 NaN
3 A 50 10 0.80 0.80
4 A 20 30 0.80 0.80
5 B 50 10 NaN NaN
6 B 30 5 NaN NaN
7 B 40 20 0.90 NaN
8 B 10 10 0.75 0.90
9 B 20 30 0.75 0.75
We can take this idea further an only group once:
groups = df.reset_index().groupby("ts_code")["index"]
for n in [3, 4]:
df[f"l{n}_high_low_pct_chg"] = groups.rolling(n).apply(custom_function, args=(df,)).values

Related

How to efficinetly combine dataframe rows based on conditions?

I have the following dataset, which contains a column with the cluster number, the number of observations in that cluster and the maximum value of another variable x grouped by that cluster.
clust = np.arange(0, 10)
obs = np.array([1041, 544, 310, 1648, 1862, 2120, 2916, 5148, 12733, 1])
x_max = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
df = pd.DataFrame(np.c_[clust, obs, x_max], columns=['clust', 'obs', 'x_max'])
clust obs x_max
0 0 1041 10
1 1 544 20
2 2 310 30
3 3 1648 40
4 4 1862 50
5 5 2120 60
6 6 2916 70
7 7 5148 80
8 8 12733 90
9 9 1 100
My task is to combine the clust row values with adjasent rows, so that each cluster contains at least 1000 observations.
My current attempt gets stuck in an infinite loop because the last cluster has only 1 observation.
condition = True
while (condition):
condition = False
for i in np.arange(0, len(df) + 1):
if df.loc[i, 'x'] < 1000:
df.loc[i, 'id'] = df.loc[i, 'id'] + 1
df = df.groupby('id', as_index=False).agg({'x': 'sum', 'y': 'max'})
condition = True
break
Is there perhaps a more efficient way of doing this? I come from a background in SAS, where such situations would be solved with the if last.row condition, but it seems here is no such condition in python.
The resulting table should look like this
clust obs x_max
0 1041 10
1 2502 40
2 1862 50
3 2120 60
4 2916 70
5 5148 80
6 12734 100
Here is another way. A vectorize way here is difficult to implement, but using for loop on an array (or a list) will be faster than using loc at each iteration. Also, not a good practice to change df within the loop, it can only bring problem.
# define variables
s = 0 #for the sum of observations
gr = [] #for the final grouping values
i = 0 #for the group indices
# loop over observations from an array
for obs in df['obs'].to_numpy():
s+= obs
gr.append(i)
# check that the size of the group is big enough
if s>1000:
s = 0
i+=1
# condition to deal with last rows if last group not big enough
if s!=0:
gr = [i-1 if val==i else val for val in gr]
# now create your new df
new_df = (
df.groupby(gr).agg({'obs':sum, 'x_max':max})
.reset_index().rename(columns={'index':'cluster'})
)
print(new_df)
# cluster obs x_max
# 0 0 1041 10
# 1 1 2502 40
# 2 2 1862 50
# 3 3 2120 60
# 4 4 2916 70
# 5 5 5148 80
# 6 6 12734 100

How to build a pandas dataframe in a recursive function?

I am trying to implement the 'Bottom-Up Computation' algorithm in data mining (https://www.aaai.org/Papers/FLAIRS/2003/Flairs03-050.pdf).
I need to use the 'pandas' library to create a dataframe and provide it to a recursive function, which should also return a dataframe as output. I am only able to return the final column as output, because I am unable to figure out how to dynamically build a data frame.
Here is the python program:
import pandas as pd
def project_data(df, d):
return df.iloc[:, d]
def select_data(df, d, val):
col_name = df.columns[d]
return df[df[col_name] == val]
def remove_first_dim(df):
return df.iloc[:, 1:]
def slice_data_dim0(df, v):
df_temp = select_data(df, 0, v)
return remove_first_dim(df_temp)
def buc(df):
dims = df.shape[1]
if dims == 1:
input_sum = sum(project_data(df, 0) )
print(input_sum)
else:
dim_vals = set(project_data(df, 0).values)
for dim_val in dim_vals:
sub_data = slice_data_dim0(df, dim_val)
buc(sub_data)
sub_data = remove_first_dim(df)
buc(sub_data)
data = {'A':[1,1,1,1,2],
'B':[1,1,2,3,1],
'M':[10,20,30,40,50]
}
df = pd.DataFrame(data, columns = ['A','B','M'])
buc(df)
I get the following output:
30
30
40
100
50
50
80
30
40
But what I need is a dataframe, like this (not necessarily formatted, but a data frame):
A B M
0 1 1 30
1 1 2 30
2 1 3 40
3 1 ALL 100
4 2 1 50
5 2 ALL 50
6 ALL 1 80
7 ALL 2 30
8 ALL 3 40
9 ALL ALL 150
How do I achieve this?
Unfortunately pandas doesn't have functionality to do subtotals - so the trick is to just calculate them on the side and concatenate together with original dataframe.
from itertools import combinations
import numpy as np
dim = ['A', 'B']
vals = ['M']
df = pd.concat(
[df]
# subtotals:
+ [df.groupby(list(gr), as_index=False)[vals].sum() for r in range(len(dim)-1) for gr in combinations(dim, r+1)]
# total:
+ [df.groupby(np.zeros(len(df)))[vals].sum()]
)\
.sort_values(dim)\
.reset_index(drop=True)\
.fillna("ALL")
Output:
A B M
0 1 1 10
1 1 1 20
2 1 2 30
3 1 3 40
4 1 ALL 100
5 2 1 50
6 2 ALL 50
7 ALL 1 80
8 ALL 2 30
9 ALL 3 40
10 ALL ALL 150

Calculating rolling beta in Pandas

I am trying to calculating a rolling beta between two Series in Pandas.
My understanding is that to get the beta, I need to get the covariance matrix and then divide the cells (0, 1) by (1, 1)
So I created a function:
def calc_beta (A, B) :
covariance = np.cov (A, B)
beta = covariance[0, 1] / covariance[1, 1]
return beta
If I just wanted to run it for the entire series, I would do:
calc_beta(A, B)
But I'm not sure how to do that on a rolling basis, I tried A.rolling(10).apply(calc_beta, raw=False, B) unsuccessfully.
Then I just tried calculating the cov matrix on a rolling basis, which I can do:
A = pd.Series(np.random.randint(1,101,50))
B = pd.Series(np.random.randint(1,101,50))
df = pd.DataFrame([A, B]).transpose()
df.rolling(10).cov(df, pairwise=True)
Now I have a covariance matrix but how do I perform the beta calc, i.e. (covariance[0,1]/covariance[1,1]) on a rolling basis (and then get the mean).
It might not be the best answer (read, the most compact) but Ithink this could do the trick. You were actually on the right track to begin with. So, assume you have the two series you gave and make them into a df
A = pd.Series(np.random.randint(1,101,50))
B = pd.Series(np.random.randint(1,101,50))
df = pd.concat([A, B], axis=1)
Define the beta and the rolling in the following way:
def calc_beta(df):
np_array = df.values
s = np_array[:,0]
m = np_array[:,1]
covariance = np.cov(s,m)
beta = covariance[0,1]/covariance[1,1]
return beta
def rolling(df, period, function , min_periods=None):
if min_periods is None:
min_periods = period
result = pd.Series(np.nan, index=df.index)
for i in range(1, len(df)+1):
df2 = df.iloc[max(i-period, 0):i,:] #I edited here
if len(df2) >= min_periods:
idx = df2.index[-1]
result[idx] = function(df2)
return result
And do the following:
calc_beta(df)
which return 0.15350171576854774
and
rolling(df, 12,calc_beta, min_periods=None)
(Of course, you can choose any period)
which gives
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 0.034478
12 0.019883
13 -0.093483
14 0.140603
15 0.137694
16 -0.004115
17 -0.144355
18 -0.079803
19 -0.023759
20 0.099539
21 0.186670
22 0.199526
23 0.113457
24 0.152232
25 0.149928
26 0.079760
27 0.032097
28 0.056294
29 0.070176
30 0.076560
31 0.013778
32 0.080279
33 0.058864
34 0.006916
35 0.303566
36 0.133580
37 0.238668
38 0.312243
39 0.406835
40 0.337503
41 0.370470
42 0.237132
43 0.253779
44 0.160348
45 0.103425
46 0.261430
47 0.130407
48 0.314028
49 0.322890
dtype: float64
so I appreciate the answer #Serge but I felt like I could do it in a slightly cleaner way. This is what I've come up with which works for me. Let me know if you have any comments on it. Thanks again.
A = pd.Series(np.random.randint(1,101,50))
B = pd.Series(np.random.randint(1,101,50))
df = pd.DataFrame({'A' : A, 'B' : B})
df.rolling(10).cov(df, pairwise=True).drop(['A'], axis=1) \
.unstack(1) \
.droplevel(0, axis=1) \
.apply(lambda row: row['A'] / row['B'], axis=1) \
.mean()

index counter for if conditions python pandas

I wanted to generate some sort of cycle for my dataFrame. One cycle in the example below has the length of 4. The last column is how is supposed to look like, the rest are attempts on my behalf.
My current code looks like this:
import pandas as pd
import numpy as np
l = list(np.linspace(0,10,12))
data = [
('time',l),
('A',[0,5,0.6,-4.8,-0.3,4.9,0.2,-4.7,0.5,5,0.1,-4.6]),
('B',[ 0,300,20,-280,-25,290,30,-270,40,300,-10,-260]),
]
df = pd.DataFrame.from_dict(dict(data))
length = len(df)
df.loc[0,'cycle']=1
df['cycle'] = length/4 +df.loc[0,'cycle']
i = 0
for i in range(0,length):
df.loc[i,'new_cycle']=i+1
df['want_cycle']= [1,1,1,1,2,2,2,2,3,3,3,3]
print(length)
print(df)
I do need an if conditions in the code, too only increase in the value of df['new_cycle'] if the index counter for example 4. But so far I failed to find a proper way to implement such conditions.
Try this with the default range index, because your dataframe row index is a range starting with 0, the default index of a dataframe, you can use floor divide to calculate your cycle:
df['cycle'] = df.index//4 + 1
Output:
time A B cycle
0 0.000000 0.0 0 1
1 0.909091 5.0 300 1
2 1.818182 0.6 20 1
3 2.727273 -4.8 -280 1
4 3.636364 -0.3 -25 2
5 4.545455 4.9 290 2
6 5.454545 0.2 30 2
7 6.363636 -4.7 -270 2
8 7.272727 0.5 40 3
9 8.181818 5.0 300 3
10 9.090909 0.1 -10 3
11 10.000000 -4.6 -260 3
Now, if your dataframe index isn't the default, the you can use something like this:
df['cycle'] = [df.index.get_loc(i) // 4 + 1 for i in df.index]
I've added just 1 thing for you, a new variable called new_cycle which will keep the count you're after.
In the for loop we're checking to see whether or not i is divisible by 4 without a remainder, if it is we're adding 1 to the new variable, and filling the data frame with this value the same way you did.
import pandas as pd
import numpy as np
l = list(np.linspace(0,10,12))
data = [
('time',l),
('A',[0,5,0.6,-4.8,-0.3,4.9,0.2,-4.7,0.5,5,0.1,-4.6]),
('B',[ 0,300,20,-280,-25,290,30,-270,40,300,-10,-260]),
]
df = pd.DataFrame.from_dict(dict(data))
length = len(df)
df.loc[0,'cycle']=1
df['cycle'] = length/4 +df.loc[0,'cycle']
new_cycle = 0
for i in range(0,length):
if i % 4 == 0:
new_cycle += 1
df.loc[i,'new_cycle']= new_cycle
df['want_cycle'] = [1,1,1,1,2,2,2,2,3,3,3,3]
print(length)
print(df)

Pandas sequentially apply function using output of previous value

I want to compute the "carryover" of a series. This computes a value for each row and then adds it to the previously computed value (for the previous row).
How do I do this in pandas?
decay = 0.5
test = pd.DataFrame(np.random.randint(1,10,12),columns = ['val'])
test
val
0 4
1 5
2 7
3 9
4 1
5 1
6 8
7 7
8 3
9 9
10 7
11 2
decayed = []
for i, v in test.iterrows():
if i ==0:
decayed.append(v.val)
continue
d = decayed[i-1] + v.val*decay
decayed.append(d)
test['loop_decay'] = decayed
test.head()
val loop_decay
0 4 4.0
1 5 6.5
2 7 10.0
3 9 14.5
4 1 15.0
Consider a vectorized version with cumsum() where you cumulatively sum (val * decay) with the very first val.
However, you then need to subtract the very first (val * decay) since cumsum() includes it:
test['loop_decay'] = (test.ix[0,'val']) + (test['val']*decay).cumsum() - (test.ix[0,'val']*decay)
You can utilize pd.Series.shift() to create a dataframe with val[i] and val[i-1] and then apply your function across a single axis (1 in this case):
# Create a series that shifts the rows by 1
test['val2'] = test.val.shift()
# Set the first row on the shifted series to 0
test['val2'].ix[0] = 0
# Apply the decay formula:
test['loop_decay'] = test.apply(lambda x: x['val'] + x['val2'] * 0.5, axis=1)

Categories

Resources