Days since rolling 52-week high in pandas DataFrame - python

EDIT: Just when I gave up i found the answer here:
rmlag = lambda xs: np.argmax(xs[::-1])
df['Open'].rolling(window=5).apply(func=rmlag)
I'm wrestling with the following issue: How can i add a column to a DataFrame that, for each row, calculates the number of days (periods) since an n-period high was reached?
Below is a sample DataFrame i'm working with. I've calculated the rolling 5-day high as
df['Rolling 5 Day High'] = df['Open'].rolling(5).max()
How can I calculate, for each row, the number of days since the respective 5-day high was reached? For example, the "Number of Days Since" for the row indexed at 2012-03-16 should be 4 since this row's corresponding rolling 5-day high of 14.88 was reached on 2012-03-12. For the next row at index 2012-03-19, the value should be 3 given this row's rolling 5-day high of 14.79 was reached on 2012-03-14.
Open Rolling 5 Day High
Date
2012-03-12 14.88 NaN
2012-03-13 14.65 NaN
2012-03-14 14.79 NaN
2012-03-15 14.41 NaN
2012-03-16 14.59 14.88
2012-03-19 14.68 14.79
2012-03-20 14.56 14.79
2012-03-21 14.40 14.68
2012-03-22 14.35 14.68
2012-03-23 14.40 14.68
2012-03-26 14.69 14.69
2012-03-27 14.78 14.78
2012-03-28 15.01 15.01
2012-03-29 15.14 15.14
2012-03-30 15.36 15.36
2012-04-02 15.36 15.36
2012-04-03 15.44 15.44
2012-04-04 14.85 15.44
2012-04-05 14.67 15.44
2012-04-09 14.40 15.44
2012-04-10 14.38 15.44
2012-04-11 14.35 14.85
2012-04-12 14.36 14.67
2012-04-13 14.55 14.55
2012-04-16 14.26 14.55

Related

Is there any way to use Groupby and Rollong together?

I have the following dataframe with daily data:
day value
2017-08-04 0.832
2017-08-05 0.892
2017-08-06 0.847
2017-08-07 0.808
2017-08-08 0.922
2017-08-09 0.894
2017-08-10 2.332
2017-08-11 0.886
2017-08-12 0.973
2017-08-13 0.980
... ...
2022-03-21 0.821
2022-03-22 1.121
2022-03-23 1.064
2022-03-24 1.058
2022-03-25 0.891
2022-03-26 1.010
2022-03-27 1.023
2022-03-28 1.393
2022-03-29 2.013
2022-03-30 3.872
[1700 rows x 1 columns]
I need to generate pooled averages using moving windows. I explain it group by group:
The first group must contain the data from 2017-08-04 to 2017-08-08, but also the data from 2018-08-04 to 2018-08-08, and so on until the last year. As shown below:
2017-08-04 0.832
2017-08-05 0.892
2017-08-06 0.847
2017-08-07 0.808
2017-08-08 0.922
---------- -----
2018-08-04 2.125
2018-08-05 2.200
2018-08-06 2.339
2018-08-07 2.035
2018-08-08 1.953
... ...
2020-08-04 0.965
2020-08-05 0.941
2020-08-06 0.917
2020-08-07 0.922
2020-08-08 0.909
---------- -----
2021-08-04 1.348
2021-08-05 1.302
2021-08-06 1.272
2021-08-07 1.258
2021-08-08 1.281
The second group must run one day the temporary window. That is, data from 2017-08-05 to 2017-08-09, from 2018-08-05 to 2018-08-09, and so on until the last year. As shown below:
2017-08-05 0.892
2017-08-06 0.847
2017-08-07 0.808
2017-08-08 0.922
2017-08-09 1.823
---------- -----
2018-08-05 2.200
2018-08-06 2.339
2018-08-07 2.035
2018-08-08 1.953
2018-08-09 2.009
... ...
2020-08-05 0.941
2020-08-06 0.917
2020-08-07 0.922
2020-08-08 0.909
2020-08-09 1.934
---------- -----
2021-08-05 1.302
2021-08-06 1.272
2021-08-07 1.258
2021-08-08 1.281
2021-08-09 2.348
And the following groups must follow the same dynamic. Finally, I need to form a DataFrame, where the indices are the central date of each window (the length of the DataFrame will be 365 days of the year) and the values ​​are the average of each of the groups mentioned above.
I have been trying Groupby and Rolling at the same time. But any solution based on other pandas methods is completely valid.

Panda data-frame column without labels

I have following data set in panda dataframe
print data
Result:
Open High Low Close Adj Close Volume
Date
2018-05-25 12.70 12.73 12.48 12.61 12.610000 1469800
2018-05-24 12.99 13.08 12.93 12.98 12.980000 814800
2018-05-23 13.19 13.30 13.06 13.12 13.120000 1417500
2018-05-22 13.46 13.57 13.25 13.27 13.270000 1189000
2018-05-18 13.41 13.44 13.36 13.38 13.380000 986300
2018-05-17 13.19 13.42 13.19 13.40 13.400000 1056200
2018-05-16 13.01 13.14 13.01 13.12 13.120000 481300
If I just want to print single column just close it shows with the date index
print data.Low
Result:
Date
2018-05-25 12.48
2018-05-24 12.93
2018-05-23 13.06
2018-05-22 13.25
2018-05-18 13.36
2018-05-17 13.19
2018-05-16 13.01
Is there way to slice/print just the closing price. So the output will be like:
12.48
12.93
13.06
13.25
13.36
13.19
13.01
In pandas Series and DataFrame always need some index values.
Default RangeIndex is possible create by:
print data.reset_index(drop=True).Low
But if need write only values to file as column without index and with no header:
data.Low.to_csv(file, index=False, header=None)
If need convert column to list:
print data.Low.tolist()
[12.48, 12.93, 13.06, 13.25, 13.36, 13.19, 13.01]
And for 1d numpy array:
print data.Low.values
[12.48 12.93 13.06 13.25 13.36 13.19 13.01]
If want 1xM array:
print (data[['Low']].values)
[[12.48]
[12.93]
[13.06]
[13.25]
[13.36]
[13.19]
[13.01]]

Filter a timeseries with some predefined dates in Pandas

I have this code :
close[close['Datetime'].isin(datefilter)] #Only date in the range
close1='Close' ; start='12/18/2015 00:00:00';
end='3/1/2016 00:00:00'; freq='1d0h00min';
datefilter= pd.date_range(start=start, end=end, freq= freq).values
But, strangely, some columns are given back with Nan:
Datetime ENTA KITE BSTC SAGE AGEN MGNX ESPR FPRX
2015-12-18 31.73 63.38 16.34 56.88 12.24 NaN NaN 38.72
2015-12-21 32.04 63.60 16.26 56.75 12.18 NaN NaN 42.52
Just wondering the reasons, and how can we remedy ?
Original :
Datetime ENTA KITE BSTC SAGE AGEN MGNX ESPR FPRX
0 2013-03-21 17.18 29.0 20.75 30.1 11.52 11.52 38.72
1 2013-03-22 16.81 30.53 21.25 30.0 11.64 11.52 39.42
2 2013-03-25 16.83 32.15 20.8 27.59 11.7 11.52 42.52
3 2013-03-26 17.09 29.55 20.6 27.5 11.76 11.52 11.52
EDIT:
it seems related to the datetime hh:mm:ss filtering.

How to calculate daily 52-weeks high/low in pandas?

I have a simple dataframe with typical OHLC values. I want to calculate daily 52 weeks high/low (or other time range) from it and put the result into a dataframe, so that I can track the daily movement of all record high/low.
For example, if the time range is just 3-day, the 3-day high/low would be:
(3-Day High: Maximum 'High' value in the last 3 days)
Out[21]:
Open High Low Close Volume 3-Day-High 3-Day-Low
Date
2015-07-01 273.6 273.6 273.6 273.6 0 273.6 273.6
2015-07-02 276.0 276.0 267.0 268.6 15808300 276.0 267.0
2015-07-03 268.8 269.0 256.6 259.8 20255200 276.0 256.6
2015-07-06 261.0 261.8 223.0 235.0 53285100 276.0 223.0
2015-07-07 237.2 237.8 218.4 222.0 38001700 269.0 218.4
2015-07-08 207.0 219.4 196.0 203.4 48558100 261.8 196.0
2015-07-09 207.4 233.8 204.2 233.6 37835900 237.8 196.0
2015-07-10 235.4 244.8 233.8 239.2 23299900 244.8 196.0
Is there any simple way to do it and how? Thanks guys!
You can use rolling_max and rolling_min:
>>> df["3-Day-High"] = pd.rolling_max(df.High, window=3, min_periods=1)
>>> df["3-Day-Low"] = pd.rolling_min(df.Low, window=3, min_periods=1)
>>> df
Open High Low Close Volume 3-Day-High 3-Day-Low
Date
2015-07-01 273.6 273.6 273.6 273.6 0 273.6 273.6
2015-07-02 276.0 276.0 267.0 268.6 15808300 276.0 267.0
2015-07-03 268.8 269.0 256.6 259.8 20255200 276.0 256.6
2015-07-06 261.0 261.8 223.0 235.0 53285100 276.0 223.0
2015-07-07 237.2 237.8 218.4 222.0 38001700 269.0 218.4
2015-07-08 207.0 219.4 196.0 203.4 48558100 261.8 196.0
2015-07-09 207.4 233.8 204.2 233.6 37835900 237.8 196.0
2015-07-10 235.4 244.8 233.8 239.2 23299900 244.8 196.0
Note that in agreement with your example, this uses the last three recorded days, regardless of the size of any gap between those rows (such as between 07-03 and 07-06).
The above method has been replaced in the latest versions of the python
Use this instead:
Series.rolling(min_periods=1, window=252, center=False).max()
You can try this:
three_days=df.index[-3:]
maxHigh=max(df['High'][three_days])
minLow=min(df['Low'][three_days])

vectorize for-loop to fill Pandas DataFrame

For a financial application, I'm trying to create a DataFrame where each row is a session date value for a particular equity. To get the data, I'm using Pandas Remote Data. So, for example, the features I'm trying to create might be the adjusted closes for the preceding 32 sessions.
This is easy to do in a for-loop, but it takes quite a long time for large features sets (like going back to 1960 on "ge" and making each row contain the preceding 256 session values). Does anyone see a good way to vectorize this code?
import pandas as pd
def featurize(equity_data, n_sessions, col_label='Adj Close'):
"""
Generate a raw (unnormalized) feature set from the input data.
The value at col_label on the given date is taken
as a feature, and each row contains values for n_sessions
"""
features = pd.DataFrame(index=equity_data.index[(n_sessions - 1):],
columns=range((-n_sessions + 1), 1))
for i in range(len(features.index)):
features.iloc[i, :] = equity_data[i:(n_sessions + i)][col_label].values
return features
I could alternatively just multi-thread this easily, but I'm guessing that pandas does that automatically if I can vectorize it. I mention that mainly because my primary concern is performance. So, if multi-threading is likely to outperform vectorization in any significant way, then I'd prefer that.
Short example of input and output:
>>> eq_data
Open High Low Close Volume Adj Close
Date
2014-01-02 15.42 15.45 15.28 15.44 31528500 14.96
2014-01-03 15.52 15.64 15.30 15.51 46122300 15.02
2014-01-06 15.72 15.76 15.52 15.58 42657600 15.09
2014-01-07 15.73 15.74 15.35 15.38 54476300 14.90
2014-01-08 15.60 15.71 15.51 15.54 48448300 15.05
2014-01-09 15.83 16.02 15.77 15.84 67836500 15.34
2014-01-10 16.01 16.11 15.94 16.07 44984000 15.57
2014-01-13 16.37 16.53 16.08 16.11 57566400 15.61
2014-01-14 16.31 16.43 16.17 16.40 44039200 15.89
2014-01-15 16.37 16.73 16.35 16.70 64118200 16.18
2014-01-16 16.67 16.76 16.56 16.73 38410800 16.21
2014-01-17 16.78 16.78 16.45 16.52 37152100 16.00
2014-01-21 16.64 16.68 16.36 16.41 35597200 15.90
2014-01-22 16.44 16.62 16.37 16.55 28741900 16.03
2014-01-23 16.49 16.53 16.31 16.43 37860800 15.92
2014-01-24 16.19 16.21 15.78 15.83 66023500 15.33
2014-01-27 15.90 15.91 15.52 15.71 51218700 15.22
2014-01-28 15.97 16.01 15.51 15.72 57677500 15.23
2014-01-29 15.48 15.53 15.20 15.26 52241500 14.90
2014-01-30 15.43 15.45 15.18 15.25 32654100 14.89
2014-01-31 15.09 15.10 14.90 14.96 64132600 14.61
>>> features = data.featurize(eq_data, 3)
>>> features
-2 -1 0
Date
2014-01-06 14.96 15.02 15.09
2014-01-07 15.02 15.09 14.9
2014-01-08 15.09 14.9 15.05
2014-01-09 14.9 15.05 15.34
2014-01-10 15.05 15.34 15.57
2014-01-13 15.34 15.57 15.61
2014-01-14 15.57 15.61 15.89
2014-01-15 15.61 15.89 16.18
2014-01-16 15.89 16.18 16.21
2014-01-17 16.18 16.21 16
2014-01-21 16.21 16 15.9
2014-01-22 16 15.9 16.03
2014-01-23 15.9 16.03 15.92
2014-01-24 16.03 15.92 15.33
2014-01-27 15.92 15.33 15.22
2014-01-28 15.33 15.22 15.23
2014-01-29 15.22 15.23 14.9
2014-01-30 15.23 14.9 14.89
2014-01-31 14.9 14.89 14.61
So each row of features is a series of 3 (n_sessions) successive values from the 'Adj Close' column of the features DataFrame.
====================
Improved version based on Primer's answer below:
def featurize(equity_data, n_sessions, column='Adj Close'):
"""
Generate a raw (unnormalized) feature set from the input data.
The value at column on the given date is taken
as a feature, and each row contains values for n_sessions
>>> timeit.timeit('data.featurize(data.get("ge", dt.date(1960, 1, 1),
dt.date(2014, 12, 31)), 256)', setup=s, number=1)
1.6771750450134277
"""
features = pd.DataFrame(index=equity_data.index[(n_sessions - 1):],
columns=map(str, range((-n_sessions + 1), 1)), dtype='float64')
values = equity_data[column].values
for i in range(n_sessions - 1):
features.iloc[:, i] = values[i:(-n_sessions + i + 1)]
features.iloc[:, n_sessions - 1] = values[(n_sessions - 1):]
return features
It looks like shift is your friend here and something like this will do:
df = pd.DataFrame({'adj close': np.random.random(10) + 15},index=pd.date_range(start='2014-01-02', periods=10, freq='B'))
df.index.name = 'date'
df
adj close
date
2014-01-02 15.650
2014-01-03 15.775
2014-01-06 15.750
2014-01-07 15.464
2014-01-08 15.966
2014-01-09 15.475
2014-01-10 15.164
2014-01-13 15.281
2014-01-14 15.568
2014-01-15 15.648
features = pd.DataFrame(data=df['adj close'], index=df.index)
features.columns = ['0']
features['-1'] = df['adj close'].shift()
features['-2'] = df['adj close'].shift(2)
features.dropna(inplace=True)
features
0 -1 -2
date
2014-01-06 15.750 15.775 15.650
2014-01-07 15.464 15.750 15.775
2014-01-08 15.966 15.464 15.750
2014-01-09 15.475 15.966 15.464
2014-01-10 15.164 15.475 15.966
2014-01-13 15.281 15.164 15.475
2014-01-14 15.568 15.281 15.164
2014-01-15 15.648 15.568 15.281

Categories

Resources