Python dataframe: Standard deviation of last one year of data - python

I have dataframe df with daily stock market for 10 years having columns Date, Open, Close.
I want to calculate the daily standard deviation of the close price. For this the mathematical formula is:
Step1: Calculate the daily interday change of the Close
Step2: Next, calculate the daily standard deviation of the daily interday change (calculated from Step1) for the last 1 year of data
Presently, I have figured out Step1 as per the code below. The column Interday_Close_change calculates the difference between each row and the value one day ago.
df = pd.DataFrame(data, columns=columns)
df['Close_float'] = df['Close'].astype(float)
df['Interday_Close_change'] = df['Close_float'].diff()
df.fillna('', inplace=True)
Questions:
(a). How to I obtain a column Daily_SD which finds the standard deviation of the last 252 days (which is 1 year of trading days)? On Excel, we have the formula STDEV.S() to do this.
(b). The Daily_SD should begin on the 252th row of the data since that is when the data will have 252 datapoints to calculate from. How do I realize this?

It looks like you are trying to calculate a rolling standard deviation, with the rolling window consisting of previous 252 rows.
Pandas has many .rolling() methods, including one for standard deviation:
df['Daily_SD'] = df['Interday_Close_change'].rolling(252).std().shift()
If there is less than 252 rows available from which to calculate the standard deviation, the result for the row will be a null value (NaN). Think about whether you really want to apply the .fillna('') method to fill null values, as you are doing. That will convert the entire column from a numeric (float) data type to object data type.
Without the .shift() method, the current row's value will be included in calculations. The .shift() method will shift all rolling standard deviation values down by 1 row, so the current row's result will be the standard deviation of the previous 252 rows, as you want.
with pandas version >= 1.2 you can use this instead:
df['Daily_SD'] = df['Interday_Close_change'].rolling(252, closed='left').std()
The closed=left parameter will exclude the last point in the window from calculations.

Related

Pandas rolling quantile using 3 change values for each observation- OHLC "stock" data

I want to calculate rolling (12) quantiles from my return values. The problem is I need to use 3 different return values for each observation of the 12 (return calculated using close-close/close, high-close/close and low-close/close). So for any 12 period rolling window, I need to calculate based on 36 data sets and I am not sure how to do this within pandas.
The function takes OHLC (open high low close) data as well as the crypto pair and date.
My code right now aggregates the returns and then takes the rolling 12 period to give the distribution, but ideally I would want to use all 36 data points versus aggregating my 3 data points and then taking the rolling quantiles.
def Return_Dist(data, num_hours, periods, date_column="date_time", Pair='pair', columns=['low', 'high', 'close']):
import pandas as pd
result = pd.DataFrame()
obs=num_hours*periods
data=pd.DataFrame(data)
for pair in set(data[Pair]):
data.sort_values(by=[Pair,date_column], inplace = True, ignore_index=True, ascending=True)
data['Returns_close']=((data['close']-data['close'][0])/data['close'][0])
data['Returns_high']=((data['high']-data['close'][0])/data['close'][0])
data['Returns_low']=((data['low']-data['close'][0])/data['close'][0])
Return_DF=data[['Returns_close','Returns_high', 'Returns_low']]
temp = data.head(data.shape[0]-(obs))[columns]
temp['low']=np.nan
temp['high']=np.nan
temp['close']=np.nan
temp['pair']=np.nan
temp['date_time']=np.nan
temp['P_1']=np.nan
temp['P_5']=np.nan
temp['P_10']=np.nan
temp['Median']=np.nan
temp['P_90']=np.nan
temp['P_95']=np.nan
temp['P_99']=np.nan
temp['mean']=np.nan
temp['std']=np.nan
temp['RC']=data['Returns_close']
temp['RH']=data['Returns_high']
temp['RH']=data['Returns_low']
temp['avg_close']=( temp['RC']+temp['RH']+temp['RH'])/3
temp['low']=data['low']
temp['high']=data['high']
temp['close']=data['close']
temp['pair']=data['pair']
temp['date_time']=data[date_column]
temp['P_1']=temp['avg_close'].rolling(periods).quantile(.01)
temp['P_5']=temp['avg_close'].rolling(periods).quantile(.05)
temp['P_10']=temp['avg_close'].rolling(periods).quantile(.1)
temp['Median']=temp['avg_close'].rolling(periods).quantile(.5)
temp['P_90']=temp['avg_close'].rolling(periods).quantile(.9)
temp['P_95']=temp['avg_close'].rolling(periods).quantile(.95)
temp['P_99']=temp['avg_close'].rolling(periods).quantile(.99)
temp['mean']=temp['avg_close'].rolling(periods).mean()
temp['std']=temp['avg_close'].rolling(periods).std()

Calculating the annualized average returns with resample('Y') and without

I'm trying to calc the annualized return of Amazon stock and can't figure out the main difference between the following approaches
df = pdr.get_data_yahoo('amzn',datetime(2015, 1, 1),datetime(2019, 12, 31))['Adj Close']
1)df.pct_change()).mean()*252
Result = 0,400
2)df.resample('Y').last().pct_change().mean()
Result = 0,472
Why there is a difference about 7% ?
After reading the doc for the functions, I'd like go through an example of resampling time series data for a better understanding.
With resample method the price column of the DataFrame is grouped by a certain time span, in this case the 'Y' indicates a resampling by year and with last() we get the price value at the end of each year.
data.resample('Y').last()
Output: 1. Step
Next, with pct_change() we calculate the percentage change between the values for each row and the previous rows which are the price values at the end of each year that we got before.
data.resample('Y').last().pct_change()
Output: 2. Step
Now, in the final step we calculate the mean percentage change during the entire time period by using the mean() method
data.resample('Y').last().pct_change().mean()
Output: 3. Step
like #itprorh66 already wrote, the main difference between the two approaches is just about when the mean of the values is calculated.

Calculate the percentage difference between two specific rows in Python pandas

The problem is that I am trying to run a specific row I choose to calculate what percentage the specific row value is away from the intended outputs mean (which is already calculated from another column), to find what percentage it deviates from the intended outputs mean.
I want to run each item individually like so:
Below I made a dataframe column to store the result
df['pct difference'] = ((df['tertiary_tag']['price'] - df['ab roller']['mean'])/df['ab roller']['mean']) * 100
For example, let's say the mean is 10 and I know that the item is 8 dollars, figure out whatever percentage away from the mean that product is and return that number for each items of that dataset. Return what percentage it deviates from the mean.
Keep in mind, the problem is not solved by a loop because I am sure pandas has something more practical to calculate the % difference not pct_change.
I also thought maybe to get the very specific row make a column as some indexing so I can use, and use that to access any row with in the columns by using that index and from indexing do my operation whatever you want for example to calculate difference in percentage between two rows?
I thought maybe through indexing the column of the price?
df = df.set_index(['price']) df.index = pd.to_datetime(df.index)
def percent_diff(df, row1, row2):
"""
Calculating the percentage difference between two specific rows in a dataframe
"""
return (df.loc[row1, 'value'] - df.loc[row2, 'value']) / df.loc[row2, 'value'] * 100

How to compare elements of one dataframe to another?

I have a dataframe, called PORResult, of daily temperatures where rows are years and each column is a day (121 rows x 365 columns). I also have an array, called Percentile_90, of a threshold temperature for each day (length=365). For every day for every year in the PORResult dataframe I want to find out if the value for that day is higher than the value for that day in the Percentile_90 array. The results of which I want to store in a new dataframe, called Count (121rows x 365 columns). To start, the Count dataframe is full of zeros, but if the daily value in PORResult is greater than the daily value in Percentile_90. I want to change the daily value in Count to 1.
This is what I'm starting with:
for i in range(len(PORResult)):
if PORResult.loc[i] > Percentile_90[i]:
CountResult[i]+=1
But when I try this I get KeyError:0. What else can I try?
(Edited:)
Depending on your data structure, I think
CountResult = PORResult.gt(Percentile_90,axis=0).astype(int)
should do the trick. Generally, the toolset provided in pandas is sufficient that for-looping over a dataframe is unnecessary (as well as remarkably inefficient).

Difference of 2 columns in pandas dataframe with some given conditions

I have a sheet like this. I need to calculate absolute of "CURRENT HIGH" - "PREVIOUS DAY CLOSE PRICE" of particular "INSTRUMENT" and "SYMBOL".
So I used .shift(1) function of pandas dataframe to create a lagged close column and then I am subtracting current HIGH and lagged close column but that also subtracts between 2 different "INSTRUMENTS" and "SYMBOL". But if a new SYMBOL or INSTRUMENTS appears I want First row to be NULL instead of subtracting current HIGH and lagged close column.
What should I do?
I believe you need if all days are consecutive per groups:
df['new'] = df['HIGH'].sub(df.groupby(['INSTRUMENT','SYMBOL'])['CLOSE'].shift())

Categories

Resources