Pandas pct_change with moving average - python

I would like to use pandas' pct_change to compute the rate of change between each value and the previous rolling average (before that value). Here is what I mean:
If I have:
import pandas as pd
df = pd.DataFrame({'data': [1, 2, 3, 7]})
I would expect to get, for window size of 2:
0 NaN
1 NaN
2 1
3 1.8
, because roc(3, avg(1, 2)) = (3-1.5)/1.5 = 1 and same calculation goes for 1.8. using pct_change with periods parameter just skips previous nth entries, it doesn't do the job.
Any ideas on how to do this in an elegant pandas way for any window size?

here is one way to do it, using rolling and shift
df['avg']=df.rolling(2).mean()
df['poc'] = (df['data'] - df['avg'].shift(+1))/ df['avg'].shift(+1)
df.drop(columns='avg')
data poc
0 1 NaN
1 2 NaN
2 3 1.0
3 7 1.8

Related

How can I forward fill a dataframe column where the limit of rows filled is based on the value of a cell in another column?

So I am trying to forward fill a column with the limit being the value in another column. This is the code I run and I get this error message.
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['NM'] = [0, 0, 1, np.nan, np.nan, np.nan, 0]
df['length'] = [0, 0, 2, 0, 0, 0, 0]
print(df)
NM length
0 0.0 0
1 0.0 0
2 1.0 2
3 NaN 0
4 NaN 0
5 NaN 0
6 0.0 0
df['NM'] = df['NM'].fillna(method='ffill', limit=df['length'])
print(df)
ValueError: Limit must be an integer
The dataframe I want looks like this:
NM length
0 0.0 0
1 0.0 0
2 1.0 2
3 1.0 0
4 1.0 0
5 NaN 0
6 0.0 0
Thanks in advance for any help you can provide!
I do not think you want to use ffill for this instance.
Rather I would recommend filtering to where length is greater than 0, then iterating through those rows to enter the NM value from that row in the next n+length rows.
for row in df.loc[df.length.gt(0)].reset_index().to_dict(orient='records'):
df.loc[row['index']+1:row['index']+row['length'], 'NM'] = row['NM']
To better break this down:
Get rows containing change information be sure to include the index.
df.loc[df.length.gt(0)].reset_index().to_dict(orient='records')
iterate through them... I prefer to_dict for performance reasons on large datasets. It is a habit.
sets NM rows to the NM value of your row with the defined length.
You can first group the dataframe by the length column before filling. Only issue is that for the first group in your example limit would be 0 which causes an error, so we can make sure it's at least 1 with max. This might cause unexpected results if there are nan values before the first non-zero value in length but from the given data it's not clear if that can happen.
# make groups
m = df.length.gt(0).cumsum()
# fill the column
df["NM"] = df.groupby(m).apply(
lambda f: f.NM.fillna(
method="ffill",
limit=max(f.length.iloc[0], 1))
).values

std composed with rolling and shift is not working on pandas (Python)

Consider the following dataframe
df = pd.DataFrame()
df['Amount'] = [13,17,31,48]
I want to calculate for each row the std of the previous 2 values of the column "Amount". For example:
For the third row, the value should be the std of 17 and 13 (which is 2).
For the fourth row, the value should be the std of 31 and 17 (which is 7).
This is what I did:
df['std previous 2 weeks'] = df['Amount'].shift(1).rolling(2).std()
But this is not working. I thought that my problem was an index problem. But this works perfectly with the sum method.
df['total amount of previous 2 weeks'] = df['Amount'].shift(1).rolling(2).sum()
PD : I know that this can be done in some other ways but I want to know the reason for why this does not work (and how to fix it).
You could shift after rolling.std. Also the degrees of freedom is 1 by default, it seems you want it to be 0.
df['Stdev'] = df['Amount'].rolling(2).std(ddof=0).shift()
Output:
Amount Stdev
0 13 NaN
1 17 NaN
2 31 2.0
3 48 7.0

how to get a continuous rolling mean in pandas?

Looking to get a continuous rolling mean of a dataframe.
df looks like this
index price
0 4
1 6
2 10
3 12
looking to get a continuous rolling of price
the goal is to have it look this a moving mean of all the prices.
index price mean
0 4 4
1 6 5
2 10 6.67
3 12 8
thank you in advance!
you can use expanding:
df['mean'] = df.price.expanding().mean()
df
index price mean
0 4 4.000000
1 6 5.000000
2 10 6.666667
3 12 8.000000
Welcome to SO: Hopefully people will soon remember you from prior SO posts, such as this one.
From your example, it seems that #Allen has given you code that produces the answer in your table. That said, this isn't exactly the same as a "rolling" mean. The expanding() function Allen uses is taking the mean of the first row divided by n (which is 1), then adding rows 1 and 2 and dividing by n (which is now 2), and so on, so that the last row is (4+6+10+12)/4 = 8.
This last number could be the answer if the window you want for the rolling mean is 4, since that would indicate that you want a mean of 4 observations. However, if you keep moving forward with a window size 4, and start including rows 5, 6, 7... then the answer from expanding() might differ from what you want. In effect, expanding() is recording the mean of the entire series (price in this case) as though it were receiving a new piece of data at each row. "Rolling", on the other hand, gives you a result from an aggregation of some window size.
Here's another option for doing rolling calculations: the rolling() method in a pandas.dataframe.
In your case, you would do:
df['rolling_mean'] = df.price.rolling(4).mean()
df
index price rolling_mean
0 4 nan
1 6 nan
2 10 nan
3 12 8.000000
Those nans are a result of the windowing: until there are enough rows to calculate the mean, the result is nan. You could set a smaller window:
df['rolling_mean'] = df.price.rolling(2).mean()
df
index price rolling_mean
0 4 nan
1 6 5.000000
2 10 8.000000
3 12 11.00000
This shows the reduction in the nan entries as well as the rolling function: it 's only averaging within the size-two window you provided. That results in a different df['rolling_mean'] value than when using df.price.expanding().
Note: you can get rid of the nan by using .rolling(2, min_periods = 1), which tells the function the minimum number of defined values within a window that have to be present to calculate a result.

Pandas: Dynamically replace NaN values with the average of previous and next non-missing values

I have a dataframe df with NaN values and I want to dynamically replace them with the average values of previous and next non-missing values.
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 NaN -2.027325 1.533582
4 NaN NaN 0.461821
5 -0.788073 NaN NaN
6 -0.916080 -0.612343 NaN
7 -0.887858 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
For example, A[3] is NaN so its value should be (-0.120211-0.788073)/2 = -0.454142. A[4] then should be (-0.454142-0.788073)/2 = -0.621108.
Therefore, the result dataframe should look like:
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.454142 -2.027325 1.533582
4 -0.621108 -1.319834 0.461821
5 -0.788073 -0.966089 -1.260202
6 -0.916080 -0.612343 -2.121213
7 -0.887858 1.033826 -2.551718
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
Is this a good way to deal with the missing values? I can't simply replace them by the average values of each column because my data is time-series and tends to increase over time. (The initial value may be $0 and final value might be $100000, so the average is $50000 which can be much bigger/smaller than the NaN values).
You can try to understand your logic behind the average that is Geometric progression
s=df.isnull().cumsum()
t1=df[(s==1).shift(-1).fillna(False)].stack().reset_index(level=0,drop=True)
t2=df.lookup(s.idxmax()+1,s.idxmax().index)
df.fillna(t1/(2**s)+t2*(1-0.5**s)*2/2)
Out[212]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.454142 -2.027325 1.533582
4 -0.621107 -1.319834 0.461821
5 -0.788073 -0.966089 -1.260201
6 -0.916080 -0.612343 -2.121213
7 -0.887858 1.033826 -2.551718
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
Explanation:
1st NaN x/2+y/2=1st
2nd NaN 1st/2+y/2=2nd
3rd NaN 2nd/2+y/2+3rd
Then x/(2**n)+y(1-(1/2)**n)/(1-1/2), this is the key
Got a simular Problem.
The following code worked for me.
def fill_nan_with_mean_from_prev_and_next(df):
NANrows = pd.isnull(df).any(1).nonzero()[0]
null_df = df.isnull()
for row in NANrows :
for colum in range(0,df.shape[1]):
if(null_df.iloc[row][colum]):
df.iloc[row][colum] = (df.iloc[row-1][colum]+df.iloc[row-1][colum])/2
return df
maybe it is helps someone too.
as Ben.T has mentioned above
if you have another group of NaN in the same column
you can consider this lazy solution :)
for column in df:
for ind,row in df[[column]].iterrows():
if ~np.isnan(row[column]):
previous = row[column]
else:
indx = ind + 1
while np.isnan(df.loc[indx,column]):
indx += 1
next = df.loc[indx,column]
previous = df[column][ind] = (previous + next)/2

Rolling mean with customized window with Pandas

Is there a way to customize the window of the rolling_mean function?
data
1
2
3
4
5
6
7
8
Let's say the window is set to 2, that is to calculate the average of 2 datapoints before and after the obervation including the observation. Say the 3rd observation. In this case, we will have (1+2+3+4+5)/5 = 3. So on and so forth.
Compute the usual rolling mean with a forward (or backward) window and then use the shift method to re-center it as you wish.
data_mean = pd.rolling_mean(data, window=5).shift(-2)
If you want to average over 2 datapoints before and after the observation (for a total of 5 datapoints) then make the window=5.
For example,
import pandas as pd
data = pd.Series(range(1, 9))
data_mean = pd.rolling_mean(data, window=5).shift(-2)
print(data_mean)
yields
0 NaN
1 NaN
2 3
3 4
4 5
5 6
6 NaN
7 NaN
dtype: float64
As kadee points out, if you wish to center the rolling mean, then use
pd.rolling_mean(data, window=5, center=True)
For more current version of Pandas (please see 0.23.4 documentation https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html), you don't have rolling_mean anymore. Instead, you will use
DataFrame.rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None)
For your example, it will be:
df.rolling(5,center=True).mean()

Categories

Resources