Looking to get a continuous rolling mean of a dataframe.
df looks like this
index price
0 4
1 6
2 10
3 12
looking to get a continuous rolling of price
the goal is to have it look this a moving mean of all the prices.
index price mean
0 4 4
1 6 5
2 10 6.67
3 12 8
thank you in advance!
you can use expanding:
df['mean'] = df.price.expanding().mean()
df
index price mean
0 4 4.000000
1 6 5.000000
2 10 6.666667
3 12 8.000000
Welcome to SO: Hopefully people will soon remember you from prior SO posts, such as this one.
From your example, it seems that #Allen has given you code that produces the answer in your table. That said, this isn't exactly the same as a "rolling" mean. The expanding() function Allen uses is taking the mean of the first row divided by n (which is 1), then adding rows 1 and 2 and dividing by n (which is now 2), and so on, so that the last row is (4+6+10+12)/4 = 8.
This last number could be the answer if the window you want for the rolling mean is 4, since that would indicate that you want a mean of 4 observations. However, if you keep moving forward with a window size 4, and start including rows 5, 6, 7... then the answer from expanding() might differ from what you want. In effect, expanding() is recording the mean of the entire series (price in this case) as though it were receiving a new piece of data at each row. "Rolling", on the other hand, gives you a result from an aggregation of some window size.
Here's another option for doing rolling calculations: the rolling() method in a pandas.dataframe.
In your case, you would do:
df['rolling_mean'] = df.price.rolling(4).mean()
df
index price rolling_mean
0 4 nan
1 6 nan
2 10 nan
3 12 8.000000
Those nans are a result of the windowing: until there are enough rows to calculate the mean, the result is nan. You could set a smaller window:
df['rolling_mean'] = df.price.rolling(2).mean()
df
index price rolling_mean
0 4 nan
1 6 5.000000
2 10 8.000000
3 12 11.00000
This shows the reduction in the nan entries as well as the rolling function: it 's only averaging within the size-two window you provided. That results in a different df['rolling_mean'] value than when using df.price.expanding().
Note: you can get rid of the nan by using .rolling(2, min_periods = 1), which tells the function the minimum number of defined values within a window that have to be present to calculate a result.
Related
I would like to use pandas' pct_change to compute the rate of change between each value and the previous rolling average (before that value). Here is what I mean:
If I have:
import pandas as pd
df = pd.DataFrame({'data': [1, 2, 3, 7]})
I would expect to get, for window size of 2:
0 NaN
1 NaN
2 1
3 1.8
, because roc(3, avg(1, 2)) = (3-1.5)/1.5 = 1 and same calculation goes for 1.8. using pct_change with periods parameter just skips previous nth entries, it doesn't do the job.
Any ideas on how to do this in an elegant pandas way for any window size?
here is one way to do it, using rolling and shift
df['avg']=df.rolling(2).mean()
df['poc'] = (df['data'] - df['avg'].shift(+1))/ df['avg'].shift(+1)
df.drop(columns='avg')
data poc
0 1 NaN
1 2 NaN
2 3 1.0
3 7 1.8
Consider the following dataframe
df = pd.DataFrame()
df['Amount'] = [13,17,31,48]
I want to calculate for each row the std of the previous 2 values of the column "Amount". For example:
For the third row, the value should be the std of 17 and 13 (which is 2).
For the fourth row, the value should be the std of 31 and 17 (which is 7).
This is what I did:
df['std previous 2 weeks'] = df['Amount'].shift(1).rolling(2).std()
But this is not working. I thought that my problem was an index problem. But this works perfectly with the sum method.
df['total amount of previous 2 weeks'] = df['Amount'].shift(1).rolling(2).sum()
PD : I know that this can be done in some other ways but I want to know the reason for why this does not work (and how to fix it).
You could shift after rolling.std. Also the degrees of freedom is 1 by default, it seems you want it to be 0.
df['Stdev'] = df['Amount'].rolling(2).std(ddof=0).shift()
Output:
Amount Stdev
0 13 NaN
1 17 NaN
2 31 2.0
3 48 7.0
I need to calculate the percentile using a specific algorithm that is not available using either pandas.rank() or numpy.rank().
The ranking algorithm is calculated as follows for a series:
rank[i] = (# of values in series less than i + # of values equal to
i*0.5)/total # of values
so if I had the following series
s=pd.Series(data=[5,3,8,1,9,4,14,12,6,1,1,4,15])
For the first element, 5 there are 6 values less than 5 and no other values = to 5. The rank would be (6+0x0.5)/13 or 6/13.
For the fourth element (1) it would be (0+ 2x0.5)/13 or 1/13.
How could I calculate this without using a loop? I assume a combination of s.apply and/or s.where() but can't figure it out and have tried searching. I am looking to apply to the entire series at once, with the result being a series with the percentile ranks.
You could use numpy broadcasting. First convert s to a numpy column array. Then use numpy broadcasting to count the number of items less than i for each i. Then count the number of items equal to i for each i (note that we need to subract 1 since, i is equal to i itself). Finally add them and build a Series:
tmp = s.to_numpy()
s_col = tmp[:, None]
less_than_i_count = (s_col>tmp).sum(axis=1)
eq_to_i_count = ((s_col==tmp).sum(axis=1) - 1) * 0.5
ranks = pd.Series((less_than_i_count + eq_to_i_count) / len(s), index=s.index)
Output:
0 0.461538
1 0.230769
2 0.615385
3 0.076923
4 0.692308
5 0.346154
6 0.846154
7 0.769231
8 0.538462
9 0.076923
10 0.076923
11 0.346154
12 0.923077
dtype: float64
I wanna implement a calculate method like a simple scenario:
value computed as the sum of daily data during the previous N days (set N = 3 in the following example)
Dataframe df: (df.index is 'date')
date value
20140718 1
20140721 2
20140722 3
20140723 4
20140724 5
20140725 6
20140728 7
......
to do calculating like:
date value new
20140718 1 0
20140721 2 0
20140722 3 0
20140723 4 6 (3+2+1)
20140724 5 9 (4+3+2)
20140725 6 12 (5+4+3)
20140728 7 15 (6+5+4)
......
Now I have done this using for cycle like:
df['value']=[0]*len(df)
for idx in df.index
loc=df.index.get_loc(idx)
if((loc-N)>=0):
tmp=df.ix[df.index[loc-3]:df.index[loc-1]]
sum=tmp['value'].sum()
else:
sum=0
df['new'].ix(idx)=sum
But, when the length of dataframe or the value of N is very long / big, these calculating will be very slow....How I can implement this faster using a function or by other ways?
Besides, if the scenario is more complex? how ? Thanks.
Since you want the sum of the previous three excluding the current one, you can use rolling_apply over the a window of four and sum up all but the last value.
new = rolling_apply(df, 4, lambda x:sum(x[:-1]), min_periods=4)
This is the same as shifting afterwards with a window of three:
new = rolling_apply(df, 3, sum, min_periods=3).shift()
Then
df["new"] = new["value"].fillna(0)
Is there a way to customize the window of the rolling_mean function?
data
1
2
3
4
5
6
7
8
Let's say the window is set to 2, that is to calculate the average of 2 datapoints before and after the obervation including the observation. Say the 3rd observation. In this case, we will have (1+2+3+4+5)/5 = 3. So on and so forth.
Compute the usual rolling mean with a forward (or backward) window and then use the shift method to re-center it as you wish.
data_mean = pd.rolling_mean(data, window=5).shift(-2)
If you want to average over 2 datapoints before and after the observation (for a total of 5 datapoints) then make the window=5.
For example,
import pandas as pd
data = pd.Series(range(1, 9))
data_mean = pd.rolling_mean(data, window=5).shift(-2)
print(data_mean)
yields
0 NaN
1 NaN
2 3
3 4
4 5
5 6
6 NaN
7 NaN
dtype: float64
As kadee points out, if you wish to center the rolling mean, then use
pd.rolling_mean(data, window=5, center=True)
For more current version of Pandas (please see 0.23.4 documentation https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html), you don't have rolling_mean anymore. Instead, you will use
DataFrame.rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None)
For your example, it will be:
df.rolling(5,center=True).mean()