Calculate a rolling window weighted average on a Pandas column - python

I'm relatively new to python, and have been trying to calculate some simple rolling weighted averages across rows in a pandas data frame. I have a dataframe of observations df and a dataframe of weights w. I create a new dataframe to hold the inner-product between these two sets of values, dot.
As w is of smaller dimension, I use a for loop to calculate the weighted average by row, of the leading rows equal to the length of w.
More clearly, my set-up is as follows:
import pandas as pd
df = pd.DataFrame([0,1,2,3,4,5,6,7,8], index = range(0,9))
w = pd.DataFrame([0.1,0.25,0.5], index = range(0,3))
dot = pd.DataFrame(0, columns = ['dot'], index = df.index)
for i in range(0,len(df)):
df.loc[i] = sum(df.iloc[max(1,(i-3)):i].values * w.iloc[-min(3,(i-1)):4].values)
I would expect the result to be as follows (i.e. when i = 4)
dot.loc[4] = sum(df.iloc[max(1,(4-3)):4].values * w.iloc[-min(3,(4-1)):4].values)
print dot.loc[4] #2.1
However, when running the for loop above, I receive the error:
ValueError: operands could not be broadcast together with shapes (0,1) (2,1)
Which is where I get confused - I think it must have to do with how I call i into iloc, as I don't receive shape errors when I manually calculate it, as in the example with 4 above. However, looking at other examples and documentation, I don't see why that's the case... Any help is appreciated.

Your first problem is that you are trying to multiply arrays of two different sizes. For example, when i=0 the different parts of your for loop return
df.iloc[max(1,(0-3)):0].values.shape
# (0,1)
w.iloc[-min(3,(0-1)):4].values.shape
# (2,1)
Which is exactly the error you are getting. The easiest way I can think of to make the arrays multipliable is to pad your dataframe with leading zeros, using concatenation.
df2 = pd.concat([pd.Series([0,0]),df], ignore_index=True)
df2
0
0 0
1 0
2 0
3 1
4 2
5 3
6 4
7 5
8 6
9 7
10 8
While you can now use your for loop (with some minor tweaking):
for i in range(len(df)):
dot.loc[i] = sum(df2.iloc[max(0,(i)):i+3].values * w.values)
A nicer way might be the way JohnE suggested, to use the rolling and apply functions built into pandas, there by getting rid of your for loop
import numpy as np
df2.rolling(3,min_periods=3).apply(lambda x: np.dot(x,w))
0
0 NaN
1 NaN
2 0.00
3 0.50
4 1.25
5 2.10
6 2.95
7 3.80
8 4.65
9 5.50
10 6.35
You can also drop the first two padding rows and reset the index
df2.rolling(3,min_periods=3).apply(lambda x: np.dot(x,w)).drop([0,1]).reset_index(drop=True)
0
0 0.00
1 0.50
2 1.25
3 2.10
4 2.95
5 3.80
6 4.65
7 5.50
8 6.35

Related

Create a custom percentile rank for a pandas series

I need to calculate the percentile using a specific algorithm that is not available using either pandas.rank() or numpy.rank().
The ranking algorithm is calculated as follows for a series:
rank[i] = (# of values in series less than i + # of values equal to
i*0.5)/total # of values
so if I had the following series
s=pd.Series(data=[5,3,8,1,9,4,14,12,6,1,1,4,15])
For the first element, 5 there are 6 values less than 5 and no other values = to 5. The rank would be (6+0x0.5)/13 or 6/13.
For the fourth element (1) it would be (0+ 2x0.5)/13 or 1/13.
How could I calculate this without using a loop? I assume a combination of s.apply and/or s.where() but can't figure it out and have tried searching. I am looking to apply to the entire series at once, with the result being a series with the percentile ranks.
You could use numpy broadcasting. First convert s to a numpy column array. Then use numpy broadcasting to count the number of items less than i for each i. Then count the number of items equal to i for each i (note that we need to subract 1 since, i is equal to i itself). Finally add them and build a Series:
tmp = s.to_numpy()
s_col = tmp[:, None]
less_than_i_count = (s_col>tmp).sum(axis=1)
eq_to_i_count = ((s_col==tmp).sum(axis=1) - 1) * 0.5
ranks = pd.Series((less_than_i_count + eq_to_i_count) / len(s), index=s.index)
Output:
0 0.461538
1 0.230769
2 0.615385
3 0.076923
4 0.692308
5 0.346154
6 0.846154
7 0.769231
8 0.538462
9 0.076923
10 0.076923
11 0.346154
12 0.923077
dtype: float64

I need to create a dataframe were values reference previous rows

I am just starting to use python and im trying to learn some of the general things about it. As I was playing around with it I wanted to see if I could make a dataframe that shows a starting number which is compounded by a return. Sorry if this description doesnt make much sense but I basically want a dataframe x long that shows me:
number*(return)^(row number) in each row
so for example say number is 10 and the return is 10% so i would like for the dataframe to give me the series
1 11
2 12.1
3 13.3
4 14.6
5 ...
6 ...
Thanks so much in advanced!
Let us try
import numpy as np
val = 10
det = 0.1
n = 4
out = 10*((1+det)**np.arange(n))
s = pd.Series(out)
s
Out[426]:
0 10.00
1 11.00
2 12.10
3 13.31
dtype: float64
Notice here I am using the index from 0 , since 1.1**0 will yield the original value
I think this does what you want:
df = pd.DataFrame({'returns': [x for x in range(1, 10)]})
df.index = df.index + 1
df.returns = df.returns.apply(lambda x: (10 * (1.1**x)))
print(df)
Out:
returns
1 11.000000
2 12.100000
3 13.310000
4 14.641000
5 16.105100
6 17.715610
7 19.487171
8 21.435888
9 23.579477

Populate column in dataframe based on iat values

lookup={'Tier':[1,2,3,4],'Terr.1':[0.88,0.83,1.04,1.33],'Terr.2':[0.78,0.82,0.91,1.15],'Terr.3':[0.92,0.98,1.09,1.33],'Terr.4':[1.39,1.49,1.66,1.96],'Terr.5':[1.17,1.24,1.39,1.68]}
df={'Tier':[1,1,2,2,3,2,4,4,4,1],'Territory':[1,3,4,5,4,4,2,1,1,2]}
df=pd.DataFrame(df)
lookup=pd.DataFrame(lookup)
lookup contains the lookup values, and df contains the data being fed into iat.
I get the correct values when I print(lookup.iat[tier,terr]). However, when I try to set those values in a new column, it endlessly runs, or in this simple test case just copies 1 value 10 times.
for i in df["Tier"]:
tier=i-1
for j in df["Territory"]:
terr=j
#print(lookup.iat[tier,terr])
df["Rate"]=lookup.iat[tier,terr]
Any thoughts on a possible better solution?
You can use apply() after some modification to your lookup dataframe:
lookup = lookup.rename(columns={i: i.split('.')[-1] for i in lookup.columns}).set_index('Tier')
lookup.columns = lookup.columns.astype(int)
df['Rate'] = df.apply(lambda x: lookup.loc[x['Tier'],x['Territory']], axis=1)
Returns:
Tier Territory Rate
0 1 1 0.88
1 1 3 0.92
2 2 4 1.49
3 2 5 1.24
4 3 4 1.66
5 2 4 1.49
6 4 2 1.15
7 4 1 1.33
8 4 1 1.33
9 1 2 0.78
Once lookup modified a bit the same way than #rahlf23 plus using stack, you can merge both dataframes such as:
df['Rate'] = df.merge( lookup.rename(columns={ i: int(i.split('.')[-1])
for i in lookup.columns if 'Terr' in i})
.set_index('Tier').stack()
.reset_index().rename(columns={'level_1':'Territory'}),
how='left')[0]
If you have a big dataframe df, then it should be faster than using apply and loc
Also, if any couple (Tier, Territory) in df does not exist in lookup, this method won't throw an error

Replace a column values with its mean of groups in dataframe

I have a DataFrame as
Page Line y
1 2 3.2
1 2 6.1
1 3 7.1
2 4 8.5
2 4 9.1
I have to replace column y with values of its mean in groups. I can do that grouping using one column using this code.
df['y'] = df['y'].groupby(df['Page'], group_keys=False).transform('mean')
I am trying to replace the values of y by mean of groups by 'Page' and 'Line'. Something like this,
Page Line y
1 2 4.65
1 2 4.65
1 3 7.1
2 4 8.8
2 4 8.8
I have searched through a lot of answers on this site but couldn't find this application. Using python3 with pandas.
You need list of columns names, groupby parameter by:
by : mapping, function, label, or list of labels
Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see .align() method). If an ndarray is passed, the values are used as-is determine the groups. A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted a (single) key.
df['y'] = df.groupby(['Page', 'Line'])['y'].transform('mean')
print (df)
Page Line y
0 1 2 4.65
1 1 2 4.65
2 1 3 7.10
3 2 4 8.80
4 2 4 8.80
Your solution should be changed to this syntactic sugar - pass Series in list:
df['y'] = df['y'].groupby([df['Page'], df['Line']]).transform('mean')
So you want this:
df['y'] = df.groupby(['Page', 'Line']).transform('mean')
#jezrael's approach is idiomatic. Use that approach!
np.bincount and pd.factorize
This should be pretty fast. However, this is a specialized solution to a specific problem and doesn't do well if you want to generalize. Also, if you need to deal with np.nan, you'd have to incorporate more logic.
f, u = pd.factorize(list(zip(df.Page, df.Line)))
df.assign(y=(np.bincount(f, df.y) / np.bincount(f))[f])
Page Line y
0 1 2 4.65
1 1 2 4.65
2 1 3 7.10
3 2 4 8.80
4 2 4 8.80
What this is doing is:
pd.factorize identifies the groups
np.bincount(f) is counting how many items in each group
np.bincount(f, df.y) is summing the values of column y within each group
(np.bincount(f, df.y) / np.bincount(f)) finds the mean
(np.bincount(f, df.y) / np.bincount(f))[f] slices to present the same length as the original array
set_index and map
This is me being silly. Don't use this.
cols = ['Page', 'Line']
df.assign(y=df.set_index(cols).index.map(df.groupby(cols).y.mean()))
Page Line y
0 1 2 4.65
1 1 2 4.65
2 1 3 7.10
3 2 4 8.80
4 2 4 8.80
Use groupby (without transform) to get a mapping of tuple -> mean
Use set_index as a convenient way to make pandas produce the tuples
Index objects have a map method, so we'll use that

Rolling mean with customized window with Pandas

Is there a way to customize the window of the rolling_mean function?
data
1
2
3
4
5
6
7
8
Let's say the window is set to 2, that is to calculate the average of 2 datapoints before and after the obervation including the observation. Say the 3rd observation. In this case, we will have (1+2+3+4+5)/5 = 3. So on and so forth.
Compute the usual rolling mean with a forward (or backward) window and then use the shift method to re-center it as you wish.
data_mean = pd.rolling_mean(data, window=5).shift(-2)
If you want to average over 2 datapoints before and after the observation (for a total of 5 datapoints) then make the window=5.
For example,
import pandas as pd
data = pd.Series(range(1, 9))
data_mean = pd.rolling_mean(data, window=5).shift(-2)
print(data_mean)
yields
0 NaN
1 NaN
2 3
3 4
4 5
5 6
6 NaN
7 NaN
dtype: float64
As kadee points out, if you wish to center the rolling mean, then use
pd.rolling_mean(data, window=5, center=True)
For more current version of Pandas (please see 0.23.4 documentation https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html), you don't have rolling_mean anymore. Instead, you will use
DataFrame.rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None)
For your example, it will be:
df.rolling(5,center=True).mean()

Categories

Resources