Pandas rolling mean with offset by (not continuously available) date - python

given the following example table
Index
Date
Weekday
Value
1
05/12/2022
2
10
2
06/12/2022
3
20
3
07/12/2022
4
40
4
09/12/2022
6
10
5
10/12/2022
7
60
6
11/12/2022
1
30
7
12/12/2022
2
40
8
13/12/2022
3
50
9
14/12/2022
4
60
10
16/12/2022
6
20
11
17/12/2022
7
50
12
18/12/2022
1
10
13
20/12/2022
3
20
14
21/12/2022
4
10
15
22/12/2022
5
40
I want to calculate a rolling average of the last three observations (at least) a week ago. I cannot use .shift as some dates are randomly missing, and .shift would therefore not produce a reliable output.
Desired output example for last three rows in the example dataset:
Index 13: Avg of indices 8, 7, 6 = (30+40+50) / 3 = 40
Index 14: Avg of indices 9, 8, 7 = (40+50+60) / 3 = 50
Index 15: Avg of indices 9, 8, 7 = (40+50+60) / 3 = 50
What would be a working solution for this? Thanks!
Thanks!

MOSTLY inspired from #Aidis you could, make his solution an apply:
df['mean']=df.apply(lambda y: df["Value"][df['Date'] <= y['Date'] - pd.Timedelta(1, "W")].tail(3).mean(), axis=1)
or spliting the data at each call which may run faster if you have lots of data (to be tested):
df['mean']=df.apply(lambda y: df.loc[:y.name, "Value"][ df.loc[:y.name,'Date'] <= y['Date'] - pd.Timedelta(1, "W")].tail(3).mean(), axis=1)
which returns:
Index Date Weekday Value mean
0 1 2022-12-05 2 10 NaN
1 2 2022-12-06 3 20 NaN
2 3 2022-12-07 4 40 NaN
3 4 2022-12-09 6 10 NaN
4 5 2022-12-10 7 60 NaN
5 6 2022-12-11 1 30 NaN
6 7 2022-12-12 2 40 10.000000
7 8 2022-12-13 3 50 15.000000
8 9 2022-12-14 4 60 23.333333
9 10 2022-12-16 6 20 23.333333
10 11 2022-12-17 7 50 36.666667
11 12 2022-12-18 1 10 33.333333
12 13 2022-12-20 3 20 40.000000
13 14 2022-12-21 4 10 50.000000
14 15 2022-12-22 5 40 50.000000

I apologize for this ugly code. But it seems to work:
df = df.set_index("Index")
df['Date'] = df['Date'].astype("datetime64")
for id in df.index:
dfs = df.loc[:id]
mean = dfs["Value"][dfs['Date'] <= dfs.iloc[-1]['Date'] - pd.Timedelta(1, "W")].tail(3).mean()
print(id, mean)
Result:
1 nan
2 10.0
3 15.0
4 23.333333333333332
5 23.333333333333332
6 36.666666666666664
7 33.333333333333336
8 33.333333333333336
9 33.333333333333336
10 33.333333333333336
11 33.333333333333336
12 33.333333333333336
13 40.0
14 50.0
15 50.0

Related

Smoothing the values of the column with the mean or median of the members belong to the same bin

Dataframe df is given using df = pd.DataFrame({'A':[10, 15, 12, 19, 11, 20, 25]}) as:
A
0 10
1 15
2 12
3 19
4 11
5 20
6 25
The result of equal-frequency binning of the column A by using df['B'] = pd.cut(df['A'], bins = 2)
is the following:
A B
0 10 (9.985, 17.5]
1 15 (9.985, 17.5]
2 12 (9.985, 17.5]
3 19 (17.5, 25.0]
4 11 (9.985, 17.5]
5 20 (17.5, 25.0]
6 25 (17.5, 25.0]
How can we simply have the mean (or median) of the elements of the column A which belong to the same bin instead of the intervals in column B? That is, how to have the column B as the following:
A B
0 10 12
1 15 12
2 12 12
3 19 21.33
4 11 12
5 20 21.33
6 25 21.33
You could group by the bins and transform to the mean or median, depending on what you want:
>>> df.groupby(pd.cut(df['A'], bins=2)).transform('mean')
A
0 12.000000
1 12.000000
2 12.000000
3 21.333333
4 12.000000
5 21.333333
6 21.333333
>>> df.groupby(pd.cut(df['A'], bins=2)).transform('median')
A
0 11.5
1 11.5
2 11.5
3 20.0
4 11.5
5 20.0
6 20.0
For the interval midpoint check #sacuL’s (now deleted?) answer. It was pd.cut(df['A'], bins=2).map(lambda itv: itv.mid), or the slightly faster:
>>> df['B'] = pd.IntervalIndex(pd.cut(df['A'], bins=2)).mid
>>> df
A B
0 10 13.7425
1 15 13.7425
2 12 13.7425
3 19 21.2500
4 11 13.7425
5 20 21.2500
6 25 21.2500

Is it possible to use pandas.DataFrame.rolling window period 5 with skipping today's value in that

need to get output in column <5_Days_Up> like the image.
Date price 5_Days_Up
20-May-21 1
21-May-21 2
22-May-21 4
23-May-21 5
24-May-21 6 5
25-May-21 7 6
26-May-21 8 7
27-May-21 9 8
28-May-21 10 9
29-May-21 11 10
30-May-21 12 11
31-May-21 13 12
1-Jun-21 14 13
2-Jun-21 15 14
But, got the output like this.
Date price 5_Days_Up
20-May-21 1
21-May-21 2
22-May-21 4
23-May-21 5
24-May-21 6 6
25-May-21 7 7
26-May-21 8 8
27-May-21 9 9
28-May-21 10 10
29-May-21 11 11
30-May-21 12 12
31-May-21 13 13
1-Jun-21 14 14
2-Jun-21 15 15
Here, in python pandas, I am using
df['5_Days_Up'] = df['price'].rolling(window=5).max()
is there a way to get the maximum value of the last 5 periods after skipping the today's price using the same rolling() or any other?
Your data has only 4 (instead of 5) previous entries before the entry on date 24-May-21 with price equals 6 (owing to there is no price equals 3 in the data sample.) Therefore, your first entry to show non-NaN value will start from the date 25-May-21 with price equals 7.
To include up to the previous entry (exclude current entry), you can use the parameter closed='left' to achieve this:
df['5_Days_Up'] = df['price'].rolling(window=5, closed='left').max()
Result:
Date price 5_Days_Up
0 20-May-21 1 NaN
1 21-May-21 2 NaN
2 22-May-21 4 NaN
3 23-May-21 5 NaN
4 24-May-21 6 NaN
5 25-May-21 7 6.0
6 26-May-21 8 7.0
7 27-May-21 9 8.0
8 28-May-21 10 9.0
9 29-May-21 11 10.0
10 30-May-21 12 11.0
11 31-May-21 13 12.0
12 1-Jun-21 14 13.0
13 2-Jun-21 15 14.0

Calculate 14-day rolling average on data with two hierarchies

I am trying to calculate the 14 day rolling average for retail data with multiple different hierarchies. The 'Store' dataframe looks like this:
Store | Inventory-Small | Inventory-Medium | Date | Purchases-Small | Purchases-Medium
-----------------------------------------------------------------------------------------------------
A 12 14 4/1/20 2 4
B 13 16 4/1/20 4 5
A 15 10 4/2/20 2 6
C 20 15 4/1/20 4 5
A 16 8 4/3/20 2 4
A 16 10 4/4/20 4 5
A 15 12 4/5/20 1 3
C 18 14 4/2/20 2 3
C 19 12 4/3/20 6 9
B 14 14 4/2/20 3 8
What I am trying to do is create a rolling 14 day average for the purchases column for each store. The data extends well past 14 day (over 8 months), and I would like the first 14 days of each store to be a simple average. My issue is that while I can group by 'Store' and create a column, I don't know how to also group by dates. I've tried:
Store.sort_values(['Store','Date'],ascending=(False,False))
Store['Rolling_Purchase_S'] = Store.groupby(['Store','Date'], as_index=False)['Purchases-Small'].transform(lambda x: x.rolling(14, 1).mean())
and also:
Store['Rolling_Purchase_S'] = Store.groupby('Store')['Purchases-Small'].transform(lambda x: x.rolling(14, 1).mean())
The first one doesn't seem to have any effect while the second one doesn't group by dates so I end up with a rolling average in the wrong order. Any advice would be much appreciated!
Edit: The following lines worked, thanks to all for the feedback.
Store.sort_values(['Store','Date'],ascending=(False,True),inplace=True)
Store['Rolling_Purchase_S'] = Store.groupby('Store')['Purchases-Small'].transform(lambda x: x.rolling(14, 1).mean())
I believe it's working fine as long as you sort inplace and remove 'Date' from groupby:
Store.sort_values(['Store','Date'], ascending=(False,False), inplace=True)
Store['Rolling_Purchase_S'] = Store.groupby(['Store'])['Purchases-Small'].transform(lambda x: x.rolling(14, 1).mean())
Output:
print(Store[['Store', 'Date', 'Purchases-Small', 'Rolling_Purchase_S']])
Store Date Purchases-Small Rolling_Purchase_S
8 C 2020-04-03 6 6.000000
7 C 2020-04-02 2 4.000000
3 C 2020-04-01 4 4.000000
9 B 2020-04-02 3 3.000000
1 B 2020-04-01 4 3.500000
6 A 2020-04-05 1 1.000000
5 A 2020-04-04 4 2.500000
4 A 2020-04-03 2 2.333333
2 A 2020-04-02 2 2.250000
0 A 2020-04-01 2 2.200000
There's a couple things to point out but you're very close. I changed a couple items to make it work. To illustrate the sorting, I modified the first date for each store to 4/10
If your dates aren't datetime, the sorting may not work as expected. also, you need inplace=True to make the change permanent
input = '''Store Inventory-Small Inventory-Medium Date Purchases-Small Purchases-Medium
A 12 14 4/10/20 2 4
B 13 16 4/10/20 4 5
A 15 10 4/2/20 2 6
C 20 15 4/10/20 4 5
A 16 8 4/3/20 2 4
A 16 10 4/4/20 4 5
A 15 12 4/5/20 1 3
C 18 14 4/2/20 2 3
C 19 12 4/3/20 6 9
B 14 14 4/2/20 3 8'''
Store = pd.read_csv(io.StringIO(input), sep=' ')
Store['Date'] = pd.to_datetime(Store['Date'])
Store.sort_values(['Store', 'Date'],ascending=(True, True), inplace=True)
Store['Rolling_Purchase_S'] = Store.groupby('Store')['Purchases-Small'].transform(lambda x: x.rolling(2, 1).mean())
Also, I changed the timeperiod to 2 because I had little data to work with, it'll need to go back to 14 for your dataset
Output:
In [137]: Store
Out[137]:
Store Inventory-Small Inventory-Medium Date Purchases-Small Purchases-Medium Rolling_Purchase_S
2 A 15 10 2020-04-02 2 6 2.000
4 A 16 8 2020-04-03 2 4 2.000
5 A 16 10 2020-04-04 4 5 3.000
6 A 15 12 2020-04-05 1 3 2.500
0 A 12 14 2020-04-10 2 4 1.500
9 B 14 14 2020-04-02 3 8 3.000
1 B 13 16 2020-04-10 4 5 3.500
7 C 18 14 2020-04-02 2 3 2.000
8 C 19 12 2020-04-03 6 9 4.000
3 C 20 15 2020-04-10 4 5 5.000

Change rolling window size as it rolls

I have a pandas data frame like this;
>df
leg speed
1 10
1 11
1 12
1 13
1 12
1 15
1 19
1 12
2 10
2 10
2 12
2 15
2 19
2 11
: :
I want to make a new column roll_speed where it takes a rolling average speed of the last 5 positions. But I wanna put more detailed condition in it.
Groupby leg(it doesn't take into account the speed of the rows in different leg.
I want the rolling window to be changed from 1 to 5 maximum according to the available rows. For example in leg == 1, in the first row there is only one row to calculate, so the rolling speed should be 10/1 = 10. For the second row, there are only two rows available for calculation, the rolling speed should be (10+11)/2 = 10.5.
leg speed roll_speed
1 10 10 # 10/1
1 11 10.5 # (10+11)/2
1 12 11 # (10+11+12)/3
1 13 11.5 # (10+11+12+13)/4
1 12 11.6 # (10+11+12+13+12)/5
1 15 12.6 # (11+12+13+12+15)/5
1 19 14.2 # (12+13+12+15+19)/5
1 12 14.2 # (13+12+15+19+12)/5
2 10 10 # 10/1
2 10 10 # (10+10)/2
2 12 10.7 # (10+10+12)/3
2 15 11.8 # (10+10+12+15)/4
2 19 13.2 # (10+10+12+15+19)/5
2 11 13.4 # (10+12+15+19+11)/5
: :
My attempt:
df['roll_speed'] = df.speed.rolling(5).mean()
But it just returns NA for rows where less than five rows are available for calculation. How should I solve this problem? Thank you for any help!
Set the parameter min_periods to 1
df['roll_speed'] = df.groupby('leg').speed.rolling(5, min_periods = 1).mean()\
.round(1).reset_index(drop = True)
leg speed roll_speed
0 1 10 10.0
1 1 11 10.5
2 1 12 11.0
3 1 13 11.5
4 1 12 11.6
5 1 15 12.6
6 1 19 14.2
7 1 12 14.2
8 2 10 10.0
9 2 10 10.0
10 2 12 10.7
11 2 15 11.8
12 2 19 13.2
13 2 11 13.4
Using rolling(5) will get you your results for all but the first 4 occurences of each group. We can fill the remaining values with the expanding mean:
(df.groupby('leg').speed.rolling(5)
.mean().fillna(df.groupby('leg').speed.expanding().mean())
).reset_index(drop=True)
0 10.000000
1 10.500000
2 11.000000
3 11.500000
4 11.600000
5 12.600000
6 14.200000
7 14.200000
8 10.000000
9 10.000000
10 10.666667
11 11.750000
12 13.200000
13 13.400000
Name: speed, dtype: float64

Call a Nan Value and change to a number in python

I have a DataFrame, say df, which looks like this:
id property_type1 property_type pro
1 Condominium 2 2
2 Farm 14 14
3 House 7 7
4 Lots/Land 15 15
5 Mobile/Manufactured Home 13 13
6 Multi-Family 8 8
7 Townhouse 11 11
8 Single Family 10 10
9 Apt/Condo 1 1
10 Home 7 7
11 NaN 29 NaN
Now, I need the pro column to have the same value as the property_type column, whenever the property_type1 column has a NaN value. This is how it should be:
id property_type1 property_type pro
1 Condominium 2 2
2 Farm 14 14
3 House 7 7
4 Lots/Land 15 15
5 Mobile/Manufactured Home 13 13
6 Multi-Family 8 8
7 Townhouse 11 11
8 Single Family 10 10
9 Apt/Condo 1 1
10 Home 7 7
11 NaN 29 29
That is, in line 11, where property_type1 is NaN, the value of the pro column becomes 29, which is the value of property_type. How can I do this?
ix is deprecated, don't use it.
Option 1
I'd do this with np.where -
df = df.assign(pro=np.where(df.pro.isnull(), df.property_type, df.pro))
df
id property_type1 property_type pro
0 1 Condominium 2 2.0
1 2 Farm 14 14.0
2 3 House 7 7.0
3 4 Lots/Land 15 15.0
4 5 Mobile/Manufactured Home 13 13.0
5 6 Multi-Family 8 8.0
6 7 Townhouse 11 11.0
7 8 Single Family 10 10.0
8 9 Apt/Condo 1 1.0
9 10 Home 7 7.0
10 11 NaN 29 29.0
Option 2
If you want to perform in-place assignment, use loc -
m = df.pro.isnull()
df.loc[m, 'pro'] = df.loc[m, 'property_type']
df
id property_type1 property_type pro
0 1 Condominium 2 2.0
1 2 Farm 14 14.0
2 3 House 7 7.0
3 4 Lots/Land 15 15.0
4 5 Mobile/Manufactured Home 13 13.0
5 6 Multi-Family 8 8.0
6 7 Townhouse 11 11.0
7 8 Single Family 10 10.0
8 9 Apt/Condo 1 1.0
9 10 Home 7 7.0
10 11 NaN 29 29.0
Compute the mask just once, and use it to index multiple times, which should be more efficient than computing it twice.
Find the rows where property_type1 column is NaN, and for those rows: assign the property_type values to the pro column.
df.ix[df.property_type1.isnull(), 'pro'] = df.ix[df.property_type1.isnull(), 'property_type']

Categories

Resources