given the following example table
Index
Date
Weekday
Value
1
05/12/2022
2
10
2
06/12/2022
3
20
3
07/12/2022
4
40
4
09/12/2022
6
10
5
10/12/2022
7
60
6
11/12/2022
1
30
7
12/12/2022
2
40
8
13/12/2022
3
50
9
14/12/2022
4
60
10
16/12/2022
6
20
11
17/12/2022
7
50
12
18/12/2022
1
10
13
20/12/2022
3
20
14
21/12/2022
4
10
15
22/12/2022
5
40
I want to calculate a rolling average of the last three observations (at least) a week ago. I cannot use .shift as some dates are randomly missing, and .shift would therefore not produce a reliable output.
Desired output example for last three rows in the example dataset:
Index 13: Avg of indices 8, 7, 6 = (30+40+50) / 3 = 40
Index 14: Avg of indices 9, 8, 7 = (40+50+60) / 3 = 50
Index 15: Avg of indices 9, 8, 7 = (40+50+60) / 3 = 50
What would be a working solution for this? Thanks!
Thanks!
MOSTLY inspired from #Aidis you could, make his solution an apply:
df['mean']=df.apply(lambda y: df["Value"][df['Date'] <= y['Date'] - pd.Timedelta(1, "W")].tail(3).mean(), axis=1)
or spliting the data at each call which may run faster if you have lots of data (to be tested):
df['mean']=df.apply(lambda y: df.loc[:y.name, "Value"][ df.loc[:y.name,'Date'] <= y['Date'] - pd.Timedelta(1, "W")].tail(3).mean(), axis=1)
which returns:
Index Date Weekday Value mean
0 1 2022-12-05 2 10 NaN
1 2 2022-12-06 3 20 NaN
2 3 2022-12-07 4 40 NaN
3 4 2022-12-09 6 10 NaN
4 5 2022-12-10 7 60 NaN
5 6 2022-12-11 1 30 NaN
6 7 2022-12-12 2 40 10.000000
7 8 2022-12-13 3 50 15.000000
8 9 2022-12-14 4 60 23.333333
9 10 2022-12-16 6 20 23.333333
10 11 2022-12-17 7 50 36.666667
11 12 2022-12-18 1 10 33.333333
12 13 2022-12-20 3 20 40.000000
13 14 2022-12-21 4 10 50.000000
14 15 2022-12-22 5 40 50.000000
I apologize for this ugly code. But it seems to work:
df = df.set_index("Index")
df['Date'] = df['Date'].astype("datetime64")
for id in df.index:
dfs = df.loc[:id]
mean = dfs["Value"][dfs['Date'] <= dfs.iloc[-1]['Date'] - pd.Timedelta(1, "W")].tail(3).mean()
print(id, mean)
Result:
1 nan
2 10.0
3 15.0
4 23.333333333333332
5 23.333333333333332
6 36.666666666666664
7 33.333333333333336
8 33.333333333333336
9 33.333333333333336
10 33.333333333333336
11 33.333333333333336
12 33.333333333333336
13 40.0
14 50.0
15 50.0
Related
Dataframe df is given using df = pd.DataFrame({'A':[10, 15, 12, 19, 11, 20, 25]}) as:
A
0 10
1 15
2 12
3 19
4 11
5 20
6 25
The result of equal-frequency binning of the column A by using df['B'] = pd.cut(df['A'], bins = 2)
is the following:
A B
0 10 (9.985, 17.5]
1 15 (9.985, 17.5]
2 12 (9.985, 17.5]
3 19 (17.5, 25.0]
4 11 (9.985, 17.5]
5 20 (17.5, 25.0]
6 25 (17.5, 25.0]
How can we simply have the mean (or median) of the elements of the column A which belong to the same bin instead of the intervals in column B? That is, how to have the column B as the following:
A B
0 10 12
1 15 12
2 12 12
3 19 21.33
4 11 12
5 20 21.33
6 25 21.33
You could group by the bins and transform to the mean or median, depending on what you want:
>>> df.groupby(pd.cut(df['A'], bins=2)).transform('mean')
A
0 12.000000
1 12.000000
2 12.000000
3 21.333333
4 12.000000
5 21.333333
6 21.333333
>>> df.groupby(pd.cut(df['A'], bins=2)).transform('median')
A
0 11.5
1 11.5
2 11.5
3 20.0
4 11.5
5 20.0
6 20.0
For the interval midpoint check #sacuL’s (now deleted?) answer. It was pd.cut(df['A'], bins=2).map(lambda itv: itv.mid), or the slightly faster:
>>> df['B'] = pd.IntervalIndex(pd.cut(df['A'], bins=2)).mid
>>> df
A B
0 10 13.7425
1 15 13.7425
2 12 13.7425
3 19 21.2500
4 11 13.7425
5 20 21.2500
6 25 21.2500
need to get output in column <5_Days_Up> like the image.
Date price 5_Days_Up
20-May-21 1
21-May-21 2
22-May-21 4
23-May-21 5
24-May-21 6 5
25-May-21 7 6
26-May-21 8 7
27-May-21 9 8
28-May-21 10 9
29-May-21 11 10
30-May-21 12 11
31-May-21 13 12
1-Jun-21 14 13
2-Jun-21 15 14
But, got the output like this.
Date price 5_Days_Up
20-May-21 1
21-May-21 2
22-May-21 4
23-May-21 5
24-May-21 6 6
25-May-21 7 7
26-May-21 8 8
27-May-21 9 9
28-May-21 10 10
29-May-21 11 11
30-May-21 12 12
31-May-21 13 13
1-Jun-21 14 14
2-Jun-21 15 15
Here, in python pandas, I am using
df['5_Days_Up'] = df['price'].rolling(window=5).max()
is there a way to get the maximum value of the last 5 periods after skipping the today's price using the same rolling() or any other?
Your data has only 4 (instead of 5) previous entries before the entry on date 24-May-21 with price equals 6 (owing to there is no price equals 3 in the data sample.) Therefore, your first entry to show non-NaN value will start from the date 25-May-21 with price equals 7.
To include up to the previous entry (exclude current entry), you can use the parameter closed='left' to achieve this:
df['5_Days_Up'] = df['price'].rolling(window=5, closed='left').max()
Result:
Date price 5_Days_Up
0 20-May-21 1 NaN
1 21-May-21 2 NaN
2 22-May-21 4 NaN
3 23-May-21 5 NaN
4 24-May-21 6 NaN
5 25-May-21 7 6.0
6 26-May-21 8 7.0
7 27-May-21 9 8.0
8 28-May-21 10 9.0
9 29-May-21 11 10.0
10 30-May-21 12 11.0
11 31-May-21 13 12.0
12 1-Jun-21 14 13.0
13 2-Jun-21 15 14.0
I am trying to calculate the 14 day rolling average for retail data with multiple different hierarchies. The 'Store' dataframe looks like this:
Store | Inventory-Small | Inventory-Medium | Date | Purchases-Small | Purchases-Medium
-----------------------------------------------------------------------------------------------------
A 12 14 4/1/20 2 4
B 13 16 4/1/20 4 5
A 15 10 4/2/20 2 6
C 20 15 4/1/20 4 5
A 16 8 4/3/20 2 4
A 16 10 4/4/20 4 5
A 15 12 4/5/20 1 3
C 18 14 4/2/20 2 3
C 19 12 4/3/20 6 9
B 14 14 4/2/20 3 8
What I am trying to do is create a rolling 14 day average for the purchases column for each store. The data extends well past 14 day (over 8 months), and I would like the first 14 days of each store to be a simple average. My issue is that while I can group by 'Store' and create a column, I don't know how to also group by dates. I've tried:
Store.sort_values(['Store','Date'],ascending=(False,False))
Store['Rolling_Purchase_S'] = Store.groupby(['Store','Date'], as_index=False)['Purchases-Small'].transform(lambda x: x.rolling(14, 1).mean())
and also:
Store['Rolling_Purchase_S'] = Store.groupby('Store')['Purchases-Small'].transform(lambda x: x.rolling(14, 1).mean())
The first one doesn't seem to have any effect while the second one doesn't group by dates so I end up with a rolling average in the wrong order. Any advice would be much appreciated!
Edit: The following lines worked, thanks to all for the feedback.
Store.sort_values(['Store','Date'],ascending=(False,True),inplace=True)
Store['Rolling_Purchase_S'] = Store.groupby('Store')['Purchases-Small'].transform(lambda x: x.rolling(14, 1).mean())
I believe it's working fine as long as you sort inplace and remove 'Date' from groupby:
Store.sort_values(['Store','Date'], ascending=(False,False), inplace=True)
Store['Rolling_Purchase_S'] = Store.groupby(['Store'])['Purchases-Small'].transform(lambda x: x.rolling(14, 1).mean())
Output:
print(Store[['Store', 'Date', 'Purchases-Small', 'Rolling_Purchase_S']])
Store Date Purchases-Small Rolling_Purchase_S
8 C 2020-04-03 6 6.000000
7 C 2020-04-02 2 4.000000
3 C 2020-04-01 4 4.000000
9 B 2020-04-02 3 3.000000
1 B 2020-04-01 4 3.500000
6 A 2020-04-05 1 1.000000
5 A 2020-04-04 4 2.500000
4 A 2020-04-03 2 2.333333
2 A 2020-04-02 2 2.250000
0 A 2020-04-01 2 2.200000
There's a couple things to point out but you're very close. I changed a couple items to make it work. To illustrate the sorting, I modified the first date for each store to 4/10
If your dates aren't datetime, the sorting may not work as expected. also, you need inplace=True to make the change permanent
input = '''Store Inventory-Small Inventory-Medium Date Purchases-Small Purchases-Medium
A 12 14 4/10/20 2 4
B 13 16 4/10/20 4 5
A 15 10 4/2/20 2 6
C 20 15 4/10/20 4 5
A 16 8 4/3/20 2 4
A 16 10 4/4/20 4 5
A 15 12 4/5/20 1 3
C 18 14 4/2/20 2 3
C 19 12 4/3/20 6 9
B 14 14 4/2/20 3 8'''
Store = pd.read_csv(io.StringIO(input), sep=' ')
Store['Date'] = pd.to_datetime(Store['Date'])
Store.sort_values(['Store', 'Date'],ascending=(True, True), inplace=True)
Store['Rolling_Purchase_S'] = Store.groupby('Store')['Purchases-Small'].transform(lambda x: x.rolling(2, 1).mean())
Also, I changed the timeperiod to 2 because I had little data to work with, it'll need to go back to 14 for your dataset
Output:
In [137]: Store
Out[137]:
Store Inventory-Small Inventory-Medium Date Purchases-Small Purchases-Medium Rolling_Purchase_S
2 A 15 10 2020-04-02 2 6 2.000
4 A 16 8 2020-04-03 2 4 2.000
5 A 16 10 2020-04-04 4 5 3.000
6 A 15 12 2020-04-05 1 3 2.500
0 A 12 14 2020-04-10 2 4 1.500
9 B 14 14 2020-04-02 3 8 3.000
1 B 13 16 2020-04-10 4 5 3.500
7 C 18 14 2020-04-02 2 3 2.000
8 C 19 12 2020-04-03 6 9 4.000
3 C 20 15 2020-04-10 4 5 5.000
I have a pandas data frame like this;
>df
leg speed
1 10
1 11
1 12
1 13
1 12
1 15
1 19
1 12
2 10
2 10
2 12
2 15
2 19
2 11
: :
I want to make a new column roll_speed where it takes a rolling average speed of the last 5 positions. But I wanna put more detailed condition in it.
Groupby leg(it doesn't take into account the speed of the rows in different leg.
I want the rolling window to be changed from 1 to 5 maximum according to the available rows. For example in leg == 1, in the first row there is only one row to calculate, so the rolling speed should be 10/1 = 10. For the second row, there are only two rows available for calculation, the rolling speed should be (10+11)/2 = 10.5.
leg speed roll_speed
1 10 10 # 10/1
1 11 10.5 # (10+11)/2
1 12 11 # (10+11+12)/3
1 13 11.5 # (10+11+12+13)/4
1 12 11.6 # (10+11+12+13+12)/5
1 15 12.6 # (11+12+13+12+15)/5
1 19 14.2 # (12+13+12+15+19)/5
1 12 14.2 # (13+12+15+19+12)/5
2 10 10 # 10/1
2 10 10 # (10+10)/2
2 12 10.7 # (10+10+12)/3
2 15 11.8 # (10+10+12+15)/4
2 19 13.2 # (10+10+12+15+19)/5
2 11 13.4 # (10+12+15+19+11)/5
: :
My attempt:
df['roll_speed'] = df.speed.rolling(5).mean()
But it just returns NA for rows where less than five rows are available for calculation. How should I solve this problem? Thank you for any help!
Set the parameter min_periods to 1
df['roll_speed'] = df.groupby('leg').speed.rolling(5, min_periods = 1).mean()\
.round(1).reset_index(drop = True)
leg speed roll_speed
0 1 10 10.0
1 1 11 10.5
2 1 12 11.0
3 1 13 11.5
4 1 12 11.6
5 1 15 12.6
6 1 19 14.2
7 1 12 14.2
8 2 10 10.0
9 2 10 10.0
10 2 12 10.7
11 2 15 11.8
12 2 19 13.2
13 2 11 13.4
Using rolling(5) will get you your results for all but the first 4 occurences of each group. We can fill the remaining values with the expanding mean:
(df.groupby('leg').speed.rolling(5)
.mean().fillna(df.groupby('leg').speed.expanding().mean())
).reset_index(drop=True)
0 10.000000
1 10.500000
2 11.000000
3 11.500000
4 11.600000
5 12.600000
6 14.200000
7 14.200000
8 10.000000
9 10.000000
10 10.666667
11 11.750000
12 13.200000
13 13.400000
Name: speed, dtype: float64
I have a DataFrame, say df, which looks like this:
id property_type1 property_type pro
1 Condominium 2 2
2 Farm 14 14
3 House 7 7
4 Lots/Land 15 15
5 Mobile/Manufactured Home 13 13
6 Multi-Family 8 8
7 Townhouse 11 11
8 Single Family 10 10
9 Apt/Condo 1 1
10 Home 7 7
11 NaN 29 NaN
Now, I need the pro column to have the same value as the property_type column, whenever the property_type1 column has a NaN value. This is how it should be:
id property_type1 property_type pro
1 Condominium 2 2
2 Farm 14 14
3 House 7 7
4 Lots/Land 15 15
5 Mobile/Manufactured Home 13 13
6 Multi-Family 8 8
7 Townhouse 11 11
8 Single Family 10 10
9 Apt/Condo 1 1
10 Home 7 7
11 NaN 29 29
That is, in line 11, where property_type1 is NaN, the value of the pro column becomes 29, which is the value of property_type. How can I do this?
ix is deprecated, don't use it.
Option 1
I'd do this with np.where -
df = df.assign(pro=np.where(df.pro.isnull(), df.property_type, df.pro))
df
id property_type1 property_type pro
0 1 Condominium 2 2.0
1 2 Farm 14 14.0
2 3 House 7 7.0
3 4 Lots/Land 15 15.0
4 5 Mobile/Manufactured Home 13 13.0
5 6 Multi-Family 8 8.0
6 7 Townhouse 11 11.0
7 8 Single Family 10 10.0
8 9 Apt/Condo 1 1.0
9 10 Home 7 7.0
10 11 NaN 29 29.0
Option 2
If you want to perform in-place assignment, use loc -
m = df.pro.isnull()
df.loc[m, 'pro'] = df.loc[m, 'property_type']
df
id property_type1 property_type pro
0 1 Condominium 2 2.0
1 2 Farm 14 14.0
2 3 House 7 7.0
3 4 Lots/Land 15 15.0
4 5 Mobile/Manufactured Home 13 13.0
5 6 Multi-Family 8 8.0
6 7 Townhouse 11 11.0
7 8 Single Family 10 10.0
8 9 Apt/Condo 1 1.0
9 10 Home 7 7.0
10 11 NaN 29 29.0
Compute the mask just once, and use it to index multiple times, which should be more efficient than computing it twice.
Find the rows where property_type1 column is NaN, and for those rows: assign the property_type values to the pro column.
df.ix[df.property_type1.isnull(), 'pro'] = df.ix[df.property_type1.isnull(), 'property_type']