Calculate 14-day rolling average on data with two hierarchies - python

I am trying to calculate the 14 day rolling average for retail data with multiple different hierarchies. The 'Store' dataframe looks like this:
Store | Inventory-Small | Inventory-Medium | Date | Purchases-Small | Purchases-Medium
-----------------------------------------------------------------------------------------------------
A 12 14 4/1/20 2 4
B 13 16 4/1/20 4 5
A 15 10 4/2/20 2 6
C 20 15 4/1/20 4 5
A 16 8 4/3/20 2 4
A 16 10 4/4/20 4 5
A 15 12 4/5/20 1 3
C 18 14 4/2/20 2 3
C 19 12 4/3/20 6 9
B 14 14 4/2/20 3 8
What I am trying to do is create a rolling 14 day average for the purchases column for each store. The data extends well past 14 day (over 8 months), and I would like the first 14 days of each store to be a simple average. My issue is that while I can group by 'Store' and create a column, I don't know how to also group by dates. I've tried:
Store.sort_values(['Store','Date'],ascending=(False,False))
Store['Rolling_Purchase_S'] = Store.groupby(['Store','Date'], as_index=False)['Purchases-Small'].transform(lambda x: x.rolling(14, 1).mean())
and also:
Store['Rolling_Purchase_S'] = Store.groupby('Store')['Purchases-Small'].transform(lambda x: x.rolling(14, 1).mean())
The first one doesn't seem to have any effect while the second one doesn't group by dates so I end up with a rolling average in the wrong order. Any advice would be much appreciated!
Edit: The following lines worked, thanks to all for the feedback.
Store.sort_values(['Store','Date'],ascending=(False,True),inplace=True)
Store['Rolling_Purchase_S'] = Store.groupby('Store')['Purchases-Small'].transform(lambda x: x.rolling(14, 1).mean())

I believe it's working fine as long as you sort inplace and remove 'Date' from groupby:
Store.sort_values(['Store','Date'], ascending=(False,False), inplace=True)
Store['Rolling_Purchase_S'] = Store.groupby(['Store'])['Purchases-Small'].transform(lambda x: x.rolling(14, 1).mean())
Output:
print(Store[['Store', 'Date', 'Purchases-Small', 'Rolling_Purchase_S']])
Store Date Purchases-Small Rolling_Purchase_S
8 C 2020-04-03 6 6.000000
7 C 2020-04-02 2 4.000000
3 C 2020-04-01 4 4.000000
9 B 2020-04-02 3 3.000000
1 B 2020-04-01 4 3.500000
6 A 2020-04-05 1 1.000000
5 A 2020-04-04 4 2.500000
4 A 2020-04-03 2 2.333333
2 A 2020-04-02 2 2.250000
0 A 2020-04-01 2 2.200000

There's a couple things to point out but you're very close. I changed a couple items to make it work. To illustrate the sorting, I modified the first date for each store to 4/10
If your dates aren't datetime, the sorting may not work as expected. also, you need inplace=True to make the change permanent
input = '''Store Inventory-Small Inventory-Medium Date Purchases-Small Purchases-Medium
A 12 14 4/10/20 2 4
B 13 16 4/10/20 4 5
A 15 10 4/2/20 2 6
C 20 15 4/10/20 4 5
A 16 8 4/3/20 2 4
A 16 10 4/4/20 4 5
A 15 12 4/5/20 1 3
C 18 14 4/2/20 2 3
C 19 12 4/3/20 6 9
B 14 14 4/2/20 3 8'''
Store = pd.read_csv(io.StringIO(input), sep=' ')
Store['Date'] = pd.to_datetime(Store['Date'])
Store.sort_values(['Store', 'Date'],ascending=(True, True), inplace=True)
Store['Rolling_Purchase_S'] = Store.groupby('Store')['Purchases-Small'].transform(lambda x: x.rolling(2, 1).mean())
Also, I changed the timeperiod to 2 because I had little data to work with, it'll need to go back to 14 for your dataset
Output:
In [137]: Store
Out[137]:
Store Inventory-Small Inventory-Medium Date Purchases-Small Purchases-Medium Rolling_Purchase_S
2 A 15 10 2020-04-02 2 6 2.000
4 A 16 8 2020-04-03 2 4 2.000
5 A 16 10 2020-04-04 4 5 3.000
6 A 15 12 2020-04-05 1 3 2.500
0 A 12 14 2020-04-10 2 4 1.500
9 B 14 14 2020-04-02 3 8 3.000
1 B 13 16 2020-04-10 4 5 3.500
7 C 18 14 2020-04-02 2 3 2.000
8 C 19 12 2020-04-03 6 9 4.000
3 C 20 15 2020-04-10 4 5 5.000

Related

Pandas rolling mean with offset by (not continuously available) date

given the following example table
Index
Date
Weekday
Value
1
05/12/2022
2
10
2
06/12/2022
3
20
3
07/12/2022
4
40
4
09/12/2022
6
10
5
10/12/2022
7
60
6
11/12/2022
1
30
7
12/12/2022
2
40
8
13/12/2022
3
50
9
14/12/2022
4
60
10
16/12/2022
6
20
11
17/12/2022
7
50
12
18/12/2022
1
10
13
20/12/2022
3
20
14
21/12/2022
4
10
15
22/12/2022
5
40
I want to calculate a rolling average of the last three observations (at least) a week ago. I cannot use .shift as some dates are randomly missing, and .shift would therefore not produce a reliable output.
Desired output example for last three rows in the example dataset:
Index 13: Avg of indices 8, 7, 6 = (30+40+50) / 3 = 40
Index 14: Avg of indices 9, 8, 7 = (40+50+60) / 3 = 50
Index 15: Avg of indices 9, 8, 7 = (40+50+60) / 3 = 50
What would be a working solution for this? Thanks!
Thanks!
MOSTLY inspired from #Aidis you could, make his solution an apply:
df['mean']=df.apply(lambda y: df["Value"][df['Date'] <= y['Date'] - pd.Timedelta(1, "W")].tail(3).mean(), axis=1)
or spliting the data at each call which may run faster if you have lots of data (to be tested):
df['mean']=df.apply(lambda y: df.loc[:y.name, "Value"][ df.loc[:y.name,'Date'] <= y['Date'] - pd.Timedelta(1, "W")].tail(3).mean(), axis=1)
which returns:
Index Date Weekday Value mean
0 1 2022-12-05 2 10 NaN
1 2 2022-12-06 3 20 NaN
2 3 2022-12-07 4 40 NaN
3 4 2022-12-09 6 10 NaN
4 5 2022-12-10 7 60 NaN
5 6 2022-12-11 1 30 NaN
6 7 2022-12-12 2 40 10.000000
7 8 2022-12-13 3 50 15.000000
8 9 2022-12-14 4 60 23.333333
9 10 2022-12-16 6 20 23.333333
10 11 2022-12-17 7 50 36.666667
11 12 2022-12-18 1 10 33.333333
12 13 2022-12-20 3 20 40.000000
13 14 2022-12-21 4 10 50.000000
14 15 2022-12-22 5 40 50.000000
I apologize for this ugly code. But it seems to work:
df = df.set_index("Index")
df['Date'] = df['Date'].astype("datetime64")
for id in df.index:
dfs = df.loc[:id]
mean = dfs["Value"][dfs['Date'] <= dfs.iloc[-1]['Date'] - pd.Timedelta(1, "W")].tail(3).mean()
print(id, mean)
Result:
1 nan
2 10.0
3 15.0
4 23.333333333333332
5 23.333333333333332
6 36.666666666666664
7 33.333333333333336
8 33.333333333333336
9 33.333333333333336
10 33.333333333333336
11 33.333333333333336
12 33.333333333333336
13 40.0
14 50.0
15 50.0

Is it possible to use pandas.DataFrame.rolling window period 5 with skipping today's value in that

need to get output in column <5_Days_Up> like the image.
Date price 5_Days_Up
20-May-21 1
21-May-21 2
22-May-21 4
23-May-21 5
24-May-21 6 5
25-May-21 7 6
26-May-21 8 7
27-May-21 9 8
28-May-21 10 9
29-May-21 11 10
30-May-21 12 11
31-May-21 13 12
1-Jun-21 14 13
2-Jun-21 15 14
But, got the output like this.
Date price 5_Days_Up
20-May-21 1
21-May-21 2
22-May-21 4
23-May-21 5
24-May-21 6 6
25-May-21 7 7
26-May-21 8 8
27-May-21 9 9
28-May-21 10 10
29-May-21 11 11
30-May-21 12 12
31-May-21 13 13
1-Jun-21 14 14
2-Jun-21 15 15
Here, in python pandas, I am using
df['5_Days_Up'] = df['price'].rolling(window=5).max()
is there a way to get the maximum value of the last 5 periods after skipping the today's price using the same rolling() or any other?
Your data has only 4 (instead of 5) previous entries before the entry on date 24-May-21 with price equals 6 (owing to there is no price equals 3 in the data sample.) Therefore, your first entry to show non-NaN value will start from the date 25-May-21 with price equals 7.
To include up to the previous entry (exclude current entry), you can use the parameter closed='left' to achieve this:
df['5_Days_Up'] = df['price'].rolling(window=5, closed='left').max()
Result:
Date price 5_Days_Up
0 20-May-21 1 NaN
1 21-May-21 2 NaN
2 22-May-21 4 NaN
3 23-May-21 5 NaN
4 24-May-21 6 NaN
5 25-May-21 7 6.0
6 26-May-21 8 7.0
7 27-May-21 9 8.0
8 28-May-21 10 9.0
9 29-May-21 11 10.0
10 30-May-21 12 11.0
11 31-May-21 13 12.0
12 1-Jun-21 14 13.0
13 2-Jun-21 15 14.0

Pandas GroupingBy and finding repetition by unique IDs

I have a dataframe like this:
userId date new doa
67 23 2018-07-02 1 2
68 23 2018-07-03 1 3
69 23 2018-07-04 1 4
70 23 2018-07-06 1 6
71 23 2018-07-07 1 7
72 23 2018-07-10 1 10
73 23 2018-07-11 1 11
74 23 2018-07-13 1 13
75 23 2018-07-15 1 15
76 23 2018-07-16 1 16
77 23 2018-07-17 1 17
......
194605 448053 2018-08-11 1 11
194606 448054 2018-08-11 1 11
194607 448065 2018-08-11 1 11
df['doa'] stands for day of appearance.
Now I want to find out like which unique userIds have appeared on a daily basis. Like which userIds are appearing on day1, day2, day3, and so on. So how do I exactly groupby them? And also I want to find out like the avg. no of days unique users are opening the app in a month?
And finally I want to also find out like which users have appeared at least once every day throughout the month.
I want some thing like this:
userId week_no ndays
23 1 2
23 2 5
23 3 6
.....
1533 1 0
1534 2 1
1534 3 4
1534 4 1
1553 1 1
1553 2 0
1553 3 0
1553 4 0
And so on. ndays means no. of days in a week.
You're asking several different questions, and none of them are particularly difficult, they just require a couple groupbys and aggregation operations.
Setup
df = pd.DataFrame({
'userId': [1,1,1,1,1,2,2,2,2,3,3,3,3,3],
'date': ['2018-07-02', '2018-07-03', '2018-08-04', '2018-08-05', '2018-08-06',
'2018-07-02', '2018-07-03', '2018-08-04', '2018-08-05', '2018-07-02', '2018-07-03',
'2018-07-04', '2018-07-05', '2018-08-06']
})
df.date = pd.to_datetime(df.date)
df['doa'] = df.date.dt.day
userId date doa
0 1 2018-07-02 2
1 1 2018-07-03 3
2 1 2018-08-04 4
3 1 2018-08-05 5
4 1 2018-08-06 6
5 2 2018-07-02 2
6 2 2018-07-03 3
7 2 2018-08-04 4
8 2 2018-08-05 5
9 3 2018-07-02 2
10 3 2018-07-03 3
11 3 2018-07-04 4
12 3 2018-07-05 5
13 3 2018-08-06 6
Questions
How do I find the unique visitors per day?
You may use groupby and unique:
df.groupby([df.date.dt.month, 'doa']).userId.unique()
date doa
7 2 [1, 2, 3]
3 [1, 2, 3]
4 [3]
5 [3]
8 4 [1, 2]
5 [1, 2]
6 [1, 3]
Name: userId, dtype: object
How do I find the average number of days per month users open the app?
Using groupby and size:
df.groupby(['userId', df.date.dt.month]).size()
userId date
1 7 2
8 3
2 7 2
8 2
3 7 4
8 1
dtype: int64
This will give you the number of times per month each unique visitor has visited. If you want the average of this, simply apply mean:
df.groupby(['userId', df.date.dt.month]).size().groupby('date').mean()
date
7 2.666667
8 2.000000
dtype: float64
This one was a bit more unclear, but it seems that you want the number of days a user was seen per week:
You can groupby userId, as well as a variation on your date column to create continuous weeks, starting at the minimum date, then use size:
(df.groupby(
['userId', (df.date.dt.week.sub(df.date.dt.week.min())+1).rename('week_no')])
.size().reset_index(name='ndays')
)
userId week_no ndays
0 1 1 2
1 1 5 2
2 1 6 1
3 2 1 2
4 2 5 2
5 3 1 4
6 3 6 1

Calculate average of every 7 instances in a dataframe column

I have this pandas dataframe with daily asset prices:
Picture of head of Dataframe
I would like to create a pandas series (It could also be an additional column in the dataframe or some other datastructure) with the weakly average asset prices. This means I need to calculate the average on every 7 consecutive instances in the column and save it into a series.
Picture of how result should look like
As I am a complete newbie to python (and programming in general, for that matter), I really have no idea how to start.
I am very grateful for every tipp!
I believe need GroupBy.transform by modulo of numpy array create by numpy.arange for general solution also working with all indexes (e.g. with DatetimeIndex):
np.random.seed(2018)
rng = pd.date_range('2018-04-19', periods=20)
df = pd.DataFrame({'Date': rng[::-1],
'ClosingPrice': np.random.randint(4, size=20)})
#print (df)
df['weekly'] = df['ClosingPrice'].groupby(np.arange(len(df)) // 7).transform('mean')
print (df)
ClosingPrice Date weekly
0 2 2018-05-08 1.142857
1 2 2018-05-07 1.142857
2 2 2018-05-06 1.142857
3 1 2018-05-05 1.142857
4 1 2018-05-04 1.142857
5 0 2018-05-03 1.142857
6 0 2018-05-02 1.142857
7 2 2018-05-01 2.285714
8 1 2018-04-30 2.285714
9 1 2018-04-29 2.285714
10 3 2018-04-28 2.285714
11 3 2018-04-27 2.285714
12 3 2018-04-26 2.285714
13 3 2018-04-25 2.285714
14 1 2018-04-24 1.666667
15 0 2018-04-23 1.666667
16 3 2018-04-22 1.666667
17 2 2018-04-21 1.666667
18 2 2018-04-20 1.666667
19 2 2018-04-19 1.666667
Detail:
print (np.arange(len(df)) // 7)
[0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2]

Call a Nan Value and change to a number in python

I have a DataFrame, say df, which looks like this:
id property_type1 property_type pro
1 Condominium 2 2
2 Farm 14 14
3 House 7 7
4 Lots/Land 15 15
5 Mobile/Manufactured Home 13 13
6 Multi-Family 8 8
7 Townhouse 11 11
8 Single Family 10 10
9 Apt/Condo 1 1
10 Home 7 7
11 NaN 29 NaN
Now, I need the pro column to have the same value as the property_type column, whenever the property_type1 column has a NaN value. This is how it should be:
id property_type1 property_type pro
1 Condominium 2 2
2 Farm 14 14
3 House 7 7
4 Lots/Land 15 15
5 Mobile/Manufactured Home 13 13
6 Multi-Family 8 8
7 Townhouse 11 11
8 Single Family 10 10
9 Apt/Condo 1 1
10 Home 7 7
11 NaN 29 29
That is, in line 11, where property_type1 is NaN, the value of the pro column becomes 29, which is the value of property_type. How can I do this?
ix is deprecated, don't use it.
Option 1
I'd do this with np.where -
df = df.assign(pro=np.where(df.pro.isnull(), df.property_type, df.pro))
df
id property_type1 property_type pro
0 1 Condominium 2 2.0
1 2 Farm 14 14.0
2 3 House 7 7.0
3 4 Lots/Land 15 15.0
4 5 Mobile/Manufactured Home 13 13.0
5 6 Multi-Family 8 8.0
6 7 Townhouse 11 11.0
7 8 Single Family 10 10.0
8 9 Apt/Condo 1 1.0
9 10 Home 7 7.0
10 11 NaN 29 29.0
Option 2
If you want to perform in-place assignment, use loc -
m = df.pro.isnull()
df.loc[m, 'pro'] = df.loc[m, 'property_type']
df
id property_type1 property_type pro
0 1 Condominium 2 2.0
1 2 Farm 14 14.0
2 3 House 7 7.0
3 4 Lots/Land 15 15.0
4 5 Mobile/Manufactured Home 13 13.0
5 6 Multi-Family 8 8.0
6 7 Townhouse 11 11.0
7 8 Single Family 10 10.0
8 9 Apt/Condo 1 1.0
9 10 Home 7 7.0
10 11 NaN 29 29.0
Compute the mask just once, and use it to index multiple times, which should be more efficient than computing it twice.
Find the rows where property_type1 column is NaN, and for those rows: assign the property_type values to the pro column.
df.ix[df.property_type1.isnull(), 'pro'] = df.ix[df.property_type1.isnull(), 'property_type']

Categories

Resources