(Python) How to calculate the average over a time period? - python

I have a dataFrame and I am trying to add a new column that calculates the average amount spent with a card over the last 3 days.
I have tried using df[avg_card_7days] = df.groupby('card')['amount'].resample('3D', on = 'date').mean()
The dataFrame currently looks like:
card date amount
1 2/1/10 50
2 2/1/10 40
3 2/1/10 10
1 2/2/10 20
2 2/2/10 30
3 2/2/10 30
1 2/3/10 10
2 2/3/10 30
3 2/3/10 20
...
But I a looking for this result:
card date amount avg_card_3days
1 2/1/10 50 NaN
2 2/1/10 40 NaN
3 2/1/10 10 NaN
1 2/2/10 20 NaN
2 2/2/10 30 NaN
3 2/2/10 30 NaN
1 2/3/10 10 26.26
2 2/3/10 30 33.33
3 2/3/10 20 20.00
...
Any help would be greatly appreciated!

df['date'] = pd.to_datetime(df.date, format='%m/%d/%y')
df = df.set_index('date')
df['avg_card_3days'] = df.groupby('card').expanding(3).amount.agg('mean').droplevel(0).sort_index()
df = df.reset_index()
df
Output
date card amount avg_card_3days
0 2010-02-01 1 50 NaN
1 2010-02-01 2 40 NaN
2 2010-02-01 3 10 NaN
3 2010-02-02 1 20 NaN
4 2010-02-02 2 30 NaN
5 2010-02-02 3 30 NaN
6 2010-02-03 1 10 26.666667
7 2010-02-03 2 30 33.333333
8 2010-02-03 3 20 20.000000
Explanation
Converting date column to datetime type and setting it as index.
Grouping the df by card and finding rolling mean of 3 days and assigning it to new column.
resetting the index to get required output.

Related

Pandas rolling mean with offset by (not continuously available) date

given the following example table
Index
Date
Weekday
Value
1
05/12/2022
2
10
2
06/12/2022
3
20
3
07/12/2022
4
40
4
09/12/2022
6
10
5
10/12/2022
7
60
6
11/12/2022
1
30
7
12/12/2022
2
40
8
13/12/2022
3
50
9
14/12/2022
4
60
10
16/12/2022
6
20
11
17/12/2022
7
50
12
18/12/2022
1
10
13
20/12/2022
3
20
14
21/12/2022
4
10
15
22/12/2022
5
40
I want to calculate a rolling average of the last three observations (at least) a week ago. I cannot use .shift as some dates are randomly missing, and .shift would therefore not produce a reliable output.
Desired output example for last three rows in the example dataset:
Index 13: Avg of indices 8, 7, 6 = (30+40+50) / 3 = 40
Index 14: Avg of indices 9, 8, 7 = (40+50+60) / 3 = 50
Index 15: Avg of indices 9, 8, 7 = (40+50+60) / 3 = 50
What would be a working solution for this? Thanks!
Thanks!
MOSTLY inspired from #Aidis you could, make his solution an apply:
df['mean']=df.apply(lambda y: df["Value"][df['Date'] <= y['Date'] - pd.Timedelta(1, "W")].tail(3).mean(), axis=1)
or spliting the data at each call which may run faster if you have lots of data (to be tested):
df['mean']=df.apply(lambda y: df.loc[:y.name, "Value"][ df.loc[:y.name,'Date'] <= y['Date'] - pd.Timedelta(1, "W")].tail(3).mean(), axis=1)
which returns:
Index Date Weekday Value mean
0 1 2022-12-05 2 10 NaN
1 2 2022-12-06 3 20 NaN
2 3 2022-12-07 4 40 NaN
3 4 2022-12-09 6 10 NaN
4 5 2022-12-10 7 60 NaN
5 6 2022-12-11 1 30 NaN
6 7 2022-12-12 2 40 10.000000
7 8 2022-12-13 3 50 15.000000
8 9 2022-12-14 4 60 23.333333
9 10 2022-12-16 6 20 23.333333
10 11 2022-12-17 7 50 36.666667
11 12 2022-12-18 1 10 33.333333
12 13 2022-12-20 3 20 40.000000
13 14 2022-12-21 4 10 50.000000
14 15 2022-12-22 5 40 50.000000
I apologize for this ugly code. But it seems to work:
df = df.set_index("Index")
df['Date'] = df['Date'].astype("datetime64")
for id in df.index:
dfs = df.loc[:id]
mean = dfs["Value"][dfs['Date'] <= dfs.iloc[-1]['Date'] - pd.Timedelta(1, "W")].tail(3).mean()
print(id, mean)
Result:
1 nan
2 10.0
3 15.0
4 23.333333333333332
5 23.333333333333332
6 36.666666666666664
7 33.333333333333336
8 33.333333333333336
9 33.333333333333336
10 33.333333333333336
11 33.333333333333336
12 33.333333333333336
13 40.0
14 50.0
15 50.0

Interpolating in mulitiindex pandas

I have data which looks like this:
month day
1 1 NaN
2 NaN
3 39.529999
4 40.570000
5 40.099998
...
12 27 NaN
28 NaN
29 NaN
30 NaN
31 39.049999
df55.iloc[df55.index.get_level_values('month') == 3]
month day
3 1 37.099998
2 38.060001
3 37.939999
4 37.230000
5 NaN
6 NaN
7 35.869999
8 35.660000
9 36.970001
10 36.660000
11 36.400002
12 NaN
13 NaN
14 36.860001
15 37.380001
16 38.430000
17 38.910000
18 39.000000
19 NaN
20 NaN
21 38.810001
22 39.439999
23 38.709999
24 39.020000
25 39.520000
26 NaN
27 NaN
28 NaN
29 NaN
30 NaN
31 NaN
I want to interpolate() the missing data but only till today, which is month 3 and day 26 from month 1 day 1 and leave all the other NaN as is. Could you please advise how can data between the range to interpolate()
Your idea to use iloc is good but you can use dayofyear to slice your dataframe because I guess your dataframe is well ordered.
today = pd.to_datetime('today')
df.iloc[:today.dayofyear] = df.iloc[:today.dayofyear].interpolate()
It seems easiest to temporarily reset the index so you can use a query:
today = pd.to_datetime('today')
idx = df.reset_index().query('month in [1,2] or (month == #today.month and day < #today.day)').index.max()
df.iloc[:idx] = df.iloc[:idx].interpolate()
Now all values from 1-1 (inclusive) to 3-25 (inclusive) will be non-NaN.

Get sum of values from last nth row by group id

I just want to know how to get the sum of the last 5th values based on id from every rows.
df:
id values
-----------------
a 5
a 10
a 10
b 2
c 2
d 2
a 5
a 10
a 20
a 10
a 15
a 20
expected df:
id values sum(x.tail(5))
-------------------------------------
a 5 NaN
a 10 NaN
a 10 NaN
b 2 NaN
c 2 NaN
d 2 NaN
a 5 NaN
a 10 NaN
a 20 40
a 10 55
a 15 55
a 20 60
For simplicity, I'm trying to find the sum of values from the last 5th rows from every rows with id a only.
I tried to use code df.apply(lambda x: x.tail(5)), but that only showed me last 5 rows from the very last row of the entire df. I want to get the sum of last nth rows from every and each rows. Basically it's like rolling_sum for time series data.
you can calculate the sum of the last 5 as like this:
df["rolling As"] = df[df['id'] == 'a'].rolling(window=5).sum()["values"]
(this includes the current row as one of the 5. not sure if that is what you want)
id values rolling As
0 a 5 NaN
1 a 10 NaN
2 a 10 NaN
3 b 2 NaN
4 c 2 NaN
5 d 5 NaN
6 a 10 NaN
7 a 20 55.0
8 a 10 60.0
9 a 10 60.0
10 a 15 65.0
11 a 20 75.0
If you don't want it included. you can shift
df["rolling"] = df[df['id'] == 'a'].rolling(window=5).sum()["values"].shift()
to give:
id values rolling
0 a 5 NaN
1 a 10 NaN
2 a 10 NaN
3 b 2 NaN
4 c 2 NaN
5 d 5 NaN
6 a 10 NaN
7 a 20 NaN
8 a 10 55.0
9 a 10 60.0
10 a 15 60.0
11 a 20 65.0
Try using groupby, transform, and rolling:
df['sum(x.tail(5))'] = df.groupby('id')['values']\
.transform(lambda x: x.rolling(5, min_periods=5).sum().shift())
Output:
id values sum(x.tail(5))
1 a 5 NaN
2 a 10 NaN
3 a 10 NaN
4 b 2 NaN
5 c 2 NaN
6 d 2 NaN
7 a 5 NaN
8 a 10 NaN
9 a 20 40.0
10 a 10 55.0
11 a 15 55.0
12 a 20 60.0

Calculate average of every 7 instances in a dataframe column

I have this pandas dataframe with daily asset prices:
Picture of head of Dataframe
I would like to create a pandas series (It could also be an additional column in the dataframe or some other datastructure) with the weakly average asset prices. This means I need to calculate the average on every 7 consecutive instances in the column and save it into a series.
Picture of how result should look like
As I am a complete newbie to python (and programming in general, for that matter), I really have no idea how to start.
I am very grateful for every tipp!
I believe need GroupBy.transform by modulo of numpy array create by numpy.arange for general solution also working with all indexes (e.g. with DatetimeIndex):
np.random.seed(2018)
rng = pd.date_range('2018-04-19', periods=20)
df = pd.DataFrame({'Date': rng[::-1],
'ClosingPrice': np.random.randint(4, size=20)})
#print (df)
df['weekly'] = df['ClosingPrice'].groupby(np.arange(len(df)) // 7).transform('mean')
print (df)
ClosingPrice Date weekly
0 2 2018-05-08 1.142857
1 2 2018-05-07 1.142857
2 2 2018-05-06 1.142857
3 1 2018-05-05 1.142857
4 1 2018-05-04 1.142857
5 0 2018-05-03 1.142857
6 0 2018-05-02 1.142857
7 2 2018-05-01 2.285714
8 1 2018-04-30 2.285714
9 1 2018-04-29 2.285714
10 3 2018-04-28 2.285714
11 3 2018-04-27 2.285714
12 3 2018-04-26 2.285714
13 3 2018-04-25 2.285714
14 1 2018-04-24 1.666667
15 0 2018-04-23 1.666667
16 3 2018-04-22 1.666667
17 2 2018-04-21 1.666667
18 2 2018-04-20 1.666667
19 2 2018-04-19 1.666667
Detail:
print (np.arange(len(df)) // 7)
[0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2]

pandas DataFrame cumulative value

I have the following pandas dataframe:
>>> df
Category Year Costs
0 A 1 20.00
1 A 2 30.00
2 A 3 40.00
3 B 1 15.00
4 B 2 25.00
5 B 3 35.00
How do I add a cumulative cost column that adds up the cost for the same category and previous years. Example of the extra column with previous df:
>>> new_df
Category Year Costs Cumulative Costs
0 A 1 20.00 20.00
1 A 2 30.00 50.00
2 A 3 40.00 90.00
3 B 1 15.00 15.00
4 B 2 25.00 40.00
5 B 3 35.00 75.00
Suggestions?
This works in pandas 0.17.0 Thanks to #DSM in the comments for the terser solution.
df['Cumulative Costs'] = df.groupby(['Category'])['Costs'].cumsum()
>>> df
Category Year Costs Cumulative Costs
0 A 1 20 20
1 A 2 30 50
2 A 3 40 90
3 B 1 15 15
4 B 2 25 40
5 B 3 35 75

Categories

Resources