I have data about how many messages each account sends aggregated to an hourly level. For each row, I would like to add a column with the sum of the previous 7 days messages. I know I can groupby account and date and aggregate the number of messages to the daily level, but I'm having a hard time calculating the rolling average because there isn't a row in the data if the account didn't send any messages that day (and I'd like to not balloon my data by adding these in if at all possible). If I could figure out a way to calculate the rolling 7-day average for each day that each account sent messages, I could then re-join that number back to the hourly data (is my hope). Any suggestions?
Note: For any day not in the data, assume 0 messages sent.
Raw Data:
Account | Messages | Date | Hour
12 5 2022-07-11 09:00:00
12 6 2022-07-13 10:00:00
12 10 2022-07-13 11:00:00
12 9 2022-07-15 16:00:00
12 1 2022-07-19 13:00:00
15 2 2022-07-12 10:00:00
15 13 2022-07-13 11:00:00
15 3 2022-07-17 16:00:00
15 4 2022-07-22 13:00:00
Desired Output:
Account | Messages | Date | Hour | Rolling Previous 7 Day Average
12 5 2022-07-11 09:00:00 0
12 6 2022-07-13 10:00:00 0.714
12 10 2022-07-13 11:00:00 0.714
12 9 2022-07-15 16:00:00 3
12 1 2022-07-19 13:00:00 3.571
15 2 2022-07-12 10:00:00 0
15 13 2022-07-13 11:00:00 0.286
15 3 2022-07-17 16:00:00 2.143
15 4 2022-07-22 13:00:00 0.429
I hope I've understood your question right:
df["Date"] = pd.to_datetime(df["Date"])
df["Messages_tmp"] = df.groupby(["Account", "Date"])["Messages"].transform(
"sum"
)
df["Rolling Previous 7 Day Average"] = (
df.set_index("Date")
.groupby("Account")["Messages_tmp"]
.rolling("7D")
.apply(lambda x: x.loc[~x.index.duplicated()].shift().sum() / 7)
).values
df = df.drop(columns="Messages_tmp")
print(df)
Prints:
Account Messages Date Hour Rolling Previous 7 Day Average
0 12 5 2022-07-11 09:00:00 0.000000
1 12 6 2022-07-13 10:00:00 0.714286
2 12 10 2022-07-13 11:00:00 0.714286
3 12 9 2022-07-15 16:00:00 3.000000
4 12 1 2022-07-19 13:00:00 3.571429
5 15 2 2022-07-12 10:00:00 0.000000
6 15 13 2022-07-13 11:00:00 0.285714
7 15 3 2022-07-17 16:00:00 2.142857
8 15 4 2022-07-22 13:00:00 0.428571
Related
I'm trying to merge two df's, one df has a datetime column, and the other has just a date column. My application for this is to find yesterday's high price using an OHLC dataset. I've attached some starter code below, but I'll describe what I'm looking for.
Given this intraday dataset:
time current_intraday_high
0 2022-02-11 09:00:00 1
1 2022-02-11 10:00:00 2
2 2022-02-11 11:00:00 3
3 2022-02-11 12:00:00 4
4 2022-02-11 13:00:00 5
5 2022-02-14 09:00:00 6
6 2022-02-14 10:00:00 7
7 2022-02-14 11:00:00 8
8 2022-02-14 12:00:00 9
9 2022-02-14 13:00:00 10
10 2022-02-15 09:00:00 11
11 2022-02-15 10:00:00 12
12 2022-02-15 11:00:00 13
13 2022-02-15 12:00:00 14
14 2022-02-15 13:00:00 15
15 2022-02-16 09:00:00 16
16 2022-02-16 10:00:00 17
17 2022-02-16 11:00:00 18
18 2022-02-16 12:00:00 19
19 2022-02-16 13:00:00 20
...and this daily dataframe:
time daily_high
0 2022-02-11 5
1 2022-02-14 10
2 2022-02-15 15
3 2022-02-16 20
...how can I merge them together, and have each row of the intraday dataframe contain the previous (business) day's high price, like so:
time current_intraday_high yesterdays_high
0 2022-02-11 09:00:00 1 NaN
1 2022-02-11 10:00:00 2 NaN
2 2022-02-11 11:00:00 3 NaN
3 2022-02-11 12:00:00 4 NaN
4 2022-02-11 13:00:00 5 NaN
5 2022-02-14 09:00:00 6 5.0
6 2022-02-14 10:00:00 7 5.0
7 2022-02-14 11:00:00 8 5.0
8 2022-02-14 12:00:00 9 5.0
9 2022-02-14 13:00:00 10 5.0
10 2022-02-15 09:00:00 11 10.0
11 2022-02-15 10:00:00 12 10.0
12 2022-02-15 11:00:00 13 10.0
13 2022-02-15 12:00:00 14 10.0
14 2022-02-15 13:00:00 15 10.0
15 2022-02-16 09:00:00 16 15.0
16 2022-02-16 10:00:00 17 15.0
17 2022-02-16 11:00:00 18 15.0
18 2022-02-16 12:00:00 19 15.0
19 2022-02-16 13:00:00 20 15.0
(Note the NaN's at the top because we don't have any data for Feb 10, 2022 from the intraday dataset, and see how each row contains the intraday data, plus the PREVIOUS day's max "high" price.)
Minimal reproducible example code below:
import pandas as pd
###################################################
# CREATE MOCK INTRADAY DATAFRAME
###################################################
intraday_date_time = [
"2022-02-11 09:00:00",
"2022-02-11 10:00:00",
"2022-02-11 11:00:00",
"2022-02-11 12:00:00",
"2022-02-11 13:00:00",
"2022-02-14 09:00:00",
"2022-02-14 10:00:00",
"2022-02-14 11:00:00",
"2022-02-14 12:00:00",
"2022-02-14 13:00:00",
"2022-02-15 09:00:00",
"2022-02-15 10:00:00",
"2022-02-15 11:00:00",
"2022-02-15 12:00:00",
"2022-02-15 13:00:00",
"2022-02-16 09:00:00",
"2022-02-16 10:00:00",
"2022-02-16 11:00:00",
"2022-02-16 12:00:00",
"2022-02-16 13:00:00",
]
intraday_date_time = pd.to_datetime(intraday_date_time)
intraday_df = pd.DataFrame(
{
"time": intraday_date_time,
"current_intraday_high": [x for x in range(1, 21)],
},
)
print(intraday_df)
# intraday_df.to_csv('intradayTEST.csv', index=True)
###################################################
# AGGREGATE/UPSAMPLE TO DAILY DATAFRAME
###################################################
# Aggregate to business days using intraday_df
agg_dict = {'current_intraday_high': 'max'}
daily_df = intraday_df.set_index('time').resample('B').agg(agg_dict).reset_index()
daily_df.rename(columns={"current_intraday_high": "daily_high"}, inplace=True)
print(daily_df)
# daily_df.to_csv('dailyTEST.csv', index=True)
###################################################
# MERGE THE TWO DATAFRAMES
###################################################
# Need to merge the daily dataset to the intraday dataset, such that,
# any row on the newly merged/joined/concat'd dataset will have:
# 1. The current intraday datetime in the 'time' column
# 2. The current 'intraday_high' value
# 3. The PREVIOUS DAY's 'daily_high' value
# This doesn't work as the daily_df just gets appended to the bottom
# of the intraday_df due to the datetimes/dates merging
merged_df = pd.merge(intraday_df, daily_df, how='outer', on='time')
print(merged_df)
pd.merge_asof allows you to easily do a merge like this.
yesterdays_high = (intraday_df.resample('B', on='time')['current_intraday_high'].max()
.shift()
.rename('yesterdays_high')
.reset_index())
merged_df = pd.merge_asof(intraday_df, yesterdays_high)
print(merged_df)
Output:
time current_intraday_high yesterdays_high
0 2022-02-11 09:00:00 1 NaN
1 2022-02-11 10:00:00 2 NaN
2 2022-02-11 11:00:00 3 NaN
3 2022-02-11 12:00:00 4 NaN
4 2022-02-11 13:00:00 5 NaN
5 2022-02-14 09:00:00 6 5.0
6 2022-02-14 10:00:00 7 5.0
7 2022-02-14 11:00:00 8 5.0
8 2022-02-14 12:00:00 9 5.0
9 2022-02-14 13:00:00 10 5.0
10 2022-02-15 09:00:00 11 10.0
11 2022-02-15 10:00:00 12 10.0
12 2022-02-15 11:00:00 13 10.0
13 2022-02-15 12:00:00 14 10.0
14 2022-02-15 13:00:00 15 10.0
15 2022-02-16 09:00:00 16 15.0
16 2022-02-16 10:00:00 17 15.0
17 2022-02-16 11:00:00 18 15.0
18 2022-02-16 12:00:00 19 15.0
19 2022-02-16 13:00:00 20 15.0
Given your already existing code, you can map the shifted values:
intraday_df['yesterdays_high'] = (intraday_df['time']
.dt.date
.map(daily_df['daily_high']
.set_axis(daily_df['time'].shift(-1)))
)
If you don't have all days and really want to map the real previous business day:
intraday_df['yesterdays_high'] = (intraday_df['time']
.dt.date
.map(daily_df['daily_high']
.set_axis(daily_df['time'].add(pd.offsets.BusinessDay())))
)
Output:
time current_intraday_high yesterdays_high
0 2022-02-11 09:00:00 1 NaN
1 2022-02-11 10:00:00 2 NaN
2 2022-02-11 11:00:00 3 NaN
3 2022-02-11 12:00:00 4 NaN
4 2022-02-11 13:00:00 5 NaN
5 2022-02-14 09:00:00 6 5.0
6 2022-02-14 10:00:00 7 5.0
7 2022-02-14 11:00:00 8 5.0
8 2022-02-14 12:00:00 9 5.0
9 2022-02-14 13:00:00 10 5.0
10 2022-02-15 09:00:00 11 10.0
11 2022-02-15 10:00:00 12 10.0
12 2022-02-15 11:00:00 13 10.0
13 2022-02-15 12:00:00 14 10.0
14 2022-02-15 13:00:00 15 10.0
15 2022-02-16 09:00:00 16 15.0
16 2022-02-16 10:00:00 17 15.0
17 2022-02-16 11:00:00 18 15.0
18 2022-02-16 12:00:00 19 15.0
19 2022-02-16 13:00:00 20 15.0
We can use .dt.date as an index to join two frames together on the same days. As of previous day hight_price, we can apply shift on daily_df:
intra_date = intraday_df['time'].dt.date
daily_date = daily_df['time'].dt.date
answer = intraday_df.set_index(intra_date).join(
daily_df.set_index(daily_date)['daily_high'].shift()
).reset_index(drop=True)
I have a column which I have converted to dateime:
df['date'] = pd.to_datetime(df['date'], errors='coerce')
date
2021-10-21 00:00:00
2021-10-24 00:00:00
2021-10-25 00:00:00
2021-10-26 00:00:00
And I need to add 1 year to this time based on a conditional:
df.loc[df['quarter'] == "Q4_", 'date'] + pd.offsets.DateOffset(years=1)
but it's not working....
date
2021-10-21 00:00:00
2021-10-24 00:00:00
2021-10-25 00:00:00
2021-10-26 00:00:00
I have tried converting it to period since I only need the year to be used in a concatenation later:
df['year'] = df['date'].dt.to_period('Y')
but I cannot add any number to a period.
This appears to be working for me:
import pandas as pd
df = pd.DataFrame({'date':pd.date_range('1/1/2021', periods=50, freq='M')})
print(df.head(24))
Input:
date
0 2021-01-31
1 2021-02-28
2 2021-03-31
3 2021-04-30
4 2021-05-31
5 2021-06-30
6 2021-07-31
7 2021-08-31
8 2021-09-30
9 2021-10-31
10 2021-11-30
11 2021-12-31
12 2022-01-31
13 2022-02-28
14 2022-03-31
15 2022-04-30
16 2022-05-31
17 2022-06-30
18 2022-07-31
19 2022-08-31
20 2022-09-30
21 2022-10-31
22 2022-11-30
23 2022-12-31
Add, year:
df.loc[df['date'].dt.quarter == 4, 'date'] += pd.offsets.DateOffset(years=1)
print(df.head(24))
Note per your logic, the year increase on October.
Output:
date
0 2021-01-31
1 2021-02-28
2 2021-03-31
3 2021-04-30
4 2021-05-31
5 2021-06-30
6 2021-07-31
7 2021-08-31
8 2021-09-30
9 2022-10-31
10 2022-11-30
11 2022-12-31
12 2022-01-31
13 2022-02-28
14 2022-03-31
15 2022-04-30
16 2022-05-31
17 2022-06-30
18 2022-07-31
19 2022-08-31
20 2022-09-30
21 2023-10-31
22 2023-11-30
23 2023-12-31
I have got a dataframe loc_df where all the values are datetime and some of them are NaT. This is what loc_df looks like:
loc_df = pd.DataFrame({'10101':['2020-01-03','2019-11-06','2019-10-09','2019-09-26','2019-09-19','2019-08-19','2019-08-08','2019-07-05','2019-07-04','2019-06-27','2019-05-21','2019-04-21','2019-04-15','2019-04-06','2019-03-28','2019-02-28'], '10102':['2020-01-03','2019-11-15','2019-11-11','2019-10-23','2019-10-10','2019-10-06','2019-09-26','2019-07-14','2019-05-21','2019-03-15','2019-03-11','2019-02-27','2019-02-25',None,None,None], '10103':['2019-08-27','2019-07-14','2019-06-24','2019-05-21','2019-04-11','2019-03-06','2019-02-11',None,None,None,None,None,None,None,None,None]})
loc_df = loc_df.apply(pd.to_datetime)
print(loc_df)
10101 10102 10103
0 2020-01-03 2020-01-03 2019-08-27
1 2019-11-06 2019-11-15 2019-07-14
2 2019-10-09 2019-11-11 2019-06-24
3 2019-09-26 2019-10-23 2019-05-21
4 2019-09-19 2019-10-10 2019-04-11
5 2019-08-19 2019-10-06 2019-03-06
6 2019-08-08 2019-09-26 2019-02-11
7 2019-07-05 2019-07-14 NaT
8 2019-07-04 2019-05-21 NaT
9 2019-06-27 2019-03-15 NaT
10 2019-05-21 2019-03-11 NaT
11 2019-04-21 2019-02-27 NaT
12 2019-04-15 2019-02-25 NaT
13 2019-04-06 NaT NaT
14 2019-03-28 NaT NaT
15 2019-02-28 NaT NaT
I want to know the days between the dates for each colum so I have used:
loc_df = loc_df.diff(periods = -1)
The result was:
print(loc_df)
10101 10102 10103
0 58 days 49 days 00:00:00 44 days 00:00:00
1 28 days 4 days 00:00:00 20 days 00:00:00
2 13 days 19 days 00:00:00 34 days 00:00:00
3 7 days 13 days 00:00:00 40 days 00:00:00
4 31 days 4 days 00:00:00 36 days 00:00:00
5 11 days 10 days 00:00:00 23 days 00:00:00
6 34 days 74 days 00:00:00 -88814 days +00:12:43.145224
7 1 days 54 days 00:00:00 0 days 00:00:00
8 7 days 67 days 00:00:00 0 days 00:00:00
9 37 days 4 days 00:00:00 0 days 00:00:00
10 30 days 12 days 00:00:00 0 days 00:00:00
11 6 days 2 days 00:00:00 0 days 00:00:00
12 9 days -88800 days +00:12:43.145224 0 days 00:00:00
13 9 days 0 days 00:00:00 0 days 00:00:00
14 28 days 0 days 00:00:00 0 days 00:00:00
15 NaT NaT NaT
Do you know why I high values at the end of each column? I guess it has something to do with subtract a NaT to a datetime.
Is there an alternative to my code to prevent this?
Thanks in advance
If you have some initial data:
print(loc_df)
10101 10102 10103
0 2020-01-03 2020-01-03 2019-08-27
1 2019-11-06 2019-11-15 2019-07-14
2 2019-10-09 2019-11-11 2019-06-24
3 2019-09-26 2019-10-23 2019-05-21
4 2019-09-19 2019-10-10 2019-04-11
5 2019-08-19 2019-10-06 2019-03-06
6 2019-08-08 2019-09-26 2019-02-11
7 2019-07-05 2019-07-14 NaT
8 2019-07-04 2019-05-21 NaT
9 2019-06-27 2019-03-15 NaT
10 2019-05-21 2019-03-11 NaT
11 2019-04-21 2019-02-27 NaT
12 2019-04-15 2019-02-25 NaT
13 2019-04-06 NaT NaT
14 2019-03-28 NaT NaT
15 2019-02-28 NaT NaT
You could use DataFrame.ffill to fill in the NaT values before you use diff():
loc_df = loc_df.ffill()
loc_df = loc_df.diff(periods=-1)
print(loc_df)
10101 10102 10103
0 58 days 49 days 44 days
1 28 days 4 days 20 days
2 13 days 19 days 34 days
3 7 days 13 days 40 days
4 31 days 4 days 36 days
5 11 days 10 days 23 days
6 34 days 74 days 0 days
7 1 days 54 days 0 days
8 7 days 67 days 0 days
9 37 days 4 days 0 days
10 30 days 12 days 0 days
11 6 days 2 days 0 days
12 9 days 0 days 0 days
13 9 days 0 days 0 days
14 28 days 0 days 0 days
15 NaT NaT NaT
I have dataframe like:
Timestamp Sold
10.01.2017 10:00:20 10
10.01.2017 10:01:55 20
10.01.2017 11:02:11 15
11.01.2017 11:04:30 10
11.01.2017 11:15:35 35
12.01.2017 10:02:01 22
How to resample it by hour. Ordinary resample resamples by all hours from first row to last. But what I need is to make timeframe (10-11) and resample it within this timeframe.
Last df should be like this:
Timestamp Sold
10.01.2017 10:00:00 30
10.01.2017 11:00:00 15
11.01.2017 10:00:00 NAN
11.01.2017 11:00:00 45
12.01.2017 10:00:00 22
12.01.2017 11:00:00 NAN
You could do something like this:
df_out = df.groupby(df.Timestamp.dt.floor('H')).sum()
df_out.reset_index()
Output:
Timestamp Sold
0 2017-10-01 10:00:00 30
1 2017-10-01 11:00:00 15
2 2017-11-01 11:00:00 45
3 2017-12-01 10:00:00 22
I have a dataframe with period_start_time by every 15 minutes and now I need to aggregate to 1 hour and calculate sum and avg for almost every column in dataframe (it has about 20 columns) and
PERIOD_START_TIME ID val1 val2
06.21.2017 22:15:00 12 3 0
06.21.2017 22:30:00 12 5 6
06.21.2017 22:45:00 12 0 3
06.21.2017 23:00:00 12 5 2
...
06.21.2017 22:15:00 15 9 2
06.21.2017 22:30:00 15 0 2
06.21.2017 22:45:00 15 1 5
06.21.2017 23:00:00 15 0 1
...
Desired output:
PERIOD_START_TIME ID val1(avg) val1(sum) val1(max) ...
06.21.2017 22:00:00 12 3.25 13 5
...
06.21.2017 23:00:00 15 2.25 10 9 ...
And for columns val2 too, and for every other column in dataframe.
I have no idea how to group by period start time for every hour, not for the whole day, no idea how to start.
I believe you need Series.dt.floor for Hours and then aggregate by agg:
df = df.groupby([df['PERIOD_START_TIME'].dt.floor('H'),'ID']).agg(['mean','sum', 'max'])
#for columns from MultiIndex
df.columns = df.columns.map('_'.join)
print (df)
val1_mean val1_sum val1_max val2_mean val2_sum \
PERIOD_START_TIME ID
2017-06-21 22:00:00 12 2.666667 8 5 3 9
15 3.333333 10 9 3 9
2017-06-21 23:00:00 12 5.000000 5 5 2 2
15 0.000000 0 0 1 1
val2_max
PERIOD_START_TIME ID
2017-06-21 22:00:00 12 6
15 5
2017-06-21 23:00:00 12 2
15 1
df = df.reset_index()
print (df)
PERIOD_START_TIME ID val1_mean val1_sum val1_max val2_mean val2_sum \
0 2017-06-21 22:00 12 2.666667 8 5 3 9
1 2017-06-21 22:00 15 3.333333 10 9 3 9
2 2017-06-21 23:00 12 5.000000 5 5 2 2
3 2017-06-21 23:00 15 0.000000 0 0 1 1
val2_max
0 6
1 5
2 2
3 1
Very similarly you can convert PERIOD_START_TIME to a pandas Period.
df['PERIOD_START_TIME'] = df['PERIOD_START_TIME'].dt.to_period('H')
df.groupby(['PERIOD_START_TIME', 'ID']).agg(['max', 'min', 'mean']).reset_index()