Python: Aggregate & averaging rows based on Multiple Conditions - python

I am trying to find the average travel time for an Average Work Day = 5 days and an Average Weekend day = 2 days.'
I'm trying to aggregate and find the average for the veh-time for all the rows with the same 'Time' and same 'Day_type'. Because 'Time' values include seconds as well, I'm finding some trouble matching all the veh-times that belong to the same 'Time'.
My dataframe is set-up in the following way:
veh-time distance Date Time Day_of_week Day_type
0 72 379.0 2018-10-18 22:15:21 Thursday Weekday
1 72 379.0 2018-10-18 22:30:21 Friday Weekend
2 72 379.0 2018-10-18 22:45:22 Saturday Weekend
3 72 379.0 2018-10-18 23:00:20 Sunday Weekday
4 72 379.0 2018-10-18 23:15:21 Monday Weekday
5 72 379.0 2018-10-18 23:15:21 Tuesday Weekday
6 72 379.0 2018-10-18 23:15:21 Wednesday Weekday
7 72 379.0 2018-10-18 22:15:21 Thursday Weekday
8 72 379.0 2018-10-18 22:30:21 Friday Weekend
9 72 379.0 2018-10-18 22:45:22 Saturday Weekend
10 72 379.0 2018-10-18 23:00:20 Sunday Weekday
11 72 379.0 2018-10-18 23:15:21 Monday Weekday
12 72 379.0 2018-10-18 23:15:21 Tuesday Weekday
13 72 379.0 2018-10-18 23:15:21 Wednesday Weekday
I'm guessing the process would look like this:
STEP 1:
split the 'Time' column so it'll only show HH:MM. Maybe use regex or str.split()
STEP 2:
group all veh-time rows that with matching 'Time' AND 'Day_type' - e.g. all rows with time 22:15 and day type Weekday
STEP 3:
add a new column: 'avg_vt' after finding the average for the grouped rows in step 2.
avg_vt = veh-time + veh-time etc. / # of Day_type instance identified
Thanks,
R

Use transform for new column filled by aggregate values, for extract HH:MM is used rsplit with n=1 for split from right side with only first : or convert to datetimes and then to HH:MM string with strftime:
df['avg_vt'] = df.groupby([df['Time'].str.rsplit(':', n=1).str[0],
'Day_type'])['veh-time'].transform('mean')
Alternative:
df['avg_vt'] = df.groupby([pd.to_datetime(df['Time']).dt.strftime('%H:%M'),
'Day_type'])['veh-time'].transform('mean')
print (df)
veh-time distance Date Time Day_of_week Day_type avg_vt
0 72 379.0 2018-10-18 22:15:21 Thursday Weekday 72
1 72 379.0 2018-10-18 22:30:21 Friday Weekend 72
2 72 379.0 2018-10-18 22:45:22 Saturday Weekend 72
3 72 379.0 2018-10-18 23:00:20 Sunday Weekday 72
4 72 379.0 2018-10-18 23:15:21 Monday Weekday 72
5 72 379.0 2018-10-18 23:15:21 Tuesday Weekday 72
6 72 379.0 2018-10-18 23:15:21 Wednesday Weekday 72
7 72 379.0 2018-10-18 22:15:21 Thursday Weekday 72
8 72 379.0 2018-10-18 22:30:21 Friday Weekend 72
9 72 379.0 2018-10-18 22:45:22 Saturday Weekend 72
10 72 379.0 2018-10-18 23:00:20 Sunday Weekday 72
11 72 379.0 2018-10-18 23:15:21 Monday Weekday 72
12 72 379.0 2018-10-18 23:15:21 Tuesday Weekday 72
13 72 379.0 2018-10-18 23:15:21 Wednesday Weekday 72
Detail:
print (df['Time'].str.rsplit(':', n=1).str[0])
0 22:15
1 22:30
2 22:45
3 23:00
4 23:15
5 23:15
...
Name: Time, dtype: object

Related

Calculate Weekly Percent Change in Timeseries data using Pandas for Missing Data in Dataframe

I am having Financial Data,trying to calculate percent change in values between two consecutive thursday. Sometime due to Holiday's on Thursday the weekly data for this week is absent, so I want to calculate that week pct_change from Thursday to Wednesday, as Thursday data is not present in dataframe.
Reproducible Code-
# !pip install investpy
import pandas as pd
import investpy
from datetime import datetime
df = investpy.get_index_historical_data(index="Nifty 50",country="India",from_date=("23/03/2022"),to_date= "23/04/2022")
df['weekday'] = df.index.day_name()
df = df.loc[:, ['Close', 'weekday']]
df.tail(10)
Output-
Close weekday
Date
2022-04-07 17639.55 Thursday
2022-04-08 17784.35 Friday
2022-04-11 17674.95 Monday
2022-04-12 17530.30 Tuesday
2022-04-13 17475.65 Wednesday
2022-04-18 17173.65 Monday
2022-04-19 16958.65 Tuesday
2022-04-20 17136.55 Wednesday
2022-04-21 17392.60 Thursday
2022-04-22 17171.95 Friday
In df.tail(10), 14-Apr-2022 date is missing as it's holiday, so in that case I want to calculate pct_change between Thursday to Wednesday.
Code I used previously to calculate pct_returns
weekly_pct_change = df.loc[df['weekday'] == 'Thursday']
weekly_pct_change['pct_change']= np.log(1+weekly_pct_change['Close'].pct_change())*100
weekly_pct_change
Output-
Close weekday pct_change
Date
2022-03-24 17222.75 Thursday NaN
2022-03-31 17464.75 Thursday 1.395338
2022-04-07 17639.55 Thursday 0.995898
2022-04-21 17392.60 Thursday -1.409871
Onde idea is add missing datetimes by DataFrame.asfreq with method='ffill' and then reassign names of days by DatetimeIndex.day_name:
df1 = df.asfreq('B', method='ffill')
df1['weekday'] = df1.index.day_name()
print (df1.tail(10))
Close weekday
Date
2022-04-11 17674.95 Monday
2022-04-12 17530.30 Tuesday
2022-04-13 17475.65 Wednesday
2022-04-14 17475.65 Thursday
2022-04-15 17475.65 Friday
2022-04-18 17173.65 Monday
2022-04-19 16958.65 Tuesday
2022-04-20 17136.55 Wednesday
2022-04-21 17392.60 Thursday
2022-04-22 17171.95 Friday
weekly_pct_change = df1.loc[df1['weekday'] == 'Thursday'].copy()
weekly_pct_change['pct_change']= np.log(1+weekly_pct_change['Close'].pct_change())*100
print(weekly_pct_change)
Close weekday pct_change
Date
2022-03-24 17222.75 Thursday NaN
2022-03-31 17464.75 Thursday 1.395338
2022-04-07 17639.55 Thursday 0.995898
2022-04-14 17475.65 Thursday -0.933506
2022-04-21 17392.60 Thursday -0.476366

Business Days Resample with Offset

I'm trying to resample daily frequency data to business days using the Pandas resample function with an offset so the last day of the week becomes Thursday and the beginning Sunday.
This is the code so far:
import pandas as pd
resampled_data = df.resample('B', base=-1)
But it keeps resampling so Friday is being used in the resample and Sunday is excluded. I tried many different values for base and loffset but it's not affecting the resampling.
Please note: The raw data is using UTC timestamps. Timezone is Eastern Daylight Time. Sunday UTC 21:00 - Thursday UTC 21:00.
Use a CustomBusinessDay(). I've resampled the whole of Jan which includes Fri / Sat and also included day_name() and dayofweek to show it has worked.
import datetime as dt
df = pd.DataFrame(index=pd.date_range(dt.datetime(2020,1,1), dt.datetime(2020,2,1)))
bd = pd.tseries.offsets.CustomBusinessDay(n=1,
weekmask="Sun Mon Tue Wed Thu")
df = df.resample(rule=bd).first().assign(
day=lambda dfa: dfa.index.day_name(),
dn=lambda dfa: dfa.index.dayofweek
)
output
day dn
2020-01-01 Wednesday 2
2020-01-02 Thursday 3
2020-01-05 Sunday 6
2020-01-06 Monday 0
2020-01-07 Tuesday 1
2020-01-08 Wednesday 2
2020-01-09 Thursday 3
2020-01-12 Sunday 6
2020-01-13 Monday 0
2020-01-14 Tuesday 1
2020-01-15 Wednesday 2
2020-01-16 Thursday 3
2020-01-19 Sunday 6
2020-01-20 Monday 0
2020-01-21 Tuesday 1
2020-01-22 Wednesday 2
2020-01-23 Thursday 3
2020-01-26 Sunday 6
2020-01-27 Monday 0
2020-01-28 Tuesday 1
2020-01-29 Wednesday 2
2020-01-30 Thursday 3

Average of consecutive days in dataframe

I have a pandas dataframe df as:
Date Val WD
1/3/2019 2.65 Thursday
1/4/2019 2.51 Friday
1/5/2019 2.95 Saturday
1/6/2019 3.39 Sunday
1/7/2019 3.39 Monday
1/12/2019 2.23 Saturday
1/13/2019 2.50 Sunday
1/14/2019 3.62 Monday
1/15/2019 3.81 Tuesday
1/16/2019 3.75 Wednesday
1/17/2019 3.69 Thursday
1/18/2019 3.47 Friday
I need to get the following df2 from above:
Date Val WD
1/3/2019 2.65 Thursday
1/4/2019 2.51 Friday
1/5/2019 3.24 Saturday
1/6/2019 3.24 Sunday
1/7/2019 3.24 Monday
1/12/2019 2.78 Saturday
1/13/2019 2.78 Sunday
1/14/2019 2.78 Monday
1/15/2019 3.81 Tuesday
1/16/2019 3.75 Wednesday
1/17/2019 3.69 Thursday
1/18/2019 3.47 Friday
Where the df2 values are updated to have average of consecutive Sat, Sun and Mon values.
i.e. average of 2.95, 3.39, 3.39 for dates 1/5/2019, 1/6/2019, 1/7/2019 in df is 3.24 and hence in df2 I have replaced the 1/5/2019, 1/6/2019, 1/7/2019 values with 3.24.
The trick has been finding the consecutive Saturday, Sunday and Monday. Not sure how to approach this.
You can use CustomBusinessDay with pd.grouper to create a group col:
# if you want to only find the mean if all three days are found
from pandas.tseries.offsets import CustomBusinessDay
days = CustomBusinessDay(weekmask='Tue Wed Thu Fri Sat')
df['group_col'] = df.groupby(pd.Grouper(key='Date', freq=days)).ngroup()
df.update(df[df.groupby('group_col')['Val'].transform('size').eq(3)].groupby('group_col').transform('mean'))
Date Val WD group_col
0 2019-01-03 2.650000 Thursday 0
1 2019-01-04 2.510000 Friday 1
2 2019-01-05 3.243333 Saturday 2
3 2019-01-06 3.243333 Sunday 2
4 2019-01-07 3.243333 Monday 2
5 2019-01-12 2.783333 Saturday 7
6 2019-01-13 2.783333 Sunday 7
7 2019-01-14 2.783333 Monday 7
8 2019-01-15 3.810000 Tuesday 8
9 2019-01-16 3.750000 Wednesday 9
10 2019-01-17 3.690000 Thursday 10
11 2019-01-18 3.470000 Friday 11
or if you want to find the mean of any combination of sat sun mon in the same week
days = CustomBusinessDay(weekmask='Tue Wed Thu Fri Sat')
df['group_col'] = df.groupby(pd.Grouper(key='Date', freq=days)).ngroup()
df['Val'] = df.groupby('group_col')['Val'].transform('mean')
This logic creates a Series that assigns a unique ID to groups of consecutive Sat/Sun/Mon rows in your DataFrame. Then ensure there are 3 of them (not just Sat/Sun or Sun/Mon), and transform those values with the mean:
import pandas as pd
#df['Date'] = pd.to_datetime(df.Date)
s = (~(df.Date.dt.dayofweek.isin([0,6])
& (df.Date - df.Date.shift(1)).dt.days.eq(1))).cumsum()
to_trans = s[s.groupby(s).transform('size').eq(3)]
df.loc[to_trans.index, 'Val'] = df.loc[to_trans.index].groupby(to_trans).Val.transform('mean')
Output:
Date Val WD
0 2019-01-03 2.650000 Thursday
1 2019-01-04 2.510000 Friday
2 2019-01-05 3.243333 Saturday
3 2019-01-06 3.243333 Sunday
4 2019-01-07 3.243333 Monday
5 2019-01-12 2.783333 Saturday
6 2019-01-13 2.783333 Sunday
7 2019-01-14 2.783333 Monday
8 2019-01-15 3.810000 Tuesday
9 2019-01-16 3.750000 Wednesday
10 2019-01-17 3.690000 Thursday
11 2019-01-18 3.470000 Friday
12 2019-01-19 3.250000 Saturday
13 2019-01-20 3.250000 Sunday
14 2019-01-21 3.250000 Monday
15 2019-01-22 5.000000 Tuesday
16 2019-01-27 2.000000 Sunday
17 2019-01-28 4.000000 Monday
18 2019-01-29 6.000000 Tuesday
19 2019-02-05 7.000000 Tuesday
20 2019-02-07 6.000000 Thursday
21 2019-02-12 9.000000 Tuesday
Extended Input Data
Date Val WD
1/3/2019 2.65 Thursday
1/4/2019 2.51 Friday
1/5/2019 2.95 Saturday
1/6/2019 3.39 Sunday
1/7/2019 3.39 Monday
1/12/2019 2.23 Saturday
1/13/2019 2.50 Sunday
1/14/2019 3.62 Monday
1/15/2019 3.81 Tuesday
1/16/2019 3.75 Wednesday
1/17/2019 3.69 Thursday
1/18/2019 3.47 Friday
1/19/2019 3.75 Saturday
1/20/2019 2.00 Sunday
1/21/2019 4.00 Monday
1/22/2019 5.00 Tuesday
1/27/2019 2.00 Sunday
1/28/2019 4.00 Monday
1/29/2019 6.00 Tuesday
2/5/2019 7.00 Tuesday
2/7/2019 6.00 Thursday
2/12/2019 9.00 Tuesday
One approach is to calculate a week number, then use groupby to calculate means across specific days and map this back to your original dataframe.
df['Date'] = pd.to_datetime(df['Date'])
# consider Monday to belong to previous week
week, weekday = df['Date'].dt.week, df['Date'].dt.weekday
df['Week'] = np.where(weekday.eq(0), week - 1, week)
# take means of Fri, Sat, Sun, then map back
mask = weekday.isin([5, 6, 0])
week_val_map = df[mask].groupby('Week')['Val'].mean()
df.loc[mask, 'Val'] = df['Week'].map(week_val_map)
print(df)
Date Val WD Week
0 2019-01-03 2.650000 Thursday 1
1 2019-01-04 2.510000 Friday 1
2 2019-01-05 3.243333 Saturday 1
3 2019-01-06 3.243333 Sunday 1
4 2019-01-07 3.243333 Monday 1
5 2019-01-12 2.783333 Saturday 2
6 2019-01-13 2.783333 Sunday 2
7 2019-01-14 2.783333 Monday 2
8 2019-01-15 3.810000 Tuesday 3
9 2019-01-16 3.750000 Wednesday 3
10 2019-01-17 3.690000 Thursday 3
11 2019-01-18 3.470000 Friday 3

Pandas: Check whether the particular day is in index at regular interval and if not mark the one before entry as something?

I am still very new to pandas and just figured out I have made a mistake in the process I was following earlier.
df_date
Date day
0 2016-05-26 Thursday
1 2016-05-27 Friday
2 2016-05-30 Monday
3 2016-05-31 Tuesday
4 2016-06-01 Wednesday
5 2016-06-02 Thursday
6 2016-06-03 Friday
7 2016-06-06 Monday
8 2016-06-07 Tuesday
9 2016-06-08 Wednesday
10 2016-06-09 Thursday
11 2016-06-10 Friday
12 2016-06-13 Monday
13 2016-06-14 Tuesday
14 2016-06-15 Wednesday
15 2016-06-16 Thursday
16 2016-06-17 Friday
17 2016-06-20 Monday
18 2016-06-21 Tuesday
19 2016-06-22 Wednesday
20 2016-06-24 Friday
21 2016-06-27 Monday
22 2016-06-28 Tuesday
23 2016-06-29 Wednesday
There are about 600+ rows.
What I want to do
Make a column 'Exit' where if thursday is not in the week the Wednesday becomes E and if wednesday is not there then Tuesday.
I tried a for loop and I just can't seem to get this right.
Expected Output:
df_date
Date day Exit
0 2016-05-26 Thursday E
1 2016-05-27 Friday
2 2016-05-30 Monday
3 2016-05-31 Tuesday
4 2016-06-01 Wednesday
5 2016-06-02 Thursday E
6 2016-06-03 Friday
7 2016-06-06 Monday
8 2016-06-07 Tuesday
9 2016-06-08 Wednesday
10 2016-06-09 Thursday E
11 2016-06-10 Friday
12 2016-06-13 Monday
13 2016-06-14 Tuesday
14 2016-06-15 Wednesday
15 2016-06-16 Thursday E
16 2016-06-17 Friday
17 2016-06-20 Monday
18 2016-06-21 Tuesday
19 2016-06-22 Wednesday E
20 2016-06-24 Friday
21 2016-06-27 Monday
22 2016-06-28 Tuesday
23 2016-06-29 Wednesday E
I added this in comments but should be here as well:
If Thursday is not present then the record just before it.
So if Wednesday is also not present in the week, then Tuesday
If Tuesday is also not then Monday, if monday is not then Friday. Saturday and Sunday will never have a record.
Here's a solution:
ix = df.groupby(pd.Grouper(key='Date', freq='W')).Date
.apply(lambda x: (x.dt.dayofweek <= 3)[::-1].idxmax()).values
df.loc[ix,'Exit'] = 'E'
df.fillna('')
Date day Exit
0 2016-05-26 Thursday E
1 2016-05-27 Friday
2 2016-05-30 Monday
3 2016-05-31 Tuesday
4 2016-06-01 Wednesday
5 2016-06-02 Thursday E
6 2016-06-03 Friday
7 2016-06-06 Monday
8 2016-06-07 Tuesday
9 2016-06-08 Wednesday
10 2016-06-09 Thursday E
11 2016-06-10 Friday
12 2016-06-13 Monday
13 2016-06-14 Tuesday
14 2016-06-15 Wednesday
15 2016-06-16 Thursday E
16 2016-06-17 Friday
17 2016-06-20 Monday
18 2016-06-21 Tuesday
19 2016-06-22 Wednesday
20 2016-06-23 Thursday E
21 2016-06-24 Friday
22 2016-06-27 Monday
23 2016-06-28 Tuesday
24 2016-06-29 Wednesday E
You can use dt.week and dt.weekday properties of your datetime series. Then use groupby + max for your required logic. This is likely to be more efficient than sequential equality checks.
df['Date'] = pd.to_datetime(df['Date'])
# add week and weekday series
df['Week'] = df['Date'].dt.week
df['Weekday'] = df['Date'].dt.weekday.where(df['Date'].dt.weekday.isin([1, 2, 3]))
df['Exit'] = np.where(df['Weekday'] == df.groupby('Week')['Weekday'].transform('max'),
'E', '')
Result
I have left the helper columns so the way the solution works is clear. These can easily be removed.
print(df)
Date day Week Weekday Exit
0 2016-05-26 Thursday 21 3.0 E
1 2016-05-27 Friday 21 NaN
2 2016-05-30 Monday 22 NaN
3 2016-05-31 Tuesday 22 1.0
4 2016-06-01 Wednesday 22 2.0
5 2016-06-02 Thursday 22 3.0 E
6 2016-06-03 Friday 22 NaN
7 2016-06-06 Monday 23 NaN
8 2016-06-07 Tuesday 23 1.0
9 2016-06-08 Wednesday 23 2.0
10 2016-06-09 Thursday 23 3.0 E
11 2016-06-10 Friday 23 NaN
12 2016-06-13 Monday 24 NaN
13 2016-06-14 Tuesday 24 1.0
14 2016-06-15 Wednesday 24 2.0
15 2016-06-16 Thursday 24 3.0 E
16 2016-06-17 Friday 24 NaN
17 2016-06-20 Monday 25 NaN
18 2016-06-21 Tuesday 25 1.0
19 2016-06-22 Wednesday 25 2.0 E
20 2016-06-24 Friday 25 NaN
21 2016-06-27 Monday 26 NaN
22 2016-06-28 Tuesday 26 1.0
23 2016-06-29 Wednesday 26 2.0 E

Creating week variable in pandas with custom start day for week

I have the following pandas dataframe indexed to a Time_Stamp:
df = DataFrame(index = pd.date_range('4/1/2017', freq='3D', periods=10))
df['weekday'] = df.index.weekday_name
Data looks like this:
weekday
2017-04-01 Saturday
2017-04-04 Tuesday
2017-04-07 Friday
2017-04-10 Monday
2017-04-13 Thursday
2017-04-16 Sunday
2017-04-19 Wednesday
2017-04-22 Saturday
2017-04-25 Tuesday
2017-04-28 Friday
I want to create a new column 'week' that will give the week ordinal of the year but with a weekday.
I know I can just do this:
df['week_sun'] = df.index.week
Except I want the first day of the week to be something besides Sunday. For this question, lets say I need it to be Wednesday so that the resulting dataframe would be like so:
weekday week_sun week_wed
2017-04-01 Saturday 13 13
2017-04-04 Tuesday 14 13
2017-04-07 Friday 14 14
2017-04-10 Monday 15 14
2017-04-13 Thursday 15 15
2017-04-16 Sunday 15 15
2017-04-19 Wednesday 16 16
2017-04-22 Saturday 16 16
2017-04-25 Tuesday 17 16
2017-04-28 Friday 17 17
I'm at a loss to how to achieve this. Thanks!
Given your requirements, you would only need to subtract 1 to the week number, in case the day of the week is "before" the reference day (Wednesday in your example).
In [162]: df
Out[162]:
weekday week_sun
2017-04-01 Saturday 13
2017-04-04 Tuesday 14
2017-04-07 Friday 14
2017-04-10 Monday 15
2017-04-13 Thursday 15
2017-04-16 Sunday 15
2017-04-19 Wednesday 16
2017-04-22 Saturday 16
2017-04-25 Tuesday 17
2017-04-28 Friday 17
In [163]: df['week_wed'] = df['week_sun']
Let's now shift the value where needed, meaning when the weekday is before Wednesday, hence df.index.dayofweek < 2.
In [164]: df.loc[df.index.dayofweek < 2, 'week_wed'] = (df[df.index.dayofweek < 2]['week_sun'] - 2) % 52 + 1
In [165]: df
Out[165]:
weekday week_sun week_wed
2017-04-01 Saturday 13 13
2017-04-04 Tuesday 14 13
2017-04-07 Friday 14 14
2017-04-10 Monday 15 14
2017-04-13 Thursday 15 15
2017-04-16 Sunday 15 15
2017-04-19 Wednesday 16 16
2017-04-22 Saturday 16 16
2017-04-25 Tuesday 17 16
2017-04-28 Friday 17 17
I didn't exactly subtract 1, but instead used a modulo operation ((X-2) %52 +1)) so I can convert week 1 to week 52 from the previous year if need be:
weekday week_sun week_wed
2017-12-27 Wednesday 52 52
2017-12-30 Saturday 52 52
2018-01-02 Tuesday 1 52
2018-01-05 Friday 1 1

Categories

Resources