I have a dataframe containing hourly data, i want to get the max for each week of the year, so i used resample to group data by week
weeks = data.resample("W").max()
the problem is that week max is calculated starting the first monday of the year, while i want it to be calculated starting the first day of the year.
I obtain the following result, where you can notice that there is 53 weeks, and the last week is calculated on the next year while 2017 doesn't exist in the data
Date dots
2016-01-03 0.647786
2016-01-10 0.917071
2016-01-17 0.667857
2016-01-24 0.669286
2016-01-31 0.645357
Date dots
2016-12-04 0.646786
2016-12-11 0.857714
2016-12-18 0.670000
2016-12-25 0.674571
2017-01-01 0.654571
is there a way to calculate week for pandas dataframe starting first day of the year?
Find the starting day of the year, for example let say it's Friday, and then you can specify an anchoring suffix to resample in order to calculate week starting first day of the year:
weeks = data.resample("W-FRI").max()
One quick remedy is, given you data in one year, you can group it by day first, then take group of 7 days:
new_df = (df.resample("D", on='Date').dots
.max().reset_index()
)
new_df.groupby(new_df.index//7).agg({'Date': 'min', 'dots': 'max'})
new_df.head()
Output:
Date dots
0 2016-01-01 0.996387
1 2016-01-08 0.999775
2 2016-01-15 0.997612
3 2016-01-22 0.979376
4 2016-01-29 0.998240
5 2016-02-05 0.995030
6 2016-02-12 0.987500
and tail:
Date dots
48 2016-12-02 0.999910
49 2016-12-09 0.992910
50 2016-12-16 0.996877
51 2016-12-23 0.992986
52 2016-12-30 0.960348
Related
My dataframe (df) is a 12 months data which consist of 5m rows. One of the columns is day_of_week which are Monday to Sunday. This df also has a unique key which is the ride_id column. I want to calculate the average number of rides per day_of_week. I have calculated the number of rides per day_of_week using
copydf.groupby(['day_of_week']).agg(number_of_rides=('day_of_week', 'count'))
However, I find it hard to calculate the mean/average for each day of week. I have tried:
copydf.groupby(['day_of_week']).agg(number_of_rides=('ride_id', 'count')).mean()
and
avg_days = copydf.groupby(['day_of_week']).agg(number_of_rides=('ride_id', 'count'))
avg_days.groupby(['day_of_week']).agg('number_of_rides', 'mean')
They didn't work. I want the output to be in three columns, day_of_week, number_of_rides, and avg_num_of_ride or two columns day_of_week or weekday_num and avg_num_of_rides
This is my df. kindly note that code block have tampered with some columns line due to the long column names.
ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_or_casual ride_length year month day_of_week hour weekday_num
0 9DC7B962304CBFD8 electric_bike 2021-09-28 16:07:10 2021-09-28 16:09:54 Streeter Dr & Grand Ave 13022 Streeter Dr & Grand Ave 13022 41.89 -87.68 41.89 -87.67 casual 2 2021 September Tuesday 16 1
1 F930E2C6872D6B32 electric_bike 2021-09-28 14:24:51 2021-09-28 14:40:05 Streeter Dr & Grand Ave 13022 Streeter Dr & Grand Ave 13022 41.94 -87.64 41.98 -87.67 casual 15 2021 September Tuesday 14 1
2 6EF72137900BB910 electric_bike 2021-09-28 00:20:16 2021-09-28 00:23:57 Streeter Dr & Grand Ave 13022 Streeter Dr & Grand Ave 13022 41.81 -87.72 41.80 -87.72 casual 3 2021 September Tuesday 0 1
This is the output I desire
number_of_rides average_number_of_rides
day_of_week
Saturday 964079 50.4
Sunday 841919 70.9
Wednesday 840272 90.2
Thursday 836973 77.2
Friday 818205 34.4
Tuesday 814496 34.4
Monday 767002 200.3
Again, I have calculated the number of ride per day_of_week, what I want to do is just to add the third column or better still, have average_ride per weekday(Monday or 0, Tuesday or 1, Wednesday or 2) on its own output df
Thanks
To get average number of rides per week day, you need total rides on that week day and number of weeks.
You can compute the week number from date:
df["week_number"] = df["started_at"].dt.isocalendar().week
>> ride_id started_at day_of_week week_number
>> 0 1 2021-09-20 Monday 38
>> 1 2 2021-09-21 Tuesday 38
>> 2 3 2021-09-20 Monday 38
>> 3 4 2021-09-21 Tuesday 38
>> 4 5 2021-09-27 Monday 39
>> 5 6 2021-09-28 Tuesday 39
Then group by day_of_week and week_number to compute an aggregate dataframe:
week_number_group_df = df.groupby(["day_of_week", "week_number"]).agg(number_of_rides_on_day=("ride_id", "count"))
>> number_of_rides_on_day
>> day_of_week week_number
>> Monday 38 2
>> 39 1
>> Tuesday 38 2
>> 39 1
Use the aggregated dataframe to get the final results:
week_number_group_df.groupby("day_of_week").agg(number_of_rides=("number_of_rides_on_day", "sum"), average_number_of_rides=("number_of_rides_on_day", "mean"))
>> number_of_rides average_number_of_rides
>> day_of_week
>> Monday 3 1.5000
>> Tuesday 3 1.5000
As far as I understand, you're not trying to compute the average over a field in your grouped data (as #Azhar Khan pointed out), but an averaged count of rides per weekday over your original 12-months period.
Basically, you need two elements:
First, the count of rides per weekday you observe in your dataframe. That's exactly what you get with copydf.groupby(['day_of_week']).agg(number_of_rides=('ride_id', 'count'))
Let's say you get something like:
Secondly, the count of weekdays in your period. Let's imagine you're considering the year 2022 as an example, you can get such data with the next code snippet:
df_year = pd.DataFrame(data=pd.date_range(start=pd.to_datetime('01-01-2022'),
end=pd.to_datetime('31-12-2022'),
freq='1D'),
columns=['date'])
df_year["day_of_week"] = df_year["date"].dt.weekday
nb_weekdays_in_year = df_year.groupby('day_of_week').agg(nb_days=('date', 'count'))
This gives such a dataframe:
Once you have both these dataframes, you can simply join them with
nb_weekdays_in_year.join(nb_rides_per_day) for instance, and you just need to perform the ratio of both colums to get your average.
The difficulty here lies in the fact you need to get the total number of weekdays of each type over your period, that you cannot get from your observation directly I guess (what if there's some missing value ?). Plus, let's underline you're not trying to get an intra-group average, so that you cannot use simple agg functions like 'mean' directly.
Using pivot we can solve this.
import pandas as pd
import numpy as np
df = pd.read_csv('/content/test.csv')
df.head()
# sample df
date rides
0 2019-10-01 1
1 2019-10-02 2
2 2019-10-03 5
3 2019-10-04 3
4 2019-10-05 2
df['date] = pd.to_datetime(df['date'])
# extracting the week Number
df['weekNo'] = df['date'].dt.week
date rides weekNo
0 2019-10-01 1 40
1 2019-10-02 2 40
2 2019-10-03 5 40
Method 1: Use Pivot table
df.pivot_table(values='rides',index='weekNo',aggfunc='mean')
output
rides
weekNo
40 2.833333
41 2.571429
42 4.000000
Method 2: Use groupby.mean()
df.groupby('weekNo')['rides'].mean()
I was trying to calculate the week number starting from first Monday of October. Is there any functions in pandas or datetime to do the calculation efficiently?
MWE
import pandas as pd
from datetime import date,datetime
df = pd.DataFrame({'date': pd.date_range('2021-10-01','2022-11-01',freq='1D')})
df['day_name'] = df['date'].dt.day_name()
df2 = df.head()
df2['Fiscal_Week'] = [52,52,52,1,1] # 2021-10-4 is monday, so for oct4, week is 1
# How to do it programatically for any year?
df2
date day_name Fiscal_Week
0 2021-10-01 Friday 52
1 2021-10-02 Saturday 52
2 2021-10-03 Sunday 52
3 2021-10-04 Monday 1
4 2021-10-05 Tuesday 1
Shift dates by the number of days to new year, they use standard Monday week number formatting (%W):
df['Fiscal_Week'] = (
df['date'] + pd.DateOffset(days=91)
).dt.strftime('%W').astype(int).replace({0:None}).fillna(method='ffill')
The offset in days can be calculated manually (I assume the fiscal year start is fixed)
The replace part is needed because leftovers from the previous year are considered week 0. The previous week might be 52 or 53, so replacing with NA and then fill forward
as the title implies:
IN[1] :
dates = pd.date_range('10/10/2018', periods=11, freq='D')
close_prices = np.arange(len(dates))
close = pd.Series(close_prices, dates)
close
OUT[1]:
2018-10-10 0
2018-10-11 1
2018-10-12 2
2018-10-13 3
2018-10-14 4
2018-10-15 5
2018-10-16 6
2018-10-17 7
2018-10-18 8
2018-10-19 9
2018-10-20 10
IN[2] : close.resample('W').first()
OUT[2] :
2018-10-14 0
2018-10-21 5
Freq: W-SUN, dtype: int64
first what does resample & first do?
and why do we have this date 2018-10-21 as it was not existing in the series and based on what we have the 0 and 5?
Thanks
You have resampled your data by week. '2018-10-14' and '2018-10-21' are the last dates of each resampled week (each a Sunday). So by resampling, you have aggregated your data into weekly samples displayed on the Sundays on 10-14 and 10-21. 0 and 5 each refer to the count at the beginning of each respective week (in other words, the counts on 10-10 and 10-15, which would be the beginning Mondays of the resampled weeks ending on Sundays.
resample('W') reorders and groups the dates so that they're each a full week.
first() selects each week.
I have a dataframe with a '%Y/%U' date column:
Value Count YW Date
0 2 2017/19 2017-05-13
1 2 2017/20 2017-05-19
2 24 2017/22 2017-06-03
3 35 2017/23 2017-06-10
4 41 2017/24 2017-06-17
.. ... ... ...
126 51 2020/05 2020-02-06
127 26 2020/06 2020-02-15
128 30 2020/07 2020-02-22
129 26 2020/08 2020-02-29
130 18 2020/09 2020-03-04
I'm trying to add the missing weeks, like 2017/21 with 0 Count values, so I created this index:
idx = pdh.pd.date_range(df['Date'].min(), df['Date'].max(), freq='W').floor('d')
Which yields:
DatetimeIndex(['2017-05-14', '2017-05-21', '2017-05-28', '2017-06-04',
'2017-06-11', '2017-06-18', '2017-06-25', '2017-07-02',
'2017-07-09', '2017-07-16',
...
'2019-12-29', '2020-01-05', '2020-01-12', '2020-01-19',
'2020-01-26', '2020-02-02', '2020-02-09', '2020-02-16',
'2020-02-23', '2020-03-01'],
dtype='datetime64[ns]', length=147, freq=None)
Almost there, converting to '%Y/%U' again:
idx = idx.strftime('%Y/%U')
But this yields:
Index(['2017/20', '2017/21', '2017/22', '2017/23', '2017/24', '2017/25',
'2017/26', '2017/27', '2017/28', '2017/29',
...
'2019/52', '2020/01', '2020/02', '2020/03', '2020/04', '2020/05',
'2020/06', '2020/07', '2020/08', '2020/09'],
dtype='object', length=147)
I'm not sure yet whether it is a problem with reindexing but I've noticed that the firts year/week pair is now 2017/20 instead of 2017/19. This is because the freq='W' offset converts every date to the correspondent week starting day as the default is the same as 'W-SUN' anchored offset. Indeed, 2017-05-14 is a Sunday.
The problem is that the converted date now returns the next week number because of this, 2017-05-13 was converted to 2017-05-14. Using the %U strftime code does start the weeks on Sunday as well, however it is counted from the previous Sunday. Using 'W-SAT' (as 2017-05-13 was a Saturday) solves it at the start but the end will be wrong this case.
Is there any dynamic solution so date_range would start and end with the proper weeks?
I have a DataFrame df with sporadic daily business day rows (i.e., there is not always a row for every business day.)
For each row in df I want to create a historical resampled mean dfm going back one month at a time. For example, if I have a row for 2018-02-22 then I want rolling means for rows in the following date ranges:
2018-01-23 : 2018-02-22
2017-12-23 : 2018-01-22
2017-11-23 : 2017-12-22
etc.
But I can't see a way to keep this pegged to the particular day of the month using conventional offsets. For example, if I do:
dfm = df.resample('30D').mean()
Then we see two problems:
It references the beginning of the DataFrame. In fact, I can't find a way to force .resample() to peg itself to the end of the DataFrame – even if I have it operate on df_reversed = df.loc[:'2018-02-22'].iloc[::-1]. Is there a way to "peg" the resampling to something other than the earliest date in the DataFrame? (And ideally pegged to each particular row as I run some lambda on the associated historical resampling from each row's date?)
It will drift over time, because not every month is 30 days long. So as I go back in time I will find that the interval 12 "months" prior ends 2017-02-27, not 2017-02-22 like I want.
Knowing that I want to resample by non-overlapping "months," the second problem can be well-defined for month days 29-31: For example, if I ask to resample for '2018-03-31' then the date ranges would end at the end of each preceding month:
2018-03-01 : 2018-03-31
2018-02-01 : 2018-02-28
2018-01-01 : 2018-02-31
etc.
Though again, I don't know: is there a good or easy way to do this in pandas?
tl;dr:
Given something like the following:
someperiods = 20 # this can be a number of days covering many years
somefrequency = '8D' # this can vary from 1D to maybe 10D
rng = pd.date_range('2017-01-03', periods=someperiods, freq=somefrequency)
df = pd.DataFrame({'x': rng.day}, index=rng) # x in practice is exogenous data
from pandas.tseries.offsets import *
df['MonthPrior'] = df.index.to_pydatetime() + DateOffset(months=-1)
Now:
For each row in df: calculate df['PreviousMonthMean'] = rolling average of all df.x in range [df.MonthPrior, df.index). In this example the resulting DataFrame would be:
Index x MonthPrior PreviousMonthMean
2017-01-03 3 2016-12-03 NaN
2017-01-11 11 2016-12-11 3
2017-01-19 19 2016-12-19 7
2017-01-27 27 2016-12-27 11
2017-02-04 4 2017-01-04 19
2017-02-12 12 2017-01-12 16.66666667
2017-02-20 20 2017-01-20 14.33333333
2017-02-28 28 2017-01-28 12
2017-03-08 8 2017-02-08 20
2017-03-16 16 2017-02-16 18.66666667
2017-03-24 24 2017-02-24 17.33333333
2017-04-01 1 2017-03-01 16
2017-04-09 9 2017-03-09 13.66666667
2017-04-17 17 2017-03-17 11.33333333
2017-04-25 25 2017-03-25 9
2017-05-03 3 2017-04-03 17
2017-05-11 11 2017-04-11 15
2017-05-19 19 2017-04-19 13
2017-05-27 27 2017-04-27 11
2017-06-04 4 2017-05-04 19
If we can get that far, then I need to find an efficient way to iterate that so that for each row in df I can aggregate consecutive but non-overlapping df['PreviousMonthMean'] values going back one calendar month at a time from the given DateTimeIndex....