Hi so I have a dataframe on IT ticket incoming time that looks like this
Feedback time time hour Shift
0 12/2/20 23:58 2020-12-02 23:58:00 23 Shift 3
1 12/2/20 10:20 2020-12-02 10:20:00 10 Shift 1
2 12/2/20 11:40 2020-12-02 11:40:00 11 Shift 1
....
458 11/18/20 16:01 2020-11-18 16:01:00 16 Shift 2
Here the earliest date is 11/18/2020 and latest is 12/2/2020
And "Shift" is defined as
conditions = [(df['hour'] >= 7) & (df['hour'] < 15),
(df['hour'] >= 15) & (df['hour'] < 23),
(df['hour'] < 7) | (df['hour'] >= 23)]
Expected output
11/18-11/25 11/26-12/2
Shift Hour
Shift 1 7 xx
8
9
...
15
Shift 2 16
17
...
22
Shift 3 23
0
1
...
6
Basically column will be groupby week, index will be Shift and then hour during the day.
For xx, I want to count the frequency of "hour" during 7am interval, sum for the whole week, and then get the average value for that whole week (11/18-11/25 in this case), also need it to populate all the empty cells.
I have tried this
s = pd.pivot_table(df,index=['Shift',"Hour"],columns =['Time'] ,values = ['hour'],aggfunc=[len])
but my first row is only showing every day, not aggregated by week
Thank you so much!
Related
My dataframe (df) is a 12 months data which consist of 5m rows. One of the columns is day_of_week which are Monday to Sunday. This df also has a unique key which is the ride_id column. I want to calculate the average number of rides per day_of_week. I have calculated the number of rides per day_of_week using
copydf.groupby(['day_of_week']).agg(number_of_rides=('day_of_week', 'count'))
However, I find it hard to calculate the mean/average for each day of week. I have tried:
copydf.groupby(['day_of_week']).agg(number_of_rides=('ride_id', 'count')).mean()
and
avg_days = copydf.groupby(['day_of_week']).agg(number_of_rides=('ride_id', 'count'))
avg_days.groupby(['day_of_week']).agg('number_of_rides', 'mean')
They didn't work. I want the output to be in three columns, day_of_week, number_of_rides, and avg_num_of_ride or two columns day_of_week or weekday_num and avg_num_of_rides
This is my df. kindly note that code block have tampered with some columns line due to the long column names.
ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_or_casual ride_length year month day_of_week hour weekday_num
0 9DC7B962304CBFD8 electric_bike 2021-09-28 16:07:10 2021-09-28 16:09:54 Streeter Dr & Grand Ave 13022 Streeter Dr & Grand Ave 13022 41.89 -87.68 41.89 -87.67 casual 2 2021 September Tuesday 16 1
1 F930E2C6872D6B32 electric_bike 2021-09-28 14:24:51 2021-09-28 14:40:05 Streeter Dr & Grand Ave 13022 Streeter Dr & Grand Ave 13022 41.94 -87.64 41.98 -87.67 casual 15 2021 September Tuesday 14 1
2 6EF72137900BB910 electric_bike 2021-09-28 00:20:16 2021-09-28 00:23:57 Streeter Dr & Grand Ave 13022 Streeter Dr & Grand Ave 13022 41.81 -87.72 41.80 -87.72 casual 3 2021 September Tuesday 0 1
This is the output I desire
number_of_rides average_number_of_rides
day_of_week
Saturday 964079 50.4
Sunday 841919 70.9
Wednesday 840272 90.2
Thursday 836973 77.2
Friday 818205 34.4
Tuesday 814496 34.4
Monday 767002 200.3
Again, I have calculated the number of ride per day_of_week, what I want to do is just to add the third column or better still, have average_ride per weekday(Monday or 0, Tuesday or 1, Wednesday or 2) on its own output df
Thanks
To get average number of rides per week day, you need total rides on that week day and number of weeks.
You can compute the week number from date:
df["week_number"] = df["started_at"].dt.isocalendar().week
>> ride_id started_at day_of_week week_number
>> 0 1 2021-09-20 Monday 38
>> 1 2 2021-09-21 Tuesday 38
>> 2 3 2021-09-20 Monday 38
>> 3 4 2021-09-21 Tuesday 38
>> 4 5 2021-09-27 Monday 39
>> 5 6 2021-09-28 Tuesday 39
Then group by day_of_week and week_number to compute an aggregate dataframe:
week_number_group_df = df.groupby(["day_of_week", "week_number"]).agg(number_of_rides_on_day=("ride_id", "count"))
>> number_of_rides_on_day
>> day_of_week week_number
>> Monday 38 2
>> 39 1
>> Tuesday 38 2
>> 39 1
Use the aggregated dataframe to get the final results:
week_number_group_df.groupby("day_of_week").agg(number_of_rides=("number_of_rides_on_day", "sum"), average_number_of_rides=("number_of_rides_on_day", "mean"))
>> number_of_rides average_number_of_rides
>> day_of_week
>> Monday 3 1.5000
>> Tuesday 3 1.5000
As far as I understand, you're not trying to compute the average over a field in your grouped data (as #Azhar Khan pointed out), but an averaged count of rides per weekday over your original 12-months period.
Basically, you need two elements:
First, the count of rides per weekday you observe in your dataframe. That's exactly what you get with copydf.groupby(['day_of_week']).agg(number_of_rides=('ride_id', 'count'))
Let's say you get something like:
Secondly, the count of weekdays in your period. Let's imagine you're considering the year 2022 as an example, you can get such data with the next code snippet:
df_year = pd.DataFrame(data=pd.date_range(start=pd.to_datetime('01-01-2022'),
end=pd.to_datetime('31-12-2022'),
freq='1D'),
columns=['date'])
df_year["day_of_week"] = df_year["date"].dt.weekday
nb_weekdays_in_year = df_year.groupby('day_of_week').agg(nb_days=('date', 'count'))
This gives such a dataframe:
Once you have both these dataframes, you can simply join them with
nb_weekdays_in_year.join(nb_rides_per_day) for instance, and you just need to perform the ratio of both colums to get your average.
The difficulty here lies in the fact you need to get the total number of weekdays of each type over your period, that you cannot get from your observation directly I guess (what if there's some missing value ?). Plus, let's underline you're not trying to get an intra-group average, so that you cannot use simple agg functions like 'mean' directly.
Using pivot we can solve this.
import pandas as pd
import numpy as np
df = pd.read_csv('/content/test.csv')
df.head()
# sample df
date rides
0 2019-10-01 1
1 2019-10-02 2
2 2019-10-03 5
3 2019-10-04 3
4 2019-10-05 2
df['date] = pd.to_datetime(df['date'])
# extracting the week Number
df['weekNo'] = df['date'].dt.week
date rides weekNo
0 2019-10-01 1 40
1 2019-10-02 2 40
2 2019-10-03 5 40
Method 1: Use Pivot table
df.pivot_table(values='rides',index='weekNo',aggfunc='mean')
output
rides
weekNo
40 2.833333
41 2.571429
42 4.000000
Method 2: Use groupby.mean()
df.groupby('weekNo')['rides'].mean()
I have 5 years of daily volume data. I want to create a new column in pandas dataframe where it produces the values of YoY Growth for that particular day. For e.g. compares 2018-01-01 with 2017-01-01, and 2019-01-01 compares with 2018-01-01
I have 364 records for each year (except for the year 2020 there are 365 days)
How can I create the column YoY_Growth as below in pandas dataframe.
# It's more convenient to index the dataframe with the Date for our algorith,
df = df.set_index("Date")
is_leap_day = (df.index.month == 2) & (df.index.day == 29)
# Leap day is an edge case, since you can't find Feb 29 of the previous year.
# pandas handles this by shifting to Feb 28 of the previous year:
# 2020-02-29 -> 2019-02-28
# 2020-02-28 -> 2019-02-28
# This creates a duplicate for Feb 28. So we need to handle leap day separately.
volume_last_year = df.loc[~is_leap_day, "Volume"].shift(freq=pd.DateOffset(years=1))
# For non leap days
df["YoY_Growth"] = df["Volume"] / volume_last_year - 1
# For leap days
df.loc[is_leap_day, "YoY_Growth"] = (
df.loc[is_leap_day, "Volume"] / volume_last_year.shift(freq=pd.DateOffset(days=1))
- 1
)
Result (Volume was randomly generated):
df.loc[["2019-01-01", "2019-02-28", "2020-01-01", "2020-02-28", "2020-02-29"], :]
Volume YoY_Growth
Date
2019-01-01 45 NaN
2019-02-28 23 NaN
2020-01-01 10 -0.777778 # = 10 / 45 - 1
2020-02-28 34 0.478261 # = 34 / 23 - 1
2020-02-29 76 2.304348 # = 76 / 23 - 1
I am trying to create 2 columns based of a column that contains numerical values.
Value
0
4
10
24
null
49
Expected Output:
Value Day Hour
0 Sunday 12:00am
4 Sunday 4:00am
10 Sunday 10:00am
24 Monday 12:00am
null No Day No Time
49 Tuesday 1:00am
Continued.....
Code I am trying out:
value = df.value.unique()
Sunday_Starting_Point = pd.to_datetime('Sunday 2015')
(Sunday_Starting_Point + pd.to_timedelta(Value, 'h')).dt.strftime('%A %I:%M%P')
Thanks for looking!
I think unique values are not necessary, you can use 2 times dt.strftime for 2 columns with replace with NaT values:
Sunday_Starting_Point = pd.to_datetime('Sunday 2015')
x = pd.to_numeric(df.Value, errors='coerce')
s = Sunday_Starting_Point + pd.to_timedelta(x, unit='h')
df['Day'] = s.dt.strftime('%A').replace('NaT','No Day')
df['Hour'] = s.dt.strftime('%I:%M%p').replace('NaT','No Time')
print (df)
Value Day Hour
0 0.0 Sunday 12:00AM
1 4.0 Sunday 04:00AM
2 10.0 Sunday 10:00AM
3 24.0 Monday 12:00AM
4 NaN No Day No Time
5 49.0 Tuesday 01:00AM
I have a dataframe df_pct_Max with the following shape:
Date Value1 Value2
01.01.2015 5 6
08.01.2015 3 2
... ... ...
28.01.2017 7 8
and I would like to calculate the average per calendar week and subtract it from the actual values for the calendar week.
I have created a dataframe with the average per calendar week as follows:
df_weekly_avg_Max = df_pct_Max.groupby(df_pct_Max.index.week).mean()
This results in a dataframe df_weekly_avg_Max:
KW Value1 Value2
1 3.5 4.3
2 4 3
… … …
52 8.33 6.2
Now I am trying substract df_weekly_avg_Max from df_pct_Max and would like to do this by calendar week.
I have tried adding a column 'KW' and then
dfresult = df_pct_Max.sub(df_weekly_avg_Max, axis='KW')
But I am getting erros there.
Is there also a way of doing this on a rolling basis (suntracting the average of calendar week 1 over the past 3 years from calendar week 1 of 2015 and the 2016...)?
Could someone please help with this issue?
I have found a solution for the whole dataframe.
I added a column 'KW' for the calendar week and then performed a groupby on it with a lambda function that subtracts the mean for the calendar weeks "1" from the current value of calendar week "1"...
df_pct_Max ['KW'] = df_pct_Max.index.week
dfresult = df_pct_Max.groupby(by='KW').transform(lambda x: x-x.mean())
This works for me.
It would have been nicer to be able to adjust the timeframe of the mean, e.g. I substract from the current calendar week "1" value the mean for calendar week one of the past 3 years or so. But this seems rather complicated and this solution works for the current analysis.
This answer isn't clean as it doesn't make use of pandas well, but I also don't think it will be slow (depends on how large your dataframe is), the basic idea is to build up a list of the means repeated once for each day so you can subtract simply.
CODE:
from collections import Counter
import pandas as pd
import numpy as np
#Build up example data frame
num_days = 15
dates = pd.date_range('1/1/2015', periods=num_days, freq='D')
val1s = np.random.random_integers(1, 30, num_days)
val2s = np.random.random_integers(1, 30, num_days)
df_pct_MAX = pd.DataFrame({'Date':dates, 'Value1':val1s, 'Value2':val2s})
df_pct_MAX['Day'] = df_pct_MAX['Date'].dt.weekday_name
df_pct_MAX['Week'] = df_pct_MAX['Date'].dt.week
#OPs logic to get means
df_weekly_avg_Max = df_pct_MAX.groupby(df_pct_MAX['Week']).mean()
#Build up a list of the means repeated once for each day in that week
mean_fields = ['Value1','Value2'] #<-- only hardcoded portion
means_dict = {k:list(df_weekly_avg_Max[k]) for k in mean_fields} #<-- convert means into lists keyed by field
week_counts = Counter(df_pct_MAX['Week']).values() #<-- count how many days are represented in each week
#Build up a dict keyed by field with the means repeated the correct number of times
means = {k:[means_dict[k][i] for i,count in enumerate(week_counts)
for x in range(count)] for k in mean_fields}
#Assign a new column to the means for each field (not necessary, just to show done correctly)
for k in mean_fields:
df_pct_MAX[k+'Mean'] = means[k]
print(df_pct_MAX)
OUTPUT:
Date Value1 Value2 Day Week Value1Mean Value2Mean
0 2015-01-01 12 19 Thursday 1 9.000000 19.250000
1 2015-01-02 15 27 Friday 1 9.000000 19.250000
2 2015-01-03 2 30 Saturday 1 9.000000 19.250000
3 2015-01-04 7 1 Sunday 1 9.000000 19.250000
4 2015-01-05 6 20 Monday 2 17.571429 14.142857
5 2015-01-06 9 24 Tuesday 2 17.571429 14.142857
6 2015-01-07 25 17 Wednesday 2 17.571429 14.142857
7 2015-01-08 22 8 Thursday 2 17.571429 14.142857
8 2015-01-09 30 7 Friday 2 17.571429 14.142857
9 2015-01-10 10 1 Saturday 2 17.571429 14.142857
10 2015-01-11 21 22 Sunday 2 17.571429 14.142857
11 2015-01-12 23 29 Monday 3 23.750000 19.750000
12 2015-01-13 23 16 Tuesday 3 23.750000 19.750000
13 2015-01-14 21 17 Wednesday 3 23.750000 19.750000
14 2015-01-15 28 17 Thursday 3 23.750000 19.750000
I have pandas dataframe with Columns 'Date' and 'Skew(float no.)'. I want to average the values of the skew between every Tuesday and the store it in a list or dataframe. I tried using lambda as given in this question Pandas, groupby and summing over specific months I but it only helps to some over a particular week but i cannot go across week i.e from one tuesday to another. Can you give how to do the same?
Here's an example with random data
df = pd.DataFrame({'Date' : pd.date_range('20130101', periods=100),
'Skew': 10+pd.np.random.randn(100)})
min_date = df.Date.min()
start = min_date.dayofweek
if start < 1:
min_date = min_date - pd.np.timedelta64(6+start, 'D')
elif start > 1:
min_date = min_date - pd.np.timedelta64(start-1, 'D')
df.groupby((df.Date - min_date).astype('timedelta64[D]')//7).mean()
Input:
>>> df
Date Skew
0 2013-01-01 10.082080
1 2013-01-02 10.907402
2 2013-01-03 8.485768
3 2013-01-04 9.221740
4 2013-01-05 10.137910
5 2013-01-06 9.084963
6 2013-01-07 9.457736
7 2013-01-08 10.092777
Output:
Skew
Date
0 9.625371
1 9.993275
2 10.041077
3 9.837709
4 9.901311
5 9.985390
6 10.123757
7 9.782892
8 9.889291
9 9.853204
10 10.190098
11 10.594125
12 10.012265
13 9.278008
14 10.530251
Logic: Find relative week from the first week's Tuesday and groupby and each groups (i.e week no's) mean.