Calculate mean based on time elapsed in Pandas - python

I tried to ask this question previously, but it was too ambiguous so here goes again. I am new to programming, so I am still learning how to ask questions in a useful way.
In summary, I have a pandas dataframe that resembles "INPUT DATA" that I would like to convert to "DESIRED OUTPUT", as shown below.
Each row contains an ID, a DateTime, and a Value. For each unique ID, the first row corresponds to timepoint 'zero', and each subsequent row contains a value 5 minutes following the previous row and so on.
I would like to calculate the mean of all the IDs for every 'time elapsed' timepoint. For example, in "DESIRED OUTPUT" Time Elapsed=0.0 would have the value 128.3 (100+105+180/3); Time Elapsed=5.0 would have the value 150.0 (150+110+190/3); Time Elapsed=10.0 would have the value 133.3 (125+90+185/3) and so on for Time Elapsed=15,20,25 etc.
I'm not sure how to create a new column which has the value for the time elapsed for each ID (e.g. 0.0, 5.0, 10.0 etc). I think that once I know how to do that, then I can use the groupby function to calculate the means for each time elapsed.
INPUT DATA
ID DateTime Value
1 2018-01-01 15:00:00 100
1 2018-01-01 15:05:00 150
1 2018-01-01 15:10:00 125
2 2018-02-02 13:15:00 105
2 2018-02-02 13:20:00 110
2 2018-02-02 13:25:00 90
3 2019-03-03 05:05:00 180
3 2019-03-03 05:10:00 190
3 2019-03-03 05:15:00 185
DESIRED OUTPUT
Time Elapsed Mean Value
0.0 128.3
5.0 150.0
10.0 133.3

Here is one way , using transform with groupby get the group key 'Time Elapsed', then just groupby it get the mean
df['Time Elapsed']=df.DateTime-df.groupby('ID').DateTime.transform('first')
df.groupby('Time Elapsed').Value.mean()
Out[998]:
Time Elapsed
00:00:00 128.333333
00:05:00 150.000000
00:10:00 133.333333
Name: Value, dtype: float64

You can do this explicitly by taking advantage of the datetime attributes of the DateTime column in your DataFrame
First get the year, month and day for each DateTime since they are all changing in your data
df['month'] = df['DateTime'].dt.month
df['day'] = df['DateTime'].dt.day
df['year'] = df['DateTime'].dt.year
print(df)
ID DateTime Value month day year
1 1 2018-01-01 15:00:00 100 1 1 2018
1 1 2018-01-01 15:05:00 150 1 1 2018
1 1 2018-01-01 15:10:00 125 1 1 2018
2 2 2018-02-02 13:15:00 105 2 2 2018
2 2 2018-02-02 13:20:00 110 2 2 2018
2 2 2018-02-02 13:25:00 90 2 2 2018
3 3 2019-03-03 05:05:00 180 3 3 2019
3 3 2019-03-03 05:10:00 190 3 3 2019
3 3 2019-03-03 05:15:00 185 3 3 2019
Then append a sequential DateTime counter column (per this SO post)
the counter is computed within (1) each year, (2) then each month and then (3) each day
since the data are in multiples of 5 minutes, use this to scale the counter values (i.e. the counter will be in multiples of 5 minutes, rather than a sequence of increasing integers)
df['Time Elapsed'] = df.groupby(['year', 'month', 'day']).cumcount() + 1
df['Time Elapsed'] *= 5
print(df)
ID DateTime Value month day year cumulative_record
1 1 2018-01-01 15:00:00 100 1 1 2018 5
1 1 2018-01-01 15:05:00 150 1 1 2018 10
1 1 2018-01-01 15:10:00 125 1 1 2018 15
2 2 2018-02-02 13:15:00 105 2 2 2018 5
2 2 2018-02-02 13:20:00 110 2 2 2018 10
2 2 2018-02-02 13:25:00 90 2 2 2018 15
3 3 2019-03-03 05:05:00 180 3 3 2019 5
3 3 2019-03-03 05:10:00 190 3 3 2019 10
3 3 2019-03-03 05:15:00 185 3 3 2019 15
Perform the groupby over the newly appended counter column
dfg = df.groupby('Time Elapsed')['Value'].mean()
print(dfg)
Time Elapsed
5 128.333333
10 150.000000
15 133.333333
Name: Value, dtype: float64

Related

Create Bi-weekly Variable Based On A Custom Start-date

I've done some light searching using "get biweekly variable in python" but havent been able to find many useful posts so I thought to post my question here.
I have a dataframe with tens of thousands of records. The dataframe contains records for the entire fiscal year. Each records has a datetime variable CHECKIN_DATE_TIME. I would like to create a biweekly variable beginning with the date June 30 2019.
ID CHECKIN_DATE_TIME
1 2019-06-30 13:36:00
2 2019-06-30 14:26:00
3 2019-06-30 20:10:00
4 2019-06-30 21:27:00
....
51 2019-07-10 13:36:00
52 2019-07-10 10:26:00
53 2019-07-10 10:10:00
54 2019-07-10 23:27:00
....
I would like a new dataframe to look like this where 6/30/2019 - 7/13/2019 would be week 1, 7/14/2019 to 7/27/2019 would be week 2, and so on until the end date of 6/28/2020. Thus there will be 26 weeks within the Week variable and each week represent a 2 week time frame.
EDIT and to have the last day in the week range assigned to the week number.
ID CHECKIN_DATE_TIME Week Date
1 2019-06-30 13:36:00 1 7/13/2019
2 2019-06-30 14:26:00 1 7/13/2019
3 2019-06-30 20:10:00 1 7/13/2019
4 2019-06-30 21:27:00 1 7/13/2019
....
51 2019-07-20 13:36:00 2 7/27/2019
52 2019-07-20 10:26:00 2 7/27/2019
53 2019-07-20 10:10:00 2 7/27/2019
54 2019-07-20 23:27:00 2
....
You can do so by determining the number of days between the check-in date and 2019-06-30 and then doing a floor division by 14.
df['CHECKIN_DATE_TIME'] = pd.to_datetime(df.CHECKIN_DATE_TIME)
df['week'] = (df.CHECKIN_DATE_TIME - pd.datetime(2019, 6, 30)).dt.days // 14 + 1
df['last_week_day'] = (pd.to_timedelta(-((df.CHECKIN_DATE_TIME - pd.datetime(2019,6,30)).dt.days % 14) + 13 ,'d') + df.CHECKIN_DATE_TIME).dt.date
# note I've created my own test set.
ID CHECKIN_DATE_TIME week last_week_day
0 1 2019-06-30 13:36:00 1 2019-07-13
1 2 2019-07-10 10:36:00 1 2019-07-13
2 3 2019-07-12 02:36:00 1 2019-07-13
3 4 2019-07-18 18:36:00 2 2019-07-27
4 5 2019-07-30 11:36:00 3 2019-08-10
5 6 2019-08-01 20:36:00 3 2019-08-10
Edit: added last_week_day as per request in comments. This is done by calculating the required number of days to the CHECKIN_DATE_TIME columns using modulo operator %.
Using Pandas date_range function it is very easy and efficient way to generate date list for weekly, bi-weekly or monthly
import pandas as pd
from datetime import date,datetime,timedelta
date_rng=pd.date_range(start=date.today()-timedelta(weeks=53),end=date.today(), freq="2W-SAT")
for i in date_rng:
print(i)

How can I parse a field in a DF into Month, Day, Year, Hour, and Weekday?

I have data that looks like this.
VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag
2 1/1/2018 0:18:50 1/1/2018 12:24:39 AM N
2 1/1/2018 0:30:26 1/1/2018 12:46:42 AM N
2 1/1/2018 0:07:25 1/1/2018 12:19:45 AM N
2 1/1/2018 0:32:40 1/1/2018 12:33:41 AM N
2 1/1/2018 0:32:40 1/1/2018 12:33:41 AM N
2 1/1/2018 0:38:35 1/1/2018 1:08:50 AM N
2 1/1/2018 0:18:41 1/1/2018 12:28:22 AM N
2 1/1/2018 0:38:02 1/1/2018 12:55:02 AM N
2 1/1/2018 0:05:02 1/1/2018 12:18:35 AM N
2 1/1/2018 0:35:23 1/1/2018 12:42:07 AM N
So, I converted df.lpep_pickup_datetime to datetime, but originally it comes in as a string. I'm not sure which one is easier to work with. I want to append 5 fields onto my current dataframe: year, month, day, weekday, and hour.
I tried this:
df['Year']=[d.split('-')[0] for d in df.lpep_pickup_datetime]
df['Month']=[d.split('-')[1] for d in df.lpep_pickup_datetime]
df['Day']=[d.split('-')[2] for d in df.lpep_pickup_datetime]
That gives me this error: AttributeError: 'Timestamp' object has no attribute 'split'
I tried this:
df2 = pd.DataFrame(df.lpep_pickup_datetime.dt.strftime('%m-%d-%Y-%H').str.split('/').tolist(),
columns=['Month', 'Day', 'Year', 'Hour'],dtype=int)
df = pd.concat((df,df2),axis=1)
That gives me this error: AssertionError: 4 columns passed, passed data had 1 columns
Basically, I want to parse df.lpep_pickup_datetime into year, month, day, weekday, and hour, appending each to the same dataframe. How can I do that?
Thanks!!
Here you go, first I'm creating a random dataset and then renaming the column date to the name you want, so you can just copy the code. Pandas has a big section of time-series series manipulation, you don't actually need to import datetime. Here you can find a lot more information about it:
import pandas as pd
date_rng = pd.date_range(start='1/1/2018', end='4/01/2018', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
df['lpep_pickup_datetime'] = df['date']
df['year'] = df['lpep_pickup_datetime'].dt.year
df['year'] = df['lpep_pickup_datetime'].dt.month
df['weekday'] = df['lpep_pickup_datetime'].dt.weekday
df['day'] = df['lpep_pickup_datetime'].dt.day
df['hour'] = df['lpep_pickup_datetime'].dt.hour
print(df)
Output:
date lpep_pickup_datetime year weekday day hour
0 2018-01-01 00:00:00 2018-01-01 00:00:00 1 0 1 0
1 2018-01-01 01:00:00 2018-01-01 01:00:00 1 0 1 1
2 2018-01-01 02:00:00 2018-01-01 02:00:00 1 0 1 2
3 2018-01-01 03:00:00 2018-01-01 03:00:00 1 0 1 3
4 2018-01-01 04:00:00 2018-01-01 04:00:00 1 0 1 4
... ... ... ... ... ... ...
2156 2018-03-31 20:00:00 2018-03-31 20:00:00 3 5 31 20
2157 2018-03-31 21:00:00 2018-03-31 21:00:00 3 5 31 21
2158 2018-03-31 22:00:00 2018-03-31 22:00:00 3 5 31 22
2159 2018-03-31 23:00:00 2018-03-31 23:00:00 3 5 31 23
2160 2018-04-01 00:00:00 2018-04-01 00:00:00 4 6 1 0
EDIT: Since this is not working (As stated in the comments in this answer), I believe your data is formated incorrectly. Try this before applying anything:
df['lpep_pickup_datetime'] = pd.to_datetime(df['lpep_pickup_datetime'], format='%d/%m/%y %H:%M:%S')
If this format is recognized properly, then you should have no trouble using dt.year,dt.month,dt.hour,dt.day,dt.weekday.
Give this a go. Since your dates are in the datetime dtype already, just use the datetime properties to extract each part.
import pandas as pd
from datetime import datetime as dt
# Creating a fake dataset of dates.
dates = [dt.now().strftime('%d/%m/%Y %H:%M:%S') for i in range(10)]
df = pd.DataFrame({'lpep_pickup_datetime': dates})
df['lpep_pickup_datetime'] = pd.to_datetime(df['lpep_pickup_datetime'])
# Parse each date into its parts and store as a new column.
df['month'] = df['lpep_pickup_datetime'].dt.month
df['day'] = df['lpep_pickup_datetime'].dt.day
df['year'] = df['lpep_pickup_datetime'].dt.year
# ... and so on ...
Output:
lpep_pickup_datetime month day year
0 2019-09-24 16:46:10 9 24 2019
1 2019-09-24 16:46:10 9 24 2019
2 2019-09-24 16:46:10 9 24 2019
3 2019-09-24 16:46:10 9 24 2019
4 2019-09-24 16:46:10 9 24 2019
5 2019-09-24 16:46:10 9 24 2019
6 2019-09-24 16:46:10 9 24 2019
7 2019-09-24 16:46:10 9 24 2019
8 2019-09-24 16:46:10 9 24 2019
9 2019-09-24 16:46:10 9 24 2019

Changing datetime column to integer number without loop

I have a pandas dataset like this:
user_id datetime
1 13 days 21:50:00
2 0 days 02:05:00
5 10 days 00:10:00
7 2 days 01:20:00
1 3 days 11:50:00
2 1 days 02:30:00
I want to have a column that contains the mintues, So in this case the result can be :
user_id datetime minutes
1 13 days 21:50:00 20030
2 0 days 02:05:00 125
5 10 days 00:10:00 14402
7 2 days 01:20:00 2960
1 3 days 11:50:00 5030
2 1 days 02:30:00 1590
Is there any way to do that without loop?
Yes, there is a special dt accessor for date/time series:
df['minutes'] = df['datetime'].dt.total_seconds() / 60
If you only want whole minutes, cast the result using .astype(int).
Here is a way with pd.Timedelta:
df['minutes'] = pd.to_timedelta(df.datetime) / pd.Timedelta(1, 'm')
>>> df
user_id datetime minutes
0 1 13 days 21:50:00 20030.0
1 2 0 days 02:05:00 125.0
2 5 10 days 00:10:00 14410.0
3 7 2 days 01:20:00 2960.0
4 1 3 days 11:50:00 5030.0
5 2 1 days 02:30:00 1590.0
if your datetime column is already of dtype timedelta, you can omit the explicit casting and just use:
df['minutes'] = df.datetime / pd.Timedelta(1, 'm')

How to add missing rows in dataframe by comparing values in python

Hi I am using python pandas for dataframes,I have data something like as followed:
Employee-ID Time-slot Calls-received Prod-sold
1 14:30:00 10 1
1 15:00:00 15 3
1 15:30:00 10 2
1
16:00:00 8 2
1 16:30:00 10 0
2 14:30:00 10 2
2
15:00:00 15 3
2 16:30:00 10 2
2 17:00:00 10 0
I have 10,000 employee and ideally there should be 16 time slots for each employee but time slots are missing for some employees, like employee 2 time slot 15:30:00 and 16:00:00 is missing I wish to add new rows and with missing time slots and zero values for 'calls-received' and prod-sold. Something like that:
2 14:30:00 10 2
2 15:00:00 15 3
2 15:30:00 0 0
2 16:00:00 0 0
2 16:30:00 10 2
2 17:00:00 10 0

Aggregate to 15min based timestamp to hour and find sum, avg and max for multiple columns in pandas

I have a dataframe with period_start_time by every 15 minutes and now I need to aggregate to 1 hour and calculate sum and avg for almost every column in dataframe (it has about 20 columns) and
PERIOD_START_TIME ID val1 val2
06.21.2017 22:15:00 12 3 0
06.21.2017 22:30:00 12 5 6
06.21.2017 22:45:00 12 0 3
06.21.2017 23:00:00 12 5 2
...
06.21.2017 22:15:00 15 9 2
06.21.2017 22:30:00 15 0 2
06.21.2017 22:45:00 15 1 5
06.21.2017 23:00:00 15 0 1
...
Desired output:
PERIOD_START_TIME ID val1(avg) val1(sum) val1(max) ...
06.21.2017 22:00:00 12 3.25 13 5
...
06.21.2017 23:00:00 15 2.25 10 9 ...
And for columns val2 too, and for every other column in dataframe.
I have no idea how to group by period start time for every hour, not for the whole day, no idea how to start.
I believe you need Series.dt.floor for Hours and then aggregate by agg:
df = df.groupby([df['PERIOD_START_TIME'].dt.floor('H'),'ID']).agg(['mean','sum', 'max'])
#for columns from MultiIndex
df.columns = df.columns.map('_'.join)
print (df)
val1_mean val1_sum val1_max val2_mean val2_sum \
PERIOD_START_TIME ID
2017-06-21 22:00:00 12 2.666667 8 5 3 9
15 3.333333 10 9 3 9
2017-06-21 23:00:00 12 5.000000 5 5 2 2
15 0.000000 0 0 1 1
val2_max
PERIOD_START_TIME ID
2017-06-21 22:00:00 12 6
15 5
2017-06-21 23:00:00 12 2
15 1
df = df.reset_index()
print (df)
PERIOD_START_TIME ID val1_mean val1_sum val1_max val2_mean val2_sum \
0 2017-06-21 22:00 12 2.666667 8 5 3 9
1 2017-06-21 22:00 15 3.333333 10 9 3 9
2 2017-06-21 23:00 12 5.000000 5 5 2 2
3 2017-06-21 23:00 15 0.000000 0 0 1 1
val2_max
0 6
1 5
2 2
3 1
Very similarly you can convert PERIOD_START_TIME to a pandas Period.
df['PERIOD_START_TIME'] = df['PERIOD_START_TIME'].dt.to_period('H')
df.groupby(['PERIOD_START_TIME', 'ID']).agg(['max', 'min', 'mean']).reset_index()

Categories

Resources