Counting number of events per day & their duration - python

I have time-series on a minute level where I log certain events, for simplification here I use binary classification, is there an event or not. And I want to get some daily stats on it.
I tried to explain what I have and what I want to get in the figure below
So in summary I would like to detect all events (1) with their duration.
What would be the easiest way of doing this in Python, Pandas?
Here is extract from the dataframe
Time
2020-01-27 09:26:00 0
2020-01-27 09:28:00 0
2020-01-27 09:30:00 0
2020-01-27 09:32:00 0
2020-01-27 09:34:00 0
2020-01-27 09:36:00 0
2020-01-27 09:38:00 0
2020-01-27 09:40:00 0
2020-01-27 09:42:00 0
2020-01-27 09:44:00 0
2020-01-27 09:46:00 0
2020-01-27 09:48:00 1
2020-01-27 09:50:00 1
2020-01-27 09:52:00 1
2020-01-27 09:54:00 1
2020-01-27 09:56:00 1
2020-01-27 09:58:00 1
2020-01-27 10:00:00 1
2020-01-27 10:02:00 1
2020-01-27 10:04:00 1
2020-01-27 10:06:00 1
2020-01-27 10:08:00 1
2020-01-27 10:10:00 1
2020-01-27 10:12:00 1
2020-01-27 10:14:00 1
2020-01-27 10:16:00 1
2020-01-27 10:18:00 1
2020-01-27 10:20:00 1
2020-01-27 10:22:00 1
2020-01-27 10:24:00 1

I've solved it by myself. Here is how:
I calculated the difference between current and previous timestep and based on that I've counted number of occurrences of an event and calculated time difference between start (1) and end of an event (-1).
Sessions['Diff']=Sessions['Event'].diff()
begin = Sessions.loc[Sessions['Diff'] == 1].index
cutoffs = Sessions.loc[Sessions['Diff'] == -1].index
idx = cutoffs.searchsorted(begin)
mask = idx < len(cutoffs)
idx = idx[mask]
begin = begin[mask]
end = cutoffs[idx]
result = pd.DataFrame({'begin':begin, 'end':end})
result['dT']=result['end']-result['begin']
result['dT']=result['dT'].dt.total_seconds().div(60)
result=result.set_index('begin',drop=False)
result=result.rename(columns={'begin':'sessions','dT':'avg_dT'})
result['tot_dT']=result['avg_dT']
Sessions_daily=result.resample('D').apply({'sessions':'count','avg_dT':'mean','tot_dT':'sum'})
Where Sessions is a data frame of time-series (minute sampling) is there event (1) or not (0)
This resulted in
sessions avg_dT tot_dT
begin
2020-01-03 5 31.200000 156.0
2020-01-04 0 NaN 0.0
2020-01-05 0 NaN 0.0
2020-01-06 0 NaN 0.0
2020-01-07 9 39.333333 354.0
2020-01-08 8 38.000000 304.0
2020-01-09 8 33.000000 264.0
2020-01-10 8 39.250000 314.0

Related

How to find occurrence of consecutive events in python timeseries data frame?

I have got a time series of meteorological observations with date and value columns:
df = pd.DataFrame({'date':['11/10/2017 0:00','11/10/2017 03:00','11/10/2017 06:00','11/10/2017 09:00','11/10/2017 12:00',
'11/11/2017 0:00','11/11/2017 03:00','11/11/2017 06:00','11/11/2017 09:00','11/11/2017 12:00',
'11/12/2017 00:00','11/12/2017 03:00','11/12/2017 06:00','11/12/2017 09:00','11/12/2017 12:00'],
'value':[850,np.nan,np.nan,np.nan,np.nan,500,650,780,np.nan,800,350,690,780,np.nan,np.nan],
'consecutive_hour': [ 3,0,0,0,0,3,6,9,0,3,3,6,9,0,0]})
With this DataFrame, I want a third column of consecutive_hours such that if the value in a particular timestamp is less than 1000, we give corresponding value in "consecutive-hours" of "3:00" hours and find consecutive such occurrence like 6:00 9:00 as above.
Lastly, I want to summarize the table counting consecutive hours occurrence and number of days such that the summary table looks like:
df_summary = pd.DataFrame({'consecutive_hours':[3,6,9,12],
'number_of_day':[2,0,2,0]})
I tried several online solutions and methods like shift(), diff() etc. as mentioned in:How to groupby consecutive values in pandas DataFrame
and more, spent several days but no luck yet.
I would highly appreciate help on this issue.
Thanks!
Input data:
>>> df
date value
0 2017-11-10 00:00:00 850.0
1 2017-11-10 03:00:00 NaN
2 2017-11-10 06:00:00 NaN
3 2017-11-10 09:00:00 NaN
4 2017-11-10 12:00:00 NaN
5 2017-11-11 00:00:00 500.0
6 2017-11-11 03:00:00 650.0
7 2017-11-11 06:00:00 780.0
8 2017-11-11 09:00:00 NaN
9 2017-11-11 12:00:00 800.0
10 2017-11-12 00:00:00 350.0
11 2017-11-12 03:00:00 690.0
12 2017-11-12 06:00:00 780.0
13 2017-11-12 09:00:00 NaN
14 2017-11-12 12:00:00 NaN
The cumcount_reset function is adapted from this answer of #jezrael:
Python pandas cumsum with reset everytime there is a 0
cumcount_reset = \
lambda b: b.cumsum().sub(b.cumsum().where(~b).ffill().fillna(0)).astype(int)
df["consecutive_hour"] = (df.set_index("date")["value"] < 1000) \
.groupby(pd.Grouper(freq="D")) \
.apply(lambda b: cumcount_reset(b)).mul(3) \
.reset_index(drop=True)
Output result:
>>> df
date value consecutive_hour
0 2017-11-10 00:00:00 850.0 3
1 2017-11-10 03:00:00 NaN 0
2 2017-11-10 06:00:00 NaN 0
3 2017-11-10 09:00:00 NaN 0
4 2017-11-10 12:00:00 NaN 0
5 2017-11-11 00:00:00 500.0 3
6 2017-11-11 03:00:00 650.0 6
7 2017-11-11 06:00:00 780.0 9
8 2017-11-11 09:00:00 NaN 0
9 2017-11-11 12:00:00 800.0 3
10 2017-11-12 00:00:00 350.0 3
11 2017-11-12 03:00:00 690.0 6
12 2017-11-12 06:00:00 780.0 9
13 2017-11-12 09:00:00 NaN 0
14 2017-11-12 12:00:00 NaN 0
Summary table
df_summary = df.loc[df.groupby(pd.Grouper(key="date", freq="D"))["consecutive_hour"] \
.apply(lambda h: (h - h.shift(-1).fillna(0)) > 0),
"consecutive_hour"] \
.value_counts().reindex([3, 6, 9, 12], fill_value=0) \
.rename("number_of_day") \
.rename_axis("consecutive_hour") \
.reset_index()
>>> df_summary
consecutive_hour number_of_day
0 3 2
1 6 0
2 9 2
3 12 0

Find whether an event (with a start & end time) goes beyond a certain time (e.g. 6pm) in dataframe using Python (pandas, datetime)

I'm creating a Python program using pandas and datetime libraries that will calculate the pay from my casual job each week, so I can cross reference my bank statement instead of looking through payslips.
The data that I am analysing is from the Google Calendar API that is synced with my work schedule. It prints the events in that particular calendar to a csv file in this format:
Start
End
Title
Hours
0
02.12.2020 07:00
02.12.2020 16:00
Shift
9.0
1
04.12.2020 18:00
04.12.2020 21:00
Shift
3.0
2
05.12.2020 07:00
05.12.2020 12:00
Shift
5.0
3
06.12.2020 09:00
06.12.2020 18:00
Shift
9.0
4
07.12.2020 19:00
07.12.2020 23:00
Shift
4.0
5
08.12.2020 19:00
08.12.2020 23:00
Shift
4.0
6
09.12.2020 10:00
09.12.2020 15:00
Shift
5.0
As I am a casual at this job I have to take a few things into consideration like penalty rates (baserate, after 6pm on Monday - Friday, Saturday, and Sunday all have different rates). I'm wondering if I can analyse this csv using datetime and calculate how many hours are before 6pm, and how many after 6pm. So using this as an example the output would be like:
Start
End
Title
Hours
1
04.12.2020 15:00
04.12.2020 21:00
Shift
6.0
Start
End
Title
Total Hours
Hours before 3pm
Hours after 3pm
1
04.12.2020 15:00
04.12.2020 21:00
Shift
6.0
3.0
3.0
I can use this to get the day of the week but I'm just not sure how to analyse certain bits of time for penalty rates:
df['day_of_week'] = df['Start'].dt.day_name()
I appreciate any help in Python or even other coding languages/techniques this can be applied to:)
Edit:
This is how my dataframe is looking at the moment
Start
End
Title
Hours
day_of_week
Pay
week_of_year
0
2020-12-02 07:00:00
2020-12-02 16:00:00
Shift
9.0
Wednesday
337.30
49
EDIT
In response to David Erickson's comment.
value
variable
bool
0
2020-12-02 07:00:00
Start
False
1
2020-12-02 08:00:00
Start
False
2
2020-12-02 09:00:00
Start
False
3
2020-12-02 10:00:00
Start
False
4
2020-12-02 11:00:00
Start
False
5
2020-12-02 12:00:00
Start
False
6
2020-12-02 13:00:00
Start
False
7
2020-12-02 14:00:00
Start
False
8
2020-12-02 15:00:00
Start
False
9
2020-12-02 16:00:00
End
False
10
2020-12-04 18:00:00
Start
False
11
2020-12-04 19:00:00
Start
True
12
2020-12-04 20:00:00
Start
True
13
2020-12-04 21:00:00
End
True
14
2020-12-05 07:00:00
Start
False
15
2020-12-05 08:00:00
Start
False
16
2020-12-05 09:00:00
Start
False
17
2020-12-05 10:00:00
Start
False
18
2020-12-05 11:00:00
Start
False
19
2020-12-05 12:00:00
End
False
20
2020-12-06 09:00:00
Start
False
21
2020-12-06 10:00:00
Start
False
22
2020-12-06 11:00:00
Start
False
23
2020-12-06 12:00:00
Start
False
24
2020-12-06 13:00:00
Start
False
25
2020-12-06 14:00:00
Start
False
26
2020-12-06 15:00:00
Start
False
27
2020-12-06 6:00:00
Start
False
28
2020-12-06 17:00:00
Start
False
29
2020-12-06 18:00:00
End
False
30
2020-12-07 19:00:00
Start
False
31
2020-12-07 20:00:00
Start
True
32
2020-12-07 21:00:00
Start
True
33
2020-12-07 22:00:00
Start
True
34
2020-12-07 23:00:00
End
True
35
2020-12-08 19:00:00
Start
False
36
2020-12-08 20:00:00
Start
True
37
2020-12-08 21:00:00
Start
True
38
2020-12-08 22:00:00
Start
True
39
2020-12-08 23:00:00
End
True
40
2020-12-09 10:00:00
Start
False
41
2020-12-09 11:00:00
Start
False
42
2020-12-09 12:00:00
Start
False
43
2020-12-09 13:00:00
Start
False
44
2020-12-09 14:00:00
Start
False
45
2020-12-09 15:00:00
End
False
46
2020-12-11 19:00:00
Start
False
47
2020-12-11 20:00:00
Start
True
48
2020-12-11 21:00:00
Start
True
49
2020-12-11 22:00:00
Start
True
UPDATE: (2020-12-19)
I have simply filtered out the Start rows, as you were correct an extra row wa being calculated. Also, I passed dayfirst=True to pd.to_datetime() to convert the date correctly. I have also made the output clean with some extra columns.
higher_pay = 40
lower_pay = 30
df['Start'], df['End'] = pd.to_datetime(df['Start'], dayfirst=True), pd.to_datetime(df['End'], dayfirst=True)
start = df['Start']
df1 = df[['Start', 'End']].melt(value_name='Date').set_index('Date')
s = df1.groupby('variable').cumcount()
df1 = df1.groupby(s, group_keys=False).resample('1H').asfreq().join(s.rename('Shift').to_frame()).ffill().reset_index()
df1 = df1[~df1['Date'].isin(start)]
df1['Day'] = df1['Date'].dt.day_name()
df1['Week'] = df1['Date'].dt.isocalendar().week
m = (df1['Date'].dt.hour > 18) | (df1['Day'].isin(['Saturday', 'Sunday']))
df1['Higher Pay Hours'] = np.where(m, 1, 0)
df1['Lower Pay Hours'] = np.where(m, 0, 1)
df1['Pay'] = np.where(m, higher_pay, lower_pay)
df1 = df1.groupby(['Shift', 'Day', 'Week']).sum().reset_index()
df2 = df.merge(df1, how='left', left_index=True, right_on='Shift').drop('Shift', axis=1)
df2
Out[1]:
Start End Title Hours Day Week \
0 2020-12-02 07:00:00 2020-12-02 16:00:00 Shift 9.0 Wednesday 49
1 2020-12-04 18:00:00 2020-12-04 21:00:00 Shift 3.0 Friday 49
2 2020-12-05 07:00:00 2020-12-05 12:00:00 Shift 5.0 Saturday 49
3 2020-12-06 09:00:00 2020-12-06 18:00:00 Shift 9.0 Sunday 49
4 2020-12-07 19:00:00 2020-12-07 23:00:00 Shift 4.0 Monday 50
5 2020-12-08 19:00:00 2020-12-08 23:00:00 Shift 4.0 Tuesday 50
6 2020-12-09 10:00:00 2020-12-09 15:00:00 Shift 5.0 Wednesday 50
Higher Pay Hours Lower Pay Hours Pay
0 0 9 270
1 3 0 120
2 5 0 200
3 9 0 360
4 4 0 160
5 4 0 160
6 0 5 150
There are probably more concise ways to do this, but I thought resampling the dataframe and then counting the hours would be a clean approach. You can melt the dataframe to have Start and End in the same column and fill in the gap hours with resample making sure to groupby by the 'Start' and 'End' values that were initially on the same row. The easiest way to figure out which rows were initially together is to get the cumulative count with cumcount of the values in the new the dataframe grouped by 'Start' and 'End'. I'll show you how this works later in the answer.
Full Code:
df['Start'], df['End'] = pd.to_datetime(df['Start']), pd.to_datetime(df['End'])
df = df[['Start', 'End']].melt().set_index('value')
df = df.groupby(df.groupby('variable').cumcount(), group_keys=False).resample('1H').asfreq().ffill().reset_index()
m = (df['value'].dt.hour > 18) | (df['value'].dt.day_name().isin(['Saturday', 'Sunday']))
print('Normal Rate No. of Hours', df[m].shape[0])
print('Higher Rate No. of Hours', df[~m].shape[0])
Normal Rate No. of Hours 20
Higher Rate No. of Hours 26
Adding some more details...
Step 1: Melt the dataframe: You only need two columns 'Start' and 'End' to get your desired output
df = df[['Start', 'End']].melt().set_index('value')
df
Out[1]:
variable
value
2020-02-12 07:00:00 Start
2020-04-12 18:00:00 Start
2020-05-12 07:00:00 Start
2020-06-12 09:00:00 Start
2020-07-12 19:00:00 Start
2020-08-12 19:00:00 Start
2020-09-12 10:00:00 Start
2020-02-12 16:00:00 End
2020-04-12 21:00:00 End
2020-05-12 12:00:00 End
2020-06-12 18:00:00 End
2020-07-12 23:00:00 End
2020-08-12 23:00:00 End
2020-09-12 15:00:00 End
Step 2: Create Group in preparation for resample: *As you can see group 0-6 line up with each other representing '
Start' and 'End' as they were together previously
df.groupby('variable').cumcount()
Out[2]:
value
2020-02-12 07:00:00 0
2020-04-12 18:00:00 1
2020-05-12 07:00:00 2
2020-06-12 09:00:00 3
2020-07-12 19:00:00 4
2020-08-12 19:00:00 5
2020-09-12 10:00:00 6
2020-02-12 16:00:00 0
2020-04-12 21:00:00 1
2020-05-12 12:00:00 2
2020-06-12 18:00:00 3
2020-07-12 23:00:00 4
2020-08-12 23:00:00 5
2020-09-12 15:00:00 6
Step 3: Resample the data per group by hour to fill in the gaps for each group:
df.groupby(df.groupby('variable').cumcount(), group_keys=False).resample('1H').asfreq().ffill().reset_index()
Out[3]:
value variable
0 2020-02-12 07:00:00 Start
1 2020-02-12 08:00:00 Start
2 2020-02-12 09:00:00 Start
3 2020-02-12 10:00:00 Start
4 2020-02-12 11:00:00 Start
5 2020-02-12 12:00:00 Start
6 2020-02-12 13:00:00 Start
7 2020-02-12 14:00:00 Start
8 2020-02-12 15:00:00 Start
9 2020-02-12 16:00:00 End
10 2020-04-12 18:00:00 Start
11 2020-04-12 19:00:00 Start
12 2020-04-12 20:00:00 Start
13 2020-04-12 21:00:00 End
14 2020-05-12 07:00:00 Start
15 2020-05-12 08:00:00 Start
16 2020-05-12 09:00:00 Start
17 2020-05-12 10:00:00 Start
18 2020-05-12 11:00:00 Start
19 2020-05-12 12:00:00 End
20 2020-06-12 09:00:00 Start
21 2020-06-12 10:00:00 Start
22 2020-06-12 11:00:00 Start
23 2020-06-12 12:00:00 Start
24 2020-06-12 13:00:00 Start
25 2020-06-12 14:00:00 Start
26 2020-06-12 15:00:00 Start
27 2020-06-12 16:00:00 Start
28 2020-06-12 17:00:00 Start
29 2020-06-12 18:00:00 End
30 2020-07-12 19:00:00 Start
31 2020-07-12 20:00:00 Start
32 2020-07-12 21:00:00 Start
33 2020-07-12 22:00:00 Start
34 2020-07-12 23:00:00 End
35 2020-08-12 19:00:00 Start
36 2020-08-12 20:00:00 Start
37 2020-08-12 21:00:00 Start
38 2020-08-12 22:00:00 Start
39 2020-08-12 23:00:00 End
40 2020-09-12 10:00:00 Start
41 2020-09-12 11:00:00 Start
42 2020-09-12 12:00:00 Start
43 2020-09-12 13:00:00 Start
44 2020-09-12 14:00:00 Start
45 2020-09-12 15:00:00 End
Step 4 - From there, you can calculate the boolean series I have called m: *True values represent conditions met for "Higher Rate".
m = (df['value'].dt.hour > 18) | (df['value'].dt.day_name().isin(['Saturday', 'Sunday']))
m
Out[4]:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 True
11 True
12 True
13 True
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 False
27 False
28 False
29 False
30 True
31 True
32 True
33 True
34 True
35 True
36 True
37 True
38 True
39 True
40 True
41 True
42 True
43 True
44 True
45 True
Step 5: Filter the dataframe by True or False to count total hours for the Normal Rate and Higher Rate and print values.
print('Normal Rate No. of Hours', df[m].shape[0])
print('Higher Rate No. of Hours', df[~m].shape[0])
Normal Rate No. of Hours 20
Higher Rate No. of Hours 26

I am trying to do here is I have to run a loop over rows here, in index[7] it should have shown "Sell" but my 2nd if condition is not working

I am trying to do here is I have to run a loop over rows and if the condition satisfies it should print according to my command. here, in index[7] it should have shown "Sell" but my 2nd if condition is not working. What am I doing wrong?
for i in range (n_steps,(len(extended_stock_data_new)-1)):
if (extended_stock_data_new["Close"][i]<=extended_stock_data_new["Prediction"][i+1]):
extended_stock_data_new.loc[[i],"Decision"]="Buy"
if (extended_stock_data_new["Low"][i+1]<extended_stock_data_new["Prediction"][i+1]<=extended_stock_data_new["High"][i+1]):
extended_stock_data_new.loc[[i+1],"Decision"]="Sell"
else:
extended_stock_data_new.loc[[i],"Decision"]="--"
extended_stock_data_new.head(50)
output:
0 2020-01-25 08:00:00 3295.26 3298.26 3291.30 3291.75 NaN NaN
1 2020-01-27 10:00:00 3267.88 3269.01 3253.26 3259.76 NaN NaN
2 2020-01-27 11:00:00 3259.51 3269.51 3258.26 3269.51 NaN NaN
3 2020-01-27 12:00:00 3269.76 3269.76 3265.26 3267.26 NaN NaN
4 2020-01-27 13:00:00 3267.13 3267.26 3258.76 3260.26 NaN NaN
5 2020-01-27 14:00:00 3260.51 3266.76 3260.51 3265.26 NaN NaN
6 2020-01-27 15:00:00 3265.38 3266.01 3262.76 3263.01 3264.800049 Buy
7 2020-01-27 16:00:00 3263.26 3264.26 3260.01 3260.26 3263.800049 Buy
8 2020-01-27 17:00:00 3260.51 3263.13 3259.26 3261.51 3260.699951 Buy
9 2020-01-27 18:00:00 3261.26 3264.01 3259.51 3261.76 3261.600098 Buy
10 2020-01-27 19:00:00 3262.26 3267.26 3257.76 3262.76 3262.100098 Buy
11 2020-01-27 20:00:00 3262.51 3263.01 3250.26 3254.01 3263.300049 Buy
12 2020-01-27 21:00:00 3253.76 3253.76 3240.26 3240.26 3254.800049 Buy
So what's happening it that you are overwriting yourself:
when i == 6:
you assign "Buy" to row i
you assign "Sell" to row i + 1
when i == 7:
you assign "Buy" to row i, overwriting your previous answer.
If you don't want to overwrite yourself, you need to add a check to your first condition to see if a "Decision" value already exists.

compare dates within a dataframe and assign a value to another variable

I have two dataframes (df and df1) like as shown below
df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
'start_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM', '06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM']})
df.start_date = pd.to_datetime(df.start_date)
df['end_date'] = df.start_date + timedelta(days=5)
df['enc_id'] = ['ABC1','ABC2','ABC3','ABC4','DEF1','DEF2','DEF3']
df1 = pd.DataFrame({'person_id': [101,101,101,101,101,101,101,202,202,202,202,202,202,202,202],'date_1':['07/07/2013 11:20:00 AM','05/07/2013 02:30:00 PM','06/07/2013 02:40:00 PM','08/06/2014 12:00:00 AM','11/06/2014 12:00:00 AM','02/03/2013 12:30:00 PM','13/06/2014 12:00:00 AM','12/11/2011 12:00:00 AM','13/10/2012 07:00:00 AM','13/12/2015 12:00:00 AM','13/12/2012 12:00:00 AM','13/12/2012 06:30:00 PM','13/07/2011 10:00:00 AM','18/12/2012 10:00:00 AM', '19/12/2013 11:00:00 AM']})
df1['date_1'] = pd.to_datetime(df1['date_1'])
df1['within_id'] = ['ABC','ABC','ABC','ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF','DEF','DEF','DEF',np.nan]
What I would like to do is
a) Pick each person from df1 who doesnt have NA in 'within_id' column and check whether their date_1 is between (df.start_date - 1) and (df.end_date + 1) of the same person in df and for the same within_idor enc_id
ex: for subject = 101 and within_id = ABC, we have date_1 is 7/7/2013, you check whether they are between 4/7/2013 (df.start_date - 1) and 11/7/2013 (df.end_date + 1).
As the first-row comparison itself gave us the result, we don't have to compare our date_1 with rest of the records in df for subject 101. If not, we need to find/scan until we find the interval within which date_1 falls.
b) If date interval found, then assign the corresponding enc_id from df to the within_id in df1
c) If not then assign, "Out of Range"
I tried the below
t1 = df.groupby('person_id').apply(pd.DataFrame.sort_values, 'start_date')
t2 = df1.groupby('person_id').apply(pd.DataFrame.sort_values, 'date_1')
t3= pd.concat([t1, t2], axis=1)
t3['within_id'] = np.where((t3['date_1'] >= t3['start_date'] && t3['person_id'] == t3['person_id_x'] && t3['date_2'] >= t3['end_date']),enc_id]
I expect my output (also see 14th row at the bottom of my screenshot) to be as shown below. As I intend to apply the solution on big data (4/5 million records and there might be 5000-6000 unique person_ids), any efficient and elegant solution is helpful
14 202 2012-12-13 11:00:00 NA
Let's do:
d = df1.merge(df.assign(within_id=df['enc_id'].str[:3]),
on=['person_id', 'within_id'], how='left', indicator=True)
m = d['date_1'].between(d['start_date'] - pd.Timedelta(days=1),
d['end_date'] + pd.Timedelta(days=1))
d = df1.merge(d[m | d['_merge'].ne('both')], on=['person_id', 'date_1'], how='left')
d['within_id'] = d['enc_id'].fillna('out of range').mask(d['_merge'].eq('left_only'))
d = d[df1.columns]
Details:
Left merge the dataframe df1 with df on person_id and within_id:
print(d)
person_id date_1 within_id start_date end_date enc_id _merge
0 101 2013-07-07 11:20:00 ABC 2013-05-07 09:27:00 2013-05-12 09:27:00 ABC1 both
1 101 2013-07-07 11:20:00 ABC 2013-09-08 11:21:00 2013-09-13 11:21:00 ABC2 both
2 101 2013-07-07 11:20:00 ABC 2014-06-06 08:00:00 2014-06-11 08:00:00 ABC3 both
3 101 2013-07-07 11:20:00 ABC 2014-06-06 05:00:00 2014-06-11 10:00:00 DEF1 both
....
47 202 2012-12-18 10:00:00 DEF 2012-10-13 00:00:00 2012-10-18 00:00:00 DEF2 both
48 202 2012-12-18 10:00:00 DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
49 202 2013-12-19 11:00:00 NaN NaT NaT NaN left_only
Create a boolean mask m to represent the condition where date_1 is between df.start_date - 1 days and df.end_date + 1 days:
print(m)
0 False
1 False
2 False
3 False
...
47 False
48 True
49 False
dtype: bool
Again left merge the dataframe df1 with the dataframe filtered using mask m on columns person_id and date_1:
print(d)
person_id date_1 within_id_x within_id_y start_date end_date enc_id _merge
0 101 2013-07-07 11:20:00 ABC NaN NaT NaT NaN NaN
1 101 2013-05-07 14:30:00 ABC ABC 2013-05-07 09:27:00 2013-05-12 09:27:00 ABC1 both
2 101 2013-06-07 14:40:00 ABC NaN NaT NaT NaN NaN
3 101 2014-08-06 00:00:00 ABC NaN NaT NaT NaN NaN
4 101 2014-11-06 00:00:00 ABC NaN NaT NaT NaN NaN
5 101 2013-02-03 12:30:00 ABC NaN NaT NaT NaN NaN
6 101 2014-06-13 00:00:00 ABC NaN NaT NaT NaN NaN
7 202 2011-12-11 00:00:00 DEF DEF 2011-12-11 10:00:00 2011-12-16 10:00:00 DEF1 both
8 202 2012-10-13 07:00:00 DEF DEF 2012-10-13 00:00:00 2012-10-18 00:00:00 DEF2 both
9 202 2015-12-13 00:00:00 DEF NaN NaT NaT NaN NaN
10 202 2012-12-13 00:00:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
11 202 2012-12-13 18:30:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
12 202 2011-07-13 10:00:00 DEF NaN NaT NaT NaN NaN
13 202 2012-12-18 10:00:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
14 202 2013-12-19 11:00:00 NaN NaN NaT NaT NaN left_only
Populate the values in within_id column from enc_id and using Series.fillna fill the NaN excluding the ones that doesn't match from df with out of range, finally filter the columns to get the result:
print(d)
person_id date_1 within_id
0 101 2013-07-07 11:20:00 out of range
1 101 2013-05-07 14:30:00 ABC1
2 101 2013-06-07 14:40:00 out of range
3 101 2014-08-06 00:00:00 out of range
4 101 2014-11-06 00:00:00 out of range
5 101 2013-02-03 12:30:00 out of range
6 101 2014-06-13 00:00:00 out of range
7 202 2011-12-11 00:00:00 DEF1
8 202 2012-10-13 07:00:00 DEF2
9 202 2015-12-13 00:00:00 out of range
10 202 2012-12-13 00:00:00 DEF3
11 202 2012-12-13 18:30:00 DEF3
12 202 2011-07-13 10:00:00 out of range
13 202 2012-12-18 10:00:00 DEF3
14 202 2013-12-19 11:00:00 NaN
I used df and df1 as provided above.
The basic approach is to iterate over df1 and extract the matching values of enc_id.
I added a 'rule' column, to show how each value got populated.
Unfortunately, I was not able to reproduce the expected results. Perhaps the general approach will be useful.
df1['rule'] = 0
for t in df1.itertuples():
person = (t.person_id == df.person_id)
b = (t.date_1 >= df.start_date) & (t.date_2 <= df.end_date)
c = (t.date_1 >= df.start_date) & (t.date_2 >= df.end_date)
d = (t.date_1 <= df.start_date) & (t.date_2 <= df.end_date)
e = (t.date_1 <= df.start_date) & (t.date_2 <= df.start_date) # start_date at BOTH ends
if (m := person & b).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 1
elif (m := person & c).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 10
elif (m := person & d).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 100
elif (m := person & e).any():
df1.at[t.Index, 'within_id'] = 'out of range'
df1.at[t.Index, 'rule'] += 1_000
else:
df1.at[t.Index, 'within_id'] = 'impossible!'
df1.at[t.Index, 'rule'] += 10_000
df1['within_id'] = df1['within_id'].astype('Int64')
The results are:
print(df1)
person_id date_1 date_2 within_id rule
0 11 1961-12-30 00:00:00 1962-01-01 00:00:00 11345678901 1
1 11 1962-01-30 00:00:00 1962-02-01 00:00:00 11345678902 1
2 12 1962-02-28 00:00:00 1962-03-02 00:00:00 34567892101 100
3 12 1989-07-29 00:00:00 1989-07-31 00:00:00 34567892101 1
4 12 1989-09-03 00:00:00 1989-09-05 00:00:00 34567892101 10
5 12 1989-10-02 00:00:00 1989-10-04 00:00:00 34567892103 1
6 12 1989-10-01 00:00:00 1989-10-03 00:00:00 34567892103 1
7 13 1999-03-29 00:00:00 1999-03-31 00:00:00 56432718901 1
8 13 1999-04-20 00:00:00 1999-04-22 00:00:00 56432718901 10
9 13 1999-06-02 00:00:00 1999-06-04 00:00:00 56432718904 1
10 13 1999-06-03 00:00:00 1999-06-05 00:00:00 56432718904 1
11 13 1999-07-29 00:00:00 1999-07-31 00:00:00 56432718905 1
12 14 2002-02-03 10:00:00 2002-02-05 10:00:00 24680135791 1
13 14 2002-02-03 10:00:00 2002-02-05 10:00:00 24680135791 1

How to restrict time difference to same day?

I have a dataframe like as shown below
df1 = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,1,1,1,1],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-03
12:59:00','2173-04-03 13:14:00','2173-04-03 13:37:00','2173-04-04
11:30:00','2173-04-05 16:00:00','2173-04-05 22:00:00','2173-04-06
04:00:00','2173-04-06 04:30:00','2173-04-06 08:00:00']
})
I would like to create another column called tdiff to calculate the time difference
This is what I tried
df1['time_1'] = pd.to_datetime(df1['time_1'])
df['time_2'] = df['time_1'].shift(-1)
df['tdiff'] = (df['time_2'] - df['time_1']).dt.total_seconds() / 3600
But this produces an output like as shown below. As you can see, it subtracts from the next date. Instead I would like to restrict the time difference only to the same day. Ex: if Jan 15th 20:00:00 PM is the last record for that day, then I expect the tdiff to be 4:00:00 (24:00:00: - 20:00:00)
I understand it is happening because I am shifting the values of time to subtract and it's obvious that the highlighted rows are picking records from next date. But is there a way to avoid this but calculate the time difference between records in a same day?
I expect my output to be like this. Here NaN should be replaced by the current date (23:59:00). if you check the difference, you will get an idea
Is there any existing method or pandas function that can help us do this datewise timedelta? How can I shift the values datewise?
IIUC, you can use:
s=pd.to_timedelta(24,unit='h')-(df1.time_1-df1.time_1.dt.normalize())
df1['tdiff']=df1.groupby(df1.time_1.dt.date).time_1.diff().shift(-1).fillna(s)
#df1.groupby(df1.time_1.dt.date).time_1.diff().shift(-1).fillna(s).dt.total_seconds()/3600
subject_id time_1 tdiff
0 1 2173-04-03 12:35:00 00:15:00
1 1 2173-04-03 12:50:00 00:09:00
2 1 2173-04-03 12:59:00 00:15:00
3 1 2173-04-03 13:14:00 00:23:00
4 1 2173-04-03 13:37:00 10:23:00
5 1 2173-04-04 11:30:00 12:30:00
6 1 2173-04-05 16:00:00 06:00:00
7 1 2173-04-05 22:00:00 02:00:00
8 1 2173-04-06 04:00:00 00:30:00
9 1 2173-04-06 04:30:00 03:30:00
10 1 2173-04-06 08:00:00 16:00:00
you could use df.where and df.dt.ceil to decide if to subtract from time_2 or from midnight of time_1:
sameDayOrMidnight = df.time_2.where(df.time_1.dt.date==df.time_2.dt.date, df.time_1.dt.ceil(freq='1d'))
df['tdiff'] = (sameDayOrMidnight - df.time_1).dt.total_seconds() / 3600
result:
subject_id time_1 time_2 tdiff
0 1 2173-04-03 12:35:00 2173-04-03 12:50:00 0.250000
1 1 2173-04-03 12:50:00 2173-04-03 12:59:00 0.150000
2 1 2173-04-03 12:59:00 2173-04-03 13:14:00 0.250000
3 1 2173-04-03 13:14:00 2173-04-03 13:37:00 0.383333
4 1 2173-04-03 13:37:00 2173-04-04 11:30:00 10.383333
5 1 2173-04-04 11:30:00 2173-04-05 16:00:00 12.500000
6 1 2173-04-05 16:00:00 2173-04-05 22:00:00 6.000000
7 1 2173-04-05 22:00:00 2173-04-06 04:00:00 2.000000
8 1 2173-04-06 04:00:00 2173-04-06 04:30:00 0.500000
9 1 2173-04-06 04:30:00 2173-04-06 08:00:00 3.500000
10 1 2173-04-06 08:00:00 NaT 16.000000

Categories

Resources