I have the following dataframe;
Group Deadline Time Deadline Date Task Completed Date Task Completed Time
Group 1 20:00:00 17-07-2012 17-07-2012 20:34:00
Group 2 20:15:00 17-07-2012 17-07-2012 20:39:00
Group 3 22:00:00 17-07-2012 17-07-2012 22:21:00
Group 4 23:50:00 17-07-2012 18-07-2012 00:09:00
Group 5 20:00:00 18-07-2012 18-07-2012 20:37:00
Group 6 20:15:00 18-07-2012 18-07-2012 21:13:00
Group 7 22:00:00 18-07-2012 18-07-2012 22:56:00
Group 8 23:50:00 18-07-2012 19-07-2012 00:01:00
Group 9 20:15:00 19-07-2012 19-07-2012 20:34:00
Group 10 20:00:00 19-07-2012 19-07-2012 20:24:00
How do I calculate the time delay as;
Time Delay (mins)
00:34:00
00:24:00
00:21:00
00:19:00
00:37:00
00:58:00
00:56:00
00:11:00
00:19:00
00:24:00
I have tried without success;
Combining the 'Deadline' 'date' & 'time' columns and 'Task Completed' 'date' & 'time' columns and
Finding the difference as 'Task Completed' - 'Deadline' time.
Combine them as strings ("addition" works), convert them to datetime type, and then subtract, which gives a Series of timedelta type.
In [14]: deadline = pd.to_datetime(df['Deadline Date'] + ' ' + df['Deadline Time'])
In [15]: completed = pd.to_datetime(df['Task Completed Date'] + ' ' + df['Task Completed Time'])
In [16]: completed - deadline
Out[16]:
0 00:34:00
1 00:24:00
2 00:21:00
3 00:19:00
4 00:37:00
5 00:58:00
6 00:56:00
7 00:11:00
8 00:19:00
9 00:24:00
dtype: timedelta64[ns]
Related
I have cleaned up a data set to get it into this format. The assigned_pat_loc represents a room number, so I am trying to identify when two different patients (patient_id) are in the same room at the same time; i.e., overlapping start_time and end_time between rows with the same assigned_pat_loc but different patient_id's. The start_time and end_time represent the times that that particular patient was in that room. So if those times are overlapping between two patients in the same room, it means that they shared the room together. This is what I'm ultimately looking for. Here is the base data set from which I want to construct these changes:
patient_id assigned_pat_loc start_time end_time
0 19035648 SICU^6108 2009-01-10 18:27:48 2009-02-25 15:45:54
1 19039244 85^8520 2009-01-02 06:27:25 2009-01-05 10:38:41
2 19039507 55^5514 2009-01-01 13:25:45 2009-01-01 13:25:45
3 19039555 EIAB^EIAB 2009-01-15 01:56:48 2009-02-23 11:36:34
4 19039559 EIAB^EIAB 2009-01-16 11:24:18 2009-01-19 18:41:33
... ... ... ... ...
140906 46851413 EIAB^EIAB 2011-12-31 22:28:38 2011-12-31 23:15:49
140907 46851422 EIAB^EIAB 2011-12-31 21:52:44 2011-12-31 22:50:08
140908 46851430 4LD^4LDX 2011-12-31 22:41:10 2011-12-31 22:44:48
140909 46851434 EIC^EIC 2011-12-31 23:45:22 2011-12-31 23:45:22
140910 46851437 EIAB^EIAB 2011-12-31 22:54:40 2011-12-31 23:30:10
I am thinking I should approach this with a groupby of some sort, but I'm not sure exactly how to implement. I would show an attempt but it took me about 6 hours to even get to this point so I would appreciate even just some thoughts.
EDIT
Example of original data:
id Date Time assigned_pat_loc prior_pat_loc Activity
1 May/31/11 8:00 EIAB^EIAB^6 Admission
1 May/31/11 9:00 8w^201 EIAB^EIAB^6 Transfer
1 Jun/8/11 15:00 8w^201 Discharge
2 May/31/11 5:00 EIAB^EIAB^4 Admission
2 May/31/11 7:00 10E^45 EIAB^EIAB^4 Transfer
2 Jun/1/11 1:00 8w^201 10E^45 Transfer
2 Jun/1/11 8:00 8w^201 Discharge
3 May/31/11 9:00 EIAB^EIAB^2 Admission
3 Jun/1/11 9:00 8w^201 EIAB^EIAB^2 Transfer
3 Jun/5/11 9:00 8w^201 Discharge
4 May/31/11 9:00 EIAB^EIAB^9 Admission
4 May/31/11 7:00 10E^45 EIAB^EIAB^9 Transfer
4 Jun/1/11 8:00 10E^45 Death
Example of desired output:
id r_id start_date start_time end_date end_time length location
1 2 Jun/1/11 1:00 Jun/1/11 8:00 7 8w^201
1 3 Jun/1/11 9:00 Jun/5/11 9:00 96 8w^201
2 4 May/31/11 7:00 Jun/1/11 1:00 18 10E^45
2 1 Jun/1/11 1:00 Jun/1/11 8:00 7 8w^201
3 1 Jun/1/11 9:00 Jun/5/11 9:00 96 8w^201
Where r_id is the "other" patient who is sharing the same room as another one, and length is the number of time in hours that the room was shared.
In this example:
r_id is the name of the variable you will generate for the id of the other patient.
patient 1 had two room-sharing episodes, both in 8w^201 (room 201 of unit 8w); he shared the room with patient 2 for 7 hours (1 am to 8 am on June 1) and with patient 3 for 96 hours (9 am on June 1 to 9 am on June 5).
Patient 2 also had two room sharing episodes. The first one was with patient 4 in in 10E^45 (room 45 of unit 10E) and lasted 18 hours (7 am May 31 to 1 am June 1); the second one is the 7-hour episode with patient 1 in 8w^201.
Patient 3 had only one room-sharing episode with patient 1 in room 8w^201, lasting 96 hours.
Patient 4, also, had only one room-sharing episode, with patient 2 in room 10E^45, lasting 18 hours.
Note: room-sharing episodes are listed twice, once for each patient.
Another option.
I'm starting wiht the original data after the EDIT, but I have changed this row
4 May/31/11 9:00 EIAB^EIAB^9 Admission
to
4 May/31/11 6:00 EIAB^EIAB^9 Admission
because I think the admission time should be before the transfer time?
The first step is essentially to get a dataframe similiar to the one you're starting out with:
df = (
df.assign(start_time=pd.to_datetime((df["Date"] + " " + df["Time"])))
.sort_values(["id", "start_time"])
.assign(duration=lambda df: -df.groupby("id")["start_time"].diff(-1))
.loc[lambda df: df["duration"].notna()]
.assign(end_time=lambda df: df["start_time"] + df["duration"])
.rename(columns={"assigned_pat_loc": "location"})
[["id", "location", "start_time", "end_time"]]
)
Result for the sample:
id location start_time end_time
0 1 EIAB^EIAB^6 2011-05-31 08:00:00 2011-05-31 09:00:00
1 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
3 2 EIAB^EIAB^4 2011-05-31 05:00:00 2011-05-31 07:00:00
4 2 10E^45 2011-05-31 07:00:00 2011-06-01 01:00:00
5 2 8w^201 2011-06-01 01:00:00 2011-06-01 08:00:00
7 3 EIAB^EIAB^2 2011-05-31 09:00:00 2011-06-01 09:00:00
8 3 8w^201 2011-06-01 09:00:00 2011-06-05 09:00:00
10 4 EIAB^EIAB^9 2011-05-31 06:00:00 2011-05-31 07:00:00
11 4 10E^45 2011-05-31 07:00:00 2011-06-01 08:00:00
The next step is merging df with itself on the location column and eliminating the rows where id is the same as r_id:
df = (
df.merge(df, on="location")
.rename(columns={"id_x": "id", "id_y": "r_id"})
.loc[lambda df: df["id"] != df["r_id"]]
)
Then finally get the rows with an actual overlap via m, calculate the duration of the overlap, and bring the dataframe in the form you are looking for:
m = (
(df["start_time_x"].le(df["start_time_y"])
& df["start_time_y"].le(df["end_time_x"]))
| (df["start_time_y"].le(df["start_time_x"])
& df["start_time_x"].le(df["end_time_y"]))
)
df = (
df[m]
.assign(
start_time=lambda df: df[["start_time_x", "start_time_y"]].max(axis=1),
end_time=lambda df: df[["end_time_x", "end_time_y"]].min(axis=1),
duration=lambda df: df["end_time"] - df["start_time"]
)
.assign(
start_date=lambda df: df["start_time"].dt.date,
start_time=lambda df: df["start_time"].dt.time,
end_date=lambda df: df["end_time"].dt.date,
end_time=lambda df: df["end_time"].dt.time
)
[[
"id", "r_id",
"start_date", "start_time", "end_date", "end_time",
"duration", "location"
]]
.sort_values(["id", "r_id"]).reset_index(drop=True)
)
Result for the sample:
id r_id start_date start_time end_date end_time duration \
0 1 2 2011-06-01 01:00:00 2011-06-01 08:00:00 0 days 07:00:00
1 1 3 2011-06-01 09:00:00 2011-06-05 09:00:00 4 days 00:00:00
2 2 1 2011-06-01 01:00:00 2011-06-01 08:00:00 0 days 07:00:00
3 2 4 2011-05-31 07:00:00 2011-06-01 01:00:00 0 days 18:00:00
4 3 1 2011-06-01 09:00:00 2011-06-05 09:00:00 4 days 00:00:00
5 4 2 2011-05-31 07:00:00 2011-06-01 01:00:00 0 days 18:00:00
location
0 8w^201
1 8w^201
2 8w^201
3 10E^45
4 8w^201
5 10E^45
numpy broadcasting is perfect for this. It allows you to compare every record (patient-room) against every other record in the dataframe. The down size is that it's memory intensive, as it requires n^2 * 8 bytes to store the comparison matrix. Glancing over your data with ~141k rows, it will require 148GB of memory!
We need to chunk the dataframe so the memory requirement is reduced to chunk_size * n * 8 bytes.
# Don't keep date and time separately, they are hard to
# perform calculations on. Instead, combine them into a
# single column and keep it as pd.Timestamp
df["start_time"] = pd.to_datetime(df["Date"] + " " + df["Time"])
# I don't know how you determine when a patient vacate a
# room. My logic here is
# - If Activity = Discharge or Death, end_time = start_time
# - Otherwise, end_time = start_time of the next room
# You can implement your own logic. This part is not
# essential to the problem at hand.
df["end_time"] = np.where(
df["Activity"].isin(["Discharge", "Death"]),
df["start_time"],
df.groupby("id")["start_time"].shift(-1),
)
# ------------------------------------------------------------------------------
# Extract all the columns to numpy arrays
patient_id, assigned_pat_loc, start_time, end_time = (
df[["id", "assigned_pat_loc", "start_time", "end_time"]].to_numpy().T
)
chunk_size = 1000 # experiment to find a size that suits you
idx_left = []
idx_right = []
for offset in range(0, len(df), chunk_size):
chunk = slice(offset, offset + chunk_size)
# Get a chunk of each array. The [:, None] part is to
# raise the chunk up one dimension to prepare for numpy
# broadcasting
patient_id_chunk, assigned_pat_loc_chunk, start_time_chunk, end_time_chunk = [
arr[chunk][:, None] for arr in (patient_id, assigned_pat_loc, start_time, end_time)
]
# `mask` is a matrix. If mask[i, j] == True, the patient
# in row i is sharing the room with the patient in row j
mask = (
# patent_id are different
(patient_id_chunk != patient_id)
# in the same room
& (assigned_pat_loc_chunk == assigned_pat_loc)
# start_time and end_time overlap
& (start_time_chunk < end_time)
& (start_time < end_time_chunk)
)
idx = mask.nonzero()
idx_left.extend(idx[0] + offset)
idx_right.extend(idx[1])
result = pd.concat(
[
df[["id", "assigned_pat_loc", "start_time", "end_time"]]
.iloc[idx]
.reset_index(drop=True)
for idx in [idx_left, idx_right]
],
axis=1,
keys=["patient_1", "patient_2"],
)
Result:
patient_1 patient_2
id assigned_pat_loc start_time end_time id assigned_pat_loc start_time end_time
0 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00 2 8w^201 2011-06-01 01:00:00 2011-06-01 08:00:00
1 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00 2 8w^201 2011-06-01 08:00:00 2011-06-01 08:00:00
2 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00 3 8w^201 2011-06-01 09:00:00 2011-06-05 09:00:00
3 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00 3 8w^201 2011-06-05 09:00:00 2011-06-05 09:00:00
4 2 10E^45 2011-05-31 07:00:00 2011-06-01 01:00:00 4 10E^45 2011-05-31 07:00:00 2011-06-01 08:00:00
5 2 8w^201 2011-06-01 01:00:00 2011-06-01 08:00:00 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
6 2 8w^201 2011-06-01 08:00:00 2011-06-01 08:00:00 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
7 3 8w^201 2011-06-01 09:00:00 2011-06-05 09:00:00 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
8 3 8w^201 2011-06-05 09:00:00 2011-06-05 09:00:00 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
9 4 10E^45 2011-05-31 07:00:00 2011-06-01 08:00:00 2 10E^45 2011-05-31 07:00:00 2011-06-01 01:00:00
I have this df:
Index Dates
0 2017-01-01 23:30:00
1 2017-01-12 22:30:00
2 2017-01-20 13:35:00
3 2017-01-21 14:25:00
4 2017-01-28 22:30:00
5 2017-08-01 13:00:00
6 2017-09-26 09:39:00
7 2017-10-08 06:40:00
8 2017-10-04 07:30:00
9 2017-12-13 07:40:00
10 2017-12-31 14:55:00
The purpose was that between the time ranges 5:00 to 11:59 a new df would be created with data that would say: morning. To achieve this I converted those hours to booleans:
hour_morning=(pd.to_datetime(df['Dates']).dt.strftime('%H:%M:%S').between('05:00:00','11:59:00'))
and then passed them to a list with "morning" str
text_morning=[str('morning') for x in hour_morning if x==True]
I have the error in the last line because it only returns ´morning´ string values, it is as if the 'X' ignored the 'if' condition. Why is this happening and how do i fix it?
Do
text_morning=[str('morning') if x==True else 'not_morning' for x in hour_morning ]
You can also use np.where:
text_morning = np.where(hour_morning, 'morning', 'not morning')
Given:
Dates values
0 2017-01-01 23:30:00 0
1 2017-01-12 22:30:00 1
2 2017-01-20 13:35:00 2
3 2017-01-21 14:25:00 3
4 2017-01-28 22:30:00 4
5 2017-08-01 13:00:00 5
6 2017-09-26 09:39:00 6
7 2017-10-08 06:40:00 7
8 2017-10-04 07:30:00 8
9 2017-12-13 07:40:00 9
10 2017-12-31 14:55:00 10
Doing:
# df.Dates = pd.to_datetime(df.Dates)
df = df.set_index("Dates")
Now we can use pd.DataFrame.between_time:
new_df = df.between_time('05:00:00','11:59:00')
print(new_df)
Output:
values
Dates
2017-09-26 09:39:00 6
2017-10-08 06:40:00 7
2017-10-04 07:30:00 8
2017-12-13 07:40:00 9
Or use it to update the original dataframe:
df.loc[df.between_time('05:00:00','11:59:00').index, 'morning'] = 'morning'
# Output:
values morning
Dates
2017-01-01 23:30:00 0 NaN
2017-01-12 22:30:00 1 NaN
2017-01-20 13:35:00 2 NaN
2017-01-21 14:25:00 3 NaN
2017-01-28 22:30:00 4 NaN
2017-08-01 13:00:00 5 NaN
2017-09-26 09:39:00 6 morning
2017-10-08 06:40:00 7 morning
2017-10-04 07:30:00 8 morning
2017-12-13 07:40:00 9 morning
2017-12-31 14:55:00 10 NaN
I'm creating a Python program using pandas and datetime libraries that will calculate the pay from my casual job each week, so I can cross reference my bank statement instead of looking through payslips.
The data that I am analysing is from the Google Calendar API that is synced with my work schedule. It prints the events in that particular calendar to a csv file in this format:
Start
End
Title
Hours
0
02.12.2020 07:00
02.12.2020 16:00
Shift
9.0
1
04.12.2020 18:00
04.12.2020 21:00
Shift
3.0
2
05.12.2020 07:00
05.12.2020 12:00
Shift
5.0
3
06.12.2020 09:00
06.12.2020 18:00
Shift
9.0
4
07.12.2020 19:00
07.12.2020 23:00
Shift
4.0
5
08.12.2020 19:00
08.12.2020 23:00
Shift
4.0
6
09.12.2020 10:00
09.12.2020 15:00
Shift
5.0
As I am a casual at this job I have to take a few things into consideration like penalty rates (baserate, after 6pm on Monday - Friday, Saturday, and Sunday all have different rates). I'm wondering if I can analyse this csv using datetime and calculate how many hours are before 6pm, and how many after 6pm. So using this as an example the output would be like:
Start
End
Title
Hours
1
04.12.2020 15:00
04.12.2020 21:00
Shift
6.0
Start
End
Title
Total Hours
Hours before 3pm
Hours after 3pm
1
04.12.2020 15:00
04.12.2020 21:00
Shift
6.0
3.0
3.0
I can use this to get the day of the week but I'm just not sure how to analyse certain bits of time for penalty rates:
df['day_of_week'] = df['Start'].dt.day_name()
I appreciate any help in Python or even other coding languages/techniques this can be applied to:)
Edit:
This is how my dataframe is looking at the moment
Start
End
Title
Hours
day_of_week
Pay
week_of_year
0
2020-12-02 07:00:00
2020-12-02 16:00:00
Shift
9.0
Wednesday
337.30
49
EDIT
In response to David Erickson's comment.
value
variable
bool
0
2020-12-02 07:00:00
Start
False
1
2020-12-02 08:00:00
Start
False
2
2020-12-02 09:00:00
Start
False
3
2020-12-02 10:00:00
Start
False
4
2020-12-02 11:00:00
Start
False
5
2020-12-02 12:00:00
Start
False
6
2020-12-02 13:00:00
Start
False
7
2020-12-02 14:00:00
Start
False
8
2020-12-02 15:00:00
Start
False
9
2020-12-02 16:00:00
End
False
10
2020-12-04 18:00:00
Start
False
11
2020-12-04 19:00:00
Start
True
12
2020-12-04 20:00:00
Start
True
13
2020-12-04 21:00:00
End
True
14
2020-12-05 07:00:00
Start
False
15
2020-12-05 08:00:00
Start
False
16
2020-12-05 09:00:00
Start
False
17
2020-12-05 10:00:00
Start
False
18
2020-12-05 11:00:00
Start
False
19
2020-12-05 12:00:00
End
False
20
2020-12-06 09:00:00
Start
False
21
2020-12-06 10:00:00
Start
False
22
2020-12-06 11:00:00
Start
False
23
2020-12-06 12:00:00
Start
False
24
2020-12-06 13:00:00
Start
False
25
2020-12-06 14:00:00
Start
False
26
2020-12-06 15:00:00
Start
False
27
2020-12-06 6:00:00
Start
False
28
2020-12-06 17:00:00
Start
False
29
2020-12-06 18:00:00
End
False
30
2020-12-07 19:00:00
Start
False
31
2020-12-07 20:00:00
Start
True
32
2020-12-07 21:00:00
Start
True
33
2020-12-07 22:00:00
Start
True
34
2020-12-07 23:00:00
End
True
35
2020-12-08 19:00:00
Start
False
36
2020-12-08 20:00:00
Start
True
37
2020-12-08 21:00:00
Start
True
38
2020-12-08 22:00:00
Start
True
39
2020-12-08 23:00:00
End
True
40
2020-12-09 10:00:00
Start
False
41
2020-12-09 11:00:00
Start
False
42
2020-12-09 12:00:00
Start
False
43
2020-12-09 13:00:00
Start
False
44
2020-12-09 14:00:00
Start
False
45
2020-12-09 15:00:00
End
False
46
2020-12-11 19:00:00
Start
False
47
2020-12-11 20:00:00
Start
True
48
2020-12-11 21:00:00
Start
True
49
2020-12-11 22:00:00
Start
True
UPDATE: (2020-12-19)
I have simply filtered out the Start rows, as you were correct an extra row wa being calculated. Also, I passed dayfirst=True to pd.to_datetime() to convert the date correctly. I have also made the output clean with some extra columns.
higher_pay = 40
lower_pay = 30
df['Start'], df['End'] = pd.to_datetime(df['Start'], dayfirst=True), pd.to_datetime(df['End'], dayfirst=True)
start = df['Start']
df1 = df[['Start', 'End']].melt(value_name='Date').set_index('Date')
s = df1.groupby('variable').cumcount()
df1 = df1.groupby(s, group_keys=False).resample('1H').asfreq().join(s.rename('Shift').to_frame()).ffill().reset_index()
df1 = df1[~df1['Date'].isin(start)]
df1['Day'] = df1['Date'].dt.day_name()
df1['Week'] = df1['Date'].dt.isocalendar().week
m = (df1['Date'].dt.hour > 18) | (df1['Day'].isin(['Saturday', 'Sunday']))
df1['Higher Pay Hours'] = np.where(m, 1, 0)
df1['Lower Pay Hours'] = np.where(m, 0, 1)
df1['Pay'] = np.where(m, higher_pay, lower_pay)
df1 = df1.groupby(['Shift', 'Day', 'Week']).sum().reset_index()
df2 = df.merge(df1, how='left', left_index=True, right_on='Shift').drop('Shift', axis=1)
df2
Out[1]:
Start End Title Hours Day Week \
0 2020-12-02 07:00:00 2020-12-02 16:00:00 Shift 9.0 Wednesday 49
1 2020-12-04 18:00:00 2020-12-04 21:00:00 Shift 3.0 Friday 49
2 2020-12-05 07:00:00 2020-12-05 12:00:00 Shift 5.0 Saturday 49
3 2020-12-06 09:00:00 2020-12-06 18:00:00 Shift 9.0 Sunday 49
4 2020-12-07 19:00:00 2020-12-07 23:00:00 Shift 4.0 Monday 50
5 2020-12-08 19:00:00 2020-12-08 23:00:00 Shift 4.0 Tuesday 50
6 2020-12-09 10:00:00 2020-12-09 15:00:00 Shift 5.0 Wednesday 50
Higher Pay Hours Lower Pay Hours Pay
0 0 9 270
1 3 0 120
2 5 0 200
3 9 0 360
4 4 0 160
5 4 0 160
6 0 5 150
There are probably more concise ways to do this, but I thought resampling the dataframe and then counting the hours would be a clean approach. You can melt the dataframe to have Start and End in the same column and fill in the gap hours with resample making sure to groupby by the 'Start' and 'End' values that were initially on the same row. The easiest way to figure out which rows were initially together is to get the cumulative count with cumcount of the values in the new the dataframe grouped by 'Start' and 'End'. I'll show you how this works later in the answer.
Full Code:
df['Start'], df['End'] = pd.to_datetime(df['Start']), pd.to_datetime(df['End'])
df = df[['Start', 'End']].melt().set_index('value')
df = df.groupby(df.groupby('variable').cumcount(), group_keys=False).resample('1H').asfreq().ffill().reset_index()
m = (df['value'].dt.hour > 18) | (df['value'].dt.day_name().isin(['Saturday', 'Sunday']))
print('Normal Rate No. of Hours', df[m].shape[0])
print('Higher Rate No. of Hours', df[~m].shape[0])
Normal Rate No. of Hours 20
Higher Rate No. of Hours 26
Adding some more details...
Step 1: Melt the dataframe: You only need two columns 'Start' and 'End' to get your desired output
df = df[['Start', 'End']].melt().set_index('value')
df
Out[1]:
variable
value
2020-02-12 07:00:00 Start
2020-04-12 18:00:00 Start
2020-05-12 07:00:00 Start
2020-06-12 09:00:00 Start
2020-07-12 19:00:00 Start
2020-08-12 19:00:00 Start
2020-09-12 10:00:00 Start
2020-02-12 16:00:00 End
2020-04-12 21:00:00 End
2020-05-12 12:00:00 End
2020-06-12 18:00:00 End
2020-07-12 23:00:00 End
2020-08-12 23:00:00 End
2020-09-12 15:00:00 End
Step 2: Create Group in preparation for resample: *As you can see group 0-6 line up with each other representing '
Start' and 'End' as they were together previously
df.groupby('variable').cumcount()
Out[2]:
value
2020-02-12 07:00:00 0
2020-04-12 18:00:00 1
2020-05-12 07:00:00 2
2020-06-12 09:00:00 3
2020-07-12 19:00:00 4
2020-08-12 19:00:00 5
2020-09-12 10:00:00 6
2020-02-12 16:00:00 0
2020-04-12 21:00:00 1
2020-05-12 12:00:00 2
2020-06-12 18:00:00 3
2020-07-12 23:00:00 4
2020-08-12 23:00:00 5
2020-09-12 15:00:00 6
Step 3: Resample the data per group by hour to fill in the gaps for each group:
df.groupby(df.groupby('variable').cumcount(), group_keys=False).resample('1H').asfreq().ffill().reset_index()
Out[3]:
value variable
0 2020-02-12 07:00:00 Start
1 2020-02-12 08:00:00 Start
2 2020-02-12 09:00:00 Start
3 2020-02-12 10:00:00 Start
4 2020-02-12 11:00:00 Start
5 2020-02-12 12:00:00 Start
6 2020-02-12 13:00:00 Start
7 2020-02-12 14:00:00 Start
8 2020-02-12 15:00:00 Start
9 2020-02-12 16:00:00 End
10 2020-04-12 18:00:00 Start
11 2020-04-12 19:00:00 Start
12 2020-04-12 20:00:00 Start
13 2020-04-12 21:00:00 End
14 2020-05-12 07:00:00 Start
15 2020-05-12 08:00:00 Start
16 2020-05-12 09:00:00 Start
17 2020-05-12 10:00:00 Start
18 2020-05-12 11:00:00 Start
19 2020-05-12 12:00:00 End
20 2020-06-12 09:00:00 Start
21 2020-06-12 10:00:00 Start
22 2020-06-12 11:00:00 Start
23 2020-06-12 12:00:00 Start
24 2020-06-12 13:00:00 Start
25 2020-06-12 14:00:00 Start
26 2020-06-12 15:00:00 Start
27 2020-06-12 16:00:00 Start
28 2020-06-12 17:00:00 Start
29 2020-06-12 18:00:00 End
30 2020-07-12 19:00:00 Start
31 2020-07-12 20:00:00 Start
32 2020-07-12 21:00:00 Start
33 2020-07-12 22:00:00 Start
34 2020-07-12 23:00:00 End
35 2020-08-12 19:00:00 Start
36 2020-08-12 20:00:00 Start
37 2020-08-12 21:00:00 Start
38 2020-08-12 22:00:00 Start
39 2020-08-12 23:00:00 End
40 2020-09-12 10:00:00 Start
41 2020-09-12 11:00:00 Start
42 2020-09-12 12:00:00 Start
43 2020-09-12 13:00:00 Start
44 2020-09-12 14:00:00 Start
45 2020-09-12 15:00:00 End
Step 4 - From there, you can calculate the boolean series I have called m: *True values represent conditions met for "Higher Rate".
m = (df['value'].dt.hour > 18) | (df['value'].dt.day_name().isin(['Saturday', 'Sunday']))
m
Out[4]:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 True
11 True
12 True
13 True
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 False
27 False
28 False
29 False
30 True
31 True
32 True
33 True
34 True
35 True
36 True
37 True
38 True
39 True
40 True
41 True
42 True
43 True
44 True
45 True
Step 5: Filter the dataframe by True or False to count total hours for the Normal Rate and Higher Rate and print values.
print('Normal Rate No. of Hours', df[m].shape[0])
print('Higher Rate No. of Hours', df[~m].shape[0])
Normal Rate No. of Hours 20
Higher Rate No. of Hours 26
I have a dataframe like as shown below
df1 = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,1,1,1,1],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-03
12:59:00','2173-04-03 13:14:00','2173-04-03 13:37:00','2173-04-04
11:30:00','2173-04-05 16:00:00','2173-04-05 22:00:00','2173-04-06
04:00:00','2173-04-06 04:30:00','2173-04-06 08:00:00']
})
I would like to create another column called tdiff to calculate the time difference
This is what I tried
df1['time_1'] = pd.to_datetime(df1['time_1'])
df['time_2'] = df['time_1'].shift(-1)
df['tdiff'] = (df['time_2'] - df['time_1']).dt.total_seconds() / 3600
But this produces an output like as shown below. As you can see, it subtracts from the next date. Instead I would like to restrict the time difference only to the same day. Ex: if Jan 15th 20:00:00 PM is the last record for that day, then I expect the tdiff to be 4:00:00 (24:00:00: - 20:00:00)
I understand it is happening because I am shifting the values of time to subtract and it's obvious that the highlighted rows are picking records from next date. But is there a way to avoid this but calculate the time difference between records in a same day?
I expect my output to be like this. Here NaN should be replaced by the current date (23:59:00). if you check the difference, you will get an idea
Is there any existing method or pandas function that can help us do this datewise timedelta? How can I shift the values datewise?
IIUC, you can use:
s=pd.to_timedelta(24,unit='h')-(df1.time_1-df1.time_1.dt.normalize())
df1['tdiff']=df1.groupby(df1.time_1.dt.date).time_1.diff().shift(-1).fillna(s)
#df1.groupby(df1.time_1.dt.date).time_1.diff().shift(-1).fillna(s).dt.total_seconds()/3600
subject_id time_1 tdiff
0 1 2173-04-03 12:35:00 00:15:00
1 1 2173-04-03 12:50:00 00:09:00
2 1 2173-04-03 12:59:00 00:15:00
3 1 2173-04-03 13:14:00 00:23:00
4 1 2173-04-03 13:37:00 10:23:00
5 1 2173-04-04 11:30:00 12:30:00
6 1 2173-04-05 16:00:00 06:00:00
7 1 2173-04-05 22:00:00 02:00:00
8 1 2173-04-06 04:00:00 00:30:00
9 1 2173-04-06 04:30:00 03:30:00
10 1 2173-04-06 08:00:00 16:00:00
you could use df.where and df.dt.ceil to decide if to subtract from time_2 or from midnight of time_1:
sameDayOrMidnight = df.time_2.where(df.time_1.dt.date==df.time_2.dt.date, df.time_1.dt.ceil(freq='1d'))
df['tdiff'] = (sameDayOrMidnight - df.time_1).dt.total_seconds() / 3600
result:
subject_id time_1 time_2 tdiff
0 1 2173-04-03 12:35:00 2173-04-03 12:50:00 0.250000
1 1 2173-04-03 12:50:00 2173-04-03 12:59:00 0.150000
2 1 2173-04-03 12:59:00 2173-04-03 13:14:00 0.250000
3 1 2173-04-03 13:14:00 2173-04-03 13:37:00 0.383333
4 1 2173-04-03 13:37:00 2173-04-04 11:30:00 10.383333
5 1 2173-04-04 11:30:00 2173-04-05 16:00:00 12.500000
6 1 2173-04-05 16:00:00 2173-04-05 22:00:00 6.000000
7 1 2173-04-05 22:00:00 2173-04-06 04:00:00 2.000000
8 1 2173-04-06 04:00:00 2173-04-06 04:30:00 0.500000
9 1 2173-04-06 04:30:00 2173-04-06 08:00:00 3.500000
10 1 2173-04-06 08:00:00 NaT 16.000000
I am working on a dataframe in pandas with four columns of user_id, time_stamp1, time_stamp2, and interval. Time_stamp1 and time_stamp2 are of type datetime64[ns] and interval is of type timedelta64[ns].
I want to sum up interval values for each user_id in the dataframe and I tried to calculate it in many ways as:
1)df["duration"]= df.groupby('user_id')['interval'].apply (lambda x: x.sum())
2)df ["duration"]= df.groupby('user_id').aggregate (np.sum)
3)df ["duration"]= df.groupby('user_id').agg (np.sum)
but none of them work and the value of the duration will be NaT after running the codes.
UPDATE: you can use transform() method:
In [291]: df['duration'] = df.groupby('user_id')['interval'].transform('sum')
In [292]: df
Out[292]:
a user_id b interval duration
0 2016-01-01 00:00:00 0.01 2015-11-11 00:00:00 51 days 00:00:00 838 days 08:00:00
1 2016-03-10 10:39:00 0.01 2015-12-08 18:39:00 NaT 838 days 08:00:00
2 2016-05-18 21:18:00 0.01 2016-01-05 13:18:00 134 days 08:00:00 838 days 08:00:00
3 2016-07-27 07:57:00 0.01 2016-02-02 07:57:00 176 days 00:00:00 838 days 08:00:00
4 2016-10-04 18:36:00 0.01 2016-03-01 02:36:00 217 days 16:00:00 838 days 08:00:00
5 2016-12-13 05:15:00 0.01 2016-03-28 21:15:00 259 days 08:00:00 838 days 08:00:00
6 2017-02-20 15:54:00 0.02 2016-04-25 15:54:00 301 days 00:00:00 1454 days 00:00:00
7 2017-05-01 02:33:00 0.02 2016-05-23 10:33:00 342 days 16:00:00 1454 days 00:00:00
8 2017-07-09 13:12:00 0.02 2016-06-20 05:12:00 384 days 08:00:00 1454 days 00:00:00
9 2017-09-16 23:51:00 0.02 2016-07-17 23:51:00 426 days 00:00:00 1454 days 00:00:00
OLD answer:
Demo:
In [260]: df
Out[260]:
a b interval user_id
0 2016-01-01 00:00:00 2015-11-11 00:00:00 51 days 00:00:00 1
1 2016-03-10 10:39:00 2015-12-08 18:39:00 NaT 1
2 2016-05-18 21:18:00 2016-01-05 13:18:00 134 days 08:00:00 1
3 2016-07-27 07:57:00 2016-02-02 07:57:00 176 days 00:00:00 1
4 2016-10-04 18:36:00 2016-03-01 02:36:00 217 days 16:00:00 1
5 2016-12-13 05:15:00 2016-03-28 21:15:00 259 days 08:00:00 1
6 2017-02-20 15:54:00 2016-04-25 15:54:00 301 days 00:00:00 2
7 2017-05-01 02:33:00 2016-05-23 10:33:00 342 days 16:00:00 2
8 2017-07-09 13:12:00 2016-06-20 05:12:00 384 days 08:00:00 2
9 2017-09-16 23:51:00 2016-07-17 23:51:00 426 days 00:00:00 2
In [261]: df.dtypes
Out[261]:
a datetime64[ns]
b datetime64[ns]
interval timedelta64[ns]
user_id int64
dtype: object
In [262]: df.groupby('user_id')['interval'].sum()
Out[262]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [263]: df.groupby('user_id')['interval'].apply(lambda x: x.sum())
Out[263]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [264]: df.groupby('user_id').agg(np.sum)
Out[264]:
interval
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
So check your data...