I have cleaned up a data set to get it into this format. The assigned_pat_loc represents a room number, so I am trying to identify when two different patients (patient_id) are in the same room at the same time; i.e., overlapping start_time and end_time between rows with the same assigned_pat_loc but different patient_id's. The start_time and end_time represent the times that that particular patient was in that room. So if those times are overlapping between two patients in the same room, it means that they shared the room together. This is what I'm ultimately looking for. Here is the base data set from which I want to construct these changes:
patient_id assigned_pat_loc start_time end_time
0 19035648 SICU^6108 2009-01-10 18:27:48 2009-02-25 15:45:54
1 19039244 85^8520 2009-01-02 06:27:25 2009-01-05 10:38:41
2 19039507 55^5514 2009-01-01 13:25:45 2009-01-01 13:25:45
3 19039555 EIAB^EIAB 2009-01-15 01:56:48 2009-02-23 11:36:34
4 19039559 EIAB^EIAB 2009-01-16 11:24:18 2009-01-19 18:41:33
... ... ... ... ...
140906 46851413 EIAB^EIAB 2011-12-31 22:28:38 2011-12-31 23:15:49
140907 46851422 EIAB^EIAB 2011-12-31 21:52:44 2011-12-31 22:50:08
140908 46851430 4LD^4LDX 2011-12-31 22:41:10 2011-12-31 22:44:48
140909 46851434 EIC^EIC 2011-12-31 23:45:22 2011-12-31 23:45:22
140910 46851437 EIAB^EIAB 2011-12-31 22:54:40 2011-12-31 23:30:10
I am thinking I should approach this with a groupby of some sort, but I'm not sure exactly how to implement. I would show an attempt but it took me about 6 hours to even get to this point so I would appreciate even just some thoughts.
EDIT
Example of original data:
id Date Time assigned_pat_loc prior_pat_loc Activity
1 May/31/11 8:00 EIAB^EIAB^6 Admission
1 May/31/11 9:00 8w^201 EIAB^EIAB^6 Transfer
1 Jun/8/11 15:00 8w^201 Discharge
2 May/31/11 5:00 EIAB^EIAB^4 Admission
2 May/31/11 7:00 10E^45 EIAB^EIAB^4 Transfer
2 Jun/1/11 1:00 8w^201 10E^45 Transfer
2 Jun/1/11 8:00 8w^201 Discharge
3 May/31/11 9:00 EIAB^EIAB^2 Admission
3 Jun/1/11 9:00 8w^201 EIAB^EIAB^2 Transfer
3 Jun/5/11 9:00 8w^201 Discharge
4 May/31/11 9:00 EIAB^EIAB^9 Admission
4 May/31/11 7:00 10E^45 EIAB^EIAB^9 Transfer
4 Jun/1/11 8:00 10E^45 Death
Example of desired output:
id r_id start_date start_time end_date end_time length location
1 2 Jun/1/11 1:00 Jun/1/11 8:00 7 8w^201
1 3 Jun/1/11 9:00 Jun/5/11 9:00 96 8w^201
2 4 May/31/11 7:00 Jun/1/11 1:00 18 10E^45
2 1 Jun/1/11 1:00 Jun/1/11 8:00 7 8w^201
3 1 Jun/1/11 9:00 Jun/5/11 9:00 96 8w^201
Where r_id is the "other" patient who is sharing the same room as another one, and length is the number of time in hours that the room was shared.
In this example:
r_id is the name of the variable you will generate for the id of the other patient.
patient 1 had two room-sharing episodes, both in 8w^201 (room 201 of unit 8w); he shared the room with patient 2 for 7 hours (1 am to 8 am on June 1) and with patient 3 for 96 hours (9 am on June 1 to 9 am on June 5).
Patient 2 also had two room sharing episodes. The first one was with patient 4 in in 10E^45 (room 45 of unit 10E) and lasted 18 hours (7 am May 31 to 1 am June 1); the second one is the 7-hour episode with patient 1 in 8w^201.
Patient 3 had only one room-sharing episode with patient 1 in room 8w^201, lasting 96 hours.
Patient 4, also, had only one room-sharing episode, with patient 2 in room 10E^45, lasting 18 hours.
Note: room-sharing episodes are listed twice, once for each patient.
Another option.
I'm starting wiht the original data after the EDIT, but I have changed this row
4 May/31/11 9:00 EIAB^EIAB^9 Admission
to
4 May/31/11 6:00 EIAB^EIAB^9 Admission
because I think the admission time should be before the transfer time?
The first step is essentially to get a dataframe similiar to the one you're starting out with:
df = (
df.assign(start_time=pd.to_datetime((df["Date"] + " " + df["Time"])))
.sort_values(["id", "start_time"])
.assign(duration=lambda df: -df.groupby("id")["start_time"].diff(-1))
.loc[lambda df: df["duration"].notna()]
.assign(end_time=lambda df: df["start_time"] + df["duration"])
.rename(columns={"assigned_pat_loc": "location"})
[["id", "location", "start_time", "end_time"]]
)
Result for the sample:
id location start_time end_time
0 1 EIAB^EIAB^6 2011-05-31 08:00:00 2011-05-31 09:00:00
1 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
3 2 EIAB^EIAB^4 2011-05-31 05:00:00 2011-05-31 07:00:00
4 2 10E^45 2011-05-31 07:00:00 2011-06-01 01:00:00
5 2 8w^201 2011-06-01 01:00:00 2011-06-01 08:00:00
7 3 EIAB^EIAB^2 2011-05-31 09:00:00 2011-06-01 09:00:00
8 3 8w^201 2011-06-01 09:00:00 2011-06-05 09:00:00
10 4 EIAB^EIAB^9 2011-05-31 06:00:00 2011-05-31 07:00:00
11 4 10E^45 2011-05-31 07:00:00 2011-06-01 08:00:00
The next step is merging df with itself on the location column and eliminating the rows where id is the same as r_id:
df = (
df.merge(df, on="location")
.rename(columns={"id_x": "id", "id_y": "r_id"})
.loc[lambda df: df["id"] != df["r_id"]]
)
Then finally get the rows with an actual overlap via m, calculate the duration of the overlap, and bring the dataframe in the form you are looking for:
m = (
(df["start_time_x"].le(df["start_time_y"])
& df["start_time_y"].le(df["end_time_x"]))
| (df["start_time_y"].le(df["start_time_x"])
& df["start_time_x"].le(df["end_time_y"]))
)
df = (
df[m]
.assign(
start_time=lambda df: df[["start_time_x", "start_time_y"]].max(axis=1),
end_time=lambda df: df[["end_time_x", "end_time_y"]].min(axis=1),
duration=lambda df: df["end_time"] - df["start_time"]
)
.assign(
start_date=lambda df: df["start_time"].dt.date,
start_time=lambda df: df["start_time"].dt.time,
end_date=lambda df: df["end_time"].dt.date,
end_time=lambda df: df["end_time"].dt.time
)
[[
"id", "r_id",
"start_date", "start_time", "end_date", "end_time",
"duration", "location"
]]
.sort_values(["id", "r_id"]).reset_index(drop=True)
)
Result for the sample:
id r_id start_date start_time end_date end_time duration \
0 1 2 2011-06-01 01:00:00 2011-06-01 08:00:00 0 days 07:00:00
1 1 3 2011-06-01 09:00:00 2011-06-05 09:00:00 4 days 00:00:00
2 2 1 2011-06-01 01:00:00 2011-06-01 08:00:00 0 days 07:00:00
3 2 4 2011-05-31 07:00:00 2011-06-01 01:00:00 0 days 18:00:00
4 3 1 2011-06-01 09:00:00 2011-06-05 09:00:00 4 days 00:00:00
5 4 2 2011-05-31 07:00:00 2011-06-01 01:00:00 0 days 18:00:00
location
0 8w^201
1 8w^201
2 8w^201
3 10E^45
4 8w^201
5 10E^45
numpy broadcasting is perfect for this. It allows you to compare every record (patient-room) against every other record in the dataframe. The down size is that it's memory intensive, as it requires n^2 * 8 bytes to store the comparison matrix. Glancing over your data with ~141k rows, it will require 148GB of memory!
We need to chunk the dataframe so the memory requirement is reduced to chunk_size * n * 8 bytes.
# Don't keep date and time separately, they are hard to
# perform calculations on. Instead, combine them into a
# single column and keep it as pd.Timestamp
df["start_time"] = pd.to_datetime(df["Date"] + " " + df["Time"])
# I don't know how you determine when a patient vacate a
# room. My logic here is
# - If Activity = Discharge or Death, end_time = start_time
# - Otherwise, end_time = start_time of the next room
# You can implement your own logic. This part is not
# essential to the problem at hand.
df["end_time"] = np.where(
df["Activity"].isin(["Discharge", "Death"]),
df["start_time"],
df.groupby("id")["start_time"].shift(-1),
)
# ------------------------------------------------------------------------------
# Extract all the columns to numpy arrays
patient_id, assigned_pat_loc, start_time, end_time = (
df[["id", "assigned_pat_loc", "start_time", "end_time"]].to_numpy().T
)
chunk_size = 1000 # experiment to find a size that suits you
idx_left = []
idx_right = []
for offset in range(0, len(df), chunk_size):
chunk = slice(offset, offset + chunk_size)
# Get a chunk of each array. The [:, None] part is to
# raise the chunk up one dimension to prepare for numpy
# broadcasting
patient_id_chunk, assigned_pat_loc_chunk, start_time_chunk, end_time_chunk = [
arr[chunk][:, None] for arr in (patient_id, assigned_pat_loc, start_time, end_time)
]
# `mask` is a matrix. If mask[i, j] == True, the patient
# in row i is sharing the room with the patient in row j
mask = (
# patent_id are different
(patient_id_chunk != patient_id)
# in the same room
& (assigned_pat_loc_chunk == assigned_pat_loc)
# start_time and end_time overlap
& (start_time_chunk < end_time)
& (start_time < end_time_chunk)
)
idx = mask.nonzero()
idx_left.extend(idx[0] + offset)
idx_right.extend(idx[1])
result = pd.concat(
[
df[["id", "assigned_pat_loc", "start_time", "end_time"]]
.iloc[idx]
.reset_index(drop=True)
for idx in [idx_left, idx_right]
],
axis=1,
keys=["patient_1", "patient_2"],
)
Result:
patient_1 patient_2
id assigned_pat_loc start_time end_time id assigned_pat_loc start_time end_time
0 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00 2 8w^201 2011-06-01 01:00:00 2011-06-01 08:00:00
1 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00 2 8w^201 2011-06-01 08:00:00 2011-06-01 08:00:00
2 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00 3 8w^201 2011-06-01 09:00:00 2011-06-05 09:00:00
3 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00 3 8w^201 2011-06-05 09:00:00 2011-06-05 09:00:00
4 2 10E^45 2011-05-31 07:00:00 2011-06-01 01:00:00 4 10E^45 2011-05-31 07:00:00 2011-06-01 08:00:00
5 2 8w^201 2011-06-01 01:00:00 2011-06-01 08:00:00 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
6 2 8w^201 2011-06-01 08:00:00 2011-06-01 08:00:00 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
7 3 8w^201 2011-06-01 09:00:00 2011-06-05 09:00:00 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
8 3 8w^201 2011-06-05 09:00:00 2011-06-05 09:00:00 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
9 4 10E^45 2011-05-31 07:00:00 2011-06-01 08:00:00 2 10E^45 2011-05-31 07:00:00 2011-06-01 01:00:00
Related
I got the following dataframe with two groups:
start_time
end_time
ID
10/10/2021 13:38
10/10/2021 14:30
A
31/10/2021 14:00
31/10/2021 15:00
A
21/10/2021 14:47
21/10/2021 15:30
B
23/10/2021 14:00
23/10/2021 15:30
B
I will ignore the date but only preserve the time for counting.
And I would like to create an 30 minutes interval as rows for each group first and then count, which should be similar to this:
start_interval
end_interval
count
ID
13:00
13:30
0
A
13:30
14:00
1
A
14:00
14:30
2
A
14:30
15:00
1
A
13:00
13:30
0
B
13:30
14:00
0
B
14:00
14:30
1
B
14:30
15:00
2
B
15:00
15:30
2
B
Use:
#normalize all datetimes for 30 minutes
f = lambda x: pd.to_datetime(x).dt.floor('30Min')
df[["start_time", "end_time"]] = df[["start_time", "end_time"]].apply(f)
#get difference of 30 minutes
df['diff'] = df['end_time'].sub(df['start_time']).dt.total_seconds().div(1800).astype(int)
df['start_time'] = df['start_time'].sub(df['start_time'].dt.floor('d'))
#repeat by 30 minutes
df = df.loc[df.index.repeat(df['diff'])]
df['start_time'] += pd.to_timedelta(df.groupby(level=0).cumcount().mul(30), unit='Min')
print (df)
start_time end_time ID diff
0 0 days 13:30:00 2021-10-10 14:30:00 A 2
0 0 days 14:00:00 2021-10-10 14:30:00 A 2
1 0 days 14:00:00 2021-10-31 15:00:00 A 2
1 0 days 14:30:00 2021-10-31 15:00:00 A 2
2 0 days 14:30:00 2021-10-21 15:30:00 B 2
2 0 days 15:00:00 2021-10-21 15:30:00 B 2
3 0 days 14:00:00 2021-10-23 15:30:00 B 3
3 0 days 14:30:00 2021-10-23 15:30:00 B 3
3 0 days 15:00:00 2021-10-23 15:30:00 B 3
#add starting dates - here 12:00
df1 = pd.DataFrame({'ID':df['ID'].unique(), 'start_time': pd.Timedelta(12, unit='H')})
print (df1)
ID start_time
0 A 0 days 12:00:00
1 B 0 days 12:00:00
df = pd.concat([df, df1])
#count per 30 minutes
df = df.set_index('start_time').groupby('ID').resample('30Min')['end_time'].count().reset_index(name='count')
#add end column
df['end_interval'] = df['start_time'] + pd.Timedelta(30, unit='Min')
df = df.rename(columns={'start_time':'start_interval'})[['start_interval','end_interval','count','ID']]
print (df)
start_interval end_interval count ID
0 0 days 12:00:00 0 days 12:30:00 0 A
1 0 days 12:30:00 0 days 13:00:00 0 A
2 0 days 13:00:00 0 days 13:30:00 0 A
3 0 days 13:30:00 0 days 14:00:00 1 A
4 0 days 14:00:00 0 days 14:30:00 2 A
5 0 days 14:30:00 0 days 15:00:00 1 A
6 0 days 12:00:00 0 days 12:30:00 0 B
7 0 days 12:30:00 0 days 13:00:00 0 B
8 0 days 13:00:00 0 days 13:30:00 0 B
9 0 days 13:30:00 0 days 14:00:00 0 B
10 0 days 14:00:00 0 days 14:30:00 1 B
11 0 days 14:30:00 0 days 15:00:00 2 B
12 0 days 15:00:00 0 days 15:30:00 2 B
EDIT:
def f(x):
ts = x.total_seconds()
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return ('{:02d}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
df[['start_interval','end_interval']] = df[['start_interval','end_interval']].applymap(f)
print (df)
start_interval end_interval count ID
0 12:00:00 12:30:00 0 A
1 12:30:00 13:00:00 0 A
2 13:00:00 13:30:00 0 A
3 13:30:00 14:00:00 1 A
4 14:00:00 14:30:00 2 A
5 14:30:00 15:00:00 1 A
6 12:00:00 12:30:00 0 B
7 12:30:00 13:00:00 0 B
8 13:00:00 13:30:00 0 B
9 13:30:00 14:00:00 0 B
10 14:00:00 14:30:00 1 B
11 14:30:00 15:00:00 2 B
12 15:00:00 15:30:00 2 B
The input dataframe has start and end times. The resultant dataframe is a series of timestamps with 30min interval between them.
Here it is
# Import libs
import pandas as pd
from datetime import timedelta
# Sample Dataframe
df = pd.DataFrame(
[
["10/10/2021 13:40", "10/10/2021 14:30", "A"],
["31/10/2021 14:00", "31/10/2021 15:00", "A"],
["21/10/2021 14:40", "21/10/2021 15:30", "B"],
["23/10/2021 14:00", "23/10/2021 15:30", "B"],
],
columns=["start_time", "end_time", "ID"],
)
# convert to timedelta
df[["start_time", "end_time"]] = df[["start_time", "end_time"]].apply(
lambda x: pd.to_datetime(x) - pd.to_datetime(x).dt.normalize()
)
# Extract seconds elapsed
df[["start_secs", "end_secs"]] = df[["start_time", "end_time"]].applymap(
lambda x: x.seconds
)
# OUTPUT
# start_time end_time ID start_secs end_secs
# 0 0 days 13:40:00 0 days 14:30:00 A 49200 52200
# 1 0 days 14:00:00 0 days 15:00:00 A 50400 54000
# 2 0 days 14:40:00 0 days 15:30:00 B 52800 55800
# 3 0 days 14:00:00 0 days 15:30:00 B 50400 55800
# Get rounded Min and Max time in secs of the dataframe
min_t, max_t = (df["start_secs"].min() // 3600) * 3600, (
df["end_secs"].max() // 3600
) * 3600 + 3600
# Create Interval dataframe with 30min bins
interval_df = pd.DataFrame(
map(lambda x: [x, x + 30 * 60], range(min_t, max_t, 30 * 60)),
columns=["start_interval", "end_interval"],
)
# OUTPUT
# start_interval end_interval
# 0 46800 48600
# 1 48600 50400
# 2 50400 52200
# 3 52200 54000
# 4 54000 55800
# 5 55800 57600
# It finds if the bin interval overlaps with the actual timeline and then count overlapping timelines of a single ID.
interval_df[["A", "B"]] = (
df.groupby(["ID"])
.apply(
lambda x: x.apply(
lambda y: ~(
((interval_df["end_interval"] - y["start_secs"]) <= 0)
| ((interval_df["start_interval"] - y["end_secs"]) >= 0)
),
axis=1,
).sum(axis=0)
)
.T
)
# OUTPUT
# start_interval end_interval A B
# 0 46800 48600 0 0
# 1 48600 50400 1 0
# 2 50400 52200 2 1
# 3 52200 54000 1 2
# 4 54000 55800 0 2
# 5 55800 57600 0 0
# Convert seconds to time
interval_df[["start_interval", "end_interval"]] = interval_df[
["start_interval", "end_interval"]
].applymap(lambda x: str(timedelta(seconds=x)))
# Stack counts of A and B into one single column
interval_df.melt(["start_interval", "end_interval"])
# OUTPUT
# start_interval end_interval variable value
# 0 13:00:00 13:30:00 A 0
# 1 13:30:00 14:00:00 A 1
# 2 14:00:00 14:30:00 A 2
# 3 14:30:00 15:00:00 A 1
# 4 15:00:00 15:30:00 A 0
# 5 15:30:00 16:00:00 A 0
# 6 13:00:00 13:30:00 B 0
# 7 13:30:00 14:00:00 B 0
# 8 14:00:00 14:30:00 B 1
# 9 14:30:00 15:00:00 B 2
# 10 15:00:00 15:30:00 B 2
# 11 15:30:00 16:00:00 B 0
I have got a time series of meteorological observations with date and value columns:
df = pd.DataFrame({'date':['11/10/2017 0:00','11/10/2017 03:00','11/10/2017 06:00','11/10/2017 09:00','11/10/2017 12:00',
'11/11/2017 0:00','11/11/2017 03:00','11/11/2017 06:00','11/11/2017 09:00','11/11/2017 12:00',
'11/12/2017 00:00','11/12/2017 03:00','11/12/2017 06:00','11/12/2017 09:00','11/12/2017 12:00'],
'value':[850,np.nan,np.nan,np.nan,np.nan,500,650,780,np.nan,800,350,690,780,np.nan,np.nan],
'consecutive_hour': [ 3,0,0,0,0,3,6,9,0,3,3,6,9,0,0]})
With this DataFrame, I want a third column of consecutive_hours such that if the value in a particular timestamp is less than 1000, we give corresponding value in "consecutive-hours" of "3:00" hours and find consecutive such occurrence like 6:00 9:00 as above.
Lastly, I want to summarize the table counting consecutive hours occurrence and number of days such that the summary table looks like:
df_summary = pd.DataFrame({'consecutive_hours':[3,6,9,12],
'number_of_day':[2,0,2,0]})
I tried several online solutions and methods like shift(), diff() etc. as mentioned in:How to groupby consecutive values in pandas DataFrame
and more, spent several days but no luck yet.
I would highly appreciate help on this issue.
Thanks!
Input data:
>>> df
date value
0 2017-11-10 00:00:00 850.0
1 2017-11-10 03:00:00 NaN
2 2017-11-10 06:00:00 NaN
3 2017-11-10 09:00:00 NaN
4 2017-11-10 12:00:00 NaN
5 2017-11-11 00:00:00 500.0
6 2017-11-11 03:00:00 650.0
7 2017-11-11 06:00:00 780.0
8 2017-11-11 09:00:00 NaN
9 2017-11-11 12:00:00 800.0
10 2017-11-12 00:00:00 350.0
11 2017-11-12 03:00:00 690.0
12 2017-11-12 06:00:00 780.0
13 2017-11-12 09:00:00 NaN
14 2017-11-12 12:00:00 NaN
The cumcount_reset function is adapted from this answer of #jezrael:
Python pandas cumsum with reset everytime there is a 0
cumcount_reset = \
lambda b: b.cumsum().sub(b.cumsum().where(~b).ffill().fillna(0)).astype(int)
df["consecutive_hour"] = (df.set_index("date")["value"] < 1000) \
.groupby(pd.Grouper(freq="D")) \
.apply(lambda b: cumcount_reset(b)).mul(3) \
.reset_index(drop=True)
Output result:
>>> df
date value consecutive_hour
0 2017-11-10 00:00:00 850.0 3
1 2017-11-10 03:00:00 NaN 0
2 2017-11-10 06:00:00 NaN 0
3 2017-11-10 09:00:00 NaN 0
4 2017-11-10 12:00:00 NaN 0
5 2017-11-11 00:00:00 500.0 3
6 2017-11-11 03:00:00 650.0 6
7 2017-11-11 06:00:00 780.0 9
8 2017-11-11 09:00:00 NaN 0
9 2017-11-11 12:00:00 800.0 3
10 2017-11-12 00:00:00 350.0 3
11 2017-11-12 03:00:00 690.0 6
12 2017-11-12 06:00:00 780.0 9
13 2017-11-12 09:00:00 NaN 0
14 2017-11-12 12:00:00 NaN 0
Summary table
df_summary = df.loc[df.groupby(pd.Grouper(key="date", freq="D"))["consecutive_hour"] \
.apply(lambda h: (h - h.shift(-1).fillna(0)) > 0),
"consecutive_hour"] \
.value_counts().reindex([3, 6, 9, 12], fill_value=0) \
.rename("number_of_day") \
.rename_axis("consecutive_hour") \
.reset_index()
>>> df_summary
consecutive_hour number_of_day
0 3 2
1 6 0
2 9 2
3 12 0
I have a dataframe like as shown below
df1 = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,1,1,1,1],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-03
12:59:00','2173-04-03 13:14:00','2173-04-03 13:37:00','2173-04-04
11:30:00','2173-04-05 16:00:00','2173-04-05 22:00:00','2173-04-06
04:00:00','2173-04-06 04:30:00','2173-04-06 08:00:00']
})
I would like to create another column called tdiff to calculate the time difference
This is what I tried
df1['time_1'] = pd.to_datetime(df1['time_1'])
df['time_2'] = df['time_1'].shift(-1)
df['tdiff'] = (df['time_2'] - df['time_1']).dt.total_seconds() / 3600
But this produces an output like as shown below. As you can see, it subtracts from the next date. Instead I would like to restrict the time difference only to the same day. Ex: if Jan 15th 20:00:00 PM is the last record for that day, then I expect the tdiff to be 4:00:00 (24:00:00: - 20:00:00)
I understand it is happening because I am shifting the values of time to subtract and it's obvious that the highlighted rows are picking records from next date. But is there a way to avoid this but calculate the time difference between records in a same day?
I expect my output to be like this. Here NaN should be replaced by the current date (23:59:00). if you check the difference, you will get an idea
Is there any existing method or pandas function that can help us do this datewise timedelta? How can I shift the values datewise?
IIUC, you can use:
s=pd.to_timedelta(24,unit='h')-(df1.time_1-df1.time_1.dt.normalize())
df1['tdiff']=df1.groupby(df1.time_1.dt.date).time_1.diff().shift(-1).fillna(s)
#df1.groupby(df1.time_1.dt.date).time_1.diff().shift(-1).fillna(s).dt.total_seconds()/3600
subject_id time_1 tdiff
0 1 2173-04-03 12:35:00 00:15:00
1 1 2173-04-03 12:50:00 00:09:00
2 1 2173-04-03 12:59:00 00:15:00
3 1 2173-04-03 13:14:00 00:23:00
4 1 2173-04-03 13:37:00 10:23:00
5 1 2173-04-04 11:30:00 12:30:00
6 1 2173-04-05 16:00:00 06:00:00
7 1 2173-04-05 22:00:00 02:00:00
8 1 2173-04-06 04:00:00 00:30:00
9 1 2173-04-06 04:30:00 03:30:00
10 1 2173-04-06 08:00:00 16:00:00
you could use df.where and df.dt.ceil to decide if to subtract from time_2 or from midnight of time_1:
sameDayOrMidnight = df.time_2.where(df.time_1.dt.date==df.time_2.dt.date, df.time_1.dt.ceil(freq='1d'))
df['tdiff'] = (sameDayOrMidnight - df.time_1).dt.total_seconds() / 3600
result:
subject_id time_1 time_2 tdiff
0 1 2173-04-03 12:35:00 2173-04-03 12:50:00 0.250000
1 1 2173-04-03 12:50:00 2173-04-03 12:59:00 0.150000
2 1 2173-04-03 12:59:00 2173-04-03 13:14:00 0.250000
3 1 2173-04-03 13:14:00 2173-04-03 13:37:00 0.383333
4 1 2173-04-03 13:37:00 2173-04-04 11:30:00 10.383333
5 1 2173-04-04 11:30:00 2173-04-05 16:00:00 12.500000
6 1 2173-04-05 16:00:00 2173-04-05 22:00:00 6.000000
7 1 2173-04-05 22:00:00 2173-04-06 04:00:00 2.000000
8 1 2173-04-06 04:00:00 2173-04-06 04:30:00 0.500000
9 1 2173-04-06 04:30:00 2173-04-06 08:00:00 3.500000
10 1 2173-04-06 08:00:00 NaT 16.000000
I have a pandas column that contain timestamps that are unordered. When I sort them it works fine except for the values H:MM:SS.
d = ({
'A' : ['8:00:00','9:00:00','10:00:00','20:00:00','24:00:00','26:20:00'],
})
df = pd.DataFrame(data=d)
df = df.sort_values(by='A',ascending=True)
Out:
A
2 10:00:00
3 20:00:00
4 24:00:00
5 26:20:00
0 8:00:00
1 9:00:00
Ideally, I'd like to add a zero before 5 letter strings. If I convert them all to time delta it converts the times after midnight into 1 day plus n amount of hours. e.g.
df['A'] = pd.to_timedelta(df['A'])
A
0 0 days 08:00:00
1 0 days 09:00:00
2 0 days 10:00:00
3 0 days 20:00:00
4 1 days 00:00:00
5 1 days 02:20:00
Intended Output:
A
0 08:00:00
1 09:00:00
2 10:00:00
3 20:00:00
4 24:00:00
5 26:20:00
If you only need to sort by the column as timedelta, you can convert the column to timedelta and use argsort on it to create the sorting order to sort the data frame:
df.iloc[pd.to_timedelta(df.A).argsort()]
# A
#0 8:00:00
#1 9:00:00
#2 10:00:00
#3 20:00:00
#4 24:00:00
#5 26:20:00
I have the above dataframe (snippet) and want create a new dataframe which is a conditional selection where I keep only the rows that are timestamped with a time before 15:00:00.
I'm still somewhat new to Pandas / python and have been stuck on this for a while :(
You can use DataFrame.between_time:
start = pd.to_datetime('2015-02-24 11:00')
rng = pd.date_range(start, periods=10, freq='14h')
df = pd.DataFrame({'Date': rng, 'a': range(10)})
print (df)
Date a
0 2015-02-24 11:00:00 0
1 2015-02-25 01:00:00 1
2 2015-02-25 15:00:00 2
3 2015-02-26 05:00:00 3
4 2015-02-26 19:00:00 4
5 2015-02-27 09:00:00 5
6 2015-02-27 23:00:00 6
7 2015-02-28 13:00:00 7
8 2015-03-01 03:00:00 8
9 2015-03-01 17:00:00 9
df = df.set_index('Date').between_time('00:00:00', '15:00:00')
print (df)
a
Date
2015-02-24 11:00:00 0
2015-02-25 01:00:00 1
2015-02-25 15:00:00 2
2015-02-26 05:00:00 3
2015-02-27 09:00:00 5
2015-02-28 13:00:00 7
2015-03-01 03:00:00 8
If need exclude 15:00:00 add parameter include_end=False:
df = df.set_index('Date').between_time('00:00:00', '15:00:00', include_end=False)
print (df)
a
Date
2015-02-24 11:00:00 0
2015-02-25 01:00:00 1
2015-02-26 05:00:00 3
2015-02-27 09:00:00 5
2015-02-28 13:00:00 7
2015-03-01 03:00:00 8
You can check the hours of the date column and use it for subsetting:
df['date'] = pd.to_datetime(df['date']) # optional if the date column is of datetime type
df[df.date.dt.hour < 15]