Resample pandas times series that contains elapsed time values - python

I have time series data in the format shown on the bottom of this post.
I want to re-sample the data to 30 minute intervals but i need the Time in State values to be split accordingly to the correct interval (these values are expressed in whole seconds).
Now imagine for a certain row the Time in State is 2342 seconds (more than 30 minutes) and say the start time is at 08:22:00.
User Start Date Start Time State Time in State (secs)
J.Doe 03-02-2014 08:22:00 A 2342
When the re-sample is done I need for the Time in State to be split accordingly into the periods it overflows into, like this:
User Start Date Time Period State Time in State (secs)
J.Doe 03-02-2014 08:00:00 A 480
J.Doe 03-02-2014 08:30:00 A 1800
J.Doe 03-02-2014 09:00:00 A 62
480+1800+62 = 2342
I'm completely lost on how to achieve this in pandas...I would appreciate any help :-)
Source data format:
User Start Date Start Time State Time in State (secs)
J.Doe 03-02-2014 07:58:00 A 36
J.Doe 03-02-2014 07:59:00 A 43
J.Doe 03-02-2014 08:00:00 A 59
J.Doe 03-02-2014 08:01:00 A 32
J.Doe 03-02-2014 08:21:00 A 15
J.Doe 03-02-2014 08:22:00 B 3
J.Doe 03-02-2014 08:22:00 A 2342
J.Doe 03-02-2014 09:01:00 B 1
J.Doe 03-02-2014 09:01:00 A 375
J.Doe 03-02-2014 09:07:00 B 3
J.Doe 03-02-2014 09:07:00 A 6408
J.Doe 03-02-2014 10:54:00 B 2
J.Doe 03-02-2014 10:54:00 A 116
J.Doe 03-02-2014 10:58:00 B 2
J.Doe 03-02-2014 10:58:00 A 122
J.Doe 03-02-2014 10:58:00 A 12
J.Doe 03-02-2014 11:00:00 B 2
J.Doe 03-02-2014 11:00:00 A 3417
J.Doe 03-02-2014 11:57:00 B 3
J.Doe 03-02-2014 11:57:00 A 120
J.Doe 03-02-2014 11:59:00 C 165
J.Doe 03-02-2014 12:02:00 B 3
J.Doe 03-02-2014 12:02:00 A 7254

I would first create Start and End columns (as datetime64 objects):
In [11]: df['Start'] = pd.to_datetime(df['Start Date'] + ' ' + df['Start Time'])
In [12]: df['End'] = df['Start'] + df['Time in State (secs)'].apply(pd.offsets.Second)
In [13]: row = df.iloc[6, :]
In [14]: row
Out[14]:
User J.Doe
Start Date 03-02-2014
Start Time 08:22:00
State A
Time in State (secs) 2342
Start 2014-03-02 08:22:00
End 2014-03-02 09:01:02
Name: 6, dtype: object
One way to get the split times is to resample from Start and End, merge, and use diff:
def split_times(row):
y = pd.Series(0, [row['Start'], row['End']])
splits = y.resample('30min').index + y.index # this fills in middle and sorts too
res = -splits.to_series().diff(-1)
if len(res) > 2: res = res[1:-1]
elif len(res) == 2: res = res[1:]
return res.astype(int).resample('30min').astype(np.timedelta64) # hack to resample again
In [16]: split_times(row)
Out[16]:
2014-03-02 08:22:00 00:08:00
2014-03-02 08:30:00 00:30:00
2014-03-02 09:00:00 00:01:02
dtype: timedelta64[ns]
In [17]: df.apply(split_times, 1)
Out[17]:
2014-03-02 07:30:00 2014-03-02 08:00:00 2014-03-02 08:30:00 2014-03-02 09:00:00 2014-03-02 09:30:00 2014-03-02 10:00:00 2014-03-02 10:30:00 2014-03-02 11:00:00 2014-03-02 11:30:00 2014-03-02 12:00:00 2014-03-02 12:30:00 2014-03-02 13:00:00 2014-03-02 13:30:00 2014-03-02 14:00:00
0 00:00:36 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
1 00:00:43 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
2 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
3 NaT 00:00:32 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
4 NaT 00:00:15 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
5 NaT 00:00:03 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
6 NaT 00:08:00 00:30:00 00:01:02 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
7 NaT NaT NaT 00:00:01 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
8 NaT NaT NaT 00:06:15 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
9 NaT NaT NaT 00:00:03 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
10 NaT NaT NaT 00:23:00 00:30:00 00:30:00 00:23:48 NaT NaT NaT NaT NaT NaT NaT
11 NaT NaT NaT NaT NaT NaT 00:00:02 NaT NaT NaT NaT NaT NaT NaT
12 NaT NaT NaT NaT NaT NaT 00:01:56 NaT NaT NaT NaT NaT NaT NaT
13 NaT NaT NaT NaT NaT NaT 00:00:02 NaT NaT NaT NaT NaT NaT NaT
14 NaT NaT NaT NaT NaT NaT 00:02:00 00:00:02 NaT NaT NaT NaT NaT NaT
15 NaT NaT NaT NaT NaT NaT 00:00:12 NaT NaT NaT NaT NaT NaT NaT
16 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
17 NaT NaT NaT NaT NaT NaT NaT NaT 00:26:57 NaT NaT NaT NaT NaT
18 NaT NaT NaT NaT NaT NaT NaT NaT 00:00:03 NaT NaT NaT NaT NaT
19 NaT NaT NaT NaT NaT NaT NaT NaT 00:02:00 NaT NaT NaT NaT NaT
20 NaT NaT NaT NaT NaT NaT NaT NaT 00:01:00 00:01:45 NaT NaT NaT NaT
21 NaT NaT NaT NaT NaT NaT NaT NaT NaT 00:00:03 NaT NaT NaT NaT
22 NaT NaT NaT NaT NaT NaT NaT NaT NaT 00:28:00 00:30:00 00:30:00 00:30:00 00:02:54
To replace the NaTs with 0 it looks like you have to do some fiddling in 0.13.1 (this may already be fixed up in master, otherwise is a bug):
res2 = df.apply(split_times, 1).astype(int)
# hack to replace NaTs with 0
res2.where(res2 != -9223372036854775808, 0).astype(np.timedelta64)
# to just get the seconds
seconds = res2.where(res2 != -9223372036854775808, 0) / 10 ** 9

Related

All month ends until the end date

I have a df with two columns:
index start_date end_date
0 2000-01-03 2000-01-20
1 2000-01-04 2000-01-31
2 2000-01-05 2000-02-02
3 2000-01-05 2000-02-17
...
5100 2020-12-29 2021-01-11
5111 2020-12-30 2021-03-15
I would like to add columns of all month end dates between the start and end date, so that if the end_date is in the middle of a month, I would still take into account the end of this month.
So, my df would look like this:
index start_date end_date first_monthend second_monthend third_monthend fourth_monthend
0 2000-01-03 2000-01-20 2000-01-31 0 0 0
1 2000-01-04 2000-01-31 2000-01-31 0 0 0
2 2000-01-05 2000-02-02 2000-01-31 2000-02-28 0 0
3 2000-01-05 2000-02-17 2000-01-31 2000-02-28 0 0
... ... ... ... ... ...
5100 2020-12-29 2021-02-11 2020-12-31 2021-01-31 2021-02-28 0
5111 2020-12-30 2021-03-15 2020-12-31 2021-01-31 2021-02-28 2021-03-31
I would be very grateful if you could help me
If need parse months between start and end datetimes and add last day of each month use custom lambda function with period_range:
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
def f(x):
r = pd.period_range(x['start_date'],
x['end_date'], freq='m').to_timestamp(how='end').normalize()
return pd.Series(r)
df = df.join(df.apply(f, axis=1).fillna(0).add_suffix('_monthend'))
print (df)
start_date end_date 0_monthend 1_monthend \
0 2000-01-03 2000-01-20 2000-01-31 0
1 2000-01-04 2000-01-31 2000-01-31 0
2 2000-01-05 2000-02-02 2000-01-31 2000-02-29 00:00:00
3 2000-01-05 2000-02-17 2000-01-31 2000-02-29 00:00:00
5100 2020-12-29 2021-01-11 2020-12-31 2021-01-31 00:00:00
5111 2020-12-30 2021-03-15 2020-12-31 2021-01-31 00:00:00
2_monthend 3_monthend
0 0 0
1 0 0
2 0 0
3 0 0
5100 0 0
5111 2021-02-28 00:00:00 2021-03-31 00:00:00
If not replace missing values by 0:
df = df.join(df.apply(f, axis=1).add_suffix('_monthend'))
print (df)
start_date end_date 0_monthend 1_monthend 2_monthend 3_monthend
0 2000-01-03 2000-01-20 2000-01-31 NaT NaT NaT
1 2000-01-04 2000-01-31 2000-01-31 NaT NaT NaT
2 2000-01-05 2000-02-02 2000-01-31 2000-02-29 NaT NaT
3 2000-01-05 2000-02-17 2000-01-31 2000-02-29 NaT NaT
5100 2020-12-29 2021-01-11 2020-12-31 2021-01-31 NaT NaT
5111 2020-12-30 2021-03-15 2020-12-31 2021-01-31 2021-02-28 2021-03-31

Split up duration while upsampling dataframe

How do i split up a duration while upsampleing a dataframe, as in the example below.
And can i replace the for loop, with e.g. the group_by function?
I want to use pandas to transform data like this:
activity name time started time ended
0 Bedtime 2021-10-25 00:00:00 2021-10-25 08:25:42
1 videos 2021-10-25 08:25:42 2021-10-25 08:51:54
2 Commute 2021-10-25 08:51:54 2021-10-25 09:29:34
into this:
time started Bedtime videos Commute
2021-10-25 00:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 01:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 02:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 03:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 04:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 05:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 06:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 07:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 08:00:00 0 days 00:25:42 0 days 00:26:12 0 days 00:08:06
...
And i get this far:
import pandas as pd
df=pd.DataFrame({'activity name':['Bedtime','videos','Commute'],'time started':["2021-10-25 00:00:00","2021-10-25 08:25:42","2021-10-25 08:51:54"],'time ended':["2021-10-25 08:25:42","2021-10-25 08:51:54","2021-10-25 09:29:34"]})
# converting strings to datetime
df['time ended']=pd.to_datetime(df['time ended'])
df['time started']=pd.to_datetime(df['time started'])
# calclating the duration
df['duration']=df['time ended']-df['time started']
# changeing index
df.index=df['time started']
df=df.drop(columns=['time started','time ended'])
for a in df['activity name'].unique():
df[a]=(df['activity name']==a)*df['duration']
df=df.drop(columns=['activity name','duration'])
df.resample('H').first()
time started
2021-10-25 00:00:00 0 days 08:25:42 0 days 00:00:00 0 days
2021-10-25 01:00:00 NaT NaT NaT
2021-10-25 02:00:00 NaT NaT NaT
2021-10-25 03:00:00 NaT NaT NaT
2021-10-25 04:00:00 NaT NaT NaT
2021-10-25 05:00:00 NaT NaT NaT
2021-10-25 06:00:00 NaT NaT NaT
2021-10-25 07:00:00 NaT NaT NaT
2021-10-25 08:00:00 0 days 00:00:00 0 days 00:26:12 0 days
Try this:
import pandas as pd
from io import StringIO
txtfile = StringIO(
""" activity name time started time ended
0 Bedtime 2021-10-25 00:00:00 2021-10-25 08:25:42
1 videos 2021-10-25 08:25:42 2021-10-25 08:51:54
2 Commute 2021-10-25 08:51:54 2021-10-25 09:29:34"""
)
df = pd.read_csv(txtfile, sep="\s\s+", engine="python")
df[["time started", "time ended"]] = df[["time started", "time ended"]].apply(
pd.to_datetime
)
df_e = df.assign(
date=[
pd.date_range(s, e, freq="s")
for s, e in zip(df["time started"], df["time ended"])
]
).explode("date")
df_out = (
df_e.groupby(["activity name", pd.Grouper(key="date", freq="H")])["activity name"]
.count()
.unstack(0)
.apply(pd.to_timedelta, unit="s")
)
print(df_out)
Output:
activity name Bedtime Commute videos
date
2021-10-25 00:00:00 0 days 01:00:00 NaT NaT
2021-10-25 01:00:00 0 days 01:00:00 NaT NaT
2021-10-25 02:00:00 0 days 01:00:00 NaT NaT
2021-10-25 03:00:00 0 days 01:00:00 NaT NaT
2021-10-25 04:00:00 0 days 01:00:00 NaT NaT
2021-10-25 05:00:00 0 days 01:00:00 NaT NaT
2021-10-25 06:00:00 0 days 01:00:00 NaT NaT
2021-10-25 07:00:00 0 days 01:00:00 NaT NaT
2021-10-25 08:00:00 0 days 00:25:43 0 days 00:08:06 0 days 00:26:13
2021-10-25 09:00:00 NaT 0 days 00:29:35 NaT
Address #DerekO comment:
import pandas as pd
from io import StringIO
txtfile = StringIO(
""" activity name time started time ended
0 Bedtime 2021-10-25 00:00:00 2021-10-25 08:25:42
1 videos 2021-10-25 08:25:42 2021-10-25 08:51:54
2 Commute 2021-10-25 08:51:54 2021-10-25 09:29:34
3 Bedtime 2021-10-25 11:00:00 2021-10-25 13:04:31"""
)
df = pd.read_csv(txtfile, sep="\s\s+", engine="python")
df[["time started", "time ended"]] = df[["time started", "time ended"]].apply(
pd.to_datetime
)
df_e = df.assign(
date=[
pd.date_range(s, e, freq="s")
for s, e in zip(df["time started"], df["time ended"])
]
).explode("date")
df_out = (
df_e.groupby(["activity name", pd.Grouper(key="date", freq="H")])["activity name"]
.count()
.unstack(0)
.apply(pd.to_timedelta, unit="s")
.sort_index()
)
print(df_out)
Output:
activity name Bedtime Commute videos
date
2021-10-25 00:00:00 0 days 01:00:00 NaT NaT
2021-10-25 01:00:00 0 days 01:00:00 NaT NaT
2021-10-25 02:00:00 0 days 01:00:00 NaT NaT
2021-10-25 03:00:00 0 days 01:00:00 NaT NaT
2021-10-25 04:00:00 0 days 01:00:00 NaT NaT
2021-10-25 05:00:00 0 days 01:00:00 NaT NaT
2021-10-25 06:00:00 0 days 01:00:00 NaT NaT
2021-10-25 07:00:00 0 days 01:00:00 NaT NaT
2021-10-25 08:00:00 0 days 00:25:43 0 days 00:08:06 0 days 00:26:13
2021-10-25 09:00:00 NaT 0 days 00:29:35 NaT
2021-10-25 11:00:00 0 days 01:00:00 NaT NaT
2021-10-25 12:00:00 0 days 01:00:00 NaT NaT
2021-10-25 13:00:00 0 days 00:04:32 NaT NaT
Although I agree that using groupby and resample would be best, I couldn't make such a solution work. You can instead brute force the problem by creating a new DataFrame for every row of your original DataFrame, and concatenating them together.
The way it works is that we use pd.date_range to create a DatetimeIndex between the floor of the start and end times, and the start and end times are inserted into the DatetimeIndex as well. Then the difference between all of the datetimes in this DatetimeIndex are the values of your new DataFrame.
To try to make my solution as robust as possible, I added two additional rows to your original DataFrame with a repeated category, and tested situations where the starting time falls exactly on the hour versus ahead of the hour.
import pandas as pd
from pandas._libs.tslibs.timedeltas import Timedelta
df=pd.DataFrame({
'activity name':['Bedtime','videos','Commute','Work','Commute'],
'time started':["2021-10-25 00:00:00","2021-10-25 08:25:42","2021-10-25 08:51:54","2021-10-25 09:29:34","2021-10-25 17:00:00"],
'time ended':["2021-10-25 08:25:42","2021-10-25 08:51:54","2021-10-25 09:29:34","2021-10-25 17:00:00","2021-10-25 18:01:00"]})
# converting strings to datetime
df['time ended']=pd.to_datetime(df['time ended'])
df['time started']=pd.to_datetime(df['time started'])
## column names with spaces can't be accessed by name when using iterruples to iterate through the df
df.columns = [col.replace(" ","_") for col in df.columns]
Starting df:
>>> df
activity_name time_started time_ended
0 Bedtime 2021-10-25 00:00:00 2021-10-25 08:25:42
1 videos 2021-10-25 08:25:42 2021-10-25 08:51:54
2 Commute 2021-10-25 08:51:54 2021-10-25 09:29:34
3 Work 2021-10-25 09:29:34 2021-10-25 17:00:00
4 Commute 2021-10-25 17:00:00 2021-10-25 18:01:00
## we use the start and end times to determine what daterange we create
start_time = df['time_started'].min().floor('h')
end_time = df['time_started'].max().ceil('h')
## setup an empty DataFrame to hold the final result
new_columns = list(df.activity_name.unique())
df_new = pd.DataFrame(columns=new_columns)
for row in df.itertuples(index=True):
new_row = {}
daterange_start = row.time_started.floor('1h')
daterange_end = row.time_ended.floor('1h')
datetimes_index = pd.date_range(daterange_start, daterange_end, freq='1h')
all_datetimes = datetimes_index.union([row.time_started, row.time_ended])
## take the difference and shift by -1 to drop the first NaT
new_row[row.activity_name] = all_datetimes.to_series().diff().shift(-1)
## if the first row starts in the middle of an hour, we don't want the difference between the beginning of the hour and the time in that row
if (row.Index == 0) & (row.time_started > daterange_start):
df_new = df_new.append(pd.DataFrame(new_row))[1:]
else:
df_new = df_new.append(pd.DataFrame(new_row))
df_new.index.name = 'time_started'
df_new.reset_index(inplace=True)
Result:
>>> df_new
time_started Bedtime videos Commute Work
0 2021-10-25 00:00:00 0 days 01:00:00 NaT NaT NaT
1 2021-10-25 01:00:00 0 days 01:00:00 NaT NaT NaT
2 2021-10-25 02:00:00 0 days 01:00:00 NaT NaT NaT
3 2021-10-25 03:00:00 0 days 01:00:00 NaT NaT NaT
4 2021-10-25 04:00:00 0 days 01:00:00 NaT NaT NaT
5 2021-10-25 05:00:00 0 days 01:00:00 NaT NaT NaT
6 2021-10-25 06:00:00 0 days 01:00:00 NaT NaT NaT
7 2021-10-25 07:00:00 0 days 01:00:00 NaT NaT NaT
8 2021-10-25 08:00:00 0 days 00:25:42 NaT NaT NaT
9 2021-10-25 08:25:42 NaT NaT NaT NaT
10 2021-10-25 08:00:00 NaT 0 days 00:25:42 NaT NaT
11 2021-10-25 08:25:42 NaT 0 days 00:26:12 NaT NaT
12 2021-10-25 08:51:54 NaT NaT NaT NaT
13 2021-10-25 08:00:00 NaT NaT 0 days 00:51:54 NaT
14 2021-10-25 08:51:54 NaT NaT 0 days 00:08:06 NaT
15 2021-10-25 09:00:00 NaT NaT 0 days 00:29:34 NaT
16 2021-10-25 09:29:34 NaT NaT NaT NaT
17 2021-10-25 09:00:00 NaT NaT NaT 0 days 00:29:34
18 2021-10-25 09:29:34 NaT NaT NaT 0 days 00:30:26
19 2021-10-25 10:00:00 NaT NaT NaT 0 days 01:00:00
20 2021-10-25 11:00:00 NaT NaT NaT 0 days 01:00:00
21 2021-10-25 12:00:00 NaT NaT NaT 0 days 01:00:00
22 2021-10-25 13:00:00 NaT NaT NaT 0 days 01:00:00
23 2021-10-25 14:00:00 NaT NaT NaT 0 days 01:00:00
24 2021-10-25 15:00:00 NaT NaT NaT 0 days 01:00:00
25 2021-10-25 16:00:00 NaT NaT NaT 0 days 01:00:00
26 2021-10-25 17:00:00 NaT NaT NaT NaT
27 2021-10-25 17:00:00 NaT NaT 0 days 01:00:00 NaT
28 2021-10-25 18:00:00 NaT NaT 0 days 00:01:00 NaT
29 2021-10-25 18:01:00 NaT NaT NaT NaT
For each activity we created a new DataFrame obtaining the differences between times with all_datetimes.to_series().diff().shift(-1) which means there is NaT between each change in an activity. These aren't useful, so we will drop any rows where the activities are all NaT.
We then drop duplicate timestamps in the time_started column and keep the first value of these duplicates, and take the floor of all timestamps in the time_started column:
df_new = df_new.dropna(subset=new_columns, how='all').drop_duplicates(subset=['time_started'], keep='first')
df_new['time_started'] = df_new['time_started'].apply(lambda x: x.floor('1h'))
Result:
>>> df_new
time_started Bedtime videos Commute Work
0 2021-10-25 00:00:00 0 days 01:00:00 NaT NaT NaT
1 2021-10-25 01:00:00 0 days 01:00:00 NaT NaT NaT
2 2021-10-25 02:00:00 0 days 01:00:00 NaT NaT NaT
3 2021-10-25 03:00:00 0 days 01:00:00 NaT NaT NaT
4 2021-10-25 04:00:00 0 days 01:00:00 NaT NaT NaT
5 2021-10-25 05:00:00 0 days 01:00:00 NaT NaT NaT
6 2021-10-25 06:00:00 0 days 01:00:00 NaT NaT NaT
7 2021-10-25 07:00:00 0 days 01:00:00 NaT NaT NaT
8 2021-10-25 08:00:00 0 days 00:25:42 NaT NaT NaT
11 2021-10-25 08:00:00 NaT 0 days 00:26:12 NaT NaT
14 2021-10-25 08:00:00 NaT NaT 0 days 00:08:06 NaT
15 2021-10-25 09:00:00 NaT NaT 0 days 00:29:34 NaT
18 2021-10-25 09:00:00 NaT NaT NaT 0 days 00:30:26
19 2021-10-25 10:00:00 NaT NaT NaT 0 days 01:00:00
20 2021-10-25 11:00:00 NaT NaT NaT 0 days 01:00:00
21 2021-10-25 12:00:00 NaT NaT NaT 0 days 01:00:00
22 2021-10-25 13:00:00 NaT NaT NaT 0 days 01:00:00
23 2021-10-25 14:00:00 NaT NaT NaT 0 days 01:00:00
24 2021-10-25 15:00:00 NaT NaT NaT 0 days 01:00:00
25 2021-10-25 16:00:00 NaT NaT NaT 0 days 01:00:00
27 2021-10-25 17:00:00 NaT NaT 0 days 01:00:00 NaT
28 2021-10-25 18:00:00 NaT NaT 0 days 00:01:00 NaT
Now we fill all NaT with pd.Timedelta("0s"), then we can groupby values in the time_started column and sum them together:
df_new = df_new.fillna(pd.Timedelta(0)).groupby("time_started").sum().reset_index()
Final result:
>>> df_new
time_started Bedtime videos Commute Work
0 2021-10-25 00:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
1 2021-10-25 01:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
2 2021-10-25 02:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
3 2021-10-25 03:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
4 2021-10-25 04:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
5 2021-10-25 05:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
6 2021-10-25 06:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
7 2021-10-25 07:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
8 2021-10-25 08:00:00 0 days 00:25:42 0 days 00:26:12 0 days 00:08:06 0 days 00:00:00
9 2021-10-25 09:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:29:34 0 days 00:30:26
10 2021-10-25 10:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
11 2021-10-25 11:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
12 2021-10-25 12:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
13 2021-10-25 13:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
14 2021-10-25 14:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
15 2021-10-25 15:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
16 2021-10-25 16:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
17 2021-10-25 17:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00 0 days 00:00:00
18 2021-10-25 18:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:01:00 0 days 00:00:00

Assign first element of groupby to a column yields NaN

Why does this not work out?
I get the right results if I just print it out, but if I use the same to assign it to the df column, I get Nan values...
print(df.groupby('cumsum').first()['Date'])
cumsum
1 2021-01-05 11:00:00
2 2021-01-06 08:00:00
3 2021-01-06 10:00:00
4 2021-01-06 13:00:00
5 2021-01-06 14:00:00
...
557 2021-08-08 08:00:00
558 2021-08-08 09:00:00
559 2021-08-08 11:00:00
560 2021-08-08 13:00:00
561 2021-08-08 18:00:00
Name: Date, Length: 561, dtype: datetime64[ns]
vs
df["Date_First"] = df.groupby('cumsum').first()['Date']
Date
2021-01-01 00:00:00 NaT
2021-01-01 01:00:00 NaT
2021-01-01 02:00:00 NaT
2021-01-01 03:00:00 NaT
2021-01-01 04:00:00 NaT
..
2021-08-08 14:00:00 NaT
2021-08-08 15:00:00 NaT
2021-08-08 16:00:00 NaT
2021-08-08 17:00:00 NaT
2021-08-08 18:00:00 NaT
Name: Date_Last, Length: 5268, dtype: datetime64[ns]
What happens here?
I used an exmpmle form here, but want to get the first elements.
https://www.codeforests.com/2021/03/30/group-consecutive-rows-in-pandas/
What happens here?
If use:
print(df.groupby('cumsum')['Date'].first())
#print(df.groupby('cumsum').first()['Date'])
output are aggregated values by column cumsum with aggregated function first.
So in index are unique values cumsum, so if assign to new column there is mismatch with original index and output are NaNs.
Solution is use GroupBy.transform, which repeat aggregated values to Series (column) with same size like original DataFrame, so index is same like original and assign working perfectly:
df["Date_First"] = df.groupby('cumsum')['Date'].transform("first")

How to create a time matrix full of NaT in python?

I would like to create an empty 3D time matrix (with known size) that I will later populate in a loop with either pd.dateTimeIndex or a list of pd.timestamp. Is there a simple method ?
This does not work:
timeMatrix = np.empty( shape=(100, 1000, 2) )
timeMatrix[:] = pd.NaT
I can do without the second line but then the numbers in timeMatrix become 10^18 numbers.
timeMatrix = np.empty( shape=(100, 1000, 2) )
for pressureLevel in levels:
timeMatrix[ i_airport, 0:varyingNumberBelow1000, pressureLevel ] = dates_datetimeindex
Thank you
df = pd.DataFrame(index=range(10), columns=range(10), dtype="datetime64[ns]")
print(df)
Prints:
0 1 2 3 4 5 6 7 8 9
0 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
1 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
2 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
3 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
4 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
5 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
6 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
7 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
8 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT
9 NaT NaT NaT NaT NaT NaT NaT NaT NaT NaT

Iterating through datetime64 columns in pandas dataframe [duplicate]

I have two columns in a Pandas data frame that are dates.
I am looking to subtract one column from another and the result being the difference in numbers of days as an integer.
A peek at the data:
df_test.head(10)
Out[20]:
First_Date Second Date
0 2016-02-09 2015-11-19
1 2016-01-06 2015-11-30
2 NaT 2015-12-04
3 2016-01-06 2015-12-08
4 NaT 2015-12-09
5 2016-01-07 2015-12-11
6 NaT 2015-12-12
7 NaT 2015-12-14
8 2016-01-06 2015-12-14
9 NaT 2015-12-15
I have created a new column successfully with the difference:
df_test['Difference'] = df_test['First_Date'].sub(df_test['Second Date'], axis=0)
df_test.head()
Out[22]:
First_Date Second Date Difference
0 2016-02-09 2015-11-19 82 days
1 2016-01-06 2015-11-30 37 days
2 NaT 2015-12-04 NaT
3 2016-01-06 2015-12-08 29 days
4 NaT 2015-12-09 NaT
However I am unable to get a numeric version of the result:
df_test['Difference'] = df_test[['Difference']].apply(pd.to_numeric)
df_test.head()
Out[25]:
First_Date Second Date Difference
0 2016-02-09 2015-11-19 7.084800e+15
1 2016-01-06 2015-11-30 3.196800e+15
2 NaT 2015-12-04 NaN
3 2016-01-06 2015-12-08 2.505600e+15
4 NaT 2015-12-09 NaN
How about:
df_test['Difference'] = (df_test['First_Date'] - df_test['Second Date']).dt.days
This will return difference as int if there are no missing values(NaT) and float if there is.
Pandas have a rich documentation on Time series / date functionality and Time deltas
You can divide column of dtype timedelta by np.timedelta64(1, 'D'), but output is not int, but float, because NaN values:
df_test['Difference'] = df_test['Difference'] / np.timedelta64(1, 'D')
print (df_test)
First_Date Second Date Difference
0 2016-02-09 2015-11-19 82.0
1 2016-01-06 2015-11-30 37.0
2 NaT 2015-12-04 NaN
3 2016-01-06 2015-12-08 29.0
4 NaT 2015-12-09 NaN
5 2016-01-07 2015-12-11 27.0
6 NaT 2015-12-12 NaN
7 NaT 2015-12-14 NaN
8 2016-01-06 2015-12-14 23.0
9 NaT 2015-12-15 NaN
Frequency conversion.
You can use datetime module to help here. Also, as a side note, a simple date subtraction should work as below:
import datetime as dt
import numpy as np
import pandas as pd
#Assume we have df_test:
In [222]: df_test
Out[222]:
first_date second_date
0 2016-01-31 2015-11-19
1 2016-02-29 2015-11-20
2 2016-03-31 2015-11-21
3 2016-04-30 2015-11-22
4 2016-05-31 2015-11-23
5 2016-06-30 2015-11-24
6 NaT 2015-11-25
7 NaT 2015-11-26
8 2016-01-31 2015-11-27
9 NaT 2015-11-28
10 NaT 2015-11-29
11 NaT 2015-11-30
12 2016-04-30 2015-12-01
13 NaT 2015-12-02
14 NaT 2015-12-03
15 2016-04-30 2015-12-04
16 NaT 2015-12-05
17 NaT 2015-12-06
In [223]: df_test['Difference'] = df_test['first_date'] - df_test['second_date']
In [224]: df_test
Out[224]:
first_date second_date Difference
0 2016-01-31 2015-11-19 73 days
1 2016-02-29 2015-11-20 101 days
2 2016-03-31 2015-11-21 131 days
3 2016-04-30 2015-11-22 160 days
4 2016-05-31 2015-11-23 190 days
5 2016-06-30 2015-11-24 219 days
6 NaT 2015-11-25 NaT
7 NaT 2015-11-26 NaT
8 2016-01-31 2015-11-27 65 days
9 NaT 2015-11-28 NaT
10 NaT 2015-11-29 NaT
11 NaT 2015-11-30 NaT
12 2016-04-30 2015-12-01 151 days
13 NaT 2015-12-02 NaT
14 NaT 2015-12-03 NaT
15 2016-04-30 2015-12-04 148 days
16 NaT 2015-12-05 NaT
17 NaT 2015-12-06 NaT
Now, change type to datetime.timedelta, and then use the .days method on valid timedelta objects.
In [226]: df_test['Diffference'] = df_test['Difference'].astype(dt.timedelta).map(lambda x: np.nan if pd.isnull(x) else x.days)
In [227]: df_test
Out[227]:
first_date second_date Difference Diffference
0 2016-01-31 2015-11-19 73 days 73
1 2016-02-29 2015-11-20 101 days 101
2 2016-03-31 2015-11-21 131 days 131
3 2016-04-30 2015-11-22 160 days 160
4 2016-05-31 2015-11-23 190 days 190
5 2016-06-30 2015-11-24 219 days 219
6 NaT 2015-11-25 NaT NaN
7 NaT 2015-11-26 NaT NaN
8 2016-01-31 2015-11-27 65 days 65
9 NaT 2015-11-28 NaT NaN
10 NaT 2015-11-29 NaT NaN
11 NaT 2015-11-30 NaT NaN
12 2016-04-30 2015-12-01 151 days 151
13 NaT 2015-12-02 NaT NaN
14 NaT 2015-12-03 NaT NaN
15 2016-04-30 2015-12-04 148 days 148
16 NaT 2015-12-05 NaT NaN
17 NaT 2015-12-06 NaT NaN
Hope that helps.
I feel that the overall answer does not handle if the dates 'wrap' around a year. This would be useful in understanding proximity to a date being accurate by day of year. In order to do these row operations, I did the following. (I had this used in a business setting in renewing customer subscriptions).
def get_date_difference(row, x, y):
try:
# Calcuating the smallest date difference between the start and the close date
# There's some tricky logic in here to calculate for determining date difference
# the other way around (Dec -> Jan is 1 month rather than 11)
sub_start_date = int(row[x].strftime('%j')) # day of year (1-366)
close_date = int(row[y].strftime('%j')) # day of year (1-366)
later_date_of_year = max(sub_start_date, close_date)
earlier_date_of_year = min(sub_start_date, close_date)
days_diff = later_date_of_year - earlier_date_of_year
# Calculates the difference going across the next year (December -> Jan)
days_diff_reversed = (365 - later_date_of_year) + earlier_date_of_year
return min(days_diff, days_diff_reversed)
except ValueError:
return None
Then the function could be:
dfAC_Renew['date_difference'] = dfAC_Renew.apply(get_date_difference, x = 'customer_since_date', y = 'renewal_date', axis = 1)
Create a vectorized method
def calc_xb_minus_xa(df):
time_dict = {
'<Minute>': 'm',
'<Hour>': 'h',
'<Day>': 'D',
'<Week>': 'W',
'<Month>': 'M',
'<Year>': 'Y'
}
time_delta = df.at[df.index[0], 'end_time'] - df.at[df.index[0], 'open_time']
offset_base_name = str(to_offset(time_delta).base)
time_term = time_dict.get(offset_base_name)
result = (df.end_time - df.open_time) / np.timedelta64(1, time_term)
return result
Then in your df do:
df['x'] = calc_xb_minus_xa(df)
This will work for minutes, hours, days, weeks, month and Year.
open_time and end_time need to change according your df

Categories

Resources